Quick-Tune: Quickly Learning Which Pretrained Model to Finetune and How
The AutoML PodcastAugust 08, 202400:53:0436.46 MB

Quick-Tune: Quickly Learning Which Pretrained Model to Finetune and How

There are so many great foundation models in many different domains - but how do you choose one for your specific problem? And how can you best finetune it? Sebastian Pineda has an answer: Quicktune can help select the best model and tune it for specific use cases. Listen to find out when this will be a Huggingface feature and if hyperparameter optimization is even important in finetuning models (spoiler: very much so)!

[00:00:00] Hi everyone and welcome back to the AutoML Podcast. I know it's been a while, but we're back with a bang. Haven't you always wanted to just select a model, let's say, a language model or one of the big vision models we have and be able to fine tune into your task, including HBO, out of the box?

[00:00:21] That's not quite reality yet, but Sebastian Pineda is telling us today how to get there with their awesome paper called QuickTune. And I think after talking to him that we're actually quite far.

[00:00:35] We discussed a few obstacles still in the way, but it seems like we have all the tools or almost all the tools to make this automatic fine tuning, including model selection and hyperpermaterial optimization, a reality.

[00:00:51] Which I think is great because in the age of foundation models, that's something where AutoML can make a really big difference.

[00:00:58] So without further ado, I give you Sebastian Pineda.

[00:01:02] Hi, welcome to the AutoML Podcast. I'm here with Sebastian Pineda. He's done really cool work that he's going to present to us today. But before we get into it, do you want to introduce yourself to us?

[00:01:14] Yeah, thank you, Teresa, for the invitation. So yeah, my name is Sebastian Pineda. I am doing currently my PhD at the University of Freiburg.

[00:01:24] My supervisor is Joseph Kraboska, but I also work closely to Frank Futa. And I work basically on AutoML and also combining it with meta learning concepts and also finding a way to do more efficient AutoML.

[00:01:43] So this is basically what I am doing for my PhD.

[00:01:47] Yeah, I mean, that fits well since all of these are actually in the paper we're talking about, which also I think it's actually also a great name.

[00:01:55] Quick Tune just really says everything you wanted to do, right?

[00:01:59] Yeah, that's a good point. We actually brainstormed a bit the name. There were other options like Auto Fine Tune.

[00:02:07] But I think the quick part is interesting because always when I was working on this, I put on the side of the practitioner and I try to think like, OK, what the practitioner wants.

[00:02:19] And I think they mostly want quick results, at least in the industry where the best, the top performance is not so critical sometimes or right.

[00:02:29] So the time sometimes is the critical factor.

[00:02:33] Yeah, and especially from a practitioner's view, I really thought reading this paper, hey, why hasn't this existed for years?

[00:02:41] Just, you know, I know there's a big challenge behind all that of how to quickly fine tune the correct model for a task.

[00:02:48] But just from practitioner's view, I think that's just such a great thing to tackle.

[00:02:55] So is that also how this idea came about or how was the process of how you came up with Quick Tune?

[00:03:00] Yeah. So I follow very closely the community, the machine learning communities, like from different social networks, like Twitter, LinkedIn.

[00:03:11] And I see normally like that there you can see the practitioner's real problems.

[00:03:17] And very often they were wondering, OK, which model should I use?

[00:03:23] Which hyperparameters should I set up?

[00:03:26] There are many forums discussing this.

[00:03:28] And nowadays we have we have actually so many models that this question is becoming increasingly like common.

[00:03:35] And yeah, it's very, very common, common question.

[00:03:38] So I realized also there are not a lot of words on trying to answer the question in a principle way, like with different experiments, like that you can find some forum blog posts and also previous words tackling the problem, like just selecting the model or, of course, how to find how to propose different fine tuning strategies.

[00:04:01] But there were not like work that were like that was actually joining all these ideas and putting them together and testing them.

[00:04:10] And since I saw that there is really this problem or challenge, like to select the model and the hyperparameters and there is still not a lot of progress or not a lot of work in progress.

[00:04:22] I thought that there was this this was a very important project to start and to proceed with.

[00:04:29] Yeah, I agree.

[00:04:31] I think it's a it's a really interesting task and also something that's really very useful to a lot of people that are not necessarily always interested in training from scratch, especially since we have so many great pre-trained models nowadays.

[00:04:44] Right.

[00:04:44] For a lot of domains, you don't want to or you or you can train from scratch.

[00:04:48] That's the next thing.

[00:04:49] And a lot of domains, you really cannot do that.

[00:04:51] Yes.

[00:04:52] So, right.

[00:04:54] I think I forgot to mention the pre-trained part.

[00:04:56] That's very important as well, because we want to find the way to fine tune.

[00:05:01] Right.

[00:05:01] Like and of course, you have a big set of pre-trained models that if you follow the communities, you can see every day there is a new pre-trained model, either for computer vision or for natural language processing.

[00:05:14] And again, still the question, what model should I use?

[00:05:17] What is the best model?

[00:05:18] You will always face the question because different groups and different industries or companies, they release different models, but it's not clear which is the winner.

[00:05:30] The leaderboard changes every week.

[00:05:32] So we don't have really a clear winner that you can take like, oh, this is my default choice and I will be the best always with this choice.

[00:05:42] But this fine tuning setting then is also quite a bit different than what we often see in AutoML.

[00:05:48] I think the biggest AutoML block is still hyperprimator optimization.

[00:05:52] And the most what's done there is still fairly black box in a way, right?

[00:05:56] You just get some performance and you hope to make a better decision based on performance.

[00:06:01] But in model selection, you really can't and don't want to evaluate all models.

[00:06:07] Basically, you can't really do this in a black box way anymore, right?

[00:06:11] You can't use directly all of these super sophisticated black box optimizers, black box schedulers that we usually use in hyperprimator optimization and also in a lot of other AutoML applications.

[00:06:25] How do you fill these gaps?

[00:06:27] Oh, that's an interesting point because basically Tune is borrowing three ideas.

[00:06:34] First is the gray box optimization.

[00:06:37] Second is the meta learning.

[00:06:40] And third is the cost awareness.

[00:06:42] And the first one, the gray box optimization is exactly tackling what you are mentioning.

[00:06:48] Like we don't want to evaluate a specific model for a lot of epochs and then find out that it's actually not the right model or the hyperparameters were not so good.

[00:07:01] What we want is actually just to fine tune a few epochs or even one epoch and then have feedback.

[00:07:08] And this is what is called the gray box that you see partially the performance or the learning curve.

[00:07:14] And this is especially for fine tuning from my experience and any practitioner will also notice when you fine tune already in the first epoch, you know, where something is leading to something good because the model is pre-trained and it has some knowledge, of course, from the problem.

[00:07:32] So if you have a terrible hyperparameter setup and the first epochs, the loss or the accuracy or whatever you are monitoring is, it's not as you expect or it's really low.

[00:07:45] Then you can start feeling, okay, this might not be a good model or I need to increase the learning rate, things like that.

[00:07:52] So in fine tuning especially, so of course, gray box is applied to other setups in AutoML, but I think fine tuning is very relevant to have this feedback within some epochs.

[00:08:05] Now, that's really interesting.

[00:08:06] I mean, I don't know a lot about fine tuning actually, but that's pretty cool that even as a practitioner, if you're really practiced, you can tell within a short timeframe.

[00:08:17] How's that different from a multi-fidelity evaluation exactly though?

[00:08:20] No, it's exactly the same.

[00:08:22] So in here I use, I could use interchangeably the term gray box multi-fidelity, but I just decided to, in the paper and in the socialization of the idea, I always refer to this as gray box.

[00:08:40] But it's a multi-fidelity where the fidelities are the epochs.

[00:08:44] But of course, you can have also other type of fidelities like data set size or so.

[00:08:49] So this is a special type of multi-fidelity, let's say, where our fidelities are the epochs.

[00:08:56] Okay, so that's the gray box part.

[00:08:58] You had two other parts.

[00:09:00] Yes.

[00:09:01] So there are two other parts.

[00:09:03] The second part, I think, is very relevant.

[00:09:05] It's very important.

[00:09:06] And also in the paper, we could see that it gives an important lift in the performance.

[00:09:12] It's the cost sensitivity or cost awareness.

[00:09:16] And I say that it's very important because previous work in multi-fidelity and gray box, there are amazing algorithms out there that the practitioners are using.

[00:09:29] But most of them don't take into account the cost.

[00:09:35] And when you run algorithms that don't take into account the cost in setups where you actually don't evaluate the learning curve, like let's say you have a benchmark and you just, a tabular benchmark, and you just query a table to get the performance after some epochs.

[00:09:54] Then, of course, you don't realize the cost of query in that.

[00:09:58] But if you take into account the cost, you realize that some models are very expensive.

[00:10:05] Some hyperparameters are very expensive.

[00:10:07] And let's say if you increase the batch size or things like that, they are expensive in compute and in time.

[00:10:14] When I'm mentioning here cost awareness, however, I refer mainly to time.

[00:10:20] The time is very important, especially when you have a model hub, a model zoo with a lot of different models, different sizes.

[00:10:28] And some models will be quicker to fine-tune than others.

[00:10:33] Therefore, it's very important to be aware of the cost.

[00:10:36] Like, are you sure you really want to fine-tune that model with 1 billion or 500 million parameters?

[00:10:42] Or we rather fine-tune this 20 million model that actually could reach similar or sometimes, why not, better performance.

[00:10:54] So this cost awareness is an important component.

[00:10:56] And the third component is the meta-learning, where basically we leverage information from previous datasets,

[00:11:06] where we know, okay, these datasets have this learning curve or this loss curve or whatsoever, and this cost.

[00:11:13] Then if we find a way to transfer this information to the new fine-tuning setup and the new dataset,

[00:11:20] then we will achieve, like, we will find hopefully faster, good configurations.

[00:11:25] And all these three concepts or these three ideas that we borrow, they are aiming for this quickly finding the answer.

[00:11:36] The meta-learning, the cost awareness, and the gray box allow us to find quickly the model and the happy parameters.

[00:11:44] Yeah, and if I understood it correctly, what you do is you first meta-train on a pretty big dataset of previous results.

[00:11:53] What's in the domain you then want to quick-tune for?

[00:11:55] And then you basically have this as a knowledge base to use cost-aware gray box tuning on top of, right?

[00:12:03] Yes.

[00:12:03] Yeah, exactly.

[00:12:04] So we have two predictors, two estimators that they are basically used during the search.

[00:12:12] One is the performance predictor that tries to...

[00:12:15] It's a probabilistic model that tries to predict, okay, if I take this model with these hyperparameters and I fine-tune it for one epoch, what will be my accuracy or my loss?

[00:12:27] So this is the performance predictor.

[00:12:30] And we have another predictor that is the cost predictor.

[00:12:33] It's also a model, a neural network that predicts what is the time of actually fine-tuning the model, these epochs with these hyperparameters.

[00:12:43] And then using both, we proceed with the search.

[00:12:47] And before starting the optimization or the search, we pre-train them on different tasks.

[00:12:54] So that's the main idea.

[00:12:55] Yeah, can you say about that pre-training?

[00:12:57] I think there's been a lot of work on pre-training AutoML models generally in the last two years.

[00:13:03] But I'm not sure if there's like a set methodology yet.

[00:13:06] So what did your pre-training dataset look like?

[00:13:09] What domain were you targeting?

[00:13:11] And how big are you able to be in that scale?

[00:13:16] Like if I remember correctly, you're doing computer vision, right?

[00:13:19] Image classification.

[00:13:21] Yeah.

[00:13:21] So the pre-training, as you say, there is a lot of work trying to apply meta-learning or transfer HPO.

[00:13:35] And finding ways to do this transfer of information between tasks.

[00:13:42] So what I apply here is basically what we have as a predictor is a neural network with an output that is a Gaussian process.

[00:13:52] So you could say it's a Bayesian neural network.

[00:13:55] And this Bayesian neural network, that is the term properly is a deep kernel Gaussian process.

[00:14:02] The deep kernel Gaussian process has this neural network that you can actually just train it in the specific task or dataset.

[00:14:13] But what previous work has shown, so this is not from this work.

[00:14:17] Previous work is that in the black box setup, where you actually just observe the full model training at the end,

[00:14:25] only you observe the actual performance.

[00:14:28] This black box setup, you can pre-train this deep kernel Gaussian process with different datasets that are similar to the target dataset, right?

[00:14:40] And this is where the domain aspect comes into play.

[00:14:45] So there is this assumption that the datasets that we use for pre-training should be close to the target dataset, right?

[00:14:54] Because then, like that, the pre-training will be more helpful.

[00:14:58] And the pre-training, again, is just like we just train the neural network sampling randomly the datasets and doing the gradient descent on the Bayesian neural network.

[00:15:10] But then the assumption is that the target dataset is similar.

[00:15:16] If it's not similar, then what we saw in experiments and in other setups is that the harm is not much, but the gains are not that high, right?

[00:15:25] So if it turns out that the datasets that we use for a specific domain, that we pre-trained a specific domain,

[00:15:32] but then we test in other dataset that belongs to another domain, then there will be no huge gain.

[00:15:42] Okay, so basically you lose again what you gained through the meta-training,

[00:15:46] but you also didn't see something like learning a bad bias in some way.

[00:15:51] No, so we didn't see that.

[00:15:55] And probably this is a line of work that should be interesting to proceed,

[00:16:01] like this negative transfer behavior that you can see when fine-tuning other type of models,

[00:16:07] not specifically for automers.

[00:16:08] You can see negative transfer behavior where the transfer actually is affecting the performance.

[00:16:15] We didn't see that because we did run an ablation with and without meta-learning,

[00:16:20] and we didn't see like there is like if you use meta-learning, there is actually a drop in performance.

[00:16:27] So it's not harming, but sometimes it's not helpful.

[00:16:31] Yeah, if the dataset, the target dataset is too different from the...

[00:16:36] That's still actually a very nice behavior, right?

[00:16:38] Because it really means, yeah, you do need to invest some compute upfront to do this meta-training in the end.

[00:16:45] But you can ask, if I understand it correctly, you could reuse that.

[00:16:49] So somebody could say, hey, these people already did that.

[00:16:52] I'm just going to grab that somewhere.

[00:16:54] This is another model I can reuse.

[00:16:56] And if it doesn't work, it doesn't work.

[00:16:59] And if it does, I'm at least likely to get some sort of benefit,

[00:17:02] even if maybe my dataset doesn't fit exactly.

[00:17:05] I think that's actually really nice.

[00:17:06] Yes, that's what we actually want.

[00:17:09] Like if I use it and then it happens that the dataset is not similar,

[00:17:15] I hope at least there is no like a damage or a drop in performance.

[00:17:21] So what did your pretending dataset look like exactly?

[00:17:24] Like how big can we imagine that being?

[00:17:27] What exactly is that data?

[00:17:29] The whole experimental setup was based on meta-album.

[00:17:34] That is a huge meta-dataset.

[00:17:35] With around like more than 80 datasets.

[00:17:40] And they are split in different sizes.

[00:17:43] So we have three versions.

[00:17:45] Like the micro version, mini version, and extended version.

[00:17:50] The extended version has basically the largest datasets.

[00:17:53] And here, if you see like the paper,

[00:17:56] there is a paper published in ERIPS datasets and benchmark track.

[00:18:00] In the meta-album, there are many, many like domains.

[00:18:03] Like you can see animal, OCR, airplane, cars.

[00:18:06] Like there are different domains.

[00:18:07] So it's actually, I would say cross domain.

[00:18:10] But of course there are like, if I remember correctly,

[00:18:14] there are like 10 different domains.

[00:18:16] But of course there are always domains that are not included.

[00:18:20] And then there are like, you could claim it's a diverse meta-dataset,

[00:18:25] but it doesn't cover all the possible domains, of course.

[00:18:31] Wait, just so I understand that correctly.

[00:18:32] That's what you test on?

[00:18:34] Or is that also your meta-training data?

[00:18:36] So what we did is in the experimental setup.

[00:18:39] So this is the whole dataset that we use for experiment.

[00:18:42] What we did is we take, let's say in the mini version,

[00:18:46] we have 30 datasets.

[00:18:48] So what we do is that we take a holdout set or meta-set,

[00:18:53] meta-dataset of three or five datasets for meta-testing

[00:18:57] and the rest are used for the meta-training itself.

[00:19:02] And then what we do is we pre-train in this 25

[00:19:06] and then we just test on the holdout set of datasets.

[00:19:11] And this is what we are doing.

[00:19:12] And we did that, we do that five times

[00:19:14] so that we actually evaluate on all the datasets.

[00:19:17] But always, of course, being careful not to pre-train

[00:19:21] on the same dataset that we want to test

[00:19:24] because that will be leakage.

[00:19:25] So yeah, we do this split in meta-split.

[00:19:28] So that's why you also get some idea of how it works across domains

[00:19:34] because you have this big difference.

[00:19:36] You have these large amounts of domains

[00:19:39] and also bigger datasets in all of your experiments.

[00:19:43] Okay, then let me check where I would want to get

[00:19:48] into your experiments first

[00:19:48] because I think you found some interesting stuff.

[00:19:51] I mean, I would obviously recommend everyone read the paper

[00:19:54] and your results overall look really good and impressive.

[00:19:57] But you compared against a few different methods

[00:20:00] and also against non-hyperparameter optimization

[00:20:04] in a bunch of different ways.

[00:20:06] I mean, do you have a favorite result?

[00:20:08] A favorite thing that you tried where you say,

[00:20:11] oh yeah, I'm really glad this worked.

[00:20:13] So I will say the whole paper is interesting.

[00:20:17] I cannot say like, oh, I have a favorite point

[00:20:21] because I mean, we show like the model hub is important, right?

[00:20:26] In this experiment, for example,

[00:20:30] we run QuickTune limiting or constraining the model size,

[00:20:35] the size of the model hub.

[00:20:37] Let's say that we just do the search

[00:20:41] only in a subset of five models or 10 models up to 25

[00:20:47] because we have in total only 25 models that were selected

[00:20:50] because they are like in the Pareto front

[00:20:54] of the performance and the number of parameters.

[00:20:57] So they are efficient in some way.

[00:20:59] And I like the fact that when you increase actually

[00:21:02] the model size, then the size of the model hub,

[00:21:05] you can see again in performance.

[00:21:08] Like it means that it's important,

[00:21:11] but this is one experiment that I think is interesting.

[00:21:16] And another experiment that actually I discussed

[00:21:20] with some people in the conference

[00:21:22] was when we compared to Dynode V2.

[00:21:25] And in this Dynode V2, we have two set of or two approaches.

[00:21:32] Like one is the linear probing.

[00:21:33] Linear probing, by the way,

[00:21:35] is like the most common way to fine-tune.

[00:21:36] Basically, you just take the last layer

[00:21:39] and since probably your new data set

[00:21:42] will have a different number of classes,

[00:21:45] then you have to change that last layer.

[00:21:47] And then you just train the last layer

[00:21:49] and you freeze the rest of the network.

[00:21:52] And of course, this is super efficient

[00:21:53] because you don't have to back-propagate

[00:21:55] to the beginning of the network and so on.

[00:21:58] You just need to train the model in the last layer.

[00:22:01] And also you don't have to save all the grimes

[00:22:03] and so on from the whole model.

[00:22:05] So it's somehow efficient.

[00:22:07] And we also compared, so this is linear probing,

[00:22:10] and we also compared to Dynode V2

[00:22:14] using LoRa, low-rank adaptation.

[00:22:17] And what we can see here is very interesting.

[00:22:20] So Dynode V2 is a large model,

[00:22:22] like more than 600 million parameters

[00:22:25] and almost 1 billion.

[00:22:27] And we can see is that Dynode V2 with LoRa,

[00:22:31] like with this additional efficient fine-tuning

[00:22:34] that is performed,

[00:22:36] is actually performing really bad

[00:22:38] when the data sets are small.

[00:22:41] And you can see that even linear probing

[00:22:43] sometimes is better.

[00:22:45] Just fine-tuning the last layer is enough

[00:22:47] compared to Dynode V2 and LoRa.

[00:22:50] But when the data sets are large,

[00:22:54] this LoRa becomes like actually very helpful,

[00:22:57] but it's still quick to outperform both.

[00:23:00] But what we were discussing,

[00:23:02] what I found is that what might happen

[00:23:05] with Dynode V2 with LoRa in small data sets

[00:23:09] is that there is probably an overfitting

[00:23:11] even with LoRa.

[00:23:12] So LoRa has arguably had less parameters,

[00:23:17] but still probably there are a lot of parameters

[00:23:20] compared to the size of the data set.

[00:23:23] So it might be like,

[00:23:24] it might overfit Dynode V2,

[00:23:26] but when you have larger data sets,

[00:23:28] it's better.

[00:23:29] It's still quick to outperform is better,

[00:23:30] but I think one important point

[00:23:32] or inside of here is that

[00:23:35] even if you have Dynode V2 and LoRa

[00:23:38] or Dynode V2 linear probing,

[00:23:39] you still don't know what is the best.

[00:23:40] It depends on the data set.

[00:23:42] So you still need to do like some search

[00:23:44] on the model and the hyperparameters.

[00:23:47] That's actually the message of the paper.

[00:23:50] Yeah, and I think that's really interesting

[00:23:51] because there's this trend towards

[00:23:54] unification has been for a few years, right?

[00:23:57] Where we get all these foundation models

[00:23:59] and kind of the claim of each model,

[00:24:01] at least implicitly is,

[00:24:02] this is the new best model for computer vision,

[00:24:06] for NLP, for whatever.

[00:24:08] And you just need to fine tune this one model.

[00:24:11] And I mean, also the result that you see

[00:24:13] if you include 25 instead of five models

[00:24:16] and you see better accuracy across all your data sets

[00:24:18] really seems to say,

[00:24:19] no, that's probably not going to happen.

[00:24:22] Yes.

[00:24:22] So there are two important aspects here.

[00:24:26] First, I think it's true that the community,

[00:24:31] especially the industry,

[00:24:33] wants to push this one big foundational model

[00:24:36] that will solve everything.

[00:24:37] I don't know if we will probably arrive there.

[00:24:40] Probably, yes.

[00:24:41] But the point is right now,

[00:24:43] there is not a clear winner.

[00:24:45] And there are some papers that show

[00:24:48] that depending on the task

[00:24:50] or even depending on the prompt,

[00:24:52] you might need different LLM, for example.

[00:24:55] And there is this paper called LLM Blender.

[00:24:58] They have in the first page

[00:24:59] where they show that depending on the prompt,

[00:25:02] you might actually have different ranking

[00:25:05] on the model that you want to use.

[00:25:09] And so the model that you want to use

[00:25:11] might not depend only on the data set,

[00:25:13] but also depend only on the prompt.

[00:25:15] And then we are still not there

[00:25:17] where we have a model that

[00:25:19] I am talking about language processing,

[00:25:23] but also in computer vision,

[00:25:24] there is no model that is like

[00:25:26] the one that you could use by default.

[00:25:29] You always need to do,

[00:25:30] if you want to take, of course,

[00:25:31] the best that you can,

[00:25:33] you might need to do some search

[00:25:35] on the model space

[00:25:37] and the hyperparameter space.

[00:25:39] And another aspect is that,

[00:25:41] and this is more related

[00:25:43] to GPU poor organizations,

[00:25:45] like where you don't have this

[00:25:46] like huge amount of resources

[00:25:48] where you can't afford to fine tune,

[00:25:50] let's say, Danovi 2

[00:25:51] because actually it was very expensive.

[00:25:53] And even we reported in the paper,

[00:25:55] some tasks don't fit

[00:25:56] in the one single GPU experiments

[00:25:58] we were running.

[00:25:59] So in this GPU,

[00:26:01] the poor GPU setups

[00:26:03] where you just don't have A100,

[00:26:06] H100,

[00:26:07] you have few resources,

[00:26:09] there is more important

[00:26:10] than never to do

[00:26:11] a proper search on the model,

[00:26:13] right?

[00:26:13] Because then you have

[00:26:14] many small models

[00:26:15] that might solve the task,

[00:26:16] but maybe some of them

[00:26:17] are better depending

[00:26:18] on the setup

[00:26:20] or in the data set,

[00:26:21] right?

[00:26:22] Because all of them

[00:26:23] have different inductive biases.

[00:26:25] They were trained

[00:26:26] with a different setup

[00:26:28] and they,

[00:26:29] of course,

[00:26:29] they have different behavior.

[00:26:31] And then you always

[00:26:32] will need to choose,

[00:26:34] especially in the GPU

[00:26:35] constraint setups,

[00:26:37] you would always need

[00:26:38] to find the model

[00:26:39] and the hyperparameters.

[00:26:41] Yeah,

[00:26:42] I think that's a great value

[00:26:44] proposition for a lot of people

[00:26:45] because even if you have

[00:26:46] the GPUs to fine tune

[00:26:47] the model,

[00:26:48] you might not want

[00:26:49] to forever have the GPUs

[00:26:50] to deploy the model afterwards,

[00:26:51] right?

[00:26:52] I mean,

[00:26:53] I don't know

[00:26:53] how expensive

[00:26:54] the forward passes

[00:26:55] on your fine-tuned

[00:26:56] dyno then were,

[00:26:57] but I still assume

[00:26:58] you wouldn't want

[00:26:59] to run that

[00:27:00] on the really

[00:27:01] slow CPU node then.

[00:27:04] Yeah,

[00:27:05] so I think

[00:27:05] this is actually

[00:27:06] a really interesting

[00:27:07] result of the paper

[00:27:09] that you kind of

[00:27:09] confirm this.

[00:27:11] This is something

[00:27:12] at least for

[00:27:12] the landscape

[00:27:13] as it is right now

[00:27:14] in research

[00:27:15] and domains

[00:27:16] where we have

[00:27:17] these big models

[00:27:17] in computer vision

[00:27:18] and LP.

[00:27:19] It's probably

[00:27:20] smart to select

[00:27:21] and you're also

[00:27:22] sure that you can

[00:27:23] do that,

[00:27:23] right?

[00:27:24] That's the next thing.

[00:27:24] If you increase

[00:27:25] the amount of models

[00:27:26] you can choose from,

[00:27:28] you could also

[00:27:28] potentially just

[00:27:29] see that the selection

[00:27:31] method doesn't work.

[00:27:32] But that's actually

[00:27:33] I obviously

[00:27:35] can't show

[00:27:36] listeners the plot,

[00:27:37] but it's quite

[00:27:39] a big difference.

[00:27:40] If I remember correctly,

[00:27:41] it performs then

[00:27:42] really well

[00:27:43] if you just give it

[00:27:44] access to more models.

[00:27:45] Well,

[00:27:46] the coolest future version

[00:27:47] of this would be

[00:27:48] if people can just

[00:27:49] call get model

[00:27:51] from hogging face

[00:27:51] for data set,

[00:27:53] right?

[00:27:53] How hard would it

[00:27:55] be to accomplish this?

[00:27:56] Is this like

[00:27:56] something that's

[00:27:57] 10 years away?

[00:27:59] Is that something

[00:27:59] where you think,

[00:28:00] hey,

[00:28:00] if we just

[00:28:01] do the meta

[00:28:02] pre-training

[00:28:03] for a few more

[00:28:04] domains,

[00:28:05] we could just

[00:28:06] basically do it

[00:28:07] right now?

[00:28:08] How realistic

[00:28:09] is something like that?

[00:28:11] It's a good question.

[00:28:13] And connecting

[00:28:13] to our previous

[00:28:15] discussion

[00:28:15] with the different

[00:28:17] model sizes

[00:28:18] and so.

[00:28:18] So,

[00:28:20] what we did

[00:28:20] was

[00:28:21] we make

[00:28:22] a selection

[00:28:23] of the efficient

[00:28:24] models

[00:28:25] in the Pareto front,

[00:28:26] like the ones

[00:28:27] that are actually

[00:28:28] having the best

[00:28:29] performance,

[00:28:29] but at the same

[00:28:30] time trying

[00:28:31] to have

[00:28:32] the less

[00:28:32] parameters.

[00:28:33] But at the end,

[00:28:35] we use only 25

[00:28:36] models,

[00:28:36] but they are

[00:28:37] actually 700

[00:28:38] in the

[00:28:40] team library

[00:28:40] model hub.

[00:28:41] so we could

[00:28:42] actually use

[00:28:43] 700,

[00:28:44] but searching

[00:28:45] this 700

[00:28:45] will be

[00:28:46] actually

[00:28:47] infeasible

[00:28:47] in our setup

[00:28:48] where we

[00:28:49] actually wanted

[00:28:50] to select

[00:28:51] one,

[00:28:51] fine-tune a few

[00:28:52] epochs and so on

[00:28:53] because maybe

[00:28:54] we will end up

[00:28:55] selecting many

[00:28:57] models at just

[00:28:57] one epoch.

[00:28:58] So that's why

[00:28:59] we constrained

[00:29:00] the search space

[00:29:01] for efficient

[00:29:01] models,

[00:29:02] also because,

[00:29:03] again,

[00:29:03] we care about

[00:29:04] the time.

[00:29:06] But I think

[00:29:07] what we could

[00:29:07] do now

[00:29:08] answering your

[00:29:09] question is

[00:29:09] exactly as

[00:29:11] you mentioned,

[00:29:11] like to have

[00:29:12] a stage

[00:29:12] where we do

[00:29:13] a pre-selection

[00:29:14] of the models,

[00:29:15] like the most

[00:29:16] efficient,

[00:29:16] more accurate

[00:29:17] ones,

[00:29:17] but not only

[00:29:19] in this case

[00:29:20] we did the

[00:29:20] selection based

[00:29:21] on ImageNet,

[00:29:22] like we did

[00:29:22] the Pareto front

[00:29:23] using the

[00:29:24] performance on

[00:29:24] ImageNet

[00:29:25] because that's

[00:29:25] where they

[00:29:26] went pre-trained,

[00:29:27] but the idea

[00:29:28] or the nice

[00:29:29] thing will be

[00:29:29] that this

[00:29:30] could be done

[00:29:31] per task.

[00:29:33] So for every

[00:29:35] task,

[00:29:36] we select a

[00:29:37] subset of

[00:29:38] efficient models

[00:29:39] and then we

[00:29:40] perform the

[00:29:40] search on

[00:29:41] these efficient

[00:29:41] models.

[00:29:42] And this

[00:29:43] intermediate

[00:29:43] step,

[00:29:45] I think

[00:29:46] there are

[00:29:47] already

[00:29:48] some scientific

[00:29:49] work and

[00:29:50] evidence that

[00:29:51] you could do

[00:29:52] this model

[00:29:52] selection.

[00:29:53] This is

[00:29:54] entirely

[00:29:55] model selection,

[00:29:56] so what I'm

[00:29:56] saying is

[00:29:56] first to do

[00:29:57] model selection

[00:29:57] and then do

[00:29:58] the joint

[00:29:59] model selection

[00:30:00] and hyperparameter

[00:30:01] optimization,

[00:30:02] but this

[00:30:03] model selection

[00:30:04] can be done

[00:30:04] because there

[00:30:05] are already

[00:30:05] some works

[00:30:07] proposing

[00:30:07] transferability

[00:30:08] measures.

[00:30:09] So this is

[00:30:10] like the

[00:30:10] scientific term

[00:30:11] where you

[00:30:12] want to

[00:30:12] find,

[00:30:13] given my

[00:30:13] new dataset,

[00:30:14] how transferable

[00:30:15] a model

[00:30:16] is and

[00:30:16] then just

[00:30:17] select the

[00:30:17] top 500

[00:30:21] or top 50

[00:30:22] models that

[00:30:24] are transferable

[00:30:25] and then

[00:30:26] perform the

[00:30:26] search on

[00:30:26] this or only

[00:30:27] one as

[00:30:28] you propose.

[00:30:28] But in this

[00:30:29] case,

[00:30:30] only one,

[00:30:31] it means

[00:30:31] that you

[00:30:31] are assuming

[00:30:32] the default

[00:30:33] hyperparameters,

[00:30:34] but probably

[00:30:35] you also want

[00:30:35] to tune

[00:30:35] the hyperparameters.

[00:30:37] Now,

[00:30:38] the thing

[00:30:38] is,

[00:30:39] Hagen Phase,

[00:30:40] for example,

[00:30:41] has right

[00:30:41] now more

[00:30:42] than 500,000

[00:30:43] models.

[00:30:44] Right now,

[00:30:45] anyone can

[00:30:46] submit a new

[00:30:47] model and

[00:30:47] then we

[00:30:48] will have

[00:30:48] so many

[00:30:49] models that

[00:30:50] it will be

[00:30:50] very infeasible

[00:30:51] to search

[00:30:52] within reasonable

[00:30:54] time

[00:30:54] proper models.

[00:30:56] So maybe

[00:30:56] there should

[00:30:57] be smart

[00:30:58] strategies

[00:30:58] of how

[00:30:59] to do

[00:31:00] like a

[00:31:00] pruning

[00:31:01] in the

[00:31:02] search

[00:31:02] or how

[00:31:03] to make

[00:31:06] a subset

[00:31:06] of relevant

[00:31:07] models to

[00:31:08] select

[00:31:08] or to

[00:31:08] search.

[00:31:09] Because in

[00:31:11] a sense,

[00:31:11] these

[00:31:11] transferability

[00:31:12] measures can

[00:31:13] be computed

[00:31:14] efficiently,

[00:31:15] but even if

[00:31:15] it takes

[00:31:16] one second,

[00:31:18] if you have

[00:31:18] to compute

[00:31:19] it in 500,000

[00:31:20] models,

[00:31:20] it's a lot,

[00:31:22] right?

[00:31:23] Yeah,

[00:31:24] that sounds

[00:31:24] like a

[00:31:25] challenging

[00:31:26] thing to

[00:31:27] do.

[00:31:27] I think it's

[00:31:28] possible,

[00:31:29] just to

[00:31:29] summarize,

[00:31:29] I think it's

[00:31:30] possible,

[00:31:30] but we need

[00:31:31] more work

[00:31:32] on efficient

[00:31:32] methods to

[00:31:33] do that.

[00:31:34] It also

[00:31:35] sounds like

[00:31:35] this is

[00:31:36] something that

[00:31:37] is probably

[00:31:38] feasible if

[00:31:39] you do it

[00:31:40] inside a

[00:31:41] model hub

[00:31:41] like

[00:31:42] Hugging Face,

[00:31:43] but if you

[00:31:43] do it from

[00:31:44] the outside

[00:31:44] and you

[00:31:45] need to

[00:31:46] keep track

[00:31:46] and recompute

[00:31:47] all of these

[00:31:47] metrics and

[00:31:48] evaluations

[00:31:48] all the time,

[00:31:49] this can

[00:31:50] become very

[00:31:50] hard,

[00:31:51] very quickly.

[00:31:52] But it's

[00:31:53] really interesting

[00:31:54] that you

[00:31:54] say that

[00:31:54] this is

[00:31:55] basically

[00:31:55] how do

[00:31:56] we do

[00:31:57] it efficiently

[00:31:57] at scale,

[00:31:59] not even

[00:32:00] from a

[00:32:01] machine learning

[00:32:01] or AutoML

[00:32:02] aspect,

[00:32:02] just from

[00:32:03] a

[00:32:04] this is

[00:32:05] just

[00:32:06] logistically

[00:32:06] hard,

[00:32:07] right?

[00:32:08] Because

[00:32:09] that really

[00:32:10] says that

[00:32:11] we should

[00:32:11] probably put

[00:32:12] a bit more

[00:32:13] thought into

[00:32:13] that because

[00:32:14] that's

[00:32:15] really

[00:32:16] impressive.

[00:32:16] I really

[00:32:17] would have

[00:32:17] thought you

[00:32:17] are now

[00:32:18] going to

[00:32:18] tell me,

[00:32:19] okay,

[00:32:19] now we

[00:32:20] need to

[00:32:20] refine

[00:32:21] model selection,

[00:32:21] we need

[00:32:22] to

[00:32:22] refine

[00:32:22] meta

[00:32:23] training.

[00:32:24] No,

[00:32:24] I think

[00:32:24] it's more

[00:32:25] how do

[00:32:25] you say

[00:32:25] a problem

[00:32:26] from the

[00:32:27] efficiency

[00:32:28] part.

[00:32:29] Also,

[00:32:29] I just

[00:32:30] thought about

[00:32:31] this,

[00:32:32] if you

[00:32:32] want to

[00:32:33] do that,

[00:32:34] let alone,

[00:32:35] leaving

[00:32:35] alone the

[00:32:36] fact that

[00:32:36] you have

[00:32:37] to do

[00:32:37] probably

[00:32:37] some

[00:32:38] forward

[00:32:38] passes

[00:32:38] in the

[00:32:39] whole

[00:32:39] data

[00:32:40] per

[00:32:40] model

[00:32:41] to

[00:32:42] have

[00:32:42] an

[00:32:42] idea

[00:32:42] of

[00:32:42] what

[00:32:43] model

[00:32:43] is

[00:32:43] transferable,

[00:32:44] for example.

[00:32:45] Even if

[00:32:46] you neglect

[00:32:48] that,

[00:32:49] you still

[00:32:49] have to

[00:32:49] save,

[00:32:50] you should

[00:32:51] have a lot

[00:32:52] of memory.

[00:32:54] I don't

[00:32:54] know,

[00:32:55] disk

[00:32:55] and also

[00:32:57] run

[00:32:57] memory,

[00:32:58] you should

[00:33:00] have a lot

[00:33:00] because you

[00:33:01] will load

[00:33:01] a lot

[00:33:01] of models

[00:33:02] and you

[00:33:02] do

[00:33:03] forward

[00:33:03] passes,

[00:33:04] at least

[00:33:04] one

[00:33:04] forward

[00:33:05] pass,

[00:33:05] for example,

[00:33:06] you need

[00:33:06] the model

[00:33:07] to know

[00:33:09] where the

[00:33:09] model is

[00:33:09] good or

[00:33:10] not.

[00:33:11] It's also

[00:33:11] in memory

[00:33:12] very demanding.

[00:33:14] Yeah.

[00:33:15] Okay,

[00:33:16] so here's

[00:33:16] a challenge

[00:33:17] for people

[00:33:17] who are

[00:33:17] interested

[00:33:18] in efficiency.

[00:33:19] If you

[00:33:19] can solve

[00:33:20] this,

[00:33:20] it's likely

[00:33:21] that you

[00:33:21] can actually

[00:33:22] do something

[00:33:22] like quick

[00:33:23] at scale.

[00:33:25] I think

[00:33:25] that's actually

[00:33:26] really cool.

[00:33:29] I think

[00:33:30] still

[00:33:30] automata

[00:33:31] is underused

[00:33:32] in the

[00:33:33] machine learning

[00:33:34] community,

[00:33:34] even if we

[00:33:35] think about

[00:33:35] these,

[00:33:36] let's call

[00:33:37] them a

[00:33:37] bit more

[00:33:38] basic

[00:33:38] approaches

[00:33:39] that do

[00:33:40] things like

[00:33:40] HBO and

[00:33:41] NAS.

[00:33:42] But I

[00:33:42] think

[00:33:43] quick

[00:33:43] tune is

[00:33:43] something

[00:33:43] that

[00:33:44] actually

[00:33:45] doesn't

[00:33:46] only do

[00:33:47] something

[00:33:47] useful,

[00:33:48] like

[00:33:48] have

[00:33:48] parameter

[00:33:48] optimization.

[00:33:49] This

[00:33:49] could

[00:33:49] actually

[00:33:49] be

[00:33:50] people's

[00:33:50] whole

[00:33:51] workflow

[00:33:51] for

[00:33:51] a lot

[00:33:52] of

[00:33:52] things.

[00:33:53] I

[00:33:53] think

[00:33:53] that's

[00:33:54] actually

[00:33:54] extremely

[00:33:55] interesting.

[00:33:56] This is

[00:33:56] like the

[00:33:56] version

[00:33:57] of

[00:33:57] full

[00:33:58] pipeline

[00:33:58] search

[00:33:59] for

[00:33:59] you

[00:34:05] with

[00:34:05] that.

[00:34:06] Yes,

[00:34:07] I agree

[00:34:08] that.

[00:34:08] I would

[00:34:09] like to

[00:34:09] see how

[00:34:10] this performs

[00:34:11] with context

[00:34:13] learning to

[00:34:13] select the

[00:34:15] model before

[00:34:15] doing the

[00:34:17] in-context

[00:34:18] learning.

[00:34:19] There are

[00:34:19] many

[00:34:19] possibilities

[00:34:20] because of

[00:34:21] the time

[00:34:22] and space

[00:34:23] we could not

[00:34:24] address in

[00:34:24] the paper,

[00:34:25] but I

[00:34:26] think this

[00:34:26] opens up

[00:34:27] a lot

[00:34:27] of

[00:34:27] questions

[00:34:28] and

[00:34:28] interesting

[00:34:29] research

[00:34:31] work

[00:34:31] that can

[00:34:32] be done.

[00:34:33] I mean,

[00:34:34] again,

[00:34:35] I think

[00:34:35] the efficiency

[00:34:36] should be

[00:34:36] in the

[00:34:37] central

[00:34:37] point.

[00:34:39] That's

[00:34:40] why I

[00:34:40] think

[00:34:41] the quick

[00:34:42] adjective

[00:34:43] is important.

[00:34:45] Yeah,

[00:34:46] but something

[00:34:46] else that

[00:34:47] stuck out

[00:34:47] to me

[00:34:48] in your

[00:34:48] results

[00:34:49] was actually

[00:34:50] the benefit

[00:34:51] that HBO

[00:34:51] brought.

[00:34:52] I mean,

[00:34:52] you just

[00:34:53] compared

[00:34:53] what happens

[00:34:54] if I

[00:34:54] fine-tune

[00:34:54] without

[00:34:55] the

[00:34:55] HBO

[00:34:55] and

[00:34:55] with

[00:34:55] HBO

[00:34:56] and

[00:34:57] AutoML

[00:34:58] people

[00:34:58] will

[00:34:58] not

[00:34:59] be

[00:34:59] surprised

[00:34:59] to

[00:35:00] read

[00:35:00] that

[00:35:01] HBO

[00:35:01] helps.

[00:35:02] Yeah,

[00:35:03] great.

[00:35:04] But what

[00:35:05] struck me

[00:35:05] is that

[00:35:05] you did

[00:35:06] this,

[00:35:07] as you

[00:35:07] said,

[00:35:07] for the

[00:35:08] different

[00:35:08] sizes

[00:35:09] of the

[00:35:09] meta

[00:35:09] data set

[00:35:10] and HBO

[00:35:11] seemed to

[00:35:11] become more

[00:35:12] important

[00:35:12] the more

[00:35:13] data sets

[00:35:13] you had,

[00:35:14] which would

[00:35:15] really say

[00:35:16] to me,

[00:35:17] okay,

[00:35:17] HBO

[00:35:18] becomes much

[00:35:18] more important

[00:35:19] if you look

[00:35:19] at a

[00:35:20] practical

[00:35:20] test use

[00:35:21] case

[00:35:21] where

[00:35:22] someone

[00:35:22] could

[00:35:22] potentially

[00:35:23] give you

[00:35:23] something

[00:35:23] fairly

[00:35:24] out of

[00:35:24] domain.

[00:35:25] Is that

[00:35:25] also your

[00:35:26] conclusion?

[00:35:27] Yeah,

[00:35:28] so you

[00:35:28] mean,

[00:35:28] I think

[00:35:29] you mean

[00:35:29] like in

[00:35:30] the

[00:35:30] extended,

[00:35:30] for example,

[00:35:31] in the

[00:35:31] extended

[00:35:32] data sets,

[00:35:32] the HBO

[00:35:33] seems to

[00:35:33] be more

[00:35:34] helpful.

[00:35:35] Like

[00:35:35] extended has

[00:35:35] the largest

[00:35:36] data set,

[00:35:37] so in

[00:35:37] there it

[00:35:38] makes sense

[00:35:39] that the

[00:35:40] cost awareness,

[00:35:41] the meta

[00:35:41] learning,

[00:35:41] all these

[00:35:42] recipes

[00:35:42] are important

[00:35:43] because

[00:35:44] one epoch

[00:35:45] costs a

[00:35:46] lot in

[00:35:46] these

[00:35:46] data sets.

[00:35:47] I don't

[00:35:48] have a

[00:35:49] precise

[00:35:49] number,

[00:35:50] but one

[00:35:50] epoch might

[00:35:51] take more

[00:35:52] than 10

[00:35:53] minutes or

[00:35:53] 20 minutes

[00:35:54] and you

[00:35:56] should be

[00:35:57] aware of

[00:35:57] what you

[00:35:58] select,

[00:35:58] what you

[00:35:58] try,

[00:35:59] right?

[00:36:00] So you

[00:36:01] probably don't

[00:36:02] have that

[00:36:02] problem when

[00:36:03] you have

[00:36:03] small

[00:36:04] data sets

[00:36:04] because one

[00:36:05] epoch can

[00:36:06] take like

[00:36:07] 30 seconds,

[00:36:08] one minute,

[00:36:09] and then you

[00:36:10] can afford to

[00:36:10] try many

[00:36:11] and many

[00:36:12] optimizers

[00:36:13] find good

[00:36:14] approaches

[00:36:15] and even

[00:36:15] what we

[00:36:16] could see

[00:36:17] is that,

[00:36:17] as you

[00:36:17] say,

[00:36:18] even using

[00:36:18] default

[00:36:18] parameters

[00:36:19] is enough

[00:36:20] and I

[00:36:20] think it's

[00:36:21] because it's

[00:36:22] enough to

[00:36:22] have

[00:36:24] acceptable

[00:36:24] performance,

[00:36:25] but if you

[00:36:25] want to

[00:36:26] push it

[00:36:26] more,

[00:36:26] of course

[00:36:27] you need

[00:36:27] HPO,

[00:36:28] but it's

[00:36:28] enough to

[00:36:29] have an

[00:36:30] acceptable

[00:36:30] performance

[00:36:30] and this

[00:36:31] is

[00:36:31] interesting,

[00:36:32] as you

[00:36:32] said,

[00:36:33] and what

[00:36:33] I think

[00:36:33] it shows

[00:36:34] is that

[00:36:35] the transfer

[00:36:36] learning is

[00:36:36] working in

[00:36:37] these small

[00:36:37] data sets.

[00:36:39] It means that

[00:36:40] the pre-training

[00:36:40] the model,

[00:36:41] even if it's

[00:36:41] small,

[00:36:42] the data

[00:36:42] set or

[00:36:44] the model,

[00:36:45] it's very

[00:36:46] helpful,

[00:36:46] right?

[00:36:47] And actually

[00:36:47] transfer learning

[00:36:48] was initially

[00:36:50] researched a lot

[00:36:51] for small

[00:36:51] data sets

[00:36:52] because you

[00:36:53] cannot afford

[00:36:54] to train a

[00:36:54] model from

[00:36:55] scratch,

[00:36:55] but you

[00:36:56] want to

[00:36:56] pre-training

[00:36:57] the model

[00:36:57] and then

[00:36:57] just run

[00:36:59] some few

[00:37:00] epochs

[00:37:00] to find

[00:37:01] a good

[00:37:02] model

[00:37:02] and that's

[00:37:03] why these

[00:37:04] results with

[00:37:05] no HPO

[00:37:06] in small

[00:37:08] data sets

[00:37:08] are good

[00:37:09] because the

[00:37:10] transfer learning

[00:37:11] is helping

[00:37:12] actually

[00:37:12] one other

[00:37:13] experiment

[00:37:14] that supports

[00:37:15] this is

[00:37:16] this experiment

[00:37:17] where we

[00:37:17] change the

[00:37:18] model hub

[00:37:18] size,

[00:37:19] where we

[00:37:20] can see

[00:37:20] that in

[00:37:21] the micro

[00:37:21] version of

[00:37:22] the data

[00:37:22] sets,

[00:37:23] even one

[00:37:24] model is

[00:37:24] enough to

[00:37:25] have very

[00:37:25] good performance

[00:37:26] and one

[00:37:26] model is

[00:37:27] competitive

[00:37:27] when we

[00:37:28] have a

[00:37:29] model hub

[00:37:31] of size

[00:37:31] 10 or

[00:37:33] 15.

[00:37:34] so it's

[00:37:36] competitive

[00:37:36] only using

[00:37:37] one model.

[00:37:38] It means

[00:37:38] that again

[00:37:39] the power

[00:37:40] of transfer

[00:37:40] learning

[00:37:41] on only

[00:37:42] one model

[00:37:42] is very

[00:37:44] clear in

[00:37:45] small data

[00:37:46] regimes.

[00:37:47] basically what

[00:37:48] you're saying

[00:37:48] is the

[00:37:49] model can

[00:37:50] transfer so

[00:37:50] well that

[00:37:51] it doesn't

[00:37:51] really matter

[00:37:52] if the

[00:37:53] fine tuning

[00:37:53] works at

[00:37:54] all,

[00:37:54] or not

[00:37:55] at all,

[00:37:55] but if it

[00:37:56] works.

[00:37:56] But fine

[00:37:57] tuning works

[00:37:57] for sure.

[00:37:58] If the data

[00:37:59] is small,

[00:38:00] fine tuning

[00:38:00] works really

[00:38:01] good,

[00:38:02] let's say.

[00:38:02] But again,

[00:38:03] let's not

[00:38:03] lose the big

[00:38:04] picture.

[00:38:05] We still need

[00:38:05] the HPO

[00:38:06] if we want

[00:38:06] to push

[00:38:07] the performance

[00:38:09] a bit.

[00:38:10] Absolutely.

[00:38:10] So yeah,

[00:38:11] I might have

[00:38:11] not said that

[00:38:12] clearly.

[00:38:12] HPO still

[00:38:13] improves on

[00:38:14] each

[00:38:15] data set

[00:38:15] size.

[00:38:16] Yeah,

[00:38:16] definitely.

[00:38:17] It's just a

[00:38:17] bigger improvement

[00:38:18] with the

[00:38:19] bigger data

[00:38:19] sets,

[00:38:19] which I

[00:38:20] found interesting

[00:38:21] because it

[00:38:22] kind of also

[00:38:23] says if you

[00:38:23] have a more

[00:38:24] well,

[00:38:25] okay,

[00:38:26] a data set

[00:38:26] size is not

[00:38:27] necessarily

[00:38:27] always a

[00:38:29] more complex

[00:38:29] task,

[00:38:30] but it

[00:38:30] might very

[00:38:31] well be.

[00:38:31] And I mean,

[00:38:32] we get more

[00:38:34] data from

[00:38:34] everything these

[00:38:35] days than

[00:38:36] we used to

[00:38:36] and used to

[00:38:37] even five or

[00:38:37] 10 years ago,

[00:38:38] then apparently

[00:38:39] you should

[00:38:40] really look

[00:38:40] into HPO,

[00:38:41] which I

[00:38:41] don't think

[00:38:42] is always

[00:38:42] the way

[00:38:43] people think

[00:38:44] about it,

[00:38:44] right?

[00:38:45] I think

[00:38:45] it's often

[00:38:46] more tempting

[00:38:46] to say,

[00:38:47] oh,

[00:38:47] I'm going

[00:38:47] to use

[00:38:47] HPO for

[00:38:48] a small

[00:38:48] application

[00:38:49] because I

[00:38:49] expect it

[00:38:50] to be

[00:38:50] cheaper

[00:38:51] anyway.

[00:38:51] And then

[00:38:52] for the

[00:38:52] big stuff,

[00:38:53] I can't

[00:38:54] afford

[00:38:54] HPO.

[00:38:55] But

[00:38:56] looking at

[00:38:57] how benefit

[00:38:58] of HPO

[00:38:59] progresses,

[00:39:00] it always

[00:39:00] seems like

[00:39:00] you should

[00:39:01] consider

[00:39:02] even more

[00:39:02] on the

[00:39:03] harder

[00:39:03] stuff,

[00:39:03] because if

[00:39:04] you do

[00:39:04] it in a

[00:39:04] cost-aware

[00:39:05] manner,

[00:39:05] you can

[00:39:05] actually save

[00:39:06] yourself a

[00:39:06] lot of

[00:39:07] time.

[00:39:08] Yeah,

[00:39:08] that's

[00:39:09] true.

[00:39:09] I mean,

[00:39:10] also there

[00:39:10] is the

[00:39:11] factor,

[00:39:12] there are

[00:39:12] some

[00:39:12] factors

[00:39:13] here,

[00:39:14] like the

[00:39:15] meta-learning

[00:39:15] and the

[00:39:15] cost-awareness.

[00:39:16] So in

[00:39:17] the experiment

[00:39:18] when we

[00:39:19] compare with

[00:39:19] default HPO,

[00:39:20] we are

[00:39:21] already using

[00:39:22] meta-learning

[00:39:23] and cost-awareness

[00:39:23] and it shows

[00:39:24] that if you

[00:39:25] are efficient

[00:39:26] in these

[00:39:27] large

[00:39:28] data sets,

[00:39:28] then HPO

[00:39:29] can be a

[00:39:31] very good

[00:39:31] option,

[00:39:32] right?

[00:39:32] So the

[00:39:34] important

[00:39:35] point here

[00:39:36] probably is

[00:39:37] that if

[00:39:37] your

[00:39:38] data set

[00:39:38] is large,

[00:39:39] yeah,

[00:39:39] HPO is a

[00:39:40] very good

[00:39:40] option,

[00:39:41] but try

[00:39:41] to do it

[00:39:41] efficiently

[00:39:43] via meta-learning

[00:39:44] or cost-awareness

[00:39:45] and so on.

[00:39:47] So I

[00:39:47] really think

[00:39:48] the results

[00:39:49] look really

[00:39:50] convincing

[00:39:50] overall from

[00:39:52] both a

[00:39:53] perspective of

[00:39:53] I mean,

[00:39:54] I read a lot

[00:39:55] of auto-anal

[00:39:56] papers,

[00:39:56] but even I

[00:39:57] think if I

[00:39:58] wouldn't,

[00:39:58] if I would

[00:39:59] not necessarily

[00:40:00] know the

[00:40:01] HPO tools you

[00:40:02] compare against

[00:40:03] or something,

[00:40:04] I would still

[00:40:04] fairly convince

[00:40:05] to say,

[00:40:05] okay,

[00:40:05] I need to

[00:40:06] fine tune

[00:40:06] something

[00:40:07] complex,

[00:40:07] I can use

[00:40:08] this reliably.

[00:40:09] But are

[00:40:10] there any

[00:40:10] weaknesses

[00:40:11] where you

[00:40:12] think,

[00:40:12] oh,

[00:40:12] okay,

[00:40:13] this is

[00:40:13] not yet

[00:40:13] great,

[00:40:14] this is

[00:40:14] still a

[00:40:15] limitation

[00:40:15] of

[00:40:15] quick

[00:40:16] tune,

[00:40:16] that's

[00:40:17] still

[00:40:17] something

[00:40:18] where

[00:40:18] you think

[00:40:19] there's

[00:40:20] going to

[00:40:20] be some

[00:40:21] improvements,

[00:40:22] except for

[00:40:22] this logistics

[00:40:23] efficiency thing

[00:40:24] we talked

[00:40:24] about.

[00:40:25] Yeah,

[00:40:26] so during

[00:40:26] the research,

[00:40:28] we wanted

[00:40:28] to have

[00:40:29] like a

[00:40:29] nice

[00:40:29] setup that

[00:40:30] solves

[00:40:30] a clear

[00:40:31] problem

[00:40:31] that we

[00:40:32] mentioned at

[00:40:33] the beginning,

[00:40:34] like how to

[00:40:34] select the

[00:40:35] model and

[00:40:35] the hyper

[00:40:35] parameters.

[00:40:37] And we

[00:40:38] wanted,

[00:40:39] of course,

[00:40:39] or we

[00:40:40] were thinking

[00:40:41] to create an

[00:40:43] approach that

[00:40:44] were a bit

[00:40:46] simple in the

[00:40:47] sense that it's

[00:40:47] understandable and

[00:40:49] would avoid a

[00:40:50] lot of complexity

[00:40:51] unnecessarily.

[00:40:52] So we

[00:40:54] could create a

[00:40:55] more sophisticated

[00:40:55] cost awareness

[00:40:57] model or

[00:40:58] predictor or

[00:40:59] more sophisticated

[00:41:00] way to select

[00:41:01] the pipeline,

[00:41:03] meaning the

[00:41:03] combination of

[00:41:04] the model and

[00:41:05] hyper parameters.

[00:41:06] But we're

[00:41:07] striving for

[00:41:07] something simple

[00:41:08] first that

[00:41:09] works, of

[00:41:10] course, but

[00:41:11] keeping it

[00:41:12] simple just to

[00:41:13] test that it

[00:41:13] works and to

[00:41:14] show the

[00:41:14] community,

[00:41:15] look, take a

[00:41:16] look at this,

[00:41:16] and this is a

[00:41:17] promising direction

[00:41:18] that we need to

[00:41:19] research further.

[00:41:20] And that's

[00:41:21] what we did,

[00:41:22] and especially

[00:41:22] with this cost

[00:41:23] awareness part,

[00:41:25] where we

[00:41:26] basically just

[00:41:26] trade off the

[00:41:27] performance with

[00:41:28] the cost,

[00:41:29] dividing the

[00:41:30] performance and

[00:41:30] the cost by

[00:41:31] the cost.

[00:41:32] But I think

[00:41:33] this operation

[00:41:35] that trading

[00:41:36] off the cost

[00:41:36] could be

[00:41:37] improved in

[00:41:38] other ways,

[00:41:39] like because

[00:41:40] if the cost is

[00:41:41] very small,

[00:41:42] then we

[00:41:43] will have a

[00:41:44] huge preference

[00:41:44] for this super

[00:41:46] cheap model,

[00:41:46] but maybe

[00:41:47] they could,

[00:41:48] and actually

[00:41:49] there are other

[00:41:50] works for

[00:41:51] black box

[00:41:52] optimization,

[00:41:53] where they

[00:41:54] find other

[00:41:54] ways to

[00:41:55] trade off

[00:41:56] the cost.

[00:41:57] There is

[00:41:57] something called

[00:41:58] cost cooling,

[00:41:59] where basically

[00:42:00] they elevate

[00:42:01] the cost to

[00:42:02] an exponent

[00:42:03] that is

[00:42:03] proportional to

[00:42:05] the budget,

[00:42:06] so that when

[00:42:06] you have

[00:42:07] a few

[00:42:08] budget,

[00:42:08] you care

[00:42:10] about cheap

[00:42:11] configurations,

[00:42:12] but when you

[00:42:13] have less

[00:42:15] budget,

[00:42:16] you go for

[00:42:17] configurations

[00:42:17] that are

[00:42:18] more expensive,

[00:42:21] but you

[00:42:22] think that

[00:42:22] they are

[00:42:23] really good

[00:42:23] because of

[00:42:24] the performance.

[00:42:25] And then

[00:42:26] this component,

[00:42:27] I think we

[00:42:28] can play

[00:42:28] more with

[00:42:28] that,

[00:42:29] we can

[00:42:29] find more

[00:42:30] sophisticated

[00:42:31] ways,

[00:42:31] or even

[00:42:32] run,

[00:42:32] for example,

[00:42:33] multi-objective

[00:42:34] optimization,

[00:42:35] and I think

[00:42:35] this could

[00:42:36] bring an

[00:42:37] improvement in

[00:42:38] performance,

[00:42:39] if we find

[00:42:40] better ways

[00:42:40] to trade

[00:42:41] off this,

[00:42:41] because at

[00:42:41] the end,

[00:42:42] this trade-off

[00:42:43] or this

[00:42:44] acquisition

[00:42:44] function that

[00:42:45] we are

[00:42:45] proposing

[00:42:48] is basically

[00:42:49] a multi-objective

[00:42:50] problem.

[00:42:50] We didn't

[00:42:51] have it there

[00:42:51] as a

[00:42:52] multi-objective

[00:42:52] problem,

[00:42:53] but you

[00:42:54] have two

[00:42:54] objectives.

[00:42:55] You have

[00:42:55] to maximize

[00:42:56] performance,

[00:42:57] but also

[00:42:58] trying to

[00:42:59] keep the

[00:43:00] cost low.

[00:43:01] And I

[00:43:01] think there

[00:43:01] are plenty

[00:43:02] of approaches

[00:43:03] to improve

[00:43:03] this in

[00:43:05] different ways.

[00:43:07] Yeah,

[00:43:08] I can see

[00:43:09] how the

[00:43:10] scaling by

[00:43:11] cost basically

[00:43:12] would not be

[00:43:13] everyone's

[00:43:14] preference,

[00:43:14] and that

[00:43:15] you might

[00:43:15] want to

[00:43:16] change

[00:43:16] that.

[00:43:16] But yeah,

[00:43:17] I think

[00:43:18] this is a

[00:43:19] really cool

[00:43:20] idea,

[00:43:21] and I

[00:43:21] love to

[00:43:22] see that

[00:43:23] it also

[00:43:23] works this

[00:43:24] well.

[00:43:24] Like as

[00:43:24] you said,

[00:43:25] it's still

[00:43:25] fairly simple.

[00:43:27] Obviously,

[00:43:28] having

[00:43:29] parameterized

[00:43:30] kernels

[00:43:31] and neural

[00:43:32] networks in

[00:43:33] there,

[00:43:33] it's not the

[00:43:34] simplest thing

[00:43:35] you can do,

[00:43:35] but it's

[00:43:36] also definitely

[00:43:36] less complex

[00:43:37] than a lot

[00:43:39] of the other

[00:43:39] approaches

[00:43:39] we've seen

[00:43:40] for really

[00:43:41] efficient

[00:43:42] tuning or

[00:43:42] working with

[00:43:44] big models.

[00:43:45] If you,

[00:43:45] for example,

[00:43:46] think about

[00:43:47] this mood

[00:43:48] transfer idea

[00:43:49] for tuning

[00:43:50] a smaller

[00:43:52] version and

[00:43:52] then transferring

[00:43:53] the configuration,

[00:43:54] you have the

[00:43:55] scaling issue.

[00:43:56] And the

[00:43:57] cool thing I

[00:43:57] see with

[00:43:58] quicktune is

[00:43:58] actually that

[00:43:58] you can scale

[00:43:59] it nicely as

[00:44:00] long as you

[00:44:00] are able to

[00:44:01] do the

[00:44:01] model selection

[00:44:02] somehow.

[00:44:03] Is there

[00:44:03] anything else

[00:44:04] you'd like to

[00:44:05] talk about

[00:44:05] that you think,

[00:44:06] hey,

[00:44:07] that's something

[00:44:08] we haven't

[00:44:08] said yet,

[00:44:09] but that would

[00:44:09] be really,

[00:44:10] really great

[00:44:11] that you

[00:44:11] found

[00:44:12] interesting

[00:44:12] in this

[00:44:12] project?

[00:44:13] So I

[00:44:14] think one

[00:44:16] practitioner

[00:44:17] might wonder

[00:44:18] like,

[00:44:18] oh,

[00:44:18] what is the

[00:44:19] actual cost

[00:44:19] of doing

[00:44:22] the meta

[00:44:22] tuning?

[00:44:23] It means

[00:44:23] to find

[00:44:24] the right

[00:44:26] hyper,

[00:44:27] hyper

[00:44:27] parameters

[00:44:28] or the

[00:44:28] hyper

[00:44:28] parameters

[00:44:29] of the

[00:44:29] components

[00:44:30] that we

[00:44:30] are using.

[00:44:31] And we

[00:44:33] didn't try

[00:44:34] many hyper

[00:44:35] parameters

[00:44:35] because the

[00:44:36] assumption,

[00:44:37] and this is

[00:44:37] something that

[00:44:38] could be also

[00:44:39] a research

[00:44:40] line.

[00:44:41] The assumption

[00:44:41] is that in

[00:44:42] the meta

[00:44:42] level,

[00:44:43] the sensibility

[00:44:44] of the

[00:44:47] results to

[00:44:48] the meta

[00:44:48] level hyper

[00:44:49] parameters is

[00:44:50] lower.

[00:44:51] Like,

[00:44:51] if you change

[00:44:52] some hyper

[00:44:52] parameter,

[00:44:53] it doesn't

[00:44:54] affect so

[00:44:54] much like

[00:44:55] the end

[00:44:55] up result.

[00:44:56] Of course,

[00:44:56] there are

[00:44:57] some set

[00:44:57] of meta

[00:44:59] hyper

[00:44:59] parameters

[00:44:59] that are

[00:45:00] more

[00:45:01] appropriate,

[00:45:02] but the

[00:45:04] assumption

[00:45:05] here is

[00:45:05] that we

[00:45:06] don't need

[00:45:07] to tune

[00:45:07] them a

[00:45:07] lot,

[00:45:07] and therefore

[00:45:08] if you

[00:45:08] change the

[00:45:09] domain,

[00:45:10] the same

[00:45:11] setup will

[00:45:12] work out

[00:45:14] of the

[00:45:14] box.

[00:45:15] But this

[00:45:15] is something

[00:45:16] that we

[00:45:17] assume,

[00:45:18] and many

[00:45:18] of the

[00:45:19] transfer

[00:45:19] HPO

[00:45:20] approaches

[00:45:21] assume,

[00:45:22] but it

[00:45:22] would be

[00:45:23] interesting

[00:45:23] to research

[00:45:24] what happens

[00:45:25] in the

[00:45:25] meta

[00:45:25] level in

[00:45:26] the

[00:45:26] transfer

[00:45:26] HPO

[00:45:27] or

[00:45:27] auto ML.

[00:45:28] If you

[00:45:29] change

[00:45:29] the

[00:45:30] optimizer

[00:45:30] or if

[00:45:31] you

[00:45:31] change

[00:45:32] one

[00:45:32] hyper

[00:45:32] parameter

[00:45:32] optimizer,

[00:45:33] how

[00:45:38] it would

[00:45:39] be

[00:45:39] interesting

[00:45:40] to see

[00:45:40] that it

[00:45:40] actually

[00:45:41] is the

[00:45:41] case

[00:45:41] that

[00:45:41] the

[00:45:42] hyper

[00:45:42] parameters

[00:45:42] of

[00:45:43] the

[00:45:43] meta

[00:45:44] level

[00:45:44] are

[00:45:44] affecting

[00:45:45] less

[00:45:46] the

[00:45:46] loss,

[00:45:48] right?

[00:45:48] But yeah,

[00:45:49] I am not

[00:45:50] aware of a

[00:45:50] work that

[00:45:51] explored that,

[00:45:52] but I think

[00:45:52] it could be

[00:45:53] interesting.

[00:45:54] Yeah,

[00:45:55] I think

[00:45:55] that's an

[00:45:56] interesting

[00:45:56] question in

[00:45:56] general.

[00:45:57] And I

[00:45:57] think for,

[00:45:58] I mean,

[00:45:59] at least my

[00:45:59] personal

[00:46:00] theory of

[00:46:01] why this

[00:46:01] might be

[00:46:02] true in

[00:46:02] a lot

[00:46:03] of

[00:46:04] auto ML

[00:46:04] is because

[00:46:05] there's been

[00:46:05] some work,

[00:46:06] especially on

[00:46:07] hyper parameter

[00:46:08] optimization

[00:46:08] landscapes,

[00:46:09] they look

[00:46:10] really nice.

[00:46:11] They're really

[00:46:12] benign,

[00:46:13] actually,

[00:46:13] in the sense

[00:46:14] that what

[00:46:15] works well

[00:46:16] is pretty

[00:46:16] close together

[00:46:17] and you

[00:46:18] can probably

[00:46:19] tell that

[00:46:19] you have

[00:46:22] at least

[00:46:23] a few

[00:46:24] locations

[00:46:25] where

[00:46:25] performances

[00:46:26] are good,

[00:46:26] but those

[00:46:27] are broad,

[00:46:27] so you can

[00:46:28] hit them

[00:46:28] well.

[00:46:29] Basically,

[00:46:29] it's not a

[00:46:29] hard optimization

[00:46:30] task in

[00:46:31] most cases.

[00:46:32] And then

[00:46:32] if you just

[00:46:34] change the

[00:46:34] domain and

[00:46:35] it's still

[00:46:35] not a

[00:46:35] hard optimization

[00:46:36] task,

[00:46:37] those might

[00:46:38] look similar.

[00:46:39] But I'm

[00:46:40] not sure

[00:46:40] if this

[00:46:40] is even

[00:46:41] true for

[00:46:42] something

[00:46:43] like fine

[00:46:43] tuning in

[00:46:44] the first

[00:46:44] place,

[00:46:44] right?

[00:46:45] And I'm

[00:46:46] also not

[00:46:46] sure for

[00:46:47] how many

[00:46:47] different

[00:46:48] domains this

[00:46:48] has even

[00:46:49] been tried

[00:46:49] in HBO

[00:46:50] or if

[00:46:51] this has

[00:46:51] been replicated

[00:46:52] in architect

[00:46:53] research,

[00:46:53] for example.

[00:46:54] So yeah,

[00:46:55] I agree

[00:46:55] that's

[00:46:56] definitely

[00:46:57] super

[00:46:57] interesting.

[00:46:57] An open

[00:46:58] question,

[00:46:59] I think.

[00:47:01] But it

[00:47:01] could be

[00:47:02] interesting

[00:47:02] to see

[00:47:05] what happens

[00:47:07] here.

[00:47:08] Actually,

[00:47:09] that's another

[00:47:10] good question.

[00:47:10] How many

[00:47:11] of the

[00:47:11] important

[00:47:12] hyperparameters

[00:47:13] of QuickTune

[00:47:14] do you think

[00:47:14] are in the

[00:47:16] basically meta

[00:47:18] pre-training

[00:47:18] part that

[00:47:19] someone could

[00:47:20] do for you

[00:47:21] that it

[00:47:21] might make

[00:47:22] sense to

[00:47:22] actually spend

[00:47:23] compute

[00:47:23] tuning?

[00:47:24] And how

[00:47:24] many are

[00:47:25] in the

[00:47:25] part that

[00:47:26] a user

[00:47:27] would then

[00:47:27] need to

[00:47:27] do

[00:47:29] themselves

[00:47:29] if they

[00:47:30] want to

[00:47:30] use

[00:47:30] or a

[00:47:31] pre-trained

[00:47:32] cost-precouser?

[00:47:33] Which hyperparameters

[00:47:34] should they play

[00:47:35] with?

[00:47:36] That's a good

[00:47:37] point.

[00:47:37] What would

[00:47:38] actually be

[00:47:38] necessary?

[00:47:39] Because if we

[00:47:39] talk about

[00:47:40] the influence

[00:47:40] of meta

[00:47:41] hyperparameters

[00:47:41] in QuickTune,

[00:47:43] right?

[00:47:43] I mean,

[00:47:43] there's a part

[00:47:44] of QuickTune

[00:47:44] that it's

[00:47:46] very feasible

[00:47:46] to say,

[00:47:47] okay,

[00:47:48] someone just

[00:47:48] invests the

[00:47:49] compute,

[00:47:49] does it

[00:47:50] once,

[00:47:50] we make it

[00:47:51] work well,

[00:47:52] and then

[00:47:52] people don't

[00:47:53] do it again,

[00:47:53] but just

[00:47:53] reuse it.

[00:47:54] Especially if

[00:47:55] your experience

[00:47:56] has been that

[00:47:57] you tend to

[00:47:58] benefit even

[00:48:00] if the

[00:48:00] meta-training

[00:48:01] hasn't been

[00:48:01] completely on

[00:48:02] your domain.

[00:48:03] And in that

[00:48:04] case,

[00:48:04] you could

[00:48:05] say if

[00:48:06] that person,

[00:48:07] if that

[00:48:08] group needs

[00:48:09] to invest a

[00:48:09] bit of time

[00:48:10] for HBO,

[00:48:11] that's more

[00:48:12] fair than if

[00:48:12] everyone who

[00:48:13] uses it would

[00:48:14] need to use

[00:48:15] HBO.

[00:48:17] Yeah,

[00:48:18] yeah,

[00:48:19] that's a

[00:48:20] good point.

[00:48:20] So I

[00:48:22] think in

[00:48:23] QuickTune,

[00:48:26] the most

[00:48:28] sensible

[00:48:29] hyperparameter,

[00:48:29] I would say,

[00:48:30] are the

[00:48:32] neural networks.

[00:48:33] It means

[00:48:33] the predictors.

[00:48:35] Actually,

[00:48:35] if you choose,

[00:48:36] for example,

[00:48:36] a very large

[00:48:37] network or

[00:48:38] so,

[00:48:39] if you don't

[00:48:40] have enough

[00:48:41] data for

[00:48:41] meta-learning,

[00:48:42] it might

[00:48:43] perform really

[00:48:44] bad because

[00:48:44] it will overfeed

[00:48:45] and so on.

[00:48:46] A very

[00:48:47] shallow

[00:48:47] network,

[00:48:48] on the

[00:48:48] other hand,

[00:48:49] will be

[00:48:49] also very

[00:48:51] problematic.

[00:48:51] So I

[00:48:52] think selecting

[00:48:53] these networks

[00:48:55] properly is

[00:48:56] very relevant.

[00:48:57] And also

[00:48:58] say,

[00:48:59] because I

[00:48:59] previously

[00:49:00] worked with

[00:49:00] this deep

[00:49:00] kernel,

[00:49:02] Gaussian

[00:49:03] processes,

[00:49:03] and the

[00:49:04] network size

[00:49:05] might affect

[00:49:06] the performance.

[00:49:07] So if we

[00:49:07] had,

[00:49:08] and this is

[00:49:08] what we

[00:49:09] expect to

[00:49:10] achieve,

[00:49:11] is if we

[00:49:11] have a lot

[00:49:12] of pre-training

[00:49:13] data or

[00:49:13] meta-learning

[00:49:14] data,

[00:49:15] then we

[00:49:15] can afford

[00:49:16] to have

[00:49:16] a very

[00:49:17] large

[00:49:19] predictor,

[00:49:19] and then

[00:49:20] we can

[00:49:21] meta-learn it

[00:49:21] on all the

[00:49:22] data we

[00:49:22] have,

[00:49:23] and then

[00:49:24] that probably

[00:49:24] will improve

[00:49:25] the performance.

[00:49:26] So yeah,

[00:49:27] playing with

[00:49:27] these networks

[00:49:29] is very

[00:49:30] relevant,

[00:49:31] I would say.

[00:49:32] And I

[00:49:33] think a

[00:49:33] very straightforward

[00:49:34] way,

[00:49:35] as we

[00:49:35] can see in

[00:49:36] machine learning,

[00:49:37] always the

[00:49:37] best way to

[00:49:38] improve the

[00:49:38] performance

[00:49:39] is adding

[00:49:40] more data.

[00:49:41] So I

[00:49:41] think if

[00:49:42] we scale

[00:49:44] the meta-dataset,

[00:49:45] we might

[00:49:45] see a

[00:49:47] bigger lift

[00:49:48] probably for

[00:49:49] the extended

[00:49:49] versions or

[00:49:50] even for

[00:49:51] mini-micro.

[00:49:52] So yeah,

[00:49:54] that's what I

[00:49:55] think could be

[00:49:55] interesting.

[00:49:57] Yeah,

[00:49:57] I think that's

[00:49:58] really interesting,

[00:49:59] and that again

[00:49:59] also speaks to

[00:50:01] how the method

[00:50:02] is built,

[00:50:02] because again,

[00:50:03] you talked about

[00:50:03] something that

[00:50:04] automata experts

[00:50:05] like yourself

[00:50:06] can do,

[00:50:07] and that people

[00:50:08] can then reuse,

[00:50:08] and the

[00:50:10] user doesn't

[00:50:10] have to think

[00:50:11] about these

[00:50:11] network sizes

[00:50:12] necessarily as

[00:50:13] much as you

[00:50:14] do, where you

[00:50:15] have already an

[00:50:16] intuition for

[00:50:16] it, and I

[00:50:18] think that's

[00:50:18] always a really

[00:50:19] good dynamic

[00:50:20] for something

[00:50:21] that people

[00:50:22] are actually

[00:50:22] supposed to

[00:50:22] use.

[00:50:23] Yeah, I

[00:50:24] think so,

[00:50:25] and this

[00:50:25] brings me

[00:50:26] actually to

[00:50:26] another nice

[00:50:27] point, and

[00:50:28] so we are

[00:50:30] working right

[00:50:31] now, so

[00:50:32] this work,

[00:50:33] the quick

[00:50:33] tool was a

[00:50:34] research work,

[00:50:35] right, and

[00:50:36] of course we

[00:50:37] want to

[00:50:38] deliver some

[00:50:38] thing that

[00:50:39] people can

[00:50:39] use out

[00:50:40] of the box

[00:50:40] with no

[00:50:42] like, with

[00:50:43] not a

[00:50:43] strategy for,

[00:50:44] right, so

[00:50:44] you don't

[00:50:45] have to

[00:50:45] find again

[00:50:46] the right

[00:50:47] optimizer, the

[00:50:48] right cost

[00:50:49] predictor, and

[00:50:50] so, so we

[00:50:51] are working on

[00:50:52] a quick

[00:50:52] tool that

[00:50:54] will be

[00:50:54] available soon

[00:50:56] as an open

[00:50:57] source project,

[00:50:58] and then we

[00:50:59] hope that people

[00:50:59] can also use

[00:51:01] it and play

[00:51:01] around, initially

[00:51:02] it will be

[00:51:02] available for

[00:51:03] image classification

[00:51:04] because it was

[00:51:06] the direct

[00:51:06] task from the

[00:51:07] research paper,

[00:51:08] but the

[00:51:09] hope is

[00:51:10] that we

[00:51:10] can extend

[00:51:12] it to

[00:51:13] other type

[00:51:14] of domains

[00:51:15] and modalities

[00:51:17] like language

[00:51:18] and, yeah,

[00:51:20] segmentation,

[00:51:20] image segmentation,

[00:51:21] so that's the

[00:51:23] idea, to have

[00:51:24] something out

[00:51:24] of the box

[00:51:24] because, again,

[00:51:26] the communities

[00:51:26] are like the

[00:51:28] central customer

[00:51:29] in this, like

[00:51:30] the main

[00:51:31] target, like

[00:51:32] to find

[00:51:33] something that

[00:51:34] the people can

[00:51:35] use out

[00:51:35] of the box

[00:51:36] and they

[00:51:36] don't have to

[00:51:37] think a lot

[00:51:37] about how to

[00:51:38] fine-tune, and

[00:51:39] finally, there is

[00:51:40] a clear answer

[00:51:42] for those

[00:51:43] questions in

[00:51:43] forums, like

[00:51:44] which model

[00:51:45] should I use

[00:51:47] and with

[00:51:48] which hyper

[00:51:49] parameters.

[00:51:49] tells us.

[00:51:50] And I think

[00:51:51] that's actually

[00:51:52] a great point

[00:51:52] to end on, so

[00:51:53] everybody, you

[00:51:54] know what you

[00:51:55] have to look

[00:51:55] forward to, you

[00:51:56] can actually do

[00:51:57] the thing I just

[00:51:58] wished for or

[00:51:59] something similar,

[00:52:00] probably not call

[00:52:01] Hugging Facegate

[00:52:02] as I said, but

[00:52:03] QuickTune, give me

[00:52:05] the model that I

[00:52:05] want.

[00:52:07] Awesome, thank you

[00:52:08] for being here.

[00:52:09] Where can listeners

[00:52:10] find you if they

[00:52:10] want to know

[00:52:11] more about

[00:52:12] either QuickTune

[00:52:13] or your next

[00:52:13] follow-up

[00:52:14] projects?

[00:52:15] So if they

[00:52:16] want to know

[00:52:17] more about

[00:52:18] QuickTune,

[00:52:19] people can find

[00:52:20] actually the

[00:52:21] paper in

[00:52:22] archive.

[00:52:23] The name is

[00:52:24] QuickTune,

[00:52:25] quickly learning

[00:52:25] which pretrain

[00:52:26] model to

[00:52:26] fine-tune and

[00:52:27] how.

[00:52:28] There is also

[00:52:29] a repo with

[00:52:30] the research

[00:52:30] code and as

[00:52:32] I say, also a

[00:52:34] QuickTune tool

[00:52:34] that people can

[00:52:35] try and in

[00:52:36] GitHub.

[00:52:37] The package

[00:52:38] is called

[00:52:38] QuickTune

[00:52:39] tool and they

[00:52:40] can find it

[00:52:40] also there.

[00:52:41] And yeah, you

[00:52:42] can follow me

[00:52:43] in LinkedIn

[00:52:44] or Twitter

[00:52:46] to know more

[00:52:47] about the

[00:52:48] things that I

[00:52:49] am working on

[00:52:50] with my group.

[00:52:53] Awesome, thank

[00:52:53] you.

[00:52:53] And thank you

[00:52:54] again for being

[00:52:54] here.

[00:52:55] I think this

[00:52:55] was really

[00:52:56] interesting.

[00:52:57] Thank you

[00:52:57] Teresa for the

[00:52:58] invitation.

[00:52:58] I really

[00:52:58] enjoyed it and

[00:52:59] I hope we

[00:53:01] can have a nice

[00:53:02] message to the

[00:53:02] community.