Neural Architecture Search: Insights from 1000 Papers

Colin White, head of research at Abacus AI, takes us on a tour of Neural Architecture Search: its origins, important paradigms and the future of NAS in the age of LLMs. If you're looking for a broad overview of NAS, this is the podcast for you!

*note: this is auto-generated, so errors are possible

Theresa: [00:00:00] Hi, and welcome back to the AutoML podcast. This episode, we're not looking at one specific research result or one specific AutoML system. We're taking a bird's eye view of neural architecture search. Neural architecture search has been around for a while. is really important, has been really successful as well.

And there's a really interesting paper, that has been released called Insights from a Thousand Papers on Neural Architecture Search. And one of the authors, Colin White, is with me today to talk about that paper. And Colin has also done a bunch of work. on architectures. If you work in NAS, you probably know his name, you probably know his work, so he's a great person to chat about this with.

I learned a lot from him, because I'm by no means an expert in that space. And I hope you will as well, on where neural architecture search came from, what kind of methods exist, what's promising right now, [00:01:00] and what will happen in neural architecture, to neural architecture search in the age of large language models.

I hope you enjoy it.

Mhm.

Theresa: Hi, everyone. I'm here with Colin White. Colin, do you want to introduce yourself?

Colin: Hi. Yeah, it's great to be here. Thank you for having me. I'm Colin. I'm a head of research at Abacus AI. This is a cloud AI platform. We sell end to end machine learning solutions to other companies. It's a startup based in San Francisco. And and one of the interesting parts about Abacus AI is that there's a research team that does pure research publishes papers and, and and gives out open source research to the community.

Theresa: Yeah, I know you do quite a bit of research, as we're here to talk about a lot of NAS research, actually. [00:02:00] But what, how would you characterize your general research interest, or maybe even what Abacus is doing in general

Colin: So I'd say I have quite a diverse range of research interests as can be seen with my publication history. My PhD was actually in theory in machine learning theory, and that that was already like I graduated six years ago, I think. And Yeah, after I graduated, I got more into deep learning and auto ML and neural architecture search.

That's probably where I've spent the largest or yeah, the most effort or have the most work. But I've also have these other interests like AI for science, debiasing machine learning. And more recently, I've also gotten more into large language models, but definitely AutoML and neural architecture search are the biggest threads of my, of my overall [00:03:00] research.

Theresa: Yeah, but that's good to know that people can look up other things besides what we're going to talk about today, which is a really interesting paper, I think. I read the title and I'm not, I'm not a big NAS person, so I don't, I read NAS papers, but I don't go searching them out usually. But I thought, hey, learning from a thousand NAS papers, that sounds like something, like something really cool.

How did that idea come about?

Colin: Yeah, I think I think it was. Frank Cutter's idea the senior author, he, he had a survey paper from a few years before we started this one. And but in that time there had been over a thousand NASPapers that came out. So he's like, okay, we really need another survey. And yeah, I was lucky to be involved in it and play a major role.

Yeah.

Theresa: Man, it's quite thorough, I think. It really starts at, what are the [00:04:00] beginnings? So, let's start at the beginning. How and when would you say, NAS started?

Colin: Yeah. So even before that question. Maybe I can talk about the progression of machine learning as a whole to put to put NASA into context.

So machine learning has been around since at least nine, the 1950s, if not longer but machine learning back in the 1950s looks very different from what it looks like today.

So back then it was quite a labor intensive process just to Train the perceptron once on a small dataset, whereas nowadays any of us with an internet connection can pull up chat GPT and have access to one of the most largest state of the art models. And I think one of the driving forces or one of the things that's been happening consistently is more and more automation [00:05:00] in machine learning as a whole.

And so one of the biggest leaps in automation was deep learning itself. So before, before deep learning which came or came about right at the start of the 2010s like for example, for computer visions specifically before deep learning, people were spending a lot of effort manually designing features.

Whereas nowadays, convolutional neural networks sort of do this automatically for us, or, and vision transformers as well. Yeah, I don't know if PhD students these days know, like, ho know what hog means, like, histogram of Orient and gradients, but this was one of the leading techniques before convolutional neural nets.

Theresa: There's actually still a lecture here in Hannover, Computer Vision, where you have to know a lot of these concepts, and I think some students are quite dissatisfied with it for the reason that they say, hey, why can't we just [00:06:00] use a CNN to do this?

Colin: Yeah, I mean, it makes sense to still know these concepts, but yeah, now CNNs are, are performed very well and are much less much less manual work. They more automate the process of feature design. And and so that was one big leap in automation. And so now we might ask, what is the next big step in automation?

And to, to answer that we can look at all the different complex architectures people are designing these days or that have been released these days. And so I think the next. One of the next big steps in automation is, is automatically designing the architectures themselves. And so this is what I'd say is is ongoing.

And now to actually answer your question, I think it started the, the modern, the modern age of [00:07:00] NAS started at the end of 2016 or beginning of 2017. And and one of the prominent reasons it started was this paper by Barrett Zoff and Kwok Le which, which was like the first paper on the, the, I guess there have been papers all the way since at least the late 1980s, but, but this paper in 2017 was like the start of the modern era of neural architecture search.

Theresa: So we're actually talking about a quite small time frame, if you think about it. There was a survey between the last survey and your new well, fairly new paper. We had over a thousand last papers. The whole thing only started less than a decade ago. which you would say in this modern way as you describe it.

How would you, how would you distinguish that if you say the current modern wave of neural architecture approaches, what was, what's different about them?

Colin: You mean, since like the, the [00:08:00] approaches before that starting in the 1980s? Well, def, definitely I mean, we have GPUs now and we have deep learning, so I guess like there's been, there's been work before in deep learning before the 2010s, and there have been like, you know, people claiming they have some result and, and so on, although I I feel like for deep learning it's, it's like hard to, like, like now that we have GPUs, we can like actually run these at scale and show that they work.

Whereas before, I guess, deep learning wasn't like the prominent technique. So there were, I guess there were other techniques before deep learning. So So yeah, I guess now, like, everything was sort of like, the modern era of deep learning came about, and then we have neural architecture search for these new types of architectures.

Theresa: Okay, and what does that look [00:09:00] like? So, I mean, we're gonna talk about this in detail, but from which research directions or research communities would you say NAS primarily draws from?

Colin: Definitely, definitely there's been a big link between NAS and computer vision and historically and now now with large language models and NLP being more popular, there's a, there's a lot of work that straddles NAS and large language models. So I guess, I guess you could say that sort of. NAS sort of stays close to whatever the hottest topics are, since that's the, where the most research is coming, the most architectures are being designed, and the most need for, for automating the design of these architectures.

I guess one of the, one of the things I point to when people like ask about neural architecture search and whether it's [00:10:00] useful is that, so, so there's this website, Papers with Code, that keeps track of Like the state of the art the state of the art models or state of the art results for popular data sets.

And so maybe one of the most popular data sets is ImageNet. I would say between, especially between 2010 to 2020, it's maybe the most competitive data set out there. And, and starting from 2017 until 2022, I count there are seven times the, the state of the art model was held by a automatically searched architecture, not a human designed architecture, which is the majority, both in terms of number of state of the art results and also time spent as state of the art.

So so yeah, definitely there's. Neur architecture search has helped the community and or has to, to achieve these strong [00:11:00] results.

Theresa: Yeah, I think in computer vision, especially it's quite impressive also how small automatically searched architectures can often get versus what a lot of people hand built. I think that's also a really relevant point, especially if we think about computer vision applications that we want to run on really small devices.

I mean, phones nowadays are quite powerful already, but edge computing is also something where that I often hear in combination with NAS for the reason that people say, Hey NAS is a really good way to transfer our cool applications to edge devices.

Colin: Absolutely. I think that's, that would probably be maybe my number one or, or very high up, like application of NASA or, or, or ways that NASA is being used today is, is designing efficient architectures. That are both have high performance and can fit on edge devices or, or whether whatever device where we [00:12:00] want it to fit on there's, of course, there's there, I'd say maybe two of the most popular lines of NASA designed architectures are efficient net, and also the mobile net series, and both of these are sort of not not only a Searching for the best accuracy, but the best accuracy and efficiency trade off.

Theresa: So if we're looking for an architecture what are we actually looking for? I, because I think for people who maybe have heard of the concept, but never looked into neural architecture search in detail, it might not be quite obvious how a search space for neural architecture search actually looks like.

Can you give us an example?

Colin: Yeah, definitely. So and maybe even before that, so there's, there's the search space, I guess, historically, there's like three pillars of neural architecture search, although I'm sure I'll [00:13:00] talk later about how it's not always that cut and dry the three pillars, but but there's the search space, which is the set of all architectures we're searching over.

The search strategy, which is what optimization algorithm are we running to find the best architecture in that search space? And then there's the performance estimation strategy, which is how are, are we evaluating the architectures as we're running this search algorithm? And so, so yeah, I would say the search space is maybe the most important pillar of these three.

It's for one, I think it's the pillar that most. that makes NAS most distinct from other areas like hyperparameter optimization. Because they are pretty, some people have even called NAS a subset of hyperparameter optimization, but, but really the techniques can be quite different because hyperparameter optimization [00:14:00] often has a search space that's like just a simple product space of Of different hyperparameters, whereas neural architecture search, the search spaces are often graphs, like discreet graphs or directed acyclic graphs.

So, so yeah, that, I guess now I'm sort of leading into answering your question, which is yeah, so typically when we, when we start to do NAS, we, we want to design a search space. And it often looks like this some type of like, big set of directed acyclic graphs. There, there's other types of techniques to make this more efficient.

The, one of the most popular techniques is called the cell based search space, which is we, we search over a relatively small set of directed acyclic graphs and, and then we duplicate it many times. And, and when I, when I talk about these these [00:15:00] graphs we can think of. We, we can think of every node as some operation like convolution or pooling or, or batch norm or something like that.

And then the edges are the connections between the operations, like the, the direction that the gradients flow in the architecture there, this is, this is just one simple idea. There are, there are many other. proposals people have made. Some people like, or some papers put the edges as the operations and the nodes are just are just show connections between the gradients.

So yeah, there's, there's a lot of there, there's a lot of different ways to set up the search space, but, but overall we're looking for, to search over all these structures of, of operations and how they're connected and in some yeah, graph structure.

Theresa: Which makes [00:16:00] sense because oftentimes, I mean, if you imagine like a neural network in your head, it will usually look like something like a graph in a way. So having the search space or the result of your architecture encoding in a way, be a graph makes a lot of sense to me at least. But yeah, I mean, You said the cell based approach was something to simplify that.

Does that, is that just a technique to shrink down the search space so that we only have to find a smaller graph instead of a, I don't know, 15 billion parameter graph? Or is there another reason why we want to use a simple search space?

Colin: Yeah. So I guess designing the search spaces is a very important part I guess as, as I already said and I think at the heart of designing the search space is this trade off between how much, how much bias or how much structure do we want to add to the search [00:17:00] space? So if ideally we would want to have very little structure or, or yeah, not give much structure at all, but then that becomes extremely computationally intensive to, to search and find anything.

And then on the other hand, if we give too much structure, maybe like, maybe if, if we want to find an architecture and we know ResNet is already a good base. Baseline, then we can make a search space that's very similar to ResNet, and then all we find are ResNet like architectures. So we don't really find anything truly novel or anything significantly better than ResNet.

So it's so yeah, it's this big trade off between how much, how much human bias do we want to add. Yeah, and I guess if It depends on our compute budget and how, how novel the data set is. [00:18:00] And so on, like, like if we're, if we're searching over ImageNet, it, it still is possible, but it might be very hard to find like a brand new technique that gets like a 5 percent improvement.

And so, but people have had quite a lot of success eking out like another percent improvement using a search space that's, that has a good amount of structure, but still a room to have more optimizations. And So yeah, there, in all of the NAS community, there have been works on many parts of the spectrum between less structure and more structure.

And, and so I think the cell based search space is a, is a really nice Place in the middle where the search is still pretty tractable because often the cells are size or size like eight or 10 nodes or something like that. So it's still pretty tricky to search over like a graph of [00:19:00] size 10, but, but doable, and then we duplicate the cell many times, which still gives us the power of a really deep architecture, even though we're doing a search over a relatively smaller.

Yeah, relatively smaller space and yeah, I think some of the some of the most popular search spaces are size 10 to the 20 or I think the dart search space is size 10 to the 22 if I remember correctly. And then other search spaces are, are much larger. I guess some are also much smaller. Although maybe that's not like the most important number is just the size.

Because for example, like lots of darts architectures from the search base are pretty good architectures. They're not like that terrible. So yeah,

Theresa: but I think it's still interesting to hear that for scale, right? Because, I mean, if you're just [00:20:00] imagining your mind's eye a graph of size 8 to 10, it, at first glance, might sound easy and really small. But if we then put it in terms of search space size, obviously it becomes significant. So this is not an easy optimization problem.

Colin: Absolutely. Yeah. Depending on how many edges there are and how many operations, for example, I think darts has seven or eight operations, depending on if we count like the null operation as one. And then there's constraints. I think there can be at most two edges per, per node. And so if we go up to size 10 or so, it's actually, yeah, quite a large, I think that alone is 10 to the nine or 10 to the 11, somewhere around there.

And then they have two of them that we search over the. the convolutional cell and the reduction cell. So then that gets to like around 10 to the 20.[00:21:00]

Theresa: Yeah, that's that's definitely a lot of possible architectures you have in there then. So So, basically, you have to find a search space based on what you want to do, if I understand you correctly. You always have to see, okay, what's my compute budget? Do I have a good existing bias for this problem?

And there's probably going to be a search space for that. Or is, are there like gaps you see right now where you say, Ah, we haven't really figured out how to build a good search space for this problem.

Colin: Yeah, I think, I think in general, the community has been moving to search bases that are the best of the best of both worlds, where over time, we've The community has come up with clever ways of designing search spaces that still allow for finding more novel new architectures while while not being like a hopeless, hopelessly high computational [00:22:00] task.

For example, hierarchical search spaces are. Are more popular now and these, I think definitely fit that bill of, of being like we can search them, but, but the diversity in the architectures is so much higher than, than even cell based search spaces. So, so yeah, at a, at a high level, hierarchical search spaces are like, we have some hyper prep, some top level hyper parameters, maybe like the, the number of cells or something, and then.

Yeah, yeah, actually, to go beyond cell based search spaces, just one level we can have a top level hyperparameter that's just the number of cells, but then people have gone beyond this too and made hierarchical search spaces of four or five levels, and so on, where there's more and more top level hyperparameters that, that control the, the, like, overall high level structure [00:23:00] of the neural net.

And then the low level hyperparameters control, like, all the small details at each individual operation and connection. And actually there's this really interesting paper that just came out in the last couple months, I believe. It's called, it's called Einspace. And, and this is a, yeah, I think it's by some people at the University of Edinburgh, and they design a search space in terms of of, of like a context free grammar.

And so, so this is a, this. Seems like an even more powerful way of designing, of, of yeah, having a search base with just many diverse, interesting types of types of neural nets that are inside the search space. So yeah, as a, I think as a community, there's been more and more interesting search spaces that [00:24:00] have come up that can really show up the diversity while it was still being tractable, tractable.

Theresa: Yeah, that's really interesting. So context free grammar definitely seems like yeah, a very nice approach to this as well. But you also write about a concept called search space encodings. So to me it sounds like hierarchical search spaces in themselves actually already you know, are a pretty great idea, whether you, however you set them up, right?

Why would I want to encode them? And is this even something that you would say it's a mature idea yet, or is this maybe something that's pretty much still in in the beginning phases?

Colin: So, so yeah, I think encodings are a bit, a little bit different from the search spaces themselves. So that that actually gets us into the next pillar of NAS. So, so we have the search space. [00:25:00] And then we ask okay, now what optimization algorithm do we use? And so depending on the optimization algorithm, we might need to encode each architecture in the search space somehow, especially if we do something like, like a performance prediction method that's surrogate based.

So, so say we want to predict the performance, say, say we've already evaluated 50 architectures and we know their accuracy. And then we want to predict the performance of 50 more. Then we have to encode these architectures somehow and run a quick model to predict the other 50. So this is where encodings really get very important.

And So, so yeah, actually, I've done some work on encodings before writing this the survey paper, and we found that actually the encoding, even changing the encoding a little bit can have a big change in performance [00:26:00] in some of these. NAS subroutines like like using a surrogate. So so yeah, encodings overall are very tied to what algorithm we're using.

So they're tied to both the search space and what algorithm we're using. There's other types of methods like, like one shot approaches, which I guess the, like the encoding is more built into the algorithm. So it's not something. We think about as much, but, but if we, so, so yeah, that's an example of a of a whole area where the encoding is less important, but or maybe not less important, but just like less freedom in our design choices, whereas for, yeah, for running a surrogate to predict the performance, we need to think a lot about, like, how do we encode these architectures?

And so, yeah as I said before, a lot of. A lot of search spaces are DAG based, like they're, they're [00:27:00] graphs, and so there are some ways to encode graphs. There's like the adjacency list or, or these, or these types of, yeah, ways to encode graphs. or like a matrix of the graph. And so, but but sometimes these can be hard to learn because like, say for example, say for example, we have one architecture that's like the start of the architecture, then, then a skip, or then like a identity operation, then a convolution, then the end.

This would be identical to the start of the architecture convolution, identity, and the, and the end. But depending on our encoding, these two might look very different and it takes work for the algorithm to learn that these are actually the same architecture that should have the same accuracy. So, so yeah, there's other types of encodings like the, the path encoding, where we look at the, the [00:28:00] path that the.

tensors the types of operations the tensors interact with from the start to the end of the architecture. And, and each path is a feature itself. So this so yeah, so this is actually a encoding that I have studied myself. Although I think the path encoding. worked really well for the type of search spaces that were most popular at the time, but now with search spaces progressing much further it's sort of less It's not really scalable and doesn't yeah, isn't, isn't as good an approach these days, but but yeah, now, now these days with better surrogates and better methods for, for predicting performance, we can get away with like the more common encodings and yeah, not just like regular.

regular ways of of encoding the graph structure.[00:29:00]

Theresa: Yeah, because I was thinking, especially if you have a hierarchical search space, that's, you know, has a few quite a few hierarchical hyperparameters in there on the top level, representing that as a graph could become quite useful. More difficult if I conceptualize that correctly, but I mean, if surrogates can actually kind of, you know, make up for the difference here, that's pretty good.

So would you say then that with better methods on the performance prediction level, actually encodings might not be the most important topic or will they come back in full force?

Colin: I think it really depends on the type of search strategy. And I think the That people have been designing more interesting search strategies where encodings are less important, or maybe they're it's obvious which encoding to use. So, yeah, I think they're, I mean, they're, they'll always [00:30:00] be an important part of, of, like, the whole process of NAS, but maybe they're, it's.

There's less need to, like, write a paper fully on just the encodings, as there, there maybe was more need for that, like, a few years ago, but, but yeah, especially with one shot methods there, there's, like, less need for, for this.

Theresa: And I mean, that's great, right? One less thing to focus on. Yeah, but since you mentioned set strategies and especially one shot methods a few times now, what kind of set strategies are employed generally?

Colin: Yeah, so I think another quick brief history on neural architecture search. So there have been there have been a couple major developments, I'd say, since 2017. So the, the first paper by Zhou, they, they had this Basically black box optimization algorithm, [00:31:00] a reinforcement learning algorithm, and it was really slow or it was really computationally intensive.

I think actually maybe that even helped the field because then everybody wanted to write a paper on improving it, speeding it up and making it work

Theresa: Reinforcement learning. Yeah, it's not, it's not known to be the fastest thing in the world.

Colin: Yeah. I mean, it worked really well for some of Google's papers, but and really got the field going, but, but yeah, now there have been many other techniques that have been designed and, and a lot of new developments. So, so yeah, other black box optimization techniques are. Bayesian optimization. I think this is a very powerful black box optimization technique.

And there've been a lot of evolutionary search methods. I think, I think evolutionary search can perform really [00:32:00] well. Yeah, depending on the, the, what we're doing. So, so yeah, these are, I guess these are probably three of the main. Black box optimization techniques from, from the earlier days in NAS reinforcement learning based on optimization and evolutionary search.

And then and then, yeah, one of the main developments in the field was the introduction of these one shot methods. So this is yeah. So when I say black box optimization, I mean, we have this big search space. We, we try one architecture, we evaluate it. So now we know that's accuracy. Then we try another architecture, we evaluate it and we know the accuracy and, and it can become a lot more complicated than this.

And there's a lot of interesting algorithms and, and which architecture to pick next. But at the end of the day, that that's what I'm talking about. That's like very sequential process. And, [00:33:00] and so one shot techniques is rather than doing this process, what if we What if we just train one single architecture and And yeah, what if we try to do NAS in just one shot of, of training and, and the way, the way to do this, that this was the, the technique introduced by the darts paper, differentiable architecture search.

So as the name implies. It's not a black box optimization method, but it's, but it's a it's a gradient based optimization technique. So, so the, the tricky part is, is the NAS search spaces are discrete. They're not continuous, but but the paper found A clever way of turning NAS into a continuous problem.

And the way they do this is that, so, so say, say we're running NAS and we [00:34:00] have a search space and there's one, there's one slot where we can have some operation like convolution or pooling or skip connection or something. So we give each of these choices a weight alpha and, and the weight can be from zero to one.

And then there we go. We just made this continuous because now these alphas are continuous.

Theresa: So, so basically each connection now has as many weights as we have operations and we then, you know, can take the one with the highest weight in the end. But we try to learn a probability or weight for all of them at the same time.

Colin: exactly, yeah. And so, so throughout the search, we can try to make one of these alphas get to be like close to one and the rest close to zero, hopefully. And then at the end, we just pick the operation that, that is the highest. And, and [00:35:00] so crucially, this allowed the, the authors of the darts paper to run, to run Yeah, a gradient based optimization method.

So just like for normal machine learning, when we optimize the hyper parameter or sorry, the parameters we, we use we use gradient based optimization. So, so we, we yeah, we, we run back propagation and then. and then optimize the the parameters with gradient descent. And so they do the same thing now for the architectures themselves, because these alphas are now continuous and can be optimized using using gradient based techniques.

So, so what they did is they, they alternated optimizing that in every step of gradient descent, they They alternated updating the parameters and the hyperparameters. And then, so [00:36:00] this is like this big bi level optimization procedure, and they showed that this, this works and, and it converges to a strong architecture.

Theresa: So that sounds really good, especially since it sounds, at first glance at least, a lot more computationally affordable than if I use, I don't know, 50 evaluations in Bayesian optimization. Is that actually the case though? I mean, one shot sounds like, you know, I really do it once, but now you already say, okay, we switch between updating the hyper parameters and parameters, which to me says we already need a few more epochs to get there.

Colin: So it's, it's definitely fast. The, the bigger question is how well, how good is the accuracy at the end? So so yeah actually it's so black box optimization methods. These are a lot more anytime algorithms where we can just run them as long as we want and then [00:37:00] stop when we're happy with the architecture or if we've gone on too long.

But but for these one shot methods, they're a bit more like we, we train something once and it's going to converge. Or, or something will happen, maybe the gradients will explode, so they, so we, it'll stop eventually and we don't have much control like there. We can't really run it forever. So then the question is.

We run it once, and usually it's not that much longer than, than just training a single architecture. And so now the big question is, yeah, did it, did it work? Is the accuracy strong? And so the, the initial results from the very first paper were, were pretty decent. But then there's, there was a huge body of work in improving it further.

Especially Getting it to work for other search spaces and other data sets and also even the original [00:38:00] paper there. There were many improvements that people found could be made because it is a very tricky optimization problem now, whereas I'll Whereas before the black box optimization methods, they're, they're much simpler.

And we have a lot more control over everything, whereas the one shot methods, it's, it's much faster, but, but much trickier to get the optimization. Correct.

Theresa: Yeah, it reminded

Colin: Oh, sorry.

Theresa: Metagradient optimization for hyperparameters as well, right? Because I think in practice, most people use black box optimizers for that instead of metagradient methods because of that exact reason that they seem like the safe option that work a lot more reliably than metagradients for hyperparameters.

It sounds like it's similar for architectures.

Colin: Yeah, definitely. I think for example, for. Like, if somebody who isn't a NAS has a new [00:39:00] problem, a new data set, and they want to run NAS, it's actually a bit of work to, to set up a one shot method. It's, it's not like, like, just download a package and run it, but, but yeah, it takes a bit of work. Designing to, to run a one shot problem on an, on a new data set or a new search space.

But, but yeah, I mean, after, after that design, it's the benefits are clear. It's much faster and can converge to something that's, that's a really strong architecture.

Theresa: So, are those methods the focus of the field right now, would you say, or is there parallel development or even some combinations between black box and one shot, though that sounds a bit complicated to me?

Colin: Yeah, there's definitely since like 2019, 2020, the one shot approaches have been like the, one of the [00:40:00] main. Areas of research in the field for sure. And there, there's been a lot of really interesting new techniques that have come out in this space. And yeah, and, and even as, even as the bulk of the work moved from computer vision to NLP, well, there's still a lot of, there's still a lot of computer vision, but, but as NLP came to feature more prominently in NAS works, I've noticed that a lot of these are one shot approaches just because that's like the hottest area or one of the hottest areas now in NAS.

And, and so, yeah, I thought it was interesting that that as more NLP works come up, the, a lot of the methods are one shot based, reflecting the, the current hot areas in the field.

Theresa: Yeah, I guess it makes sense. I mean, if your ultimate goal would be to [00:41:00] automatically find an architecture for a fairly big language or a fairly capable language model, even if that could be a lot smaller than the ones we've seen, one shot does seem the way to go computationally.

Colin: Well, yeah, that's a whole other very interesting question, which is that nowadays, Nowadays, the machine learning paradigm is, is sort of shifting, whereas even for computer vision, like five years ago, the architectures were really large, but still, we could, we could run NAS algorithms, especially big tech companies could run NAS algorithms, even that take like a thousand iterations or more, whereas nowadays.

Like the top, the, the biggest large language models are like, they can be a hundred billion parameters or maybe even more. And so it really is impossible to run these like any or [00:42:00] many, most NASA algorithms, black box optimization. Absolutely not. We're not going to run these for, for like GPT 4, GPT 5 and, and I assume even.

one shot methods, oftentimes they have a huge memory increase, even though there's not a runtime increase. Actually, there are some methods that don't have it, that get around this memory increase. So yeah, I'm just thinking right now, maybe it's possible to run one shot, but still. It seems like much too complicated, which is my initial reaction is we probably don't want to do that either.

But so, so then the question is what like is, well, and that's never be applied to, to large language models, but actually, so I think there's some new paradigm, which unfortunately is largely. [00:43:00] Proprietary because open AI doesn't, or like the, the top companies don't want to like share their secret sauce. But now I think there's more like at least I imagine this is what it is.

More human in the loop and more transferring from, from a smaller seven B model to a hundred B model. So it's more important to run NAS on smaller models and then show that they transfer. And also have human in the loop making sure that like the, what is supposed to happen on the 100 billion model is actually working and have intuition to like stop it before it gets trained for a month and so on.

So, so more human, human in the loop and also more transferring up to larger architectures.

Theresa: do you have a gut feeling which parameter size, roughly, would probably still be doable? Like, you just said 7 billion, it sounded like 7 billion was something that [00:44:00] you think at least a big tech company could do architecture search on.

Colin: Oh, yeah, I think probably they can do architecture search on 7 billion, maybe even more. I mean it's it like my company, it takes. It takes, like, maybe a day to do some fine tune, but but yeah, with big tech companies, they have a ton of resources, and they can probably do a big search. Yeah, and 7 billion are already quite performant, so that might actually be Yeah, it might actually work to, to run like a big search and then, and then yes, scale it up.

I guess the hard part would be making sure that whatever we find on the 7 billion fully transfers to the hundred billion and, and it's like one of the best architectures for that as well. I guess also the hundred billion, it does, [00:45:00] would look very different from 7 billion and its capabilities. And of course, people say there could be like emergent abilities, and so then it seems more complex and how, how the architecture here would transfer to, to a much bigger model.

But, yeah, I guess hopefully there will be more public work in answering these questions.

Theresa: Yeah, I imagine that's an area that's really hard to investigate on non really big tech compute. Are you aware of of anything that any work that tries to investigate this? I know there's, there's something for this on hyperparameters, although there it seems to be a bit more complicated than hyperparameters from smaller to larger models.

Do you know if someone looked at this for neural architecture search?

Colin: Yeah, I, I believe there's work in this. I'm not, I'm not extremely familiar with some of the recent work for this [00:46:00] question. I, I do think that even some of the work before the, before like the chat GPT moment, there was a lot of work in performance prediction generally for neural architecture search and hyperparameter optimization.

And so it's nice to see that these are still, that this is still like Like, same thread of questions persists even, even nowadays. Yeah, there's lots of different hyper, there's lots of different performance prediction methods, all with various trade offs and, and runtime and accuracy, and, and whether they work for neural architecture search or, or hyperparameter optimization.

And so, yeah, I, I think there must be some more recent work nowadays, although I'm a bit less familiar with this.

Theresa: Yeah, we've been, I mean, you've been mentioning [00:47:00] performance prediction quite a bit. Can you maybe give us a brief overview of what methods are used? I mean, conceptually, you already said we have some sort of encoding of the architecture we want to know the performance of, put it into some performance predictor and we get out the accuracy or whatever we're looking for, we think this architecture would have instead of actually running it, right?

Colin: Yeah, exactly. So any method that we want to use to predict the, the accuracy of an architecture without fully training it. And so there, there's lots of different ways to do this. For example, as one of the most simple performance prediction methods is we train an architecture halfway for half as many epochs as it takes to converge.

And then we look at the accuracy there. So then. That can give us some like pretty decent prediction of what the accuracy would be like if we fully train if, especially if [00:48:00] we want to 10 architectures and train all of them halfway and then predict which one will Will will be the best. So, so the simplest thing would be to just pick the architecture with the highest accuracy halfway, but then we could also look at the curve of performance and see, oh, maybe this one had a slower start, but it looks like it's it's it'll have a higher ceiling.

So, so that's, that's one type of performance prediction. There's also surrogate. surrogate based methods. So if, yeah, if, if we take 50 architectures and we fully train them, and then we encode the architectures into features, we can train and we can train a model to predict the accuracy of other architectures.

And that's where I, I guess I mentioned encodings are a bit more important. And yeah, there's also other methods too. There, [00:49:00] so there's one really interesting method called zero cost proxies. And this it's pretty much exactly what it sounds. It's it's a type of method that's so fast. It basically takes zero time.

The, the idea is to look at. Look at statistics of the architecture from just running a single mini batch of data through it, or or even just looking at the parameters and the connections themselves without running any data through it. Trying both of these and then trying to guess the performance of the architecture.

Theresa: That sounds almost incredible. How well does that work in practice?

Colin: yeah, so I think so actually I've done, I've, I've done a bit of work and like analyzing these types of techniques. And I think the, the takeaway for me at least is that they're, they're really [00:50:00] interesting and. They do work better than I would expect, but probably they're not meant as a standalone method, but rather to enhance the methods we have already.

Like, I wouldn't want to, yep, if I wanted to predict the performance of some architectures, I wouldn't use only this, but, but I would, I would use some other method and include zero cost proxies. For example, for surrogate best based methods, these are perfect for additional features we can use that are like very strong, probably the best features that the model will have.

And so it can really boost the performance of surrogate bit surrogate based methods.

Theresa: Do you have any idea what ballpark of amount of data one usually needs to train a surrogate model for a dataset? Because [00:51:00] I know Surrogates in neural architecture search work. I mean, there's NASPEDG301, which is, you know, purely surrogate based, and they actually compared that their surrogates work quite well.

So, so in principle, we can learn to predict these performances. Does that mean we need to spend a lot of evaluations up front? So if I want to go to a fairly new domain, is that something that's realistic to do?

Colin: Yeah. So I think it really depends on the search space and also what we're trying to do. So NAS bench 301, I guess that that was a very different problem than NASA itself, where they wanted to have full coverage over an entire search space. Whereas in NAS, really, I guess we zone in on the area that are most promising to have the highest accuracy, but we want to be much more efficient.

Whereas NASBench301, I guess it was less efficiency and more Getting [00:52:00] full coverage. So, yeah, I think I mean, often these surrogate based methods are are run during the course of a NASA algorithm. And so, so yeah, I mean, maybe they don't work perfectly, but the goal is not to get the highest accuracy of the surrogate but is to like, is to return a high performing architecture at the end.

So oftentimes the surrogates are run in successive rounds and the surrogate is not very good at the start. But then as we partially train more architectures Or have more, yeah, have more evaluations, then the surrogate gets better and better, and it's pretty good near the end.

Theresa: So basically with all of these performance prediction methods, it's not as important that we get an actual accurate performance prediction, but more important that we can discriminate the quality of the proposed architecture and [00:53:00] can just say, okay, this is probably better than something else we've seen.

This is probably worse, and it's not really important if the number is completely correct.

Colin: Yeah, I mean, I mean, less, less important than, than like NASPENCH 301. But I mean, I mean, we do want the performance prediction method to be pretty accurate. But I guess we care most about the NAS algorithm as a whole, returning a very strong architecture at the end. And so often that does entail having a, having a pretty accurate performance prediction method, especially for the, the top end of the search space, like, it should be most accurate for the best architectures.

And there's lots of techniques and tricks to ensure that this is the case for the more recent methods.

Theresa: But I mean, that does sound like if I want to apply an algorithm [00:54:00] to a new domain, surrogate based methods are actually fine. It's not, you know, I think performance prediction is something, if I think of that in an AutoML context, I mean, always sounds like, Oh, I need to get a lot of data up front. What you're basically saying is that's not necessarily the case.

And we can just do the surrogate, learn the surrogate while we actually run our algorithm anyway.

Colin: Yeah, I see what you're saying. So yeah, I guess, I guess that is the case. The surrogate will not be like as accurate as As something where the people go out to design like a very strong performance prediction method using tons of data, like, I guess, NAS methods try to be much more efficient than that, but still can can do quite a good job at in the end being used to return a strong architecture.

Yeah, actually, one thing came up in this most recent topic, which is NASFinch 301, and [00:55:00] I think that could be a good segue because we haven't talked yet about benchmarks or, or, or yeah, this whole topic. I think, yeah, I think I said before, one of the major developments in the area of neural architecture search was.

The introduction of one shot methods, starting with darts. And I would say another major development in neural architecture search was NAS bench 101. And, and that's because neural architecture search, but before NAS bench 101, it was the, the state of comparison and, and science in the area was not perfect.

People were comparing. The, the final accuracies themselves, rather than what I think, what I think they should be doing is comparing [00:56:00] the search spaces and making sure all the hyper parameters and everything that the whole evaluation pipeline stays consistent. So we can really see what the best technique is.

So, but but yeah, that for a little bit of time, the area of neuroelectrics to search was in this, yeah, it was in the state where, where people. cared about the final accuracy numbers. And, and so even training for more epochs can push the accuracy higher. So, so even if someone had like a worse, algorithm, if they, if they tweaked the hyper parameters a bit of the final evaluation pipeline, maybe they get a better result in the end.

Theresa: it's really hard to read a neural architecture search paper and trust what's actually going on there.

Colin: yeah, until 2019 I mean, I guess it was harder. And until then because you would have to look at all the hyper parameters. And [00:57:00] everything just to, and hope that you can fairly compare, or hope that the hyperparameters are the same as the other methods you want to compare. But, but yeah, in 2019, there was this great paper by University of Freiburg and Google that was NASPENCH 101.

And this was, I believe they, They decided on a search space, and then they trained every single architecture in the search space. Which was like 400, 000 or something like that architectures and then released all of the architectures and accuracies to the community. And so what this, what. What this allowed us to do is we can now simulate our own NAS methods using, using their basically there's this, [00:58:00] there's look, there's this lookup table.

So we know that this architecture has this accuracy and so on. So we can simulate our NAS algorithms. by running it. And whenever we're supposed to train an architecture, we just look up its accuracy. So this allows to do lots of run lots of iterations of our NAS algorithm. And also, and we can run our NAS algorithm many times and, and get the get error bars on their performance.

Where it was much harder to do that before. And, and so that's why a lot of, a lot of NAS papers before NASBench 101 did not, they maybe just ran their NAS algorithm once because it took so much time. But yeah, now with, with NASBench 101, we can simulate the algorithms. And, and that started a lot of other works in, in other benchmarks.

There was NASBench 201, NASBench 301, and other types of [00:59:00] benchmarks. I think there's NASBench NLP and, and many more now. I think there's, there's like dozens now. That are various search spaces and data sets. And so this has really helped the community become more scientific in the comparisons and the claims on what's state of the art.

Theresa: Yeah, would you say that the state of benchmarking right now is actually to a point where there's a benchmark for most things you would want to do and you can actually have really good scientific comparison to other approaches?

Colin: It's, it's definitely a lot more diverse now. I mean, it's hard to say that we have benchmarks for most things we want to do. Cause I think that will always be evolving what we want to do. But, but it's definitely, like, very diverse, the benchmarks we have, and so it [01:00:00] hopefully can give a pretty good picture of what, of, like, say, a new NAS algorithm comes out, hopefully the benchmarks can give a good picture of where it stands.

There, there is actually one caveat, though, which is, The benchmarks are not, they're best at like the cleanest black box optimization algorithms, but one shot methods and many other types of methods that are more gray box or, or fully differentiable methods are much harder to be used with, with these NAS benchmarks.

Because like, if I, if I just run evolutionary search, It's very easy. I can, I can like start with this architecture, look at the accuracy evolve mutated a little bit. So I have a new architecture, then I look at the accuracy, but one shot, it's like this, this bi level optimization problem. And we, [01:01:00] yeah, we we start to train it.

And, and so throughout the whole training, it's, it's not like we're at, there's one architecture we can look up. It's like, we have this. this super network that all the weights are being trained at the exact same time. And, and so it's kind of hard to to look at it. I mean, there, so the people have come up with.

Like partial solutions to this, which is that at any moment when we're training the one shot method, we can look at what alpha is corresponding to the weights or highest, and then look, and then like that, we can convert that into an architecture, even though that architecture doesn't actually exist, we can, we can still return this architecture and then look up its accuracy.

But, but yeah, it's not like a perfect solution and it doesn't allow us to run one shot, like a thousand times like [01:02:00] we could with an evolutionary algorithm using this lookup table alone.

Theresa: And obviously, building a proper lookup table for something that's gradient based, that actually needs the gradient information for each operation is, does not sound tractable, at least at the moment. I'm not sure if there have been ideas of building something like a gradient surrogate that predict the gradients, but that seems like a very hard task, actually.

So, I guess even though one shot methods are more time efficient, it seems like, for a lot of the benchmarks. They're still expensive.

Colin: Yeah, definitely. Yeah. It's interesting. I hadn't thought about that. I guess with one shot methods. We need like a complete picture of, of all the parameters. And so, yeah, I think that would be very challenging to make, make a NAS benchmark for that, I guess. Yeah. So I guess for, for these other NAS benchmarks, there's [01:03:00] the assumption that when we pick an architecture, we run it to convergence.

And this is I mean, there's slight variations between runs, but, but. Yeah, mostly the accuracy is similar. But, but yeah, one shot methods we, depending on the choices made in the search and what type of algorithm we're running, the weights for, for both the parameters and the alpha architecture parameters will, will drastically change.

And so that makes it extremely tricky to lookup table.

Theresa: Actually, it's very interesting. I mean, given, as we talked about, that OneShot, especially for NLP, is really popular at the moment, maybe someone will come up with really smart way of detracting the cost here a bit, but who knows, who knows. It's good to see that benchmarking is really alive and well in NAS, however, [01:04:00] because I mean, there's, we see this in a bunch of areas of machine learning off and on again, right?

Where without good benchmarks and evaluation protocols, it's very hard to make good comparisons. And then it's not always clear what good forward directions are. So I think that's a really valuable thing to have in a field.

Colin: Absolutely. Yeah, I said I've at the start that I've tried or looked into a lot or done research in a good amount of fields. Unfortunately, I think a lot of fields have some problems or are not perfect. Some are maybe better than others. Like, for example, NAS was just in a very Yeah, tough spot because all the algorithms take so long to do.

So it's much harder to have scientific comparisons or comparisons involving like several runs of the same algorithm. But, but yeah, I mean, definitely there, there are not many, or maybe no area of machine [01:05:00] learning is absolutely perfect. And, and, but, but it's great to see the NASA's come so far. And now I think is, yeah, in a really good spot.

Especially compared to some other areas.

Theresa: Yeah. Speaking of good spots, how would you say the spot of NAS is if we're talking about wider machine learning? You already said like computer vision, natural language processing are two areas where it's applied a lot. Where, how much, how widely do you think it's used in general? And what's the, what are the most interesting use cases at the moment?

Colin: Yeah. So I think some of the biggest successes, definitely one of the biggest successes is, is using NAS to find high accuracy and high efficiency models. Like, like the, say accuracy to parameter count is optimized. And yeah, I think [01:06:00] maybe I mentioned before, but some of the best examples. Of architectures designed by NAS that saw a lot of real world use case or the efficient net series and the mobile net series.

So these so, so yeah, these both I think were, have been applied in the real world in many cases and are even built upon more and more like far after their initial release. And also spread follow up work and so on. So, so yeah, I think they, I would say these are some of the best examples. And, and in general examples where it's accuracy plus efficiency is where NASA has really shined in the real world.

I think there's also some examples. I, I don't know if I can think of an exact example off the top of my head, but, but [01:07:00] like some new problem, like, like say in computer vision, we have classification, but then there's also like semantic segmentation or. object detection, or many of these other use cases. And sometimes there's like A novel type of use case for some specific data set.

So NAS can be helpful here to, to come up with the best architecture for some new type of computer vision problem or new type of problem in general. Yeah. And, and then nowadays, I guess we already talked a little bit about how the, how machine learning as a whole is yeah, with NLP there's the architectures are getting larger and.

And how NAS may fit into this with, with more transfer from smaller to larger models and more human in the loop methods.

Theresa: I actually also [01:08:00] think this kind of age of big models and also, you know, foundation models is a big term in all kinds of fields, not just NLP. I always felt like that should be pretty good for us as a trend, right? Because the idea of a foundation model is that you have this big model, but you really only train it once.

And if you only train it once, it might not make that much of a difference if you spend a bit more compute to make it as good as possible. Am I right in this intuition or do you think that might not work quite as well?

Colin: Yeah, it's a good point. I mean, I there's definitely some people like, there's definitely the thought that like, like, in some sense, the architectures have become more simple. But just the we've scaled up the, the size, their size, and their Okay. Compute and training data. Like, [01:09:00] like actually, yes, people talk about like removing inductive biases as compute increases, which, but, but still, I mean, there's still a place for NAS then to get the, the type of architecture with, yeah, not so much inductive bias, but the exact perfect amount and the exact perfect architecture with this inductive bias.

So, even if architectures are getting less complex, there's still, there's still definitely a need to make sure we have the best architecture for our current compute and data.

Theresa: And there's also been a lot of demand for like small large models, this seems like a weird term actually, but like these 3 billion or 7 billion parameter models that have come out, the smaller versions of Lama, for example, Gemini have gotten a lot of attention. And I think it's because they're easier to run locally [01:10:00] and easier to build upon.

So. I mean, since you already said you think 7 billion parameters, at least for big companies, would be feasible to do in NAS, I can also see this might be worth it if we try to get maybe the capabilities of a 15 billion model into a 7 billion parameter model. If the trade off is I can run it on my toaster, whatever for.

Colin: Yeah, definitely. I think for sure, it's Yeah, I think the big companies can run NASA on 7 billion and it'd be very interesting and hopefully they could transfer it to the bigger architectures. I know that some labs have, have claimed that really data is is one of the most, is maybe the most important thing, more so than the architectures themselves.

I mean, I mean, a lot of the, a lot of the techniques developed in the NAS community would also [01:11:00] could also apply for Other hyper parameters like there, there could be many hyper parameters that apply it to data that we want to do performance prediction with. And so there's still a place for all these NAS techniques.

And also, I think Yeah, like if, if we want to create the absolute best 7 billion model, we, we do need to think about the, the best architecture and how it's designed and, and especially how it fits with the, the data used to train on it.

Theresa: So what we've been talking about right, is a lot of refining an architecture. Like, if we think about a big architecture category, like, a CNN or a transformer or whatever, I can build a search base and refine that. Something that, and I think these are mostly people who haven't read a lot of neural architecture search, sometimes come up with the question of, okay, but why doesn't NAS actually discover the next transformer?

Is that [01:12:00] something that you think will happen, or is that in your eyes just not the role of what, of NAS in the research space?

Colin: Yeah. I mean, it would be great if NAS did come up with the next transformer, but I think that's. That's very hard to do, maybe harder than the NAS community thought or, or hoped, but so yeah, I think At least the, the techniques today are not, are not ready to develop the next transformer. Maybe, maybe as Moore's law continues, we can start to run NAS and in the future and find like some brand new architecture that.

That is the current best I think, I think NAS has had more success in finding new types of components at a smaller scale, like not finding the best transformer, but, but finding best yes, smaller scale [01:13:00] type types of operations or components. And, but, but yeah, I guess. I guess I would say like some of the biggest success stories in neural architecture search, as I maybe said a couple of times, is like if efficiency plus accuracy with, with efficient net mobile net and these types of architectures.

Theresa: And I mean, that's incredibly important if we actually want to use deep learning in real world applications a lot more. So maybe to finish off with, what do you personally think are some of the most interesting things to look out for in NAS right now?

Colin: Yeah, I mean, I have, well, my, my own research areas has gotten a bit more foundation model bent to it. So I definitely. I'm excited to see both AutoML and Neural Architecture Search, how these ideas can be integrated with, with [01:14:00] the new age of large language models more so, and I think, yeah, there's already some exciting work and maybe there's some work going on that's more proprietary.

But I'm sure just, just as there's more open source models, there will also be more open source or more techniques that come out around the space. Yeah, I think I mentioned Einstein space and in the middle of this this, this new exciting search space based on context free grammars. So that's another.

Work I'm excited about. And I think that that's a great yeah, following up on this or continuing it or coming up with better algorithms for this is also an exciting area. Yeah. So what have I said so far in space and large language models? Yeah, I think these are, these are definitely two yeah, answers I'm happy with [01:15:00] that's the, the most exciting place to be now.

Theresa: Already sounds like quite a lot of space to explore already. So if people want to read more from you or hear more from you. Where can they find you?

Colin: Yeah, well, I'm based in San Francisco I'm on social media and yeah have a personal website, so, yeah, and, and feel free to email me at colin at abacus. ai.

Theresa: Great. Then thank you for this chat. I think I learned a lot about, especially the history of NAS I wasn't aware of, and I hope everyone else did so as well.

Colin: Great. Yeah. Thank you very much. It was a very great discussion and I'm very happy to have the opportunity.