Oscar Beijbom is talking about what it's like to run an AutoML startup: Nyckel. Beyond that, we chat about the differences between academia and industry, what truly matters in application and more.
Check out Nyckel at: https://www.nyckel.com/
Nyckel
===
[00:00:00]
Theresa: Hi, and welcome to the AutoML podcast. This week, we're not necessarily talking about research directly, but we're still talking about AutoML. My guest is Oskar Bebo. He's actually a founder. He's founded a startup that wants to be, in their own words, a fast, powerful, and simple machine learning API, which is kind of the goal of a lot of AutoML research.
Theresa: And from the outside, it looks quite nice. So I'm asking him some questions on how it worked for him to translate AutoML research and machine learning research in general into this practical setting where it's really tested on real customer data. But also what he thinks is important to focus on if that's our goal.
Theresa: And how we can also maybe work towards that as a research community. I [00:01:00] thought that was a quite interesting chat. Because I'm a researcher. I'm at a university. Most people I meet, even if they're from industry, they do research. And often quite fundamental research. Even though AutoML is supposed to be somewhat of an applied field, right?
Theresa: I think a lot of people in that field don't necessarily have a lot of contact with A real data use case that often. So, this check in was quite nice. I enjoyed that conversation. And I hope you do too. I'm also curious if you agree with our views of what we can do with AutoML right now, what's maybe missing, and what are interesting topics to look at next.
Theresa: So if you agree or disagree, feel free to let me know. But now, on to the show. I hope you enjoy my conversation with Oscar.
Mhm.
Theresa: So I'm here today with Oskar Debum, said that almost correctly, I think. And before we get on topic, Oskar, [00:02:00] do you want to briefly introduce yourself to the audience?
Oscar: sure. Hi everyone. So my name is Oskar. I, I live in the Bay Area, California. I run a company called Nickel. Oh, I guess they can't see my t shirt. Yeah Nickel is a classification as a service company and the big thing of what we do is the AutoML engine there. Of course, I have a feeling we'll talk more about I did my PhD in San Diego this postdoc in Berkeley after that.
Oscar: And then originally I studied in, in Sweden. I did engineering physics in Sweden. And my first job was actually with Hovding, the, the air bike bicycle helmet. So I was the first engineer in that company and actually developed their whole like accident detection system using a RBF kernel SVM
Theresa: That's really interesting. Company secrets.
Oscar: Yeah, yeah. No, that was cool. I mean, it was like, yeah, I can talk a lot of it. There's a lot to say about that. It was wild, wild time. But yeah. Also run a site called CoralNet, which is a resource for [00:03:00] you know, marine biologists to automatically annotate their images. Sort of like upload images, start annotating and then like the system picks up and helps you.
Oscar: I keep going. So it was like my PhD project. And it. It's yeah, as far as I know, it's one of the biggest, the biggest site for, for that side of data analysis right now. It's a very small world, but it's the biggest, it's the biggest fish in the small pond there.
Theresa: Yeah, but still, that's really nice. And you've been keeping that going still. That's also cool to have your PhD project kind of last for a while, right?
Oscar: Yeah, I mean, we ended up as an us as this agency called noa national Atmospheric and national Oceanographic and Atmospheric Administration. So they monitor all their, like all the, among other things, all their coral reefs. So they basically became an user customer of this. And they have been basically sponsoring with it's a obviously nonprofit, but they've been sort of keeping it alive.
Oscar: And then we have a developer, we hired this developer, I guess an undergrad in UCS. [00:04:00] It's like the best thing we ever did because he just like, he's very happy to always do that. Like, it doesn't seem like he wants a full time job. So he's been doing core on that like part time for the last 10 years, seriously.
Oscar: So the site is still up, still going. People are still using it, but it's, it's very much like I don't know. I mean, you know, it's run on a shoestring budget, basically. Yeah.
Theresa: Still, I think I'd be very happy if in 10 years someone still looks at my dissertation project
Oscar: Yeah, yeah, yeah. No, it's cool.
Theresa: That's really nice. And now you're mainly doing nickel?
Oscar: Yes. Yeah. Now since about two years back, Oh, I kind of forgot that for like after my, after Berkeley, I did like six years in the auto it was called self driving car industry.
Theresa: Mm.
Oscar: So started with just Me, I guess the first sort of AI person in one of the startups, like MIT startup. And then in the end I had like actually a team of a hundred that was running the, like AI for the, for the whole organization.
Oscar: But yeah, I did that for six years and then I decided to start nickel with, with two [00:05:00] friends. Yeah, it's my full time job.
Theresa: That's really interesting. So you've seen coral reefs, accident detection on bicycles, and self driving cars.
Oscar: Yeah. Yeah. Yeah.
Theresa: where, where those experiences The thing that motivated you to, to do something auto ML focused, or was that just part of your plan all along?
Oscar: Well, I mean, you know you're in the field I felt like every job I got because I've been always working in startups I was always the like the first and the most senior and the only tech engineer ML engineer So you end up setting up the same things, like you set up the, like the data engine to like annotate and clean up your data.
Oscar: You set up the AutoML system to train and so on, and you set up your deploy system so you can run the models. So when my buddy Dan started talking about, he needed like something, like an AutoML for his other company for tech. So I was like, that's kind of interesting. It's, I realized how hard it is for.
Oscar: For him as a sort of a normal general purpose [00:06:00] developer to this was before GPT, right? So then we It was sort of a nickel and a big, to answer your question, a big motivation for me was like, See if I could sort of solve this problem in a systematic way that like, I don't, if I get an extra job, I don't have to like, Rebuild everything from scratch. And of course there's a huge landscape of like auto, like MLOps platforms. There's, there's so many companies. What we tried to do is take it to an extreme where we call it like no dials ML. So there's like, we don't even, we take care of everything. We split the data using cross validation. We, we we don't even talk about what models we use or deploy.
Oscar: You just get the one that works the best period. You didn't even get to pick exactly the metrics. We just pick it for you.
Theresa: So basically I just give you a data set and I assume I will have to tell you what I want to do with it though. Right.
Oscar: Yeah, you give us, you define the labels, you provide a few examples per, for each label, for each class, and then that's it, basically, it trains in a few seconds, it deploys [00:07:00] instantaneously, and you can scale too. Our biggest customer is hitting it with like I think it was let me see. I don't want to get this right. yeah. 100 million invokes per month. So it's like 3 million per day. Which is hundreds per second. I forget the I've done that before. It's a lot. And it's all based, it's all running models from that system. It's just scales. Like we're not doing anything special for them. Yeah,
Theresa: Right. That's, that's, that is quite a lot. So basically you take, really take care of this whole pipeline for as I heard it classification as of now. And
Oscar: yeah.
Theresa: are there any limitations to that, like in terms of models, in terms of complexity, in terms of something, it doesn't seem to be in box per day.
Oscar: Yeah, I mean there is, there is limitations. I mean, you know, the The model zoo, for example, that we use, you know, it's like, I, I sort of, as I find new models that seem [00:08:00] good, that are open source and have the right license, we just kind of add them to the zoo. Right. And it sort of grows slowly but surely. So that's the limitation.
Oscar: And also like we can't, I mean, we want to have it fast. So we do some, some tricks, like if there is, you know, if the data is really, really imbalanced, we sub sample the. The biggest class so we, we don't get quite as many training samples.
Theresa: Okay. So basically your target audience is exactly like you said, someone trying to do a startup, something with data either can't, or doesn't want to build this whole data stack on their own. And they really just need a system that doesn't only do something like hyper parameter optimization, which you could find open source pretty easily, but actually takes care of the whole stack from data to deployment.
Oscar: Yeah. Yeah. Yeah. This company, there's a site called Autogluon. I'm sure you're aware of it. It's, I haven't looked very close to it, but it looks, it looks like a very well built stack. So I think that's an example of the DeOttoML very well, but [00:09:00] then. You still need to typically iterate on your data at 300 classes.
Oscar: And we do, we, we help you. Like you can show 'cause we do cross validation. You can do stuff like, okay, show me where your annotations disagree with the prediction and sort by the highest confidence prediction. And a lot of the times if you, if you have a couple of thousand data points, those high confidence disagree.
Oscar: So just human errors, right?
Theresa: Yeah, yeah, it makes sense.
Oscar: Yeah, so you sort and then you kind of fix those and then it retrains, yeah.
Theresa: Okay, how much of that is manual? Is that fully automated, especially in the data part?
Oscar: It's, yeah, it's fully automated.
Theresa: That's quite nice. That's quite nice.
Oscar: Yeah.
Theresa: yeah. So is, is there anything, let's call it like especially fancy. And I mean, you mentioned some tricks and that absolutely makes sense. But would you say this is. Basically things that are available open source, like the ones you use are just brought together in a well engineered way.
Theresa: Or [00:10:00] do you think there's
Oscar: Yeah, I think so. I mean, I'm
Theresa: in there?
Oscar: like, I know your, your lab does like a lot of work on AutoML, but I'm like, like, I'm sort of skeptical of any fancier method. I think all my years of doing applied machine learning has sort of made me a little bit jaded with when the methods become too fancy. So what we do is basically we try everything.
Oscar: So we have like a roster of deep nets, like say BERT, or distilBERT, or CLIP, or what's the, like the one called FLAG, like a more recent one. And we have like a separate set, obviously, for text and for images, right? And then, so we, we run. Feature extraction of those. And we do that, and we spend a lot of time on engineering, so we actually do that using AWS Lambda instead of a GPU.
Oscar: The cool thing about AWS Lambda is you can spin up nodes in like, maybe one second. So we spin up 500 of those and chug out the feature extraction, which means that even, even 10, 000 features you [00:11:00] extract in like 20 seconds or something, you get all the features, you write them to disk, and then you start the AutoML or the, like the training part, we call them like the shallow training, so that we try a bunch of like logistic regression XGBoost and so on and all of those are also, and they, with different hyperparameters, right, so And each one of those is trained on a different node. So you spin up like hundreds of nodes that each train on some shallow learner and some feature. They all report back with their accuracy. We use like the class balanced recall as sort of a key metric and then we pick the best one. And it's like I said, we don't even tell the user our, our, our hypothesis is that the user, there's something called, this is just, you know, product, sort of basic product design.
Oscar: We think there are people that don't want to, they get like anxiety. If you share too much details, like, Oh, like, what is this? So like, [00:12:00] what if I change that? And then you get kind of stressed about it. So we. We just pick the best one and give it to them.
Theresa: So basically it's the power of parallelism more
Oscar: yeah, we just, we just work really hard on like, and my colleague, he used to build actually so the Oracle cloud has a equivalent called functions.
Oscar: I think their service is called like this Lambda AWS Lambda thing on Oracle cloud. So my co founder actually led the team that built that on, in Oracle. So he's like really familiar with how to get these things to run fast and take advantage of that architecture. So that helped a lot. So him and I together was able to like spin up something that and then the cool thing is when you retrain like it's just really fast because now you just extract features for the new data points and then retrain the logistic regression which is, as you know, is super fast.
Theresa: Yeah. That shouldn't be an issue at all.
Oscar: yeah, it's amazingly fast actually. I mean, At some point recently, I started like toying with this idea, because, [00:13:00] so George, my co founder, he keeps telling me like, why are you retraining from scratch every time? Like the logistic
Theresa: Mm.
Oscar: And I told him like, I don't know, like, I don't know how to do lifelong learning.
Oscar: There's this field of research called lifelong learning, where you're like, the model, you don't retrain from scratch, you kind of keep Updating the same model, but I don't know how to do it. Like, I haven't seen a single, like, algorithm that is robust for that. Maybe you can educate me.
Theresa: I also dunno, I, I, I think it's just really hard, , it's really hard.
Oscar: it's, I think it's, it violates the, like, all the assumptions around, like, IID and, like, how the samples are come in and the, the, the sort of the statistics of the data, right?
Oscar: When you start doing that. So one thing I tried recently was to, like, initialize the new training with the previous training. Couldn't get that to work, either.
Theresa: Yeah. The question is also, is it worth it? Because if you can train so quickly I think lifelong learning is something that becomes much more relevant if we're in the territory of these [00:14:00] foundation like models that just take Avis train and then obviously don't want to refit that from scratch. But if you're able to get really good performance with something like a logistic regression where you really don't spend GPU days on, where's the point, right?
Oscar: Yeah, I mean, we are seeing when we have customers with like, say, 100, 000 data points across 300 classes, logistic regression is starting to take.
Theresa: Okay.
Oscar: Tens of minutes. And that's like, you know,
Theresa: Yeah, maybe I'm too far in the research world, tens of minutes is still quite short to me. But yeah, obviously as a customer experience, it feels much nicer if I just sit here and get a coffee for two minutes and then I have my result. I don't have to wait for an hour. I get that part.
Oscar: yeah, no, for a while we were pitching nickel as a prototyping tool. Cause the feeling of having done machine, like the usual workflow in the machine learning team is like, you start a training job before you go home, right. In the, in the, [00:15:00] in the, in the afternoon. And then it trains overnight, come home and he spends kind of the whole day, like interpreting the results, setting up the new experiments in you.
Oscar: So it, it trains in like, Five seconds. So the, the feeling there of like iterating on like, usually like the labels, what sort of labels should I do? Like fixing maybe because, you know, sometimes we started a new problem. It's not obvious how you like the definition of even if something simple as spam or not spam. It's not until you actually annotate data that you think through, what do I mean by spam? Like, is this actually spam? So, you know, so as you're working through the data, it's, it's just really helpful to see what does the machine pick up? What does, what gives me high accuracy right away?
Oscar: So I don't, so I don't have to like spend a couple of days, like on some, I think this is what I should call spam. You train and it doesn't work and then you just, it's really cool when it just happens right away.
Theresa: Basically, you can focus on the [00:16:00] data part and specifying the tasks correctly through the
Oscar: Yeah, yeah, exactly.
Theresa: Yeah. So then in essence. That system is something I think a lot of the AutoML community writes in their papers they want to build with the things they actually research. But obviously, as you said, and as you described the methods nowadays are a lot, yeah, let's stick with the word fancy, or a bit fancier than what you seem to be using.
Theresa: Do you think that's just your specific target where, for example, classification has been around a lot, right? Classification is obviously not solved, but it's quite well researched compared to something like time series. Is it, is that why that's an area, you know, simple methods just do really well? Or are we getting too fancy in some way and not focusing on the, I don't know, maybe even application side of making these [00:17:00] things work in practice enough?
Oscar: Yeah, I don't know. I don't know. My instinct is, so when I first built this system as I just described, I always thought of it as like, sort of the first level of the stack which you can think of it as super shallow. Fine tuning, right? You just add a new layer and you tune it. I always envisioned building a second layer, which would be like deep fine tuning.
Oscar: Where you actually tune the whole, the whole network, but you still take networks off the shelf and like kind of tune them. You try them all. And then the third layer would be something like network architecture search. I guess, is that what you mean? One of the examples of like fancier methods.
Theresa: Yeah, I mean, there's also a lot of like meta learning going on in the last few years, increasingly in AutoML, right? Things like trying to transfer hyperparameters or architectures trying to take more advantage of dataset features, or even trying to do things in context learning these days.
Oscar: Interesting. Yeah. I mean, it's interesting. I actually. [00:18:00] Another thing I envisioned when I started building this was more of a meta learning thing where like new data set comes in. So imagine you have like a zoo of now thousand different neural networks, maybe a hundred different shallow learners, like it's just becomes impractical to run everything on every data set.
Oscar: So there would be a, I had it in mind that would be like a meta component that like, okay, this is this type of data set. It's most similar to these three over here that I already know. And then I pick sort of the models and so on from there. We just. You know, I haven't had the need to build that yet but that's, I think that's, if that's what you mean by meta learning, I, I totally buy that when the search space becomes too big but but yeah, I don't know.
Oscar: So, okay, so getting back to your question, I think one of the reasons that we haven't felt the need is maybe the type of learning we do is relatively small sample, like up to 100, 000 training samples. It turns out that, I mean, like we did a couple of [00:19:00] benchmarks against like Vertex AI. They, as far as I know, they have a really nice network architecture search built into their AutoML system.
Oscar: Are you familiar? Do you think your listeners are familiar with Vertex AI?
Theresa: I am not. So a quick intro would be nice.
Oscar: Yeah, Vertex AI is what, like if you go to Google Cloud Platform, one of their like main sort of machine learning products is called Vertex AI. And it's essentially like Nickel. It's I mean, their data annotation, their data engine part is, is pretty.
Oscar: Basic, but their AutoML is, you know, their AutoML is really nice. And then they have also deploy system. It, you know, it takes like maybe a couple of days instead of a couple of hours when they go, but it's still the same, fundamentally the same thing. So we did benchmarks with them and I fully expected them to do much better.
Oscar: But they didn't. And I think it's because and I think, I still think they would if we like took. Much, much more data. Like it was a complex problem with like really, really big [00:20:00] amounts of data. Then of course you need, you need high, you know, you need a high capacity model, essentially like higher capacity learning system.
Oscar: So again, our hypothesis is that the types of customer we target, Hey, it's not that complex. It's just bespoke. It's just a thing, a little bit, you can't use off the shelf. You need your own custom thing. But it's fundamentally not that hard
Theresa: Okay. So that means it's fundamentally, definitely possible sticking to, to simple basics that are then also expected to work reliably.
Oscar: Yes. Yeah. Yeah. I mean, one, you know, one little, it's just a small thing, but one, one cool thing about using logistic regression is that it's so stable. So the way we, so we actually run 10 fold cross validation on this stuff, right? So for every combination of feature extractor and model parameters, we train 11 times, we train 10 fold cross split.
Oscar: And then the 11th [00:21:00] time is we train on all the data. And the reason I feel okay, and of course the, the, the 11th time is the one we deployed. And of course that hasn't been tested at all. Like, there's no validations for it at all. But I feel okay about that because logistic regression is so stable. So if the cross validation based on the other 10 runs give you high results, then why not deploy the one that's training all the data?
Oscar: But I would never do that if it was like fine tuning of a deep net, because who knows? What's going to happen, right? Yeah,
Theresa: as I said, I'm in reinforcement learning. I would never do that. I wouldn't, I wouldn't even be sure if that's a good hypothesis for reinforcement learning, which is a bit of an outlier even in these learning methods. But yeah, I think you're right. If you look at anything deep, I would be concerned.
Oscar: I mean, it's very bad practice, right? It's terrible practice. But But it makes a difference. If you have a thousand data points, normally you would split it, say, 70 30. Now you're training on [00:22:00] 700, and we're training on a thousand. That actually matters in the sort of low to medium data regime for actual performance.
Theresa: absolutely. I can totally see how that would be super relevant.
Oscar: Yeah
Theresa: I actually want to get back to something you said before, which was about, do you think at least some customers, the ones you're targeting, maybe don't want to or don't need to see a lot of stats about them? process about the model and everything.
Theresa: I think that's quite interesting because this topic of interpretability, explainability, obviously it's becoming more important in machine learning general, but also in AutoML. And I think that's a really interesting discussion to think about what do people actually need to see and what do they want to see about models.
Theresa: And I am kind of on your side in that. I think a lot of people don't actually. need or want a lot of information about what they're doing. And then there's people who then always tell me, Oh, but what about these new regulations? Like in you now, you certainly need to show things about your models and obviously you need to have them [00:23:00] be very interpretable for that.
Theresa: Do you think that's actually true that that's really helpful to most people, or do you think that's something that maybe a research scientist, a big organization would want? And apart from that, it's useful.
Oscar: Yeah. Just to make sure I understand you correctly, when you say interpretability, I should know this, but what do you mean exactly? What is an example of stuff?
Theresa: Yeah, so that's another discussion. Obviously, what counts is interpretable, right? But in this case, I think it's about, to me, it's about things like what do the performances and maybe performance splits look like, what are important model classes or hyperparameters for the process.
Oscar: Mm. I see. I see that kind of interpretability. Yeah. Or like what feature is the most important?
Theresa: yeah, that as well.
Theresa: Things like that.
Oscar: Yeah. I mean, I think we, like before deep [00:24:00] learning you know, the type of service that we provide, there was a lot of, well, I don't know about open source, but, but I know there was a lot of companies, there was one called what's it called? Okay. I can't think of it right now, but there was a company that was specifically doing basically the effort tabular data where you just get floats or categoricals in.
Oscar: And then you do, I think, almost, I think almost exclusively this is regression because then you really can do like interpretation of, of which data matters. And I think for certain applications certainly business applications, business analysts, people that sit there with their, their spreadsheets, someone like forecast demand or those kinds of things, it seems like those people definitely want that.
Oscar: And, and not so much the model details, but the feature, which features matter. So that's, that's one thing. But then the type of systems we're building where it's quote unquote, unstructured data or raw text or, or images, I, at least for me, [00:25:00] there's nothing we can tell them about that. That'll help them interpret the results. Like relevant, like, I mean, I could tell them this, this model, like the birth, the stillbirth, blah, blah, blah, worked 1 percent better than this clip, blah, blah, blah.
Oscar: And they're like, what does that mean? I mean, this doesn't mean anything. So maybe that's one of the reasons we didn't do it, but there's like not, not much to tell that is relevant. Still, I think maybe an enthusiast would, would kind of like that. Like if they're into. If they're into tinkering with this, they want to like plug in their own models.
Oscar: They want to like, but I don't think that's our target audience for Nickel. They might go to like Autogluon or something where they can really do it themselves.
Theresa: Yeah. Or even just build their own stack if they really want to tinker.
Oscar: Yeah, yeah, exactly.
Theresa: Yeah. And what about process details? Things like this is, yeah, you said about the model class, but [00:26:00] even things like took us this many iterations, stuff like that. Do you even see anyone besides enthusiasts that you think would be super interested in those details?
Oscar: The number of iterations of like training or
Theresa: Yeah. For example, like down to the nitty gritty things.
Oscar: Yeah, I don't know. I don't know. But I think most platforms like Vertex AI would, would share all those details, right? You would, you would click on the model artifact or whatever, and it would give you a lot of details, like here's how long it trained, here's how many parameters it has, here is, it would, it would actually disclose that.
Oscar: So, you know, who knows, maybe we're doing something wrong. Like, maybe people really want to know. But
Theresa: Yeah, I would be interested because I'm not, I'm never quite sure what people would do with that information. It feels to me this could end up leading somewhere where someone says, Oh, my, my output was not generated by, I don't know, Bert, for [00:27:00] example, but I think Bert is really great. So why was this not the result of the search process?
Oscar: Yeah, yeah, that's, yeah, right. Yeah, it's totally, totally possible. And most, I mean, so OpenAI has a fine tuning product now. I don't know if you played with it, but it's they do, they sort of expose all the hyperparameters of their autonomous system, but then they have defaults for everything, so I'm guessing 99 percent just go with the default and move on with their lives.
Theresa: Yeah,
Oscar: I don't know, you know, yeah.
Theresa: But OpenAI is actually a good keyword. So you mentioned, obviously, you started Nickel before chat2pt and LLMs in general became this big. Do you think anything changed for your target audience? Do they use tools like chat2pt over Nickel sometimes?
Oscar: I think so. I think so. I think when we started, like I said, it was like [00:28:00] sort of do it yourself and that's like months to using something like Vertex AI or there's more there's like Hugging Facehead and all the mouse product. There's Clarify, there's something called Roboflow and they were taking, they brought it down to like a day or two and then we try to really take it down to like an hour or even minutes.
Oscar: So we were by far the like We were sort of the one end of the spectrum, and now GPT basically. became, it's even faster than nickel, right? Because you just, you just ask and you just get the results. So I think it just does make our positioning in the market a little trickier.
Theresa: Yeah, I would assume that's even worse for open source tools though, right? Because I mean, at least it sounds a nickel, the, a, the deployment part, but also the way you set up the parallelization could still take quite a while to get up to the same speed. But if the alternative is using something like Autoglue on, obviously you're going to work a lot [00:29:00] to, to beat something like Plot or ChatGBT or whatever in setting up your system now,
Oscar: right. Yeah, no, totally. I mean, and we've done benchmarks. We do better than, you know, OpenAI. Like ZeroShot, OpenAI, Nickel does better after. I think, I think the average was like eight examples per, per class. We, we tried on like 12 datasets that we had in our production database. I didn't want to use public data because I figured they were all trained on, on that.
Theresa: probably. Yeah. Mm
Oscar: But yeah. So, but like the serial shots. It's obviously better than, it was better than Nickel at, you know,
Theresa: Ha ha
Oscar: at a few samples. We actually ended up adding their OpenAI to our LML system and used that to like boost our really low data regime performance. So you know, nowadays Nickel is as good as OpenAI because we use it.[00:30:00]
Theresa: ha!
Oscar: Just how people get off the ground also springs a little bit of magic in the process where you just define the labels and upload your data and it's already all, all of a sudden it's already annotated or predicted for you. But yeah but it's interesting, like, I mean, just starting a company, it's so much is about.
Oscar: Marketing, obviously, I
Oscar: But you know, we have customers that say, I didn't, I don't even want to try any code because I know OpenAI is better. Like it will obviously be better for my use case. I'm like. I mean, again, maybe I'm jaded, but like, after all these years in Applied Machine Learning, I'm like, the only way to know if anything is going to be better is to try it.
Oscar: Like, it's the only thing you can do. We, we have like, no, maybe you have some good theory in reinforcement learning, but in supervised learning, there's no theory that actually helps you do anything useful. So you have to try everything. That's like the only thing. You just do linear sweeps of everything left and right. But you know but still customers come to us and like, [00:31:00] yeah, yeah, no, for sure. Like Niko wouldn't. Wouldn't work and vice versa. Like they're like, oh, yeah, Chachaputi would never work for this. So I have them I'm really really excited to use Nikul. I'm like, are you sure?
Theresa: Seems like a very extreme opinion either in either way,
Oscar: Yeah, yeah, but yeah, yeah I think as researchers we're both trained to not make those sort of assumptions, but people do more than I expected No,
Theresa: hear that in research a lot nowadays that language models will be kind of the death of auto ML in a way.
Oscar: really interesting.
Theresa: yeah, there's a, there's a bunch of discussion going on that, you know, these big models will just have all the information and since they can do reasoning internally, especially in context with a few examples, they should be better than something like.
Theresa: I don't know, classical hyperparameter optimization or whatever.
Oscar: Really? I mean but [00:32:00] But that sort of, even classical hyperparameter optimization is typically done in a fine tuning way, right, where the base model also knows about everything, quote unquote, or like, maybe it's not quite as, and also like, even OpenAI does fine tuning fine tuning. Now, like you fine tune models.
Oscar: How is that different from AutoML? It's just another thing to try in your AutoML system, isn't it?
Theresa: Yeah, I think there's sometimes a bit artificial splits, especially in research. I think it might be less an application. I don't know, because there obviously you have a thing you need to get working. In research, you often don't really have that thing you need to get working. You can just compare it to baselines from your niche.
Theresa: And then you suddenly get a case of where it's, I don't know, in context learning people versus Bayesian optimization people that all try to do the same thing in principle, but suddenly start arguing about The way to do it, which I think that happens more often. If you don't have a concrete [00:33:00] case to focus on.
Theresa: And now that there's a lot of insecurity about what can be done with large models. And where are we right now? I think these things just come up in conversation a lot.
Oscar: Right. Yeah. I can, I can, I can see that. But yeah, and I guess the, I mean, the zero shot performance is really cool, but I mean, you had that before with, with, with models like clip, you could do a zero shot stuff, right? You did your, your labels were like the text and then you search similarity. I mean, I, I guess I'm.
Theresa: Took it over the edge that you don't have to, you know, set up an interface to use Clip with. I think most people just didn't do that and never tried it.
Oscar: Yeah. Yeah. Yeah. No, absolutely. No, no. I mean, there's no question. It's a fantastic product. I'm just, I still think like once the dust settles, it's just another big language model that you will fine tune to your use case among other things, you might try a birth thing. You might try something else. Right.
Oscar: And then whatever works [00:34:00] better is what you end up grabbing. To me, the really cool breakthrough is the, the Laura. Yeah.
Theresa: Hmm.
Oscar: stuff that's like crazy, like that just enables mentioning like in a practical way, right? The model artifacts are small, so you can like kind of deploy. We talk, do you think your audience know what Laura is?
Theresa: I, I'm not completely sure, but I mean I mean, I am never quite sure what the audience actually looks like for these podcasts, because obviously AutoML is a huge topic. And I think a lot of people in AutoML will have thought about bigger models these blast one or two years. So I think a substantial amount of people will know what Laura is.
Oscar: Right. So you can yeah, so it just makes it practical to, at least to me, it means that it makes it practical to use something like I mean, obviously you can't use GPT because it's closed [00:35:00] source, but you can use something like a big Lama model as part of an RML system because you can. You can amortize the cost of running the base model, right?
Oscar: And then just have these LoRa adaptations sort of spun up and down as needed.
Theresa: Yes, that's, I agree. We even had, I think, a master's thesis fine tuning one of the, probably medium sized Dalmas with Laura. I think if we couldn't have done that in such a sparse way, I don't think we couldn't have given that to a master's student. That's just not something we could have done in academia.
Oscar: right, right. Yeah. No, it's, it's an amazing, it's like, I would say it's up there with like top, top five papers or breakthroughs for, for deep learning, at least as far as like making it yeah. Impact, impact of deep learning, right?
Theresa: Yeah, because then you, you don't have to have the zero shot performance, which obviously, Is great for zero shot, [00:36:00] but as you already said, I mean, most of your customers really need something custom in a way.
Oscar: Yeah. And I can share, I did a pretty big study. Yeah, maybe I can send it to you if you want to put it on the, in the notes. We actually did like a comparison of zero shot GPT, fine tuned GPT, like GPT with few shot prompts and a couple of different, like, sort of best practices for GPT, and then we compared it with our, like, baseline, you know, transfer learning, if you want to call it that. So I can show that and it's, it's very obviously, like, at least the cost benefit of of fine tuning is
Theresa: Yeah, that's really cool. That's also, I mean, I think it makes sense intuitively if you've ever tried to have a model generalize either way, probably everybody who's tried to do that knows it's just so much easier if you can fine tune it. And the fact that we now actually have the tools to do that efficiently is great.
Oscar: Yeah, yeah, no, I mean, and there's whole startups that are like [00:37:00] providing that on top of OpenAI. I mean, there's like a whole, obviously a big, big rush there and a big ecosystem of sort of LoRa specific, like LLM LoRa specific type of ML. I mean, yeah. And again, I think you know, my hypothesis from starting Nikko was like.
Oscar: I want, we don't talk about models because models will always change. We talk about function types or function prototypes. Like input is image, output is categorical. And then what happens in there shouldn't be sort of a first class component of an API because it'll always have to change. So we shouldn't talk about like specific models or specific fine tuning technologies.
Oscar: Because it's, it's, it's ephemeral in, in the, in the, like over time like the best practice will change, but but you know, I don't know, like right now it certainly seems like the LLM foundational model plus a LoRa fine tuning is going to give you, it's a very, very good stack.[00:38:00]
Theresa: Yes. And I mean, it might still change, but the idea of, of having this, yeah, basically amalgamation of a large amount of data there to, to give you a really good base for predictions. I'm not sure that's going to go away so quickly because we'd have to do extremely well to beat that out of the box somehow, right.
Oscar: Yeah, no, I think that's, I think that's fair. And I guess, and, and I guess it's also, I mean, I always made this parallel to, to self driving cars, right? Where. The challenge with self driving cars is like, at least when I was working in the field, it was, it was always this trade off between, okay, so we know fundamentally you make a bigger capacity model, it can handle more complex data. So, and, and the models get bigger and bigger every year. So we've thought like, okay, cool.
Oscar: Even though we can't have the self driving car right now, the models will become bigger and we'll be able to handle all the [00:39:00] complexity. Of the reality of driving around in a, you know, in a, in a city, but on the other hand, you have this combinatorial set of things you can experience in the real world, which is you know, all weather conditions, all different types of road actors and the combination of all of those, right.
Oscar: Then you have all the light, light conditions, and then you have all different types of occlusions and so on and so forth. And every one of those is, it's combinatorial, right? And I sort of convinced myself that we can't beat that. Like we, we, like, no matter how big the models are, like, or, you know, because so, so you, you have a combinatorial reality and you need basically training examples for all of those bins.
Oscar: Right. To have the give the model a fighting chance to, to deal with it, unless you assume some, or at least to be able to measure it. Right. Even if you assume that, that, [00:40:00] that the model will generalize across some of these dimensions smoothly. You want to measure it to make sure you can deploy the safely.
Theresa: Especially in something like a car, which is definitely a domain where you want to be sure.
Oscar: Yeah, so I was like, when I left the field, I'm like, I don't know if this is like a pure AI approach is going to work because the reality is so complex and I don't know if the models will scale to it. And obviously the same thing is true with LLMs. I mean, they're super big, so they handle 90 percent of the situations.
Oscar: But if you have edge, edge situations, and I think it happens more with images than text. It seems like with text, there's only so many things you do with it. It's like spam, sentiment, toxicity, blah, blah. I mean, there's a lot, right? But there's, it feels like it's a smaller set than images where it's just like, holy cow, we have people inspecting like sewer systems, like how full is my sewer tank?
Oscar: And then another one saying, how is my indoor plant looking? And then another one, like, is this a. The real face or not, it's just [00:41:00] like, maybe that's at some point the set is so big that it falls outside of the pre training.
Theresa: Yeah. Yeah. Yeah, that makes sense. I mean, we even see that for language already that, you know, just the pre training can't handle everything. We see LLM failures on the internet every day. And I'm, I'm tempted to agree that it. I think it's much easier to find strange looking images, even to people than strange looking text.
Theresa: I think most people can handle a lot of text in a meaningful way, but I've definitely seen images which I couldn't parse at all. I think it's not, I think it's not that difficult to do it, right? I mean, if you would give me an image of someone checking a sewage system, I think it would be very easy to confuse me.
Oscar: Yeah. It would be hard to even know what's up and down in that image. Yeah. And I mean, I guess we're saying we sort of stayed in the obvious. Of course, images have higher dimensionality [00:42:00] than text. Perhaps not surprising.
Theresa: Yeah, possible. And I mean, then I mean, that's images, text. You can also obviously still make videos. Then you have another axis to take care of and have combinations of both. I mean, we already have text in images sometimes, but yeah, I mean, you already mentioned clip. There's a few multimodal models that combine texts and images or even video.
Theresa: In different ways these days. So there's really a lot of, a lot of combinations you can add on top of even just image complexity.
Oscar: Yeah. You're right. You're right. Yeah. Yeah.
Theresa: So quite a way to go.
Oscar: Yeah. No, but it is interesting. I've been a little bit curious about, I haven't been to CVPR, like the computer vision, CVPR is one of the biggest computer vision conferences. I haven't been there in a couple of years. I'm sort of curious what they're, what they're all talking about these days. Like, how do you, I mean, I hope what [00:43:00] they are researching is like completely new architectures that are not transformer based, you know sort of the next gen thing.
Oscar: But if they're all just fine tuning some model for some application, it doesn't seem that interesting.
Theresa: I actually can tell you, I, I haven't been to CVPR at all, but I've been to, to ICML this year. And it's a mixture. I mean, there are definitely people thinking about what is the next transformer? There's these there's a few different contenders, I think, but none of them really beating transformers yet.
Theresa: But yeah, obviously, people are trying to fine tune models in different domains, which I guess in some sense also makes sense. I saw a lot of natural science The things like physics and biology,
Oscar: All right. Yeah.
Theresa: and I mean, those are also domains. I think if you want a really good model for just many biological problems, you're going to run into data issues pretty quickly.
Theresa: And [00:44:00] also just in the fact that they're really complex.
Oscar: Yeah.
Theresa: so I think that actually makes sense to. Try fine tuning a transformer of any sort, whether that be a language model or vision or combined in
Oscar: Yeah. You might even have to start from scratch. Like language might not be the right representation there at all. I know a friend of mine has a startup called Hook Theory. So they're like a music education company or sort of thing or, or music creation company. And they trained a generative model for music where, of course you still use a transformer, but you don't use a language tokenizer.
Oscar: Like it's a completely different. Embedding a representation of the data. It's like, I imagine that happens a lot in 6 in biology.
Theresa: Yeah. I mean, I think that's also actually a very interesting question. How much of this language infrastructure do you actually want to reuse? How much is maybe just convenience because we got it to work in that one domain and it might. Kind of work in other domains [00:45:00] or might look like it might be easier to just transfer it without tinkering with it.
Theresa: But I assume that we should actually think of different solutions for different types of data that so I think there's a lot of a lot of thinking still to be done around that because I think in the first typeface, obviously the task was more or less like, how do we live with this technology?
Theresa: How do we deal with it? How relevant is it for us? But I think that's probably still ways to go. I
Oscar: Yeah. Yeah. And I mean, there's just no way that such a transformer is to end all be all. I'm sure there's some, something.
Theresa: mean, I'm pretty sure to actually. I mean, the, the problems I've seen around conferences, especially this year, have been really large scale. Those really haven't been on the same level as you described with we have a few thousand images and try to solve that. It's more like how do we, I don't know predict how proteins work.
Theresa: That's good. And that's not something that I think [00:46:00] most. Companies or at least smaller companies would try to solve on their own. That sounds more like a we manage this once and then a company would have a version of that internally. Do you think the problems startups like Nickel want to solve will grow?
Theresa: Or do you think that's actually something that will stay fairly small? Because the big stuff we only really want to solve once.
Oscar: That's a good question. No, my, my instinct is actually that Most practical use cases have relatively little data. there's like, like I was talking to a guy, I actually played tennis with him, I'm from Sweden, so I played tennis with him when I was in Sweden, and he runs a company, Lund, that does like, assessment of tree It costed like three avatars, so like, he helps cities monitor the trees.
Oscar: So like basically [00:47:00] every single tree becomes like, there's a digital clone of it, quote unquote, and then he can like, with aerial footage, go back to that tree and see how it's doing over time.
Theresa: Nice.
Oscar: So in his case, he needs something like nickel, like, I mean, he needs a lot of different types of ML, but he needed classification for just like an aerial photograph of a tree, like a little crop, and then classified into maybe the types of tree, but also like health of the tree.
Oscar: And it just seemed like such a weird bespoke use case. Like the category, maybe not the data itself, of course, there's a lot of aerial photographs of trees floating around, but like the specific taxonomy that he's interested in that he think he can, like, possibly resolve from the photographs that he has at hand is pretty bespoke, I think, and he, so, so he, of course, it's there with His data and he has no no annotation to, to begin with, and he's a small, it's a small company.
Oscar: It's just him and maybe a developer. So he's like [00:48:00] sort of by definition or by, by design, he just going to have no training data to begin with. So he's, so he's going to start in the data regime. So the, the, the, the trade off there becomes. Like annotation is such a big cost compared to compute and all the other stuff.
Oscar: So any product that gives him like high performance at low data will probably be more interesting than a product that gives him really high accuracy. Sorry, this is like yeah. Do you know what I mean?
Theresa: Yeah. Yeah, totally. Absolutely. I see that. I have a friend who works in the biology lab and they now have a new machine that Canada's a similar thing for, for fungus and can track fungal growth under a microscope, which is great for them because they don't have to manually count out the hundreds of fungi.
Theresa: But on the other hand, the amount of data they produce in a year is really limited. I mean, she told me [00:49:00] they had to give the company that produces that machine, they had to send them data. So they have. their fungus is actually recognized. And he was like, yeah, we sent them like, 4, 000 images or something.
Theresa: And that's just not a lot,
Oscar: right,
Theresa: but that's about what lab, I mean, it's a medium sized lab and they actually have to grow the fungi. Like they can't just mass produce that.
Oscar: Yeah, yeah, yeah, yeah, right. So that's another example where the actual data is. It's, it's scarce, expensive to collect it. So yeah, I mean, maybe, I mean, what does people do? I mean, maybe, maybe people don't work on classifications for AutoML. But what is, I mean, last time I looked, there's like the this is really, really old datasets, like the, it's called the Glue dataset,
Theresa: I don't even know it, to be honest.
Oscar: it's from, it's from UC Irvine.
Oscar: It's a combination of I think 12 datasets, most of them are classification and it's like features, you know, like. [00:50:00] It's already a feature vector and you just do something on top of it. But yeah, maybe that would be interesting for us to publish sort of Like if we have thousand data sets that are all like, odd, odd, long tail, you know, tasks. Maybe that's, that would be interesting for the community. I don't know.
Theresa: I think it's definitely interesting because there are things like OpenML, you know, where there's tons of data sets, but a lot of the data sets are also strange and you never know. Is this like a normal kind of strange? Is this something you could encounter in the wild? Or is this just someone who hasn't really thought about what they're uploading?
Theresa: Because it's so much, right? It's really hard to navigate. And then it's also really hard to know. Is this, Is this a real use case that I want to solve? Or is, is this someone's data on the computer? They haven't really focused on annotating a lot.
Oscar: Oh, yeah, that's a good point. Actually, it's embarrassing to admit, but I've never looked at OpenML before. It [00:51:00] looks interesting.
Theresa: I mean, it's a lot. I sometimes find it a bit overwhelming, especially since I don't really work with data sets like that. But there's definitely gems in there. I just. Think you sometimes need to put in some effort to find what you want,
Oscar: Right, right, right, right.
Theresa: but yeah, as I said, it's always great always to have like confirmation that yeah, yeah, this, this thing, this data is actually a relevant problem. Someone's trying to solve.
Oscar: Yeah, I mean, that's, I guess, as researchers, we all want to solve, like, real, real problems. So there's always this trade off, like, how do you, you don't have a lot of resources, you don't have a lot of sort of engineering capacity, typically. So you want something real, but not too real.
Theresa: Yeah, exactly. Exactly.
Oscar: And it's sort of unfair to like, you know, even if, even if we did like publish say 10, 000 data sets, maybe that's, maybe that's a bad [00:52:00] benchmark for academia because no one wants to sit there and run their things on 10, 000 data sets because it'll be costly and it'll slow down iteration speed and so on.
Oscar: Yeah.
Theresa: Yeah. I feel like it's a kind of an internal trade off, right. But having some elements of of actually Doctrine data in a way is certainly helpful. I think, I think this is an ongoing discussion, what data sets are actually relevant to testing them, because even if you say, yeah, I'm low on resources, just using data sets, that's like one data set that's 20 years old and super trivial might not actually be a signal.
Theresa: Even if you perform really well on that, is
Oscar: Oh my God.
Theresa: can take serious? Right.
Oscar: Yeah, there was this, there was this field called when I was doing my PhD, there was this field called like, or subfield called the domain transfer learning,
Theresa: no.
Oscar: That I always sort of like to pick on because it's obviously an extremely important problem. [00:53:00] Like it's maybe the only problem in machine learning because now we're always going across domains.
Oscar: We're functioning across domains. But it's like, for some reason, there was this one data set that had become the standard benchmark. It was the office data set from Berkeley and it had, are you familiar with it? Yeah. Yeah. It was, it was like the smallest data set you've ever imagined. It's like 20 categories, it's like keyboard, mouse, monitor, lamp, and so on.
Oscar: And it had three domains. One was a cell phone. The picture's taken with a cell phone. One was with the SDLR the, the digital SLR. And one was sort of stock photos from Amazon. And it was so few examples per class. It was like five. So it was, the whole data set was, I don't know, 100, 100 samples per domain.
Oscar: And, and the amount of papers that were published on this data set was completely absurd. And, and, and because you have all these splits, you can like train on Amazon and go to the DSR, you can train on DSLR, go to, [00:54:00] to, to, to a webcam. So there was always some way to beat state of the art, like, and it's like, God, how is this like?
Oscar: And it just kept going. As far as I know, they're still publishing on this data set. And everyone I talked to in the field were sort of embarrassed about it. They're like, yeah, it's a terrible data set. But it sort of has entrenched itself as the default. And here we go, like real cottage industry, just paper mill.
Oscar: Yeah.
Theresa: Yeah. One of those things that if you don't include it, a reviewer will tell you, why didn't you test on the office data set?
Oscar: Yeah. Yeah. Totally. Totally. And these methods were so complicated. This was before deep learning. So like, God, it was these huge nodes, like graphs of like this, and then this, and then this, and then this, and then this, and then this, like, one thing fancier than the other. And the complexity of those algorithms compared to like the size of the dataset just made no, no [00:55:00] sense at all.
Oscar: Anyway.
Theresa: Yeah. Yeah. I think the data set problem definitely hasn't gone away though. I think there's actually been an interesting paper this year looking at which data sets are used. I think even AutoML specifically. And that's where I got the 20 year old paper thing from, because they actually did found a fairly old paper.
Theresa: I think it's about 20 years old. That was still in use. Things like that.
Oscar: You have to, you have to share that paper with me. I'll be interested. I'll be interested. Right.
Theresa: Yeah. I can look it up. Yeah. It's yeah, I think that's really hard. I think that's also where where it's sometimes, I sometimes have the feeling that this, this automotive research that's really within the research community is sometimes a bit disconnected from what applied research actually looks like, even though the pitch is always, we do this for people who apply ML because you, you test on such different problems.
Oscar: Yeah, yeah, I [00:56:00] mean, we've all, you know, been there and tried something on one data set and then come to a dumb data set and it just doesn't work. Or even one data set, I mean, one, one figure that I'm like, keep not seeing in, in academia. So at my previous job, I was told the team, like, if you have, you should always do the sort of backwards ablation on your data set.
Oscar: So. You can't magically conjure up more data, but you can conjure up less data. So if you have too much, too, you're trying to, again, maybe I'm a little bit jaded, but to me, all machine learning comes down to pick a dimension, hyperparameter dimensions, like learning rate or number of convolutional layers or something else.
Oscar: And then you try different things along that dimension and whatever works best, you pick it and then you try another dimension. And all like, all you get from like machine learning experience is some sort of intuition for what dimension to try. Other than that, you just have to be [00:57:00] a good engineer and try things as quickly as possible.
Oscar: Okay, so that's my very jaded take on this whole field. But so one thing you definitely should do is the reverse ablation where you don't just look at which model, which method is best at the current data state, but you look backwards and, and see how it performs at like half the data and a quarter of the data and eighth of the data.
Oscar: And then you plot that out. And it could, because it could well be that the second. Best method has a much higher sort of trajectory if X is number of data is X axis and performance is why it's sort of slow, but then it sort of starts sloping up. You can very clearly see if you draw it out that. Okay, once I have twice as much data again, which I will have in a month, you know, at a company, especially like a car company, then that will actually be the better method.
Oscar: So you shouldn't select. You should try to select method for the data volumes you will deploy on or like that you will, that you will have rather than you, that you are sort of arbitrarily [00:58:00] have. And, and that's certainly true of academia. Like these data sets, the size of the data set is treated as some sort of universal constant.
Oscar: Right. And people always evaluate on the full test set or the full train set. Sorry, but that's really just, that's just a completely arbitrary. Data volume, this is much more interesting to look at the full sweep backwards in time and see how does your method actually Generalize across data volumes and how how would you expect it to work and like twice twice the amount of data? You know, I mean,
Theresa: Yeah. That's a really good point because I mean, obviously then if you say, I want something that works in low data regimes, you pick a data set that's used in these regimes, but then it's a different data set as well. And then you don't really get a reliable estimate of that trajectory. You just have two data sets of different sizes and [00:59:00] see that it works differently for them.
Theresa: That's much less reliable.
Oscar: Yeah, and it's so easy to do and I I see very seldom Because I think people in academia are not that interested in making those sort of arguments they want to make the argument at least it, you know, at least when I was publishing is like Ultimately what the reviewer cared about, because there's no theory it's just what works best.
Oscar: So you have to show, you can, and then the rest is just, just like, you sort of make it sound like you know what you're talking about. But ultimately what matters is performance on a certain benchmark. So they, maybe they're less interested in like that.
Theresa: Not even sure if it's always interest, because obviously, when you get started in academia, you're kind of brought up in an academic environment, in a sense. And if these questions just aren't asked, I don't think you learn to ask them yourself. Like, I think I wouldn't have thought about [01:00:00] this, because I don't think I've really ever seen it in papers published that I read.
Theresa: Right. So, so I think it's then very hard to reinvent this question if, if it isn't told to you in a way.
Oscar: Yeah. Yeah, and I mean, you know, at least you should have the academic honors, so a lot of times, and I've done this myself, right, so when we publish what we call point pillars, which is one of my most cited papers, it's like for detecting objects in 3D point clouds and we make a statement sort of like effectively saying point pillars is the most, is the fastest and highest accurate method for object detection in point clouds, period.
Oscar: But the real, the correct statement is, is the, is the fastest and most accurate method for taking obviously point clouds on this data set with exactly this amount of data,
Theresa: Yeah.
Oscar: right? Because you go to another data set, who knows? You get twice as much data, who knows? [01:01:00] But it, you know, that's Because it's an empirical science at this point.
Oscar: You, you, you kind of, can't underestimate the importance of the data you're running on.
Theresa: And it's really hard to then make these, these full stop statements in a way.
Oscar: Yeah, it's, you, you really can't maybe back in the days when they were doing sort of studying Convex, you know, optimization theory of S different kernels of SVMs. And I mean, there's obviously in machine learning, there's machine learning theory. But I think a lot of that, that was, as far as I understand, like the history of machine learning, that's why ANNs like neural networks became out of favors in the nineties because empirically they did it roughly as well as support vector machines, but support vector machines have like.
Oscar: All this beautiful like theoretical properties that people could really study and I felt good about them, you know but, [01:02:00] but obviously you can't argue with performance and now we're using neural networks. And now we, we just, I think,
Theresa: Yeah. I sometimes even think that, you know, Application is kind of the true proving ground of research these days that you, I mean, even if something is published at a conference, it's sometimes hard to believe it actually works until you see that in application. Would you agree with that?
Oscar: yes. I mean, I think if a method really were another way to, I would say the citation count to rule. In a, in a way, we'll tell you if method really works, like take, take something like batch normalization, right? It was, I mean, I, I'm not like a theory person, but like, I think there's some dubious statements in there, what, why, why it works, but it works, you know and I think [01:03:00] we, you don't need to see that in, in an, Application to know that it works, like you don't, but, but, but the fact that sort of everyone adopts it to make their own results better also sort of speaks to that.
Oscar: It works,
Theresa: Is that how you select new models for your models? It
Oscar: Yeah, well, no, because we don't want to wait till, like, the citation counts are high enough that we, we just typically look at, like, leaderboards. So hugging face. That's a great leaderboard for justification, for example.
Theresa: sense.
Oscar: Yeah. And then you have to pick a small enough model that makes sense of our infrastructure, and so on and so forth, and
Theresa: I actually also wanted to ask, and this is, I guess, more of a personal question. What made you want to go for this applied machine learning kind of lane, right? Because you did your PhD. You saw a good bit [01:04:00] of academia during that, I assume. Why not? I don't know. Startup life, I assume, is also quite a bit different from big industry or academia or something else entirely.
Oscar: Yeah I think that's just my, I was never in, like, I was always, I was always like an engineer at heart. I was always interested in applications and making things work. Like even Coronet, like that, my PhD project, it's a very, very applied. It's literally a web service. Right. So instead of and I had a couple of publications, but they weren't particularly good.
Oscar: To be honest, it was just like, here's how you get machine learning to work on these coral images. It was very applied. Didn't have a lot of reach. So I think I'm always interested in making things work in, in practice. And I also think academia can be, like, if you're really on the cutting edge, bleeding edge of like applied or like machine learning research, like really sort of the mainstream it's really [01:05:00] cutthroat.
Oscar: Like you have to be so fast, like the papers come out really, really quickly. Like the archives just, you know, there's a, there's a saying, like once you think of something, 10 other people think of it because sort of the, the, the idea, the sort of the field is. Has advanced far enough that that field is within people's reach and then you have to move so quickly and it's sort of a, it's a very stressful process because you have to, you have to like run all this experiment, like write it up and you have to try your very best to be, have an academic integrity and honesty and not say things that you don't, can't support and like run the experiments in a fair way and then you rush out and get like an archive out and then like you look, look around and there's another paper out that sort of did the same thing.
Oscar: Yeah. So now you've spent like six months or a year and it's not wasted obviously, like in terms of you learned yourself and maybe you have a slightly different angle, but you might not still get any paper out of it because it was already published. [01:06:00] So that stress was a little, I don't know if I like that stress.
Oscar: Even though, like, it is awesome. The feeling is awesome of feeling like, okay, I, I came up with the best method, like, in this point of time for something. So, yeah, I think if I go back to academia, it would be much more like, I'm not even going to try. I'm going to try and do something. Very different. I'm not going to try to compete in any of the standard benchmarks or anything like that. It's going to have to be a very, very, like the most academic type of research, the deepest type of research.
Theresa: It's interesting, though. I think you're the first person who's told me like, founding a startup is less stressful than academia.
Oscar: Yeah, I always thought, especially when I thought doing a PhD was brutal. It was just like, because When you work at a company, people around you are really invested in your success, right? If you're not successful, the company is not successful. [01:07:00] When you're doing your PhD, I mean, your advisor, I mean, I have a great advisor.
Oscar: I think he's a fantastic person, but fundamentally he's not incentivized to make me graduate. He's incentivized to I mean, he doesn't, it doesn't matter for him essentially. Like I'm very cheap, you know, I teach, you know, whatever I teach him classes to he gives me some grant money, but I'm very cheap labor.
Oscar: And I, it's sort of on me to come up with these breakthroughs that are significant enough to get a publication and the university have a certain, you have to publish certain amount of papers and then you can, then you can get your degree. Right. So it's just, I thought that was very stressful, this feeling that you have to invent really new things, get them accepted by the community in order to proceed.
Oscar: And I don't know how to invent new things. Like it's super hard, right. Whereas in a company you're like, even if you have a boss, like the boss is invested in. Oh, like, oh, you're struggling, like, okay, well, we'll help you. We'll bring in this colleague that can [01:08:00] help you. It's, it feels more like a team effort.
Oscar: But maybe that was just the lab I was in. Maybe I'm sure there are labs that are much more collaborative.
Theresa: Yeah, but I think fundamentally, you're also right in a way, right? You are always forced to be. New and, and work on these, these common benchmarks and convince the community of something you think is good. Whereas obviously if you, if you pick an application and really say, okay, I'm solving this and not just doing well on a benchmark that is somewhat theoretical, at least if not actually useless in practice then it's a lot more tangible, right?
Theresa: I don't
Oscar: Yeah. And like, I mean, with Nickel, we are selling services to people. It's sort of the ultimate proof that it's useful. Like people use the service. They're super happy with it. They give us money. It's like, I don't have to wake up in the morning and ask myself, am I adding something? Am I, am I helping anyone in this world?
Oscar: Right. [01:09:00] I mean, I was helping is a, is it capitalists in a very capitalist way? I guess not. I'm not like a, you know, it's not a nonprofit, but whereas the paper I could, I could, I would find myself like, why am I doing this? It's a little bit selfish sometimes where I'm doing it for my own fancy title Which is weird, because it should be the other way around, like if you're in, if you're in academia, it's like a life of service, and you're really like, you're accepting maybe a lower pay, and you're advancing the, the, the, the, the field and I think that's totally true in a lot of types of academia, it's just, I think, in our type of Field where you're doing applied.
Oscar: It's so applied that like I said, it's very little hold water in any, I don't know, maybe I'm making too much of it, but yeah.
Theresa: mean, it's an interesting perspective, especially like. What you said about Nicola, just being a really well engineered, but in [01:10:00] the end, simple solution of existing things that just do well in practice versus the everyday life of someone who would research in a more academic setting, who would, you know, work on much more elaborate methods and settings but wouldn't have This direct feedback would probably even maybe at times have to fight to find a niche of where your ideas actually work.
Theresa: That's yeah, that, that's a pretty big discrepancy. And I think it's very easy to kind of lose track in your own mind of this bigger thing is supposed to be solving, supposed to be working towards, because, you know, it's really far away. I mean, if we're talking about this idea of AutomL solving huge problems, that that's very far away from where we are right now.
Theresa: We're good at solving simple problems, so that's too boring in a way for some academic research, and then, then you're stuck in a weird middle.
Oscar: Yeah. Yeah. Yeah. Yeah.
Theresa: That, that makes, makes it really [01:11:00] interesting to hear your perspective on that, because I am one of these researchers working on things that are very far away from practice.
Oscar: are you, are you, are you planning to stay in academia then? Are you?
Theresa: Let's see, I mean, academia. I mean, I had a good PhD time. It was, I had a really good supervisor. I had a good group. And I think for what I've done during my PhD that also suited me really well. So I was quite lucky there. But I don't wanna do this whole, do five postdocs and then see if you can get a permanent position thick.
Theresa: So. I'm seeing where life takes me
Oscar: Yeah. Yeah.
Theresa: if that's academia, I think I would really enjoy that. I think there's definitely some things, as you mentioned, that are really less than great, but thankfully reinforcement learning is not quite natural language processing. So speed and just pressure is a bit lower.
Theresa: And if it's not, then it's not because I can also definitely, that's really [01:12:00] something I'm missing this, this idea of I just did something concrete and someone. Really was helped by that someone accomplished something with that. You know, that's that that's really
Oscar: Yeah, we all want to be we all want to be useful, you know,
Theresa: Yeah, and I mean, I always love when I hear these these cool AI tools being deployed somewhere Like for example when I talk to my friend about this lab machine, that was great and at the same time I then sometimes feel like ah, what am I doing with the you know compute and ideas I have and So I think I could also totally see Being satisfying working in a more applied way.
Oscar: yeah, the reinforcements running is cool though. I always felt like that's that's the only real machine learning like that's like the true North Star because it's so the formulation is really compelling. I think the way the problem is set up. It's just feels like. More real, more, more like can do more.
Oscar: I'm a very fascinated by it. I [01:13:00] haven't, haven't studied it very carefully, but
Theresa: Yeah, it's, it's sort of, it would be really great if it worked.
Oscar: yeah.
Theresa: that, that's why I'm kind of doing reinforcement learning in an auto ML group. I think I have it much easier than a lot of people here because there's really limited work. Let's put it like that.
Oscar: That even mean?
Theresa: So yeah, so what, what a big problem with reinforcement learning as it's often done now is, is that reinforcement learning as a problem setting is super general, right?
Theresa: Everything is reinforcement learning. The thing is the algorithms that are constructed for reinforcement learning oftentimes also try to be that. They try to be an algorithm that can do everything. then all the details of how that actually looks, they are in the hyperparameters and models and,
Oscar: objective functions, right?
Theresa: and objective function.
Theresa: Yeah, that's also a really big one. And they all interact like how you specify objective interacts with at least five hyperparameters. And ideally you would also have a schedule of all of them because Yeah, your data acquisition is [01:14:00] is totally dependent on your update quality at each step. So it's super complicated, all of it.
Theresa: And since it's so complicated, actually getting it right
Oscar: Hmm.
Theresa: And then we have the situation where we have this, at this point, it's over a decade old Atari, right? It's one of these original benchmarks where a lot of people, even including myself, I would say, practical relevance of Atari?
Theresa: Questionable. then, on the other hand we can't even select the best algorithm to use for each Atari game. Currently, not something we can do, which is a bit embarrassing, let alone have one algorithm that actually performs even just
Oscar: can't, you can't, you can't select, you mean you can't select without trying, or you can't select even after trying, you're still not sure
Theresa: yeah, if you try, you can select, but I mean, training in the Atari game is pretty expensive, so and that, that includes. stuff like hyperparameter optimization, which is also expensive. So it's, it's not a great situation because then for each new problem, we would have to [01:15:00] do the whole combined algorithm selection hyperparameter optimization pipeline, which
Oscar: Yeah.
Theresa: is expensive, especially if you think that real world data and reinforcement learning is not quite as easy to collect as real data for a one step problem like classification, because you then
Oscar: Yeah. I mean, you can't. Right. Right.
Theresa: Yeah, you, you, You basically have to fix the policy you collected with, so you can get expert data, but then the expert, but the expert doesn't match what your policy has learned, and then you have this shift that
Oscar: Yeah.
Theresa: has been shown to be really hard to overcome.
Oscar: Yeah. I bet. I bet.
Theresa: Yeah, that's what offline reinforcement learning tries to do, but it's just behind a mile.
Theresa: And all of that is just. Very hard. So there's a lot of focus, obviously, on trying to get either zero short generalization or at least adaption online to work. And I think there's progress on that. And in a lot of ways, there are. His effort to, during training, just [01:16:00] make the training process fit whatever we're, wherever the policy is at right now.
Theresa: But that's kind of limited still, and that's a lot of what I try to work on. This idea of Hyperparameter optimization, not as in, oh, let's try a learning rate. Let's see how well it does and hopefully do better next time, but have it be kind of adaptive as an, oh, we, we improved in this way and maybe even something like our predictions on this are stable.
Theresa: Let's, let's adjust in one way or another. It's really hard so far. But hopefully, hopefully that could really solve a lot of these stability issues in reinforcement learning. A lot of work to do.
Oscar: I, I, one of my, like most fondest memory or strongest memory from, from Berkeley, so I was there for like, just a year. I did postdoc for a year in, in the, like in that floor. And that floor is like amazing. You have like, you maleek, you have Peter Bial, you have like Leo Ross, you have like crazy high skill FA [01:17:00] faculty.
Oscar: But the, but the, but the facilities themselves are really sad, like I think compared to any. Any European lab, it's just like, I was shocked when I got there. It's just like crammed people. Suddenly it's like tiny desks, like cubicle looking things like carpet. It's just, it's not a very fancy looking, you know, room or whatever.
Oscar: So it's already crammed behind me with three reinforcement learning researchers, and they're trying to get a physical robot like this big to walk. Right. And this was maybe eight years ago and they were just working. day and night and they had it so it's a little thing and they had it hanging by the scruff of its neck because it couldn't stand so it's not hanging on a tripod most of the time and then they would code and code and code and then they would like put it down and then it would just like every time it was like they just couldn't get it to work and then I was at late one night it's probably just before their paper deadline and they were all sitting around it, the robot was hanging on the tripod, they were all sitting like [01:18:00] this, just like in a circle around it, just completely quiet, just like, I'd just given up, and they were just like, there's nothing more to say, there's nothing more, it was just such a like, moment, and I realized how hard reinforcement learning, especially in physical systems seems to be.
Theresa: that's a whole nother issue. I mean, we, we now have a robot that could work like a dog.
Oscar: And they can learn from scratch using reinforcement learning?
Theresa: yeah, but it's really fresh. We only got it a few weeks ago. So I think that joy is still ahead of us. They're actually getting it to work outside of the simulation.
Oscar: I see, right, okay, yeah. No, I mean, it's such an awesome thing to work on, like just any, any object that can move its limbs, get that to propel itself forward in the real world. It's like, God, maybe you can educate me. Is that solved? Like, is there [01:19:00] algorithms or can I buy a robot and like a dev kit that gets it to walk in a rudimentary way?
Theresa: I think so. I would say, I mean, soft is maybe if soft is a strong word. The thing is, if you just want the robot to learn one simple thing, like walk straight ahead or do a handstand or whatever, it's probably doable in some way. You might need to do some research and make some The research code work for your setting or something like that.
Theresa: But I think that's doable. The problem is more like if you then say, Oh, I want this thing walking the halls and greet everyone with a handstand. Combining things like that and then making it work in the physical world without hurting anyone or itself. That's, that's a whole different issue. But yeah, robotics has, has gone really far.
Theresa: Is there anything you feel that you haven't talked enough about yet or anything that you want to bring to the audience?
Oscar: Not really. I thought it [01:20:00] was really, really enjoyed the conversation. It's great. I mean, most of these podcasts I haven't done one in a while, but most of them I'm on it's like someone is just trying to sell something, you know there are their agency to recruit ML talent. And so they just do this. I haven't talked to a single line person.
Oscar: I've talked to actually the previous one I did was great as well with anarchy AI guy, but it's just great to talk to someone who knows, knows the field. Yeah.
Theresa: Yeah. I was actually for a second when, when you first wrote, I was like, Hmm, do I want to answer? I'm not sure how much I want companies on here thinking that it might end up one of these sales things, but no,
Oscar: didn't sell too much.
Theresa: no, no. And I think it's really interesting. I, I still rarely get to talk to people who actually work on more practical things.
Theresa: It's always great to get that perspective. So yeah, thanks for being on here.
Oscar: Yeah. Thanks for thanks for the conversation.