Discovering Temporally-Aware Reinforcement Learning Algorithms

Designing algorithms by hand is hard, so Chris Lu and Matthew Jackson talk about how to meta-learn them for reinforcement learning. Many of the concepts in this episode are interesting to meta-learning approaches as a whole, though: "how expressive can we be and still perform well?", "how can we get the necessary data to generalize?" and "how do we make the resulting algorithm easy to apply in practice?" are problems that come up for any learning-based approach to AutoML and some of the topics we dive into.

[00:00:00] Hi and welcome back to the AutoML Podcast. Today's episode we're talking about something I really enjoy personally which is automated reinforcement learning. Now, I know that reinforcement learning is maybe not the most well-known target domain of AutoML. We often talk about supervised learning

[00:00:19] There's a lot of work on that, a lot of work on tabular data on efficient settings and in RL You could say we're not quite as far because it's a complicated setting. There's still a lot of uncertainty. It's often expensive It's unreliable but

[00:00:36] it's still a very interesting AutoML domain and actually our guests for this episode can't tell you in how it's not as inefficient, not as unreliable and Not as brittle anymore as you might remember it from maybe even one or two years ago Specifically, we're talking about

[00:00:54] meta-learning reinforcement learning algorithms to fix a lot of these issues To get more stable algorithm to get better progress to be quicker and in this specific work To get a better trade-off when it comes to exploration

[00:01:09] Exploitation and I think that part is very universal for AutoML tools, right? That's something we're usually very interested in and that's something my guests Matthew Jackson and Chris Liu have really managed to learn in an algorithm So we talk about that

[00:01:28] But a lot of our conversation is actually also about How much do we want to try to learn and how much do we want to optimize? And I think that's also a very current topic in the world of AutoML if we think about newer innovations like PFNs

[00:01:47] that try to meta-learn a lot of the things that we would have done with Bayesian optimization and The question then becomes is one better than the other? Where is the trade-off in between? How much flexibility do we want to give this end-turned learning pipeline or how much more

[00:02:06] Reliability and generalizability do we get if we really optimize for our results? And I think that's very interesting and that's something they've both thought deeply about this meta-learning setting, this meta-algorithmic idea of we learn something about algorithms. Do we learn a small part?

[00:02:27] Do we learn the whole algorithm? Do we learn a component? How can we do that? We also talk about what kind of data base you would need to meta-learn something like that So I think even if you're not very interested in reinforcement learning these parts of our conversation

[00:02:46] could be transferred to AutoML as a discipline right now where meta-learning is becoming increasingly powerful and increasingly important So I hope you bear with us and

[00:02:57] Diving into the world of reinforcement learning and let me know if you would be interested in more reinforcement learning content in the future. I hope you enjoy Hello everyone and welcome to the AutoML podcast I'm Theresa Eimer and today we're talking about something I'm very excited about

[00:03:14] Reinforcement learning more specifically how we can discover temporally aware reinforcement learning algorithms and The people that will tell me how that is done Matthew Jackson and Chris Liu could you two before we get into your paper? Maybe quickly introduce yourselves. Tell us what you're working on.

[00:03:29] Hey, I'm Matt and I'm a PhD student at Oxford with Jacob Foster and Shimon Whiten I started my PhD off working on this topic. So I've been doing meta reinforcement learning since the start

[00:03:45] I've recently been looking a bit of world modeling with diffusion models, but I don't think we'll have time to get into that today Yeah, I'm Chris. I'm also a PhD student in Oxford

[00:03:57] I'm with Jacob Forrester and most of my research is also around kind of meta-learning in particular A lot of my work has been focused on kind of hardware acceleration for RL and this enables a lot of new meta-learning techniques

[00:04:10] So in particular one of them is what we're about to talk about which is this idea of evolving new RL algorithms I think is a very promising direction Yeah, and since you gave me this nice transition

[00:04:21] What kind of meta-learning are we talking about when you say discovering new algorithms? How can I imagine that working? You can kind of think of Normal meta-learning as trying to like learn across a distribution of tasks

[00:04:35] But what we're talking about is maybe a more broader version of this Which is just generally learning how to learn as an algorithm And so if you kind of look at the history of RL A lot of it is like algorithmic improvements over time, right?

[00:04:48] So you start with something like DQN and then like one fault direction was this idea of like TRPO Which is this idea of trust region policy optimization and so on and so forth

[00:04:57] There's this idea of like we keep developing better and better algorithms just through maybe theory or like just empirical results Or maybe just like intuition But instead we could try to just automate this step, right?

[00:05:07] Can we maybe try to evolve better RL algorithms that then outperform all of our existing ones on our benchmarks and tasks? Yeah, and you did that in two different ways So before we get into the temporal part of the discovery which is

[00:05:23] Somewhat of an extension to both the learning algorithms you examine. Can you maybe start with how you met alone algorithms in both cases? So you have one case that I would more or less call Meta-learning from scratch and one that is kind of a more warm-started approach

[00:05:41] Right, how does that work? Yeah, sure So to take a bit of a step back and give the big picture on this When meta reinforcement learning sort of started as a field I think around 2017 it was done by effectively training just like RNNs as policies and

[00:06:03] Training an RNN just like a recurrent model or any model with some form of memory to solve a bunch of different tasks that like meta train time and so the key inside there was to Maintain your occurrence state

[00:06:19] Throughout your experience with the task so you never reset it unlike in normal RL where every time your episode resets you reset your memory and you start from scratch and The idea behind that was that your agent has

[00:06:36] access to all of its interactions with the task at scene and So since it's conditioning on its whole history it can theoretically represent a learning algorithm and it can internally condition on on The embedding it's inferred from that history and Transform its own policy based on that so

[00:07:00] That's like really to in-context learning if there's like a simple hard word that corresponds when I'm talking about Yeah, yeah exactly. Yeah in context learning is is the newer new and improved metara but yeah originally this was done with I mean typically RNNs and

[00:07:19] They got some really cool results. So things like you train a Robot morphology like a cheetah over a load of different values of gravity and like friction coefficients and then you deploy it on a

[00:07:35] New value that's completely unknown and you see that in a few steps the agent like probes its environment and tests for What the value of these parameters are and then can instantly adapt its policy to get this really strong like adaptation and in distribution transfer

[00:07:54] But the limit of this approach is is as you start getting to more extreme values of these parameters the agent is much less able to adapt and If you try and move it onto a completely new environment, so you go from like

[00:08:10] From I know a robot to like an Atari game. It's it's completely hopeless So the thing that me and Chris work on now what what your original question was is meta learning Just the objective function for an agent

[00:08:26] So rather than trying to represent the policy and the update rule all with an RNN We try and just learn this one component that takes in Things like the reward of the agent and the action probabilities

[00:08:39] And it outputs the objective that gets like back propagated through the agent and so The paper I mean the paper that Because me and Chris kind of build off different algorithms for this The the one I build off is learned policy gradient or LPG

[00:08:57] And in the LPG paper they showed that you can train an objective function on completely toy environments like grid worlds and then deploy on Atari tasks and get some like actually impressive performance that's Somewhat competitive with like handcrafted objective functions

[00:09:16] Yeah, that's that's interesting. I think that's a really nice context you gave us there with Short history both of you would you say this generalization came from the fact that we can explicitly

[00:09:29] Make a separate the objective for the policy learning and for the objective function. Is that the helpful part here? Yeah, I think to take a similar kind of big picture approach There's like multiple ways we can get generalization, right One really common one nowadays

[00:09:47] Is to just have a ton of data, right? And then you can kind of like interpolate between your data and you'll do well On like, you know In distribution like unseen things and if the distribution is really big

[00:09:57] Then you can cover like everything you possibly see and this works well for domains We have a lot of data but in domains like For example, this type of very broad meta rl We don't have so much data, right?

[00:10:09] You can think of like in something like language modeling like each document is a data point Maybe or each token is a data point and you have like, you know, trillions of these in rl

[00:10:18] Each game is a data point, right? And we have like a few dozen popular rl environments, right? It's nowhere near enough to really properly generalize And so one approach to kind of get this generalization out of there

[00:10:30] Or to get this generalization into our system is to try to you know Find good inductive biases or good kind of architectural structures that enable generalization And so kind of like you can think of what Matt mentioned LPG has one take on this

[00:10:44] And what I was building on called LPO is another take on this where Basically, we can try to find some theoretical framework for existing rl algorithms So for example like ppo trpo all under this one theoretical framework, which we call mirror learning

[00:11:01] And the gist is we can then search in this space of theoretically sound algorithms And so it's more restrictive since we're only meta learning, you know Something like the objective function like you're saying or a small part of the algorithm

[00:11:13] But also means that we get more kind of guarantees with respect to generalization It's kind of ties into this bigger picture of kind of the difference between maybe auto ml and meta learning Where you can imagine something like hyper parameter tuning

[00:11:27] Is a scenario where you can imagine like a wide Range of like hyper parameters would probably work decently well on your task And you're only tuning maybe a few parameters on you know a handful of tasks

[00:11:37] On the other end is like this black box meta learning type of idea We have a ton of data and a ton of parameters And so kind of you can think of our approach is kind of taking a middle ground here Where we're

[00:11:49] Devolving small neural networks that correspond to the objective function We're taking a very tactical approach on picking what to evolve or what to meta learn In order to get this trade-off between generalization and expressivity

[00:11:59] Right, so you can imagine as you tune your hyper parameters. It's not very expressive You can only do a handful of things but it will probably generalize to unseen Tasks whereas in general black box meta learning

[00:12:09] It won't generalize the things that hasn't seen before but it's very expressive and you can really outperform Anything handcrafted And so that's the trade-off we're looking at here, which is as we restrict it it will Generalize better, but perhaps be less expressive and

[00:12:24] That's a long way of answering a question about objective functions Yeah, it's a targeted trade-off between these two things As like an interesting section because I think a lot of People in auto ML are also more and more exploring meta learning and how we can

[00:12:38] Fit that together with our traditional optimization regimes and how we can find a good trade-off that generalizes, right? So it's a bit of a side note But there was a paper by Louis Kirsch

[00:12:49] Who is who is a co-author on this paper we're talking about today that he did a few years ago called Similarly introducing symmetries to meta RL And the way he put it was that you want your algorithm to be symmetric or effectively invariant to all the things that

[00:13:08] You don't care about for generalization So you don't want it to learn from like from task Specific patterns that aren't generally informative and so in his case that was things like the Ordering of action and observation dimensions and I mean even the number of them

[00:13:26] like if you can train a architecture or like you have an inductive bias that isn't doesn't like Encode that information in any way Then in theory it means it can generalize outside of that and that's something you get from objective functions because

[00:13:45] It doesn't matter what the input and output space of the policy is all you're doing is getting in a Scaler reward and action probability and like Maybe a few other features like like time which we discussed today, but that's all you're

[00:14:01] Given and so you can't overfit to these like task specific details Yeah, and imagine the only alternative to that would be just have enough data quote-unquote whatever that would mean To be able to still filter out the noise introduced by these non-specific features which as Chris said is

[00:14:20] Yeah, something. Yeah, exactly. We're probably not gonna have at least in the next Zero two. Let's see where synthetic data generation goes Yeah, but Chris you mentioned mirror learning and can you Very briefly just give us an intuition on what that is and how that makes lpo different

[00:14:38] from what metatrader is from opg Yeah, so I guess one way to kind of think about rl Is through the lens of one of the most simple algorithms called reinforce And the gist of reinforce is hey like if you did something and it was good

[00:14:56] Do that more and if it was bad do it less. It's like the very high level idea there And One thing is that if I keep updating if I have a fixed Set of data like oh, I collected some data and I'm just like keep doing

[00:15:09] Oh, this was good to keep doing anything was bad do it less And I just kept updating a bunch of times on the same batch of data I would probably really massively over fit to this like small batch of data collected from the environment, right?

[00:15:20] and so what algorithms like ppo do is Basically introduce this idea of like a clipping ratio, which is saying hey like after I've updated enough on this data I should probably stop updating And you can think of mirror learning as kind of generalizing this concept of

[00:15:37] One you should maybe stop updating or how much you should update on a fixed set of data And so it's just kind of like a conservative Force for a new loss function that stops you from kind of like taking extreme updates on like a limited set of data

[00:15:50] Is the high level idea there and a lot of algorithms kind of fit in this framework and are just small variations on How to do this conservative function Yeah, and you'll use this to get some nice guarantees about

[00:16:02] What the outcome of lpo will be in the end right as far as I'm aware if you Simply meta learn the whole function as an lpd you won't really have any guarantees of the outcome But in lpo because you don't actually learn the whole function

[00:16:17] You learn a drift function right you have these guarantees. What is the drift function? What guarantees are we talking about? Yes, exactly. Sorry. So the drift function is this conservative force I was just talking about I should have made that more clear

[00:16:30] And the theoretical guarantees are just like in the you know, they're very like, you know Big guarantees that you can only get in rl just like in the infinite limit you're guaranteed to converge to the optimal policy this is

[00:16:40] Obviously like a very like with a lot of asterisks and things like that But if you want to learn more about kind of the theoretical framework It's developed by one of our past libraries named kuba kuba grutzian icml 2022 I believe and he just introduces the mirror learning framework

[00:16:57] And yeah, the gist is a lot of existing algorithms fit in this framework and all the algorithms within this space Have these theoretical guarantees and what's really cool is he even like shows that some algorithms feel proposed that don't work Do not fit this framework

[00:17:09] And so it seems to be like a decent rule of thumb for maybe what does and doesn't work In rl. So it seems to have like some amount of predictive power as well, even though it's a very maybe loosely applied theory

[00:17:21] So we have these two approaches we have one case Learning an objective from scratch and the other case learning a drift from in this case the original ppo objective, right

[00:17:32] That's not very time aware just yet. So what's the change? How are you looking at time as a factor? Sure, so the thing we introduce in this paper Is the idea of like making a very simple change to to the feature space

[00:17:51] Like to the input of both of these objective functions to try and make them adaptive to an agents learning progress So all we do for this is we tell the objective functions How far the agent is through training as like a percentage of its total budget?

[00:18:09] And how many steps it gets it's going to have absolutely to update the agent and so the motivation for this was kind of that like in in reinforcement learning like the the fundamental trade-off is between like exploration exploitation

[00:18:26] and to be able to do that to know like when you should be Increasing the entropy of your policy so it explores more And when you should be starting to hone in on the things You know have worked and like the areas of higher reward

[00:18:40] You need some idea of like how long you have left effectively So based on how many steps I've taken and How many I have left to explore like if i'm coming to the end of my training period

[00:18:52] I want to just give you the optimal policy based on what i've seen so far So to be able to even represent this trade-off you need You need access to time as an input and that's that's what we call temporal awareness

[00:19:08] So all effectively the method we propose in this is adding temporal awareness to these methods There there is one of the technical change remake which is that there's a divide between the way you optimize LPO and LPG Which is that LPO uses evolution strategies And LPG uses metagradians

[00:19:31] So chris can talk more about evolution strategies because he's very well versed on that but for metagradients you train LPG originally by updating the underlying agent for like a handful of steps And then back propagating through time

[00:19:49] and like taking your your the derivative of your like meta parameters of the LPG parameters with respect to How well the updated policy did And we found that to learn how to condition on time we had to change to

[00:20:04] Evolution strategies like we had to optimize both methods with this approach Yeah, a little bit about temporal awareness is also existing oral algorithms like kind of have temporal awareness so PPO for example and TRPO and decon these things are all fixed algorithms, right?

[00:20:22] There's like throughout the agent's lifetime these things don't really change in their objective function But there is kind of one thing that does change versus like the learning rate schedule, right?

[00:20:30] So one popular implementation detail PPO is you can you know, kneel the learning rate from initial value to zero throughout its lifetime And this seems like a pretty weak Update to the objective function or it's just like or not even like to the update rule, right?

[00:20:45] It's like maybe doing slightly smaller updates throughout your lifetime to get towards the end And so this is trying to kind of expand on this where it's like saying hey Why don't we you know be more temporarily aware like aware of the agent lifetime?

[00:20:58] Right and it seems as though in our paper We found that this has a huge difference and we can really discover new algorithms that vastly or that very heavily outperform PPO on unseen tasks Yeah, and I think That's really interesting that you mentioned that I mean if we

[00:21:13] Think back of ways that we have this sliding scale where we have traditional black box optimization on really on one end and for Fox metal learning on the other end What would correspond on the optimization side to this wouldn't in the end be something like dynamic

[00:21:28] Scheduler like population based training optimizing to find a schedule that would Probably very Problem specifically or even potentially seed specifically find a similar tradeoff that you're now able to encode in the algorithm Yeah, and it definitely runs into a similar

[00:21:47] Part of this like broader trade-off we talked about whereby adding temporal awareness into these algorithms We are Making it more expressive which allows it to do a lot better At the same time you kind of risk overfitting to individual tasks, right?

[00:22:00] Since we just don't have so many tasks and so much data and so It's very possible that for maybe a completely unseen task or like a task that's very different The schedule that it learns no longer applies so well

[00:22:11] We don't think that seems to be the case so far based on some of the relations that we've run But it's very possible that for different tasks. You'd want different looking schedules And definitely like this is just a very targeted approach at this expressivity generalization tradeoff

[00:22:25] Yeah, but back to evolutionary strategy. So I find it interesting that You went for evolution strategies on both. What were your observations on? Why metagradians maybe didn't have the same quality in the time aware case So the key with metagradians versus yes

[00:22:46] Metagradians require you to do bank propagation through time when you update when you update your your objective function And that's because we represent it as like a an lstm

[00:22:57] And so we have to we have to like roll out the agent for a few steps and then back pop through all of those job day and parameters And so if we're trying to optimize

[00:23:09] With respect to time and we're trying to optimize for the performance all the way at the end of training That would mean that we have to train an agent from the start all the way to completion and then

[00:23:21] Store all those interactions and all those gradients and back prop through all of them to update our method parameters And like this isn't we're not close to being able to fit that in in memory is the problem

[00:23:35] So I I think for a bit more context. It's the point is that like to be able to Teach your objective function how it should be updating the agent when like the time is Like near zero and you're on your very first steps

[00:23:50] You have to see how those updates affect the performance right at the end of training And so yeah met metagradians don't give you that because of this like truncated backdrop and es gives you a really nice like natural mechanism for this

[00:24:06] Because the way you train these algorithms with es Is you initialize an agent and you train it all the way from initialization To the end of its lifetime with like your meta parameters plus a slight permutation and all you use to evaluate those like that

[00:24:25] Permitation is the final fitness of the agent is its final performance on that task And so that means you're you're evaluating the quality of your learn objective at like every point in time Based on just the final performance. So you're naturally Learning parameters that good

[00:24:45] Throughout training not just not just at the end One really nice thing about the setups also It's kind of like analogous to what happens on earth right where we evolved to do reinforcement learning

[00:24:56] From an outer loop. There's this kind of broad like natural selection evolution happening on any loop is the reinforcement learning algorithm And that's kind of what's happening here where on the outer loop We're doing this like very loose evolution type of setup

[00:25:07] That has like not so much signal, but if you do it enough times then we can Properly update these the reinforcement learning algorithm. Now, we just run the algorithm, right? You know the human learning algorithm probably doesn't change so much genetically throughout the human's lifetime, right?

[00:25:21] It's we generally think of as pretty fixed and I guess one other if you're kind of interested in understanding The trade-offs between maybe evolution and metagradians One of the one of our collaborators on the original LP O paper

[00:25:35] Luke Metz wrote up a really nice blog post kind of explaining how When you basically take gradients through these repeated Computations, you end up with very high variants and kind of like nasty updates And for Luke he has a lot of work on

[00:25:52] Evolving new optimizers and that was kind of a big inspiration for evolving your objective functions for oral Yeah, that's a good point also with the accumulated variants that you get plus es is obviously something that is Nicely parallelizable because you only need to black box result

[00:26:09] But something that's a question how you do it in this meta setting with the front trust, right? Right is the evaluation and I think Matthew kind of alluded to that how did you solve that?

[00:26:20] I think there's some people who might be interested in how to do es if you trying to generalize at the same time Sure, right. Yeah, so that that's kind of the The problem we faced applying es that we hadn't seen anywhere else was trying to do multitask training

[00:26:37] with es so the way you You usually do es is you evaluate you sample a bunch of like Permitations to your your meta parameters And then you evaluate them all on the same objective because you're just trying to solve one task and that means the

[00:26:56] The objective function for all these permutations is the same And so like you can compare their fitness and you can derive an update from that And so we were instead looking at the multitask setting

[00:27:08] Where we're trying to like train LPG with es to solve multiple different tasks at once And so that meant we kind of had this like bias variance trade-off where We could either evaluate All of our permutations and all the different like candidates for es

[00:27:30] on the same tasks and like a few different tasks But that would be like very inefficient because I mean we'd have to for n tasks. It's like Scaling up the number of the agent trainings by n On the other side of the spectrum, we could evaluate each

[00:27:49] Each permutation or like each candidate On its own task But then that introduces this variance where If you have if you sample like a bad permutation of the meta parameters

[00:28:01] That happens to be on a really easy task, then it's going to look like the fitness is really good So what we're trying to look for is effectively a way of like Doing multitask training whilst normalizing the fitness across these different permutations And not just like

[00:28:19] evaluating on a single task every step because there is always that solution like We just sample one task and like use that each time set And so we found this like nice intermediary Where we evaluate We sample a batch of tasks

[00:28:37] And we evaluate each task on an antithetic pair of candidates And so an antithetic pair of ES candidates just means that like starting from the current value of the meta parameters You sample a permutation in one direction and that's one candidate

[00:28:55] And then you take the negative and you also Sample it in the other direction. And so that's your antithetic pair And so this has already been shown in ES to Like it has the same theoretical guarantees, but practically it reduces variance a lot because you have this like

[00:29:11] Symmetry in your like sample space But we found that this means you have a really natural way of normalizing task fitness Because you can evaluate this antithetic pair, which is like saying Okay, if I update my meta parameters forwards or backwards in this direction

[00:29:30] On a specific task, which which direction is better? So we do that and then you can just like rank transform or effectively like mask the The candidate in the pair that gets the higher fitness

[00:29:45] And so that means if you say like moving forwards in the direction of this permutation Gets better performance on this particular task We'll add that direction to our ES update and so when we do That on a bunch of different tasks for a bunch of different permutations

[00:30:03] We kind of balance evaluating on multiple tasks with evaluating a lot of different permutations and like being relatively Sample efficient because every task gets the same like Waiting in our updates Yes, I think then we have all the building blocks

[00:30:22] We have the algorithms. We know how you update them and we also know what you did namely just Add where we're at at the point of training so Obviously since we're talking about it and you wrote a paper about we already know it works

[00:30:37] But what are your impressions of the improvements that happen? How the time aware versions compared to the baseline versions? Very surprised how well it works Yeah, I think one really cool thing we do in the paper is we actually analyze what it learned and try to interpret it

[00:30:56] and so for LPO in particular You can really like Visualize what the objective function looks like since it's a very low dimensional function And what we find is it kind of learns this very general rule

[00:31:09] Which is early in the lifetime be really optimistic and you know if something looks good Just explore as much as possible explain as much as possible or not split like explore and you know try things out As much as possible

[00:31:20] And then as it gets later in the lifetime it becomes much more conservative its update and it becomes more safe And so what this kind of corresponds to Is like a different way of looking at the drift function

[00:31:34] I'm probably can't go into too many details as well on the podcast since it's just is a very like visual process but in particular The high level idea Is basically it maximizes increases entropy towards being in this lifetime. It's saying in general

[00:31:50] Like you know if something was bad don't overfit to bad examples But if something was good, you know try to try it out more And then towards the end of its lifetime it's saying hey if something was bad definitely Don't do that anymore

[00:32:02] And if something's good, you know don't update too much on that right? It could just be like a you know high variance point and I think this kind of general idea of starting very optimistically and then ending up being much more conservative is like a

[00:32:17] Very powerful one that I think has not been explored so much In like handcrafted schedules for rl objectives and things like that um I think in general the really transferable takeaways on this paper are These types of

[00:32:32] Well, first of all like the method and approaches that we use to do this The second one like the actual discovered insights seem to be pretty useful

[00:32:39] I can't imagine that you know people will maybe like take the weights that we evolved and like apply it to their problem But maybe the high level ideas that I discovered might be really relevant for developing algorithms

[00:32:49] The other thing so lpo is really nice in the sense that you get this like Interpretable component out of it, which is the drift function that chris was talking about LPG obviously doesn't have this because it's it's completely black box and they can represent things like

[00:33:06] outside of that that might be like completely Sort of gibberish to us But the analysis we did in LPG was kind of like reverse engineering it or like looking at the Looking at like the outcome of its update over training

[00:33:22] And so we found with that or the coolest result on the LPG side in my mind was that it managed to learn by itself to To do things like schedules in the update size and schedules and policy entropy

[00:33:38] And so what that means is that as you increase the value of like t so as you tell it that you have More total steps to update the agent both of these algorithms Take longer to like decay there decay the agents entropy or like increase its update size

[00:33:57] And the final values they reach are much different as well Like when it doesn't have as long to update it doesn't Reach as low an entry value As when it has a really long time

[00:34:10] As that means like when you don't have as much evidence the final policy isn't as confident Which is like it's very intuitive and it's very close to A lot of human designs like algorithms and insights

[00:34:23] But it also says it's not just a linear schedule, right? It also says yes, it's beyond that It's not just some decay that ends up in the same spot Yeah, precisely. Yeah, so it looks so much like things that we designed

[00:34:37] But there was no regularization towards it. So The cool part is like we didn't tell it to learn these things at all We just gave it access to the information it needed to learn them

[00:34:48] And then purely from the data it saw it like discovered these were good techniques So it's kind of nice and that it validates Like rl researchers as a field in that maybe we were doing something right because Given like just given computing data the algorithm discovered something similar

[00:35:06] But at the same time to produce this point, it's not like we used very like adaptive schedules in RL most of the time, right? So it's also at the same time like discovering things that we maybe already kind of knew but also like suggests new

[00:35:19] Potential directions that are very promising and seem to really increase performance Yeah, that's true. I think the I think the biggest and possibly the only example of a really adaptive schedule that's used often is in sec I think the temperature adaption there

[00:35:34] But except for that I've seen some relatively linear scheduling of like learning rates and That's a lot of values for exploration. I think so. That's a that's a really good point I think it's also really important because I can see how many

[00:35:49] Different types parameters or design decisions in RL actually contribute to this exploitation exploration trade-off, right? So it's actually a great Great way to show that yeah, this is something you might want to look into more

[00:36:02] Then the big question of course is still it's a metatrial. How well does it generalize? So So in our paper we kind of separated into two so the LPG side we copied the The training paradigm or the training environments from the original paper, which are these

[00:36:20] very toy grid world environments and then we evaluate Uh, we evaluate things on the fringe of that distribution effectively So we evaluate on like mazes on these like very stochastic and delayed reward like holdout environments And we also evaluate over very different training horizons

[00:36:42] So we evaluate on tasks where the agent is only given like a handful of updates and far fewer than It saw metatrain time And when it's given a really long time to train and it should be explorative for for a long time

[00:36:56] And yeah, almost universally you find that temporal adaptation Improves generalization to all of these and that it it can actually handle values of of big tea like Like training horizons that are very different to the ones that it saw

[00:37:14] Methatrain time. So we have pretty big generalization in this horizon space But since it's like the more complex learning paradigm also it's within the same domain over the time dimension basically, right?

[00:37:25] Yes, yeah, but yeah as chris as you were starting to say in lpo. It's a bit different, right? Yeah, exactly. So for lpo It's really hard for it to over fit to any set of environments since it's like not particularly expressive

[00:37:38] And so for lpo we can actually just metatrain on one environment and it will transfer to other environments And that's what we do in the paper. So we actually just metatrain lpo. I believe on Was it space invaders?

[00:37:51] One of the minotaur space invaders environments and this transfers to other minotaur environments Even also transfers to brax and like these continuous control settings And the big reason for this is that it's largely this mirror learning framework we talked about

[00:38:03] Which is that anything it discovers we know will generally work And so basically like in the worst case scenario does maybe similarly to ppo And the little extra gains it finds on any particular environment are likely to be fairly general

[00:38:18] And so it's why also the temporal awareness seems to generalize quite well Quite well too and it seems to have learned like broadly true rule Which is like, you know in the beginning explore more and then explore less and this seems to

[00:38:31] apply very broadly in any type of oral setting And I think it has to do with just like the fact that it's very hard for lpo to over fit You can get away with this training on one environment and seeing if it generalizes

[00:38:42] Then that's a pretty big jump right? I mean we're talking from a what minotaur is a relatively small space Like I think it's 16 times 16 pixels something video game to robot learning

[00:38:54] Robotic simulations. That's that's pretty significant. Do you think with time over where lpo we could actually do something like Retroending algorithm and tab it as an out-of-the-box Use thing that isn't a library and that you would recommend people use

[00:39:09] Yeah, so actually one thing we did for the original lpo paper was we really looked very closely at the discovered function and we Handcrafted that analytical provision of it We're like hey, this looks like you know

[00:39:23] We can kind of analyze each of these components and maybe try to like handcraft an algorithm that tries to replicate And we did that in the paper and it actually outperformed ppo on a lot of unseen environments For we didn't do it for this paper

[00:39:33] But I think it should be really possible and doable since it's such a low dimensional function in particular the Output would be you know just like an equation that you can then you know replace in your ppo code

[00:39:43] You know you place this like a few lines of code with these other few lines of code and it we I expect it would work quite well one thing is That we wanted to try but didn't have time to get around to was this idea of symbolic optimization

[00:39:56] So can we maybe try to like maybe evolve or learn? You know like an algorithm or like an equation that outputs, you know the lpo function That we learned and I think that would be a really promising direction

[00:40:08] But that would result in what you're talking about which is like, you know a new algorithm that people can try and use So yeah the lpo paper we did this and we call it discovered policy optimization

[00:40:18] And if you can try it out yourself. It's just a few lines of code different from ppo And one of the reasons I think why it makes sense to look at things like approximation of the function

[00:40:30] Coding terms and mathematical terms or symbolic expressions obviously that if you keep the drift function even though it's A simple function. There's some added inference time and this I think has been one of the reasons I've heard floating around why learned optimizers aren't used that much

[00:40:45] You're saying this idea of Trying to meta learn things but meta learn them in like as symbolic way Is going to be an important feature direction or do you think we are just limited in what we can express that way? So it's it's definitely

[00:41:00] It's definitely more limiting. I mean compared to compared to neural networks. They're like completely universal function approximators However, if we're looking at like practical deployment of These like learn objectives or like learn optimizers It does seem like at least before you have

[00:41:20] The black box objective functions solving every task. You're more likely to have these symbolic ones So a really good example of this was lion And so that was a learned optimizer that came out quite recently that was fully symbolic and so

[00:41:37] they do a very similar approach where they like meta optimize in this space of effectively just like mathematical operations and they find a They find a function or like a short program That is a really effective optimizer over all the tasks they demonstrate

[00:41:59] And so that means the the thing they discovered I mean already has been added to optimization libraries and things like optax in jacks and it is very competitive with adam and like outperforms it on a lot of tasks

[00:42:14] And as well as that it's about as efficient as adam So in terms of like user experience as a researcher It's sort of plug-and-play because you're just changing a hyper parameter And it isn't likely to affect the training performance very much

[00:42:32] So I think practically like these symbolic methods do have a bit of a leg up for now But as you say the expressivity is limited So in the long run, it would be nice to believe that these black box methods are

[00:42:46] Are going to be able to find things that the symbolic ones can't The that managed to outperform them There is a nice intermediate between these two things which is like a neurosymbolic approach Where you learn a Like combinatorial like program

[00:43:03] That can make calls to a small neural network To represent like Certain functions And so you could imagine that as maybe like a stepping stone towards completely black box methods Or quite possibly like the end goal would be Would be that that you download

[00:43:22] Like a few like a small number of neural network weights And like a few lines of code that make use of them and that is your optimizer Yeah, that's I think that's a really interesting direction And I think we've really seen from the adoption that's come from

[00:43:38] Lion compared to something like Velo That usability is a big factor in just getting a Also brought a test bed for the optimizer, right because it's always You can always check something as much as you want in your own settings

[00:43:53] But blind spots are really going to be found by other people in the community But if you could think about how would how you would Envision maybe meta or algorithms in five years. How far do you think

[00:44:06] We're going to get in the black box paradigm because we already said there's That it's hard to train them Data limitations are real in the rl space Do you think the next years are really going to be progress in this

[00:44:19] In more symbolic and or symbolic direction where we may stay a bit You know less expressive but figure out how to do that well Or do you think given that synthetic data is

[00:44:30] Becoming a much bigger topic these days. We can actually also push a lot in the black box direction Yeah, I think a big kind of you know, generally true Pattern is just scale, right and a lot of like what enabled our work is actually an increase in scale

[00:44:49] In particular what we use is basically what we call pure jacks Reinforcement learning. So what we do is we run the entire rl pipeline on the gpu And we actually can train thousands of agents at the same time on a single you know a 100 and

[00:45:06] This increase in scales what enables us to do this meta evolution It's why it wasn't so popular before and now with the ability to train thousands agents at the same time

[00:45:14] It enables this in paradigm because we can just you know scale up the amount of compute that we use massively I think In general as we Scale up our compute use and the amount of compute we have available

[00:45:26] It becomes more and more feasible to do purely black box synthetic data style approaches In particular Matt was talking about one of loose Kirch's works on like encoding symmetries in the architectures

[00:45:38] of neural networks, but if you look at some of his recent work and actually some of work that I also do recently That's quite similar We investigate this idea of just using like synthetic augmentations on the data

[00:45:50] To encode these symmetries. So one thing we do is for example We just generate random permutations of the observation or random linear projections of the observation And we say hey, this is a new environment We'll learn this new environment and this seems to work decently well

[00:46:03] Especially as we try out different architectures and try out just increasing the amount of compute we have Like maybe we don't need to be so smart about the architecture and design and just throw in more data and more compute and it will work better

[00:46:15] Yeah, so billion what Chris says with increasingly more things scale does seem to be The answer to this as well. I should mention so the thing that Originally got me into this topic was actually looking at this problem, but

[00:46:31] Scaling the the environment space and like the task space that you're learning your algorithm on So I mentioned briefly like earlier that in the original LPG paper They showed that you can generalize from like toy grid world environments to To Atari games and do pretty well

[00:46:50] But the result that really set out to me was they had this like I mean I actually three point curve where you gradually grow your train set so you start with like tabular grid worlds and then They also trained an LPG variant with tabular and random grid worlds

[00:47:08] And then they trained another one that also had access to some like delayed reward mdp environments And they showed that as they grew their meta training task space the performance on Atari games improved so they were able to solve more Atari games past human performance

[00:47:27] So that like that kind of gave this inspiration or like suggested there might be this scaling law as in that like as you add these like archetypal problems to your meta train training set and you'll

[00:47:42] Like learn to optimize sees things like stockasticity and delayed reward and the need for exploration That it gets better on completely unseen tasks because real world environments tend to have These same like core challenges So this was actually the the work I did before the temporally aware paper

[00:48:02] Was trying to push that method further Using unsupervised environment design So what that is is a as a brief intro ued is is a field that tries to train robust rl policies over really broad and like really diverse task spaces so you can imagine like

[00:48:23] You can imagine a grid world where you can randomly place walls Or like a like a walker environment where you can design all the blocks in the terrain and What you're trying to do is train a robust agent by designing environments or designing levels in this space

[00:48:40] that are like Informative for the agent in training and where the agent still has things to learn So the idea kind of being that like as you increase your task space and you can represent more things

[00:48:52] You're likely to end up with a larger and larger portion of tasks and like the mode of the distribution Being just like kind of random and not very meaningful levels or maybe not very challenging

[00:49:05] But there'll be a few tasks at the fringe that are super interesting and like Present these present these like generally useful problems so again what why did before this was apply this to lpg and

[00:49:18] user ued method where we compared the performance of lpg on these randomly sampled levels to a to a handcrafted algorithm so a to see in our case And so we biased our meta-training towards levels where we were currently performing underperforming a to see

[00:49:37] So he purposefully selected levels where our learn objective function Still had room to improve against like against the handcrafted algorithms we knew before and again just doing this in the toy grid world space we found pretty massive improvements in like downstream atari space

[00:49:58] Which is really cool because it suggests that like you can You can find not just things that are challenging for you, but that Are generally informative and like represent something meaningful about the space of real world environments

[00:50:12] That your objective function can then then capture and like handle a test time So basically we have all the ingredients we have better scaling and That's where people should definitely check out purejax. It is like in class and we have ways of doing intelligent data scaling like ued

[00:50:32] So yeah looking forward to see where we end up in five years then um if people want to read more about All of what we talked about where can they find you on the internet? I think like most ML researchers twitter is probably probably the best place

[00:50:49] so yeah Yeah, same here twitter is good. I also have just like a you know academic website chrislu.page And you can just see papers and blog posts that I post there and also do you have repos that are hopefully useful to the community?

[00:51:03] They have been useful to me before so I would say you should check them out Then thank you too. It was really interesting talking to you. I hope you listen at home also enjoyed our conversation And yeah