Juan Sebastian Rojas - A Differential Perspective on Risk Aware Reinforcement Learning
56:59

Juan Sebastian Rojas - A Differential Perspective on Risk Aware Reinforcement Learning

Cohere 13.04.2026 72 просмотров 2 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
The field of reinforcement learning has long been dominated by discounted methods, wherein a decision-making agent aims to optimize a potentially-discounted sum of rewards over time. In this talk, we explore a fundamentally different and under-explored decision-making framework, in which a decision-making agent aims to optimize the reward received per time step. Methods associated with this framework are typically referred to as differential or average-reward methods. In particular, we will show how differential methods have unique structural properties that make it possible to circumvent some of the typical challenges and non-trivialities associated with risk-aware decision-making, in which the agent is tasked with learning and/or optimizing a performance-based measure other than the typical (risk-neutral) mean. In the first half of the talk, we will show how the differential framework admits a more scalable family of distributional RL algorithms compared to discounted methods. In the second half of the talk, we will show how we can leverage the unique structural properties of differential RL to optimize, for the first time, the well-known conditional value-at-risk (CVaR) risk measure in a fully-online manner, without the use of an explicit bi-level optimization scheme or an augmented state-space. Juan Sebastian Rojas is a PhD student at the University of Toronto, where he conducts research as part of the Dynamic Optimization & Reinforcement Learning Lab. His research interests lie in the theory and application of reinforcement learning agents. His current research focuses on developing theoretical frameworks and algorithms that incorporate the notions of risk and longevity into the learning, planning, and decision-making processes of reinforcement learning agents operating in dynamic, uncertain, and safety-critical environments. Juan has over five years of experience in industry, where he’s contributed to projects that span a broad range of topics, including machine learning, robotics, software engineering, and data science. This session is brought to you by the Cohere Labs Open Science Community - a space where ML researchers, engineers, linguists, social scientists, and lifelong learners connect and collaborate with each other. We'd like to extend a special thank you to Rahul Narava and Gusti Winata, Leads of our Reinforcement Learning group for their dedication in organizing this event. If you’re interested in sharing your work, we welcome you to join us! Simply fill out the form at https://forms.gle/ALND9i6KouEEpCnz6 to express your interest in becoming a speaker. Join the Cohere Labs Open Science Community to see a full list of upcoming events (https://tinyurl.com/CohereLabsCommunityApp).

Оглавление (12 сегментов)

Segment 1 (00:00 - 05:00)

Uh my name is Juan Roas. I'm a PhD student at the University of Toronto and I'm really excited to be here today to share with you all some of the research that I have been working on as part of my PhD. And so the title for today's presentation is called uh a differential perspective on riskaware reinforcement learning. And so uh if I had to summarize this talk in one sentence, it would be this that differential reinforcement learning makes riskaware decision-m tractable. And so in today's talk, I'm going to go I'm going to try to convince you all of why this is the case. And so to do that, I've broken down this talk into four parts. First, I'm going to start by trying to motivate the need for risk aware decision. Why we might want to consider uh moving away from the typical risk control approach. And then I'm going to go into uh trying to get our bearings on differential reinforcement learning. So I'm going to introduce kind of what that's all about. And then I'm going to present two case studies in which I'm going to try to argue that previous point that if we're we we're open-minded and and are open to embrace the differential RL uh perspective, we'll see how we can derive some really cool algorithms for distributional RL as well as for optimizing the conditional value at risk or car uh risk measure. And so with that, I will go ahead and get started by into the first section where I'm going to try to convince you all that riskaware decision-m uh is something that's really important and something that matters a lot. And so to start, I really want to just begin by taking a step back and just thinking about decision-m in general. And so uh you know, we make a lot of decisions every day. And for the most part, the decisions that we make are fairly low stakes, meaning that they don't have a big effect on the world around us. And so this includes things like choosing what we will have for breakfast or uh what clothes we might wear. These are kind of decisions that are low stakes and don't really impact the world around us. However, uh there we might also be put in situations where we have to make decisions in high stakes scenarios where the where whatever we choose might have a really big impact on the world around us. And so there's this really epic example of such a decision that has to be made in in the movies. And it happens in this uh movie you might be familiar with called The Matrix. And essentially in this movie, the main character Neo, he has to make this really important decision. And basically, he has to choose between taking a red pill or taking a blue pill. And so I won't spoil the movie in case anyone's interested to watch it, but essentially the decision that Neo makes has a really big impact on the world around him. And what's really crazy about this scenario is that in the movie uh Neo is given very little information about the choice that he has to make. And ultimately he ends up having to make the decision basically based on gut instinct or intuition. And so you know this is a movie so it's a little bit thrilling and exciting that you know the main charact main character has to make this like really big decision based on no information. But when we talk about the real world and you know autonomous agents that we want to deploy in the real world uh we don't really want that to be the way that our autonomous agents make their decisions right in the real world we want our agents to make informed decisions based on uh data and experience. And so for example we might have some agent out in the world and that agent might have some goal or purpose you know in mind you know maybe it's to go on a walk and sniff as many flowers as possible. Maybe it's to escape the matrix or anything in between. And along the way, that agent might be uh faced with making certain decisions like maybe they have to choose between taking a red leash that maybe symbolizes taking some a specific course of of actions or maybe they have to choose between taking the blue leash, which symbolizes taking some different course of action. And you know unlike in the movie The Matrix, we want our agents to make uh decision based on uh data and experience and not just on intuition or gut instinct. And so to that end, you know, maybe a natural place to start is by considering uh riskneutral decision-m. And so when we think about reinforcement learning systems, this is the kind of approach that typical RL agents uh utilize to make their decisions. And so uh essentially as we're all familiar with uh in risk neutral decision-m we essentially uh make our decisions by first quantifying uh some expected measure of reward that we will get by following a certain action certain courses of action. So we might quantify the expected amount of rewards that we will get if we take the red leash versus blue leash. And so in risk neutral decision-m we would pick the option that has the highest uh expected measure of reward. And so just so that you know we have a more concrete example right in this example we could say that okay let's

Segment 2 (05:00 - 10:00)

just suppose that we know somehow that uh if we take the red leash we're going to have some expected measure of reward that is minus 1 over3 whatever that means in this context. And conversely if we take the blue leash we'll end up with an expected measure of reward of zero. And so if we believe that risk control decision-m is the right approach, we would say that okay uh taking the red leash is the best option because it has the higher expected measure of reward. And so in some situations this might be a perfectly valid way to make decisions. But when we talk about this idea of risk decision-m um really what it all comes down to uh and really where it starts is with a simple question. And that question is making decisions based on expected values uh enough? Is make is making decisions based on the expected value appropriate or even is m safe. And so you know when we talk about making decisions in high stakes domains and safety critical domains it might be that this expected value is and making such highly consequential decisions just purely based on a single number that is the expected value. Um it may not it may be insufficient. And so if we indulge in this idea that maybe the expected value is not enough, then potentially a natural next thing we might want to consider is wanting to know, okay, what is going into this expected value? Uh and maybe we want to get a more granular idea of what we're what the agent is experiencing that that results in these expected values. And so for example, we might uh we might want to know kind of what rewards are going into this expected value. So we could say okay if we take the red leash that can be uh symbolic of potentially taking a uh a path along uh along some uh park I guess in for this example and if we take this path uh for the red leash then maybe we'll encounter certain rewards as we follow this path that will average out to that one over three that we saw previously. Alternatively if we take the blue leash maybe that means that we take a slightly different path where we might encounter different rewards. So here maybe we encounter really high rewards compared to the ones we saw with the red leash, but then uh potentially we run into some hazard area here and get some really negative reward uh such that these rewards kind of average out to that zero that we saw at the start. And so when we talk about this idea of wanting to get this more granular picture of of what our agent is experiencing, that's something that we can refer to as distributional awareness where essentially we kind of shift the focus from just the expected values and instead we want to uh get a better idea of the distribution uh that corresponds to our kind of measure of reward that we're considering. And so more formally, right, we could say that uh we could maybe for our two options here, we have these distributions. Uh for the red leash, maybe we have a distribution of some measure of reward that is heavily centered around that minus one over three. And conversely, maybe we'll have some different distribution of rewards for the blue leash that has again has this really high concentration of really high rewards, but also this low rewards that kind of average out to zero. And so again we can then ask the question uh do we now have enough information to make our decision and are we comfortable with the still comfortable with the kind of recommendation from risk neutral decision-m and so in some cases this might be the answer might be yes and we might say okay now that we kind of know the distribution that is producing this expected value maybe in some cases we're we might say okay that's fine we we're okay with taking this uh course of action and choosing the blue leash But in in other situations potentially if you're operating in the real world like this might be a really concerning uh thing for you that you have this really high concentration of these negative rewards which uh really negative rewards which in the real world could translate to uh some potentially catastrophic outcomes or really undesirable outcomes for our agent. And so then the question becomes how do we steer our agent to kind of account for this? How do we steer the behavior of our agents to act in a more c to want cautious manner and avoid these uh this high concentration of low rewards? Because if we just do the regular risk neutral decision-m, it's still going to want to choose uh this option. And so that's really where we get into this idea of riskaware decision-m and at a high level we essentially uh we replace the expected value with a more generic uh measure that we that is called a risk measure. And basic idea is that we want to we want our risk measure which we denote by row here to capture in a principled and mathematically rigorous way that in this in at least in this case we prefer the red leash over the blue leash because uh we're able to avoid that high concentration of really negative rewards even if it does come at the cost of maybe having a slightly lower average.

Segment 3 (10:00 - 15:00)

And so in a mathematical sense the way that we would actually accomplish this is u is as follows. And so when we think about risk aware decision-m and compared to risk neutral decision-m you can think of it as you know first we look at the distribution of whatever measure of reward we're considering. And in risk neutral decision-m we essentially want to optimize the average of this distribution. So we want to find some policy that's going to give us distribution with the best average. Conversely when we talk about riskaware decision-m the way that you can kind of think of it at the kind of highest level is that we essentially want to optimize some aspect of this distribution that is not just the mean. And so, you know, for example, maybe what you care about is optimizing the worst case reward, or maybe you care about optimizing the reward variance, or maybe it's like some nth percentile reward, right? It could really be anything. And this obviously is not an exclusive list of risk measures. And uh there's actually an entire kind of subfield dedicated to just deriving risk measures that kind of satisfy useful mathematical properties. Uh and so we'll kind of look at one at the end of this presentation. Uh but the thing that is a little specific to RL is just how hard it is to do risk aware decision-m because it turns out that uh in the standard reinforcement learning uh when you utilize standard reinforcement learning, it turns out that optimizing anything that is not the average is actually really difficult. And so that kind of brings me back to that initial point I made at the start, which is that what we're what I'm going to show you today is that if we actually embrace differential reinforcement learning, we're going to see how that actually allows us to uh utilize risk decision-m in a really scalable, efficient, and ultimately tractable manner in comparison to the prior methods. Okay, so that was just a little uh taste of what's to come uh at a really high level and hopefully at this point I've intrigued your interest and maybe slightly convince you that maybe risk decision-m might be something that we're interested in doing. And so now I'm going to go into uh differential reinforcement learning and really just give an introduction of what that is. And so to do that, I'm actually going to kind of park all our talk about uh risk for the put it on the side for a little bit and just look at just strictly differential reinforcement learning and then we're going to put everything together in the subsequent sections when we look at our two case studies. And so uh yeah, with that I will go ahead and uh talk about and introduce differential reinforcement learning. And so in differential reinforcement learning, we're still dealing with our typical agent environment interaction such that our agent is trying to uh find ways to output actions based on whatever state the agent is in that will produce some favorable sequence of rewards from the environment. More formally, the way that the agent chooses its actions is through what's called a policy. And the way that we the agent chooses or learns its policy is with respect to optimizing some measure of performance. And as we talked about previously uh in RL typically that is some expected measure of performance and more specifically in the standard reinforcement learning formulation that measure of performance is the term shown here on the slide. And so this term as you all probably know is called the expected discounted return. And essentially we're trying to optimize the expected value of a discounted sum of rewards over time. Now, that's a standard RL formulation, but in this talk, we're I'm going to really try to get you all excited about differential RL. And so, in differential reinforcement learning, we basically try to optimize a slightly different uh objective. And so, instead of the discounted return in differential RL, our primary optimization objective is going to be the expected value of our one-step rewards, which for this presentation, we'll I'll use shorthand and denote it with this RB bar value. So this might look sound a little familiar to you uh if you've been if you kind of are familiar with reinforcement learning and if it does that's because this is also sometimes referred to as average reward uh reinforcement learning. And so uh in the next few slides I really just want to dive in a little bit deep into this objective and because I really want us to kind of build our intuition about really what this means because it's not just that we're optimizing the expected value of the onestep rewards but we're really trying to optimize the long run average of these rewards. And so I'll kind of go into what what that means. Okay. So to kind of get an idea of what's happening is we can again uh we can go back and consider our agent environment interaction. And so the way you can think of it is whenever we follow whenever our agent is following a given policy say policy one what's really going on under the hood is that this policy is inducing what's called a markoff chain. And so this is a sequence of states that the agent will visit based on a policy. And so basically as the agent is visiting these states, it's going to uh receive certain rewards. And the basic idea is that if you have this policy and you follow it asically as time goes to infinity and then given some standard assumptions, uh this will induce a stationary distribution of rewards whose average corresponds to that RB bar

Segment 4 (15:00 - 20:00)

value that we talked about before, which is our objective here. That way we might have a slightly different policy say policy two and that policy might induce a slightly different markoff chain and that and if we follow policy asically then that might yield us a slightly different distribution of rewards whose performance we can then uh quantify based on its average and then that way we can compare the performance of two policies based on their average reward values. And so in differential reinforcement learning, our goal is to find the policy that's going to induce the stationary distribution of onestep rewards uh with the best average. Okay, so that was just to get uh so hopefully at this point I've kind of built our intuition about kind of our primary objective here with differential RL. But you might have a question which is okay, in a few slides ago I mentioned how this is sometimes called average reward reinforcement learning. And I think the reason why it's called that is kind of obvious. And so why why do we sometimes call it differential RL? Where does that come from? And so the reason is that even though this average reward is our primary learning objective here in practice, uh the way that we actually optimize this objective is we do it in an indirect way and we actually optimize it indirectly by first optimizing a surrogate objective which I will uh show here. And so you'll see that this objective actually looks very similar to the standard RL uh objective. And so we now have a sum of rewards, but we no longer have a discount factor. And at every time step, we sub we subtract our kind of average reward primary optimiz optimization objective. And so uh as I mentioned previously, like the reason why we do this is because optimizing this expectation actually indirectly optimizes our our true objective which is this average reward value. And so the term inside this uh expectation is referred to as a differential return. And so that's where the name differential RL comes from. And another reason why using this kind of formulation is useful is because it actually allows us to derive very similar algorithms to what we're used to seeing with the standard RL formulation. And so and so we're basically able to derive uh TD algorithms, Q-learning algorithms and uh you know actor critic you name it for for differential RL. And so just to kind of again try to get the our intuition, I'm actually going to go through kind of the a basic Q-learning algorithm for differential RL. And so this was an algorithm that was proposed in prior work. So we did not come up with this uh but we're going to actually use this as a baseline algorithm uh throughout the talk. And so I really want to make sure that we kind of understand uh what's happening here. And so we'll go through it line by line. And so uh we still have the concept of the temporal difference or TD error. And so it's basically structured to mimic our differential return. So you'll notice that we no longer have a discount factor as with the regular TDIR. Uh and we also now have this term here which will serve as an estimate of that average reward objective. And so we have this TD air which we can then use to update our value function in exactly the same way as with regular Q-learning. And so here we're showing a tabular update. Uh however what is different is we now have this extra line which we will use to keep to uh update our estimate of the average reward. And so essentially you can think of these first two terms here in this update as step size parameters. And then we basically uh update our average reward in this case with uh the TD error. And so in prior work it was shown that at least with tabular algorithms updating the TD air updating the average reward sorry uh with the TD error as will actually get it will actually converge to the optimal average reward again in tabular settings. And so I see we have some uh some questions and uh in the chat so I will actually get to all the questions uh in the end just so that we kind of get through everything. But uh but I'm happy to uh yeah to answer any questions at the end. Thank you. Um, okay. And so, yeah. And so, now we have our differential key learning algorithm. And so, we will use this as our baseline uh for the methods that we'll talk about in a little bit. And so, yeah. And so, now at this point, we've now seen uh you know, we've motivated the need for risk or decision-m on one end and we've also derived or uh kind of introduce differential reinforcement learning on the other. And so now we're going to combine the two uh to hopefully uh begin proving the point that I made at the start, which is that if we embrace differential RL and we embrace this formulation, we're going to see some really exciting results as it relates to riskawware decision-m. And so to kind of uh get a flavor of that, we're going to start by looking at a case study where we look at uh just wanting to learn our distribution of rewards. And so that's what we'll do next. And so for the remainder of this talk, we're actually basically going to follow along with a very simple toy environment that we will use to kind of check our intuition and make sure that you know check our make check to make sure that our algorithms are performing as we

Segment 5 (20:00 - 25:00)

would expect. And so here we have this environment that we proposed called the red pill environment. And so it's a very simple two-state environment where essentially the agent has two options. They can either take a red pill which will bring them to the red state or they can take a red pill that will take them to the red state or a blue pill which will take them to the blue state. And so this is kind of inspired kind of by that initial example at the start. And so when uh when the agent is in one of these states, they're going to receive rewards that are distributed based on some predetermined distribution. And so here we see the distribution of rewards that the agent would would experience if they stay in the red state. And so we'll see that it's like a simple kind of caution like distribution kind of centered around minus. 7. Alternatively, if the agent chooses to or if the agent is in the blue state, they will receive rewards that are distributed based on this distribution shown on the slide that that's kind of biodal and is centered at around minus. 6. And so we will use this environment to kind of test our intuition throughout the presentation uh throughout the talk. And so uh when we talk about this idea of distributional RL, we can then kind of ask the first question which is how can we learn the per step or the onestep reward distribution induced by a given policy, right? And so that's kind of like what we talked about at the start where maybe you know expected values are not necessarily enough and maybe we want to get a richer idea of what's happening. And so in this section we're going to want to uh devise a method that will allow us to learn these distributions. Okay. And so when we talk about distribution learning in reinforcement learning that uh brings to mind uh kind of a subfield of reinforcement learning called distributional RL. And so when we think about the standard RL formulation there's as I'm sure most of you probably know there's a rich set of of algorithms that have been around for many years now that basically allow us to learn the distribution of the discounted return. Uh and so essentially as we saw previously right we essentially shift the focus from the expected value to uh to learning the distribution of the discounted return. And so you know there's many famous algorithms that do this such as C-51 QRDQN and many other modern ones. And so at a very very high level the way that these uh algorithms work is they essentially you know given some target distribution they approximate this target distribution along certain points like atoms or quantiles. And the basic idea is that if you you're able to kind of learn these uh the if you're able to get uh adequate estimates at these points for your distribution and you kind of have enough of them, then you'll eventually recover the shape of your target distribution. And so when we talk about our goal of learning the distribution for differential RL, we're going to follow a very similar approach. Um the key difference is going to be that instead of wanting to learn the distribution of our discounted return, we're going to want to learn the distribution of our onestep rewards. And so that's exactly where we're going to go into now. And so uh again, our goal is going to be to want to learn our distribution of onestep rewards. And so uh the method that the method that we're going to that I'm going to talk about is one that we proposed in a recent publication at the most recent triple AI conference. And so if you want to follow along kind of by looking at the paper where we proposed uh these methods, I put the QR code on the slide here. So feel free to kind of uh take a picture and follow along if that's kind of how you your preferred way of digesting information, I guess. Okay. So essentially the way that we're going to uh learn our target distribution of the onestep rewards is we're going to leverage this method called quantile regression. And so this is something that's been around for many years. You know, this is not something that we proposed. And so essentially the way that this uh that quantile regression works is basically it allows you to learn a predetermined quantile of a distribution uh just by observing samples from that distribution. And so here's like a very generic update rule for quantile regression. And so I'll just go through it term by term. And so theta here denotes our uh quantile estimates. We will then have some steps step size parameter as we have with any learning rule. And then we have this parameter here called toao. And so Tao here is going to be how we specify which quantile we want to learn. So for example, if TOAO is 0. 1, then that means we're interested in learning the 10th percentile of our distribution. And then we'll also have this term here that's an indicator function that basically takes a sample from our target distribution and basically checks whether it's less than or greater than or equal to our current estimate. And so altogether this uh this term here is kind of referred to sometimes as the quantile loss. And it's been shown in prior work that if this uh if this estimate converges, if this update rule converges to some stationary value, then that stationary value corresponds to our desired uh quantile. And so I'm going to actually kind of do a little bit more go into detail about showing what that is. But essentially that the key idea is that if if we if this update rule converges

Segment 6 (25:00 - 30:00)

then that implies that this term here must go to zero in expectation. And so that's what we that's where this expression comes from. And here we're just going to denote what that fixed value that we kind of converge to as this theta star. And so then we can basically solve to see solve this uh this equation to see what this stationary value is. And so we can do some uh simple algebra. So we can apply our expectation and then use kind of the well-known identity for the expected value of an indicator function. And then we can finally solve for that fixed point uh to see that it indeed does correspond to our desired uh quantile value at that predetermined you know to have quantile. Okay. So to learn our distribution uh you know it's not going to be uh too complicated. So we can basically apply this generic rule to our target distribution where again we're interested in learning uh the distribution of the onestep rewards for some you know m number of quantiles and so that's great. Okay, so we can kind of leverage this to learn our distribution, but as you'll see here, this is not exactly a reinforcement learning algorithm, right? We're just this is just a quantile learning. And so we need to find a way to kind of integrate this into like an overall, you know, Q-learning or TD learning type algorithm. And so in the paper, what we kind of show and the key insight that allows us to kind of integrate this into an R algorithm is that it basically comes down to where we choose our these TOA values. And so basically what we show in the paper is that if we're clever about where we pick these tow values and specifically if they are evenly spaced between zero or one or more formally I guess evenly spaced along your the CDF of your target distribution then what we show in the paper is that the average of these quantile estimates actually serves as a really uh accurate uh estimate of your average reward. And so what that implies is that we can actually use uh these quantiles to update our estimate of the average reward. And so this kind will basically allow us to kind of put everything all into one uh algorithm. And so that's what I'm going to show here. And so this is the algorithm that we propose. It's called differential distributional or D2 for short uh Q-learning. And so you'll see that it's very similar to that baseline algorithm that we saw at the start. And so the first three lines we see in the slide, those are exactly the same as what we saw with differential Q-learning. Uh and you might recall that with differential Q-learning, we updated the average reward estimate with the TD error. However, when we talk about D2 Q-learning, we're actually going to update the estimate using uh our quantiles as we talked about previously. And so essentially we can put everything together to uh to you know learn our quantiles use them to update our average reward and then uh you know and then use that to then learn our calculate our TD error and then learn the value function and so forth. And so in the paper we show we prove that at least in tabular settings uh D2Qarning learns the optimal risk neutral policy as well as its percept uh distribution. Okay, so let's check our uh intuition. So uh maybe take a few seconds and think uh to yourselves uh the following. In the red pill environment, which distribution should D2Q learning learn? Okay, hopefully that was not too difficult. So if you thought to yourselves that D2 Q-learning should learn uh the distribution for the blue state uh you are correct and the reason is that this distribution has a higher average compared to uh the distribution for the red state which is about minus. 7 compared to uh to the minus. 6 six for the blue distribution. And so because D2Qarning learns the distribution of the risk neutral optimal policy, uh it's going to learn the distribution of the of the kind of blue state because that's that because that that's that is the by staying in the blue state that that's going to give us the highest average reward. And so in theory, this is what our algorithm should learn. So I guess the only thing left to do is to check if that actually is what happens. And so here again, this is the distribution that we in theory should learn. Uh you'll notice there's a little bump there and that's just because we're going to use an epsilon greedy policy. So this kind of little bump incorporates kind of that exploration aspect in our algorithm. And so uh given this distribution that we want to learn, we can basically kind of pre-calculate the quantiles that we should learn. So we can see if that will if that actually is what happens. And so we have not run the algorithm yet here. This is just kind of do just uh we basically just want to get an idea of what the quantiles are that we should learning uh that our algorithm should learn. And so we can apply our algorithm and the here are the results shown in this plot. And

Segment 7 (30:00 - 35:00)

so in this experiment we basically set all our initial guesses for our estimates to zero. So our average reward estimate was set to zero. The estimate for all our in this case 10 quantiles were set to zero. And essentially as learning progressed and the agent kind of interacted with the environment, it basically uh through D2Qarning was able to learn uh the quantiles as well as the average reward. And we can see that the estimates actually do converge to the correct values. And so just to clarify, you know, when we run this algorithm, you know, at the start, the agent has no idea about what the distributions are. It has no idea what the optimal policy even is. And so it's at the it's in parallel, it's al it's learning first it's learning the optimal pol what the optimal policy is. And at the same time, it's learning the distribution of that that is induced by that policy. And so I guess to kind of say that more formally, D2Qarning learns the optimal policy and its perstep reward uh distribution. Okay, so I actually want to now take out some time to actually highlight a really cool implication of this algorithm and this is kind of hinted at in this statement that I uh shown in the previous slide and it's shown here now which is that D2Q learning learns the optimal policy and its perception. And so the part of this statement that's really interesting and exciting is this part here kind of shown in purple which is that this is actually a singular objective. And so the reason why this is such a cool thing is we actually have to go back to a previous slide that I showed uh where we were comparing our distributional approaches for standard RL versus differential RL. And so I'll actually admit that when I presented this slide, it's actually underspecified mathematically. And so I'm actually going to correct that now. And the part that was a little underspecified was related to our distributional objective for standard RL. And so to correct that, I'm going to just move everything to one side so I have a little bit more space. And the key thing that I kind of left out is that when we look at the standard RL formulation, when we're trying to learn our distributional objective, we actually have to learn a distribution for every single state action pair. And so what that means is that when we want to calculate the number of distributions that we want to learn in the standard RL formulation, that number of distributions is actually dependent on the size of your state and action spaces. And so the larger spaces are the you know that increases the number of distributions to learn. Conversely with differential RL it doesn't matter how big or how small or what kind of your uh your state and action spaces are the number of distributions to learn if you are following the standard assumptions are it's always one no matter what whether it's discrete or continuous action spaces it doesn't matter. Uh you always just have to learn a single distribution. And so there's kind of two exciting implications, right? The one I kind of alluded to already. The first one is scalability. This scales really well with the size of your state and action spaces. It doesn't, you know, increase the number of learnable parameters. Uh if you want to learn if you have like a really complex state and action space. Uh second one is related to interpretability. So this is really easy to interpret. Right? In the previous slide, we saw how we could just put our single distribution in a single plot and we could get an idea of the rewards that the agent is experiencing, right? If we wanted to do the same for standard the standard RL formulation, right? It's not as straightforward because we have so many distributions to choose from, right? Like do we want to pick some representative state action pair like it becomes a little bit more ambiguous. And so with differential RL, we can see that it just we no long we're no longer faced with that kind of dilemma or that extra complication. And so um here we're now really starting to see how embracing differential RL can kind of make the whole riskaware aspect of things a lot more uh in uh easy to accomplish. So this is more or less what I wanted to talk about with distributional RL. But I do want to quickly mention that in the paper we do perform some experiments uh on more complex environments. So we we tested our D2 algorithm on uh three Atari environments and we actually found that in addition to kind of like the scalability and the uh the interpretability benefits, we actually did find that you do you get you do get better performance by using D2Qarning than if you did a deep learning version of the differential Q-learning algorithm that we baseline that we talked about previously. And so there are some benefits as relates to performance. Um but you know in this talk our focus is really just going to primarily be about you know make doing more responsible decision- making from a riskaware approach. Uh okay. So now we're kind of uh we've arrived at our final uh se section here. And so we've kind of just seen we've gotten a taste of how embracing differential RL can kind of make risk decision-making a little bit more tractable. And so now I really want to take the point home and really show you kind of a really exciting case study where we're going to be able to optimize

Segment 8 (35:00 - 40:00)

the well-known conditional value at risk measure in a really efficient way. Um essentially like a first of its kind way uh for reinforcement learning. And so just to kind of uh refresh our minds, right? We we're not going to really talk about kind of the more formal way of thinking about riskaware decision-m where essentially we still want our agent to pursue rewards. try to maximize rewards, but with riskaware decision-m we want the agent to do so in a more cautious manner by considering kind of like some potentially uh trying to avoid potentially catastrophic outcomes. And so, as we talked about previously, mathematically, the way that you would do that is by want by optimizing some aspect of your distribution beyond the mean. And so, a really popular approach in the literature is wanting to optimize the tail of your distribution. We want to find the policy that's going to induce the distribution of rewards with the quote unquote best tail. And so, mathematically, a risk measure that accomplishes this is called the conditional value at risk. And so you can think of it as you know first you want to figure out where you define your tail. So you're going to pick some quantile along that to basically define your tail and the reward at that quantile is called the value at risk. And then the conditional value at risk uh you can basically think of it as the expected value kind of within this tail here. And so the and it's also sometimes used called CR for uh in short. And so CR is actually a very popular risk measure. It's used quite a lot in the RL literature and it's also used widely outside of reinforcement learning. Uh it's used by all sorts of industries for risk management like uh you know the finance industry. It's very popular to use CRA. Same with insurance. Um, however, in reinforcement learning contexts, it turns out that optimizing CVR is actually an incredibly difficult problem because you essentially have this chicken or the egg problem where in order to optimize CRA, you kind of need to know what the optimal value at risk is beforehand. But the thing is, you don't actually know what this optimal value at risk is until you've optimized CR. And so, it's like this weird chicken or the egg problem. And so, there's been a lot of really clever approaches that have been uh derived in the literature that do allow you to provably optimize CVR. Um but to date they've all kind of come at a cost and that cost is about efficiency and scalability and in particular when in the standard RL formulation all prior algorithms uh require to provably optimize SAR required either augmenting your state space and or performing an explicit some sort of explicit by level optimization. And so for example to provably optimize sever that means that you might have to solve a bunch of different MDPs and then you pick the best solution as your final solution. And you know there are other approaches that you know let you get away with only having to solve a single MDP. But at every single uh iteration or step in your algorithm you have to use some solver to solve some standalone optimization. And so uh to date really optimizing CR has come at the cost of efficiency and scalability. And so another challenge we might think is okay well in the previous section you know we talked about distributional RL you know we looked at D2Q learning which we could use to learn quantiles and after all the value at risk is a quantile so surely there must be some way that we can leverage the gains that we get from distribution RL to optimize carb um but it turns out at least to date that it's actually not as easy as it seems and and uh and fundamentally that the core issue is that distributional algorithms, you know, including D2Q learning are fundamentally risk neutral algorithms and a lot of the theoretical guarantees associated with these algorithms uh break down the moment that you slightly dev try to deviate away from uh risk neutrality. And so the situation looks a little bit hopeless here. uh but it turns out that uh that basically it is a case that in differential reinforcement learning the TD error is all you need and so for the remainder of this presentation I'm actually going to show you that we're actually able to solve this really complex uh problem of optimizing C bar essentially just by minimizing the TDR and so this is actually going to follow a paper that we published at last summer's reinforcement learning conference and so we essentially proposed this framework called reward extended differential RL and we showed how we could actually use this framework to uh to optimize CR in a really efficient manner. And so like before I put the QR code on the slide in case you want to follow along in the paper. Okay. So in differential RL the TD error is all you need. Like what in the world do I mean by this? And so to kind of get an idea of where this comes from, I want to go back to our trusty our trusted uh differential Q-learning baseline. And I want to focus on a part that I kind of briefly alluded to at the start, but maybe you thought was a little interesting. And that's related to our average reward update. And in

Segment 9 (40:00 - 45:00)

particular, the part of it where I I mentioned that you can actually update this estimate using the TD error. And so uh just to kind of formalize it a little bit on the slide, right? we have to in differential RL we have to maintain an estimate of our average reward and so what's really cool about differential Q-learning is that it was shown in prior work that if you update this estimate with the TD error then at least in tabular settings this estimate is actually uh will actually converge to the optimal average reward uh just by minimizing the TD error and so this is something that was shown in prior work and so another way you could phrase this is that you have these two learning objectives that are solved by minimizing the TD error And so you could basically say that in differential Q-learning the TD error is all you need. And so in this paper that we published at last summer's RLC, we basically wanted to see how far we could take this idea. We wanted to see is it just a fluke that you it just happens to work that you can solve these two objectives with just minimizing the TD error or is it the case that you could actually the TD error is actually a lot more powerful than we thought and we can actually use it to solve all sorts of interesting learning problems. And so in the paper we basically explore this idea and we actually find that it's not just a fluke and that it's actually the TD error actually is a lot more powerful and and we kind of like use this slogan to say that you know it turns out that in differential RL the TD error is all you need to kind of like to potentially solve some some learning problems. And so I'm going to present kind of the framework that basically allows us to leverage the TD error in this way. But before I do, I just want to really briefly tie it back to our kind of our original goal of optimizing C bar so we kind of don't lose focus of what we're working towards. And so I claim that in differential RL the TD error is all you need. So that means that if we want to optimize C bar in the differential RL setting, then we should be able to do so uh with just the TD error. And so uh that's basically going to inspire the following approach. And so this approach is going to basically leverage the framework that I'm going to present in the next slide, but I just want to present it at a high level first just so we kind of see like how we're going to apply it. And so in this approach that I'm proposing, we're going to maintain estimates of our CR and value at risk and then kind of sim in a similar way to what we saw with the differential Q-learning. We're going to see that we can actually get these estimates to converge to their optimal values. So the C bar estimate will converge to the optimal C bar. the value at risk estimate will converge to the optimal value at risk just by minimizing the TD error. And so now I'm going to present the framework that allows us to accomplish this. Uh I'm going to present the framework for in a kind of generic way. So uh so this framework which we called reward extended differential or red RL for short. Uh this is actually not specific to CRA. There's nothing kind of inherent about CR that makes this work. Um, of course we kind of derived it with the with optimizing C bar in mind, but it's not, you know, strictly confined to kind of the realm of CRA. And so here we have the Q-learning version of this framework, uh, which we call a red Q-learning. And you'll basically see that we've taken our baseline algorithm and we basically added a line at the top and bottom. And so this line at the top, we basically took our regular onestep rewards and we've quote unquote extended it with uh n subtasks or learning objectives that we call subtasks. And those are denoted by these uh zed values here. And so we basically have our kind of extended reward here that is essentially a linear combination of or potentially piz linear combination of a reward as well as uh some learning objectives. And then we have this line at the bottom which we use uh to update our subtask estimates. So essentially that's how we will solve our subtasks. And so you'll see that we have uh similar to what we saw before we have this update rule. And as before these first two terms you can kind of think of them as step size parameters. And then you'll see that we update our subtasks with this term here which in the paper we called the reward extended TD error. And so the reason why we called it that is because in the paper we basically uh derived this this quantity in a such a way that we essentially couple it with the TD air such that and this is kind of the key result in the paper. We basically showed that the that the reward x and the TD error will actually go to zero for all your subtasks uh as the regular TD error goes to zero. So what this means is that if you just minimize a TD error, which is something you have to do anyways to kind of learn your value function, you know, to begin with, then you're basically able to solve all of these subtasks sim simultaneously, uh, for free. And so, uh, the way I think of it in my mind, I guess, is I kind of think of it as like the TD air has like this gravity, and so as a TDR goes to zero, it kind of like pulls the rewardex and the TD errors to zero as well. And so if you're interested kind of how we derive the the reward extended error, it's shown pretty explicitly in the paper. So I encourage you to uh to check it out. It's a little bit more detailed than maybe we have time for today. Um

Segment 10 (45:00 - 50:00)

but now this essentially gives us a way to uh to optimize car where we can basically use uh the valid risk as like a subtask. And so that's actually what we'll do now. We'll apply this generic framework to optimize car. And so to apply this framework you basically have to do two things. The first thing is you have to define your extended reward function and then reward extend and the TD error. And so to define your extended reward function uh we can basically leverage I guess you' call it domain knowledge related to CVR because again CR is extremely popular. It's been studied you know quite rigorously in many domains. And so there's a lot of expressions for CRA. So I've kind of shown two of them here. Uh you know the point of this slide is not necessarily to kind of understand where these expressions come from. This is more just to motivate how we're going to derive our extended uh reward function. And so we're basically going to use this expression here uh to uh to motivate where that extended reward uh function comes from uh like so. And then the only other thing we need is to derive our reward and the TD errors for CR and our value at risk estimates. And so in the paper we provide a formula that you can apply. And so here we we've applied that formula and did some a little bit of algebraic simplification and we arrive at this final algorithm that we call uh red cq learning. And here again we have our reward x and the td errors which for the value at risk is are derived in a peace-wise manner. And so in uh in the paper we kind of show we prove how this will converge to the optimal CR value at risk you know given some assumptions in the tabular setting. And so now we're kind of ready. You know, admittedly this looks like a really kind of weird algorithm. So I would understand if you're a little uh skeptical. So we should probably check make sure that this this weird looking algorithm actually works. And so to do that, we'll we'll maybe start with our trusted, you know, red pill environment. And so similar to before, I guess I I'll guess I'll maybe ask you to take a few seconds to think uh to yourselves, you know, what is the optimal CR policy for uh red pill, blue pill? Okay, so I was actually a little mean here because the answer is that it depends. And so it actually depends on kind of where you define your tail in your distribution. Uh but the way that this environment is actually is defined it's defined in such a way that for most values of of tow and really for any kind of values that we would actually be interested in at riskaware sense uh in most cases uh staying in the red state will give you the is the optimal CFR policy and so that's essentially what we're going to check now to see if our red carning actually does uh steer our agent to kind of want to stay in the red state because that has a better car. And so first we're actually just going to see what happens when we apply uh our trusted differential Q-learning in this environment. And so that's what's shown in this slide here. So on the y- axis we have the rewards that are being experienced by the agent. And so this light blue line here that denotes the average of these rewards. And we'll see that it's at uh at around minus. 6 which indicates that the agent has chosen to stay in the blue state which makes sense because that is the optimal riskneutral policy. And this dark blue line shows a sear that associated with following that policy. And so next we can check to see if red sear uh what happens when we apply red car sear here. And so basically if we were to apply this algorithm and it happens that it basically does the same thing as our risk neutral algorithm then that would mean that the our algorithm doesn't work because we basically converge to the blue state which is not the optimal cfar behavior. But thankfully uh it turns out that uh our algorithm does work and the agent does choose to stay uh in the red state because it does yield a higher sear as evidenced here. And so here we have this light kind of pink line which shows the average of the rewards that the agent gets. And we see that indeed it does have a slightly low slightly lower average compared to staying in the blue state, but it does have the better sear uh as shown here in this dark red line. And so and we also just know based on because we know what our distributions are. We know that this corresponds to the red city because that has the minus. 7 average. And so in this example what what's shown here is that our red sear algorithm is successfully allows our agent to choose the optimal car policy even if it comes at the cost of a slightly lower average reward. And so I guess another way to say it is that the agent finds the optimal risk square policy using red car sear. Okay. So we now we've seen in this really simple environment that the algorithm works. But maybe you're still skeptical. This is just a very simple environment. So you might be you might demand that I present a slightly more complicated experiment to see what happens. And so okay I'll let's do it. And so now we're going to consider the inverted pendulum environment where the aim of our agent is to basically keep the pendulum kind of standing upright perfectly upright.

Segment 11 (50:00 - 55:00)

And so as before I guess I will ask the same question uh which is that in the inverted pendulum environment that you can take a few seconds to think for yourselves uh to yourselves what is the optimal CR policy in inverted pendulum it's not that easy right like this is a really complex environment uh at least compared to red pill blue pill so I understand if that this is not entirely obvious like what the optimal policy would be in this case and so the way to think about this is that you have to think about the way that this environment is designed. So typically in the inverted pendulum, whenever the agent is able to get the pendulum at that perfectly upright position, it gets a reward of zero. Then anytime that the pendulum is not at that perfectly upright position, the agent is going to continue to get increasingly negative rewards. And so when we think about what is the best possible distribution of rewards that will give us the best C bar, it's actually going to be the distribution of rewards that is basically almost entirely made up of of zeros. Something like this where we have this like distribution of rewards that is almost entirely centered at around zero. And yeah, you know, because there might be some jitteriness that causes some slightly off negative values here, but you can basically think of the the optimal sear the optimal distribution that gives you the best sear as like a distribution that is made up entirely of zeros. And what's really interesting in this case is that same distribution is actually going to be the distribution that gives you the best average. And so for this experiment, we can actually more or less kind of if we squint a little bit, we can actually more or less uh we can kind of approximate this as the optimal CR policy being almost the same as the optimal risk neutral policy. And so what this means is we can actually directly compare our the performance of our riskaware approach to a risk neutral one and see what happens. And so that's what we'll do here. And so as before, we'll just start by seeing what our baseline differential Q-learning does. And so here we see that it does it is able to successfully find the kind of the policy that that uh balances that the inverted pendulum. And it does so it takes about just under 4,000 steps to do so. And so as before, we can then compare to see what happens when we apply red cart. And here are the results. And so there's kind of two things to see in this plot. The first is number one that uh the algorithm is able to find that the optimal CVR policy which as we argued is just keeping the pendulum perfectly upright. And the kind of other cool thing about this just so happens is that in this case acting in a risk manner was actually more advantageous for the agent. It actually converged to the optimal policy a little bit faster than with the risk control approach. And so that was kind of cool to see. And so now we've seen kind of two examples of this uh algorithm kind of working the way that we expect it to. And so hopefully I've now more or less convinced you that it works. But maybe there's one more thing to check which is related to our estimates, right? Because these two plots which I've shown previously, they're just they're actually they're showing the actual rewards experienced by the agent, the actual reward average and C bars. But as you'll recall, right, we had these estimates for the value at risk and car that we had to maintain. And so we might want to check the quality of these estimates. And so we can do that. And so we'll first note that uh for red pill, blue pill, the C bar uh that we get for the policy is around minus. 8. And as we argued previously, it's about zero for inverted pendulum. And so now we can basically check to see how accurate our estimates are with regards to these kind of true values. And so here we now have I'm showing the plots that show the estimates of our valuate risk and CFR for both environments. And so we can basically see that for the red pill blue pill our CR estimate which is denoted with the line in red uh is around minus. 8 8 as we would expect and the C bar estimate for the inverted pendulum is around zero as again we would also expect and for the value risk estimates we also see that they broadly make sense or a little bit higher than C bar and they're and in both cases they're all the estimates kind of converge in a very stable manner and so uh I guess I'll write it out more explicitly on the slide so yeah the C bar and value at risk estimates converge to the actual C bar and validate risk values and so you know to me this is the coolest plot in the entire representation because when we ran when we ran the algorithm and we started the agent had no idea what the optimal CR policy was. because it had no idea what the C bar of either policy was or the value at risk. And just by learning, just by interacting with the environment, you know, in a fully online manner, just by experience, it was able to uh maintain a reliable estimate of both the value at risk and CVR and it was actually able to act in a an entirely risk aware manner. Um you know, all of this just by minimizing the TD error. And so to me, that's a really cool thing. And we also saw how this translated into like a more complex environment with uh the inverted pendulum. And the thing I really want to highlight about all of this is that we actually achieved all of these really cool risk based results. And we did all of this without the use of an augmented

Segment 12 (55:00 - 56:00)

state space or an explicit ble optimization. And so this has a really really exciting implication which is that red CRQ learning enables truly scalable and efficient risk aware decision-m which is really a first in in reinforcement learning. And so we're really excited in future work to see how far we can take this algorithm. How much can we scale it and what kind of really cool behaviors can we get our agents to exhibit uh when we want them to act in a more risk aware manner. And so uh yeah, I guess my time is basically up and so I'll just briefly kind of state kind of like a final key takeaway here which hopefully I've convinced you all of which is that uh differential reinforcement learning methods have unique structural properties that make it possible to circumvent some of the typical challenges and non-trivialities associated with uh riskaware decision-m and so very briefly I just want to uh provide some acknowledgements and I really just want to quickly thank everybody at the mechanical industrial engineering department at the University of Toronto for providing such a stimulating and welcoming environment. And of course, I really want to express my gratitude to my uh PhD adviser, Professor Lee, for his support, as well as to my labmates at the uh dynamic optimization and reinforcement learning laboratory, which I think they're on the call, I guess. Uh and lastly, I just want to quickly uh thank Ensur for uh for supporting this research and for uh making this kind of more foundational type research uh possible. And so with that, I finally just want to thank all of you for tuning in uh for listening to this presentation. And so if any of this kind of sounded interesting to you or you just you strongly disagree with some aspect, you know, either way, please uh send me an email and I'm happy to chat. And uh you know, thank you for everybody who tuned in on the call and I'm happy to now kind of take answer any questions for for folks in the call.

Другие видео автора — Cohere

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник