# Policy Gradient Methods | Reinforcement Learning Part 6

## Метаданные

- **Канал:** Mutual Information
- **YouTube:** https://www.youtube.com/watch?v=e20EY4tFC_Q
- **Источник:** https://ekstraktznaniy.ru/video/38932

## Транскрипт

### Segment 1 (00:00 - 05:00) []

no course on reinforcement learning would be complete without a discussion of policy gradient methods as we'll see these techniques take a more direct approach to the prom statement of RL and as a result many of the most effective models are from this category for example proximal policy optimization is a type of policy gradient method and that's open ai's go to RL algorithm in fact that's what they use to incorporate human feedback into chat gpt's training considering how much that product has grown it's pretty clear these techniques have serious real world value now I'm not going to cover proximal policy optimization that's beyond the goal here rather I'll cover the basics of policy gradient methods which you'll need to know to understand the more sophisticated techniques we have to walk before we can run but before we start walking I should mention this is my sixth and final video on my series on reinforcement learning if you find this video confusing it may help to start from the beginning now let's take the first step as mentioned policy gradient methods are more direct approach to our problem that problem is we'd like to determine a policy seat that'll achieve a lot of reward to refresh a policy gives the probabilities of taking actions in any state for example if we were faced with a 2d continuous State space and in any state we had four actions to choose from then the policy gives us the agent's probability of taking each action in this state also as the state changes so do the action probabilities so to reiterate we need to determine a policy that will achieve High reward for whatever this environment is also we're still in the world to function approximation meaning there's a hopelessly huge amount of atomic states that we could be in so we'll have to assume policies of nearby states are similar now in the previous video our goal was to select a parameter Vector W which we would plug into some function Q hat such that it estimates the expected return of taking action a in state s this is then used to determine a policy in what we've seen that policy would select the highest value action while occasionally selecting an action at random for the sake of exploration but now with policy gradient methods we're going to go completely around these value estimates our goal will be to select a parameter Vector Theta which directly determines our policy so selecting a Theta doesn't tell you any expected returns instead it directly specifies the probability of every action in every state and as we'll see this comes with some nice properties now for an easy explanation of the first policy grading algorithm the reinforce algorithm I'll emphasize how it shares a very familiar shape to the previous algorithms we've covered the first thing is it's a Monte Carlo algorithm so no bootstrapping we'll be waiting until the end of each episode to observe full returns to make updates second we'll need to specify a few familiar things before running the algorithm item number one is the functional form of our policy that is we need the mathematical function which gives us the probability of each action given any state and a parameter Vector Theta this is where you declare whether you use a neural net or a simple linear model or something else next we need to specify the initial Theta and the step size Alpha so this suggests a familiar routine one in which will be applying repeated updates to Theta which will move it from an arbitrary position to somewhere which gives us High rewards also Alpha will dictate how aggressively we apply that rule and the best Alpha like we've seen in the past will be problem dependent okay now knowing this and knowing the general shape of Monte Carlo algorithms we can guess the algorithm will look something like the following first we'll Loop over M episodes and for each episode we'll sample a trajectory of States actions and Rewards until the episode terminates and I should emphasize that this trajectory depends on the current policy dictated by Theta next for each step of the episode we'll apply some update rule to Theta where we're adding in an alpha fraction of something and that's something is what we'll figure out so in preparation what can we already say about this thing well first since we're adding it to Theta we can say it's a vector of the same length as Theta and second it should make high rewards more likely after all that is our goal okay so we'll keep these in mind as we try to fill in this missing piece to do that let's again consider a 2d continuous State space this time let's say that in any state there are three actions available to us left right and down and for now that's all I'll tell you about this environment for the animation I'll say that the lengths of these arrows gives us the probabilities of taking each action so this means there's a 50 chance of going left or right and no chance of going down so how should we form our policy remember we need to be able to produce these action probabilities for any state we may run across let's use an idea similar to what we did in the last video let's fix two

### Segment 2 (05:00 - 10:00) [5:00]

Proto points in the state space call them Proto points and let's say each has their own fixed action probabilities we can accomplish that by declaring each action probability is a Theta value and so Theta has length six and we can use these to form action probabilities for any given point we might be considering we'll do that by looking at the distances between the state and these protopoints and we can use those to determine weights then the action probabilities are just the weighted average of the proto-action probabilities the closer the point is to a Proto point the more similar its action probabilities are so this is a workable idea but there is a complication because our thetas are themselves action probabilities they are subject to sum to one constraints and these constraints will be nasty things to manage as we move over the space of thetas so let's avoid them to do that we need to go on a bit of a tangent we need a new function we need one which will map to a probability Vector which has our action probabilities for left down and right and what makes it a probability Vector well that's a vector where the values sum to one like we want and all values are non-negative now since we're creating three values under one constraint we'll be mapping from a Theta Vector of length 2. and I'm leading things here a bit with the notation there's a parameter for selecting left and for selecting right to emphasize the reason we're doing this is because these two values should be totally unconstrained and they should be able to produce any probability Vector we might want okay so what function gets all this done well I bet many of you could have guessed this it's the soft Max function it does exactly what we want by using the exponential function to map positive or negative thetas to positive numbers and then does some normalization to make sure everything sums to one okay now with this tool let's get back to our original problem what we've done is justify a different parameterization now the question is how will I represent it here well for a particular protopoint I only have two Theta values to represent one for the left and one for the right one idea is to represent them with bars where the height gives the value and whether it's above or below the arrow tells us whether it's positive or negative what's important is that these bars determine action probabilities so this setting makes down very likely this makes left very likely and setting both to zero makes all actions equally likely I'll admit it's not the most natural visual but it gets the Theta values on screen and that'll be important okay now let's say we have two Proto points and once again we distance weighted average them to give us two values for the state we're considering and those will get passed through a softmax to give us action probabilities this is how we construct our action probabilities for any state we may come across all right in this setup I'll ask you a question first we initialize all thetas to zero now let's say that during the run of the algorithm we run into this state and the left action is taken and following that a return of 10 is realized over the remainder of the episode all right here's the question seeing this how should we nudge our thetas that is what should we have as our update rule hmm well I certainly think the left action should become more likely since it yielded a positive return I also think it should scale with the return if we had only observed a return of 5 then it makes sense our update should be half as large okay with this and if you know some of your multivariable Calculus you might guess this what I've introduced is the gradient of the action probability at this state with respect to Theta for this video it's useful to think of the gradient as the direction to nudge thetas that maximally increases the action probability in this state whenever you see this I want you to think that so I'll say it again slightly differently the gradient is the direction to nudge thetas such that the probability of action a in state s is maximally increased so what this rule is saying is increase the probability of a positive return action in proportion to the return if it's negative decrease the probability that sounds reasonable in this case we can picture that nudging like this by design it scales with return which matches our intuition also if we had observed this return at a state much closer to a particular protopoint then that Proto points thetas would get a larger nudging that's excellent this model knows that protopoints should be informed by observations of nearby returns and this happens because the gradient flows through our model which

### Segment 3 (10:00 - 15:00) [10:00]

has a sense of distance okay now let's run this update rule in a rather contrived circumstance let's say the agent will repeatedly encounter exactly this state if the agent selects left it'll always get a return of 10. if it selects down it'll get a return of 5 and if it selects right it'll get a return of minus seven now let's begin the first randomly selected action is right so we nudge in this direction the next action is down and this continues if we wait long enough eventually the strategy will learn to always pick the left action the best action looks like we have a working rule except we don't our update rule is wrong we can see this if we start with a wacky initialization to show this I'll put the evaluation Point here just to keep the animation decluttered okay now we'll start with this initialization which corresponds to action probabilities of 20 60 and 20 for the left down and right actions now from here let's do the exact same thing okay now this is weird it seems down is becoming the most frequently selected action that's not right what's happening is even though the best action selecting left results in a 2X larger step than selecting down that larger step doesn't make up for the fact that at initialization the left action is selected far less often than the down action and so the left nudging is applied less frequently that means as we apply these updates the down action probability will start to dominate because it gets nudged often enough to compensate for its smaller step size so to correct this we actually use this update rule where we divide by the action probability this scales updates to account for the frequency of their application and in fact if you know your Calculus this can be equivalently written like this where the natural log has snuck in so just for good measure let's see if this solves the problem we'll restart with the imbalanced initialization once again and now we'll run it with this new update rule and excellent we seem to be converging to the right answer of always selecting left and again that's because this new rule compensates for how often each action is selected so now we can finally complete the reinforce algorithm we mentioned earlier we can now see the algorithm in its full fairly simple glory for each time step within each of M episodes apply the update rule as we just covered also I snuck in the discount parameter gamma which doesn't change the spirit of the algorithm okay now let's apply it to something this is an example I made up so don't look for it anywhere else I call it windy Highway because I'm bad at naming things here's the setup like before we have a 2d continuous State space and we will represent the Asian State as a point to view the actions and transitions we'll need to zoom in a bit we have three actions up left straight up and upright if the agent selects a particular direction they'll move roughly in that direction plus a small bit of Randomness for example if up left is selected the state may move here however on a different selection of this action the agent May land here which is slightly different from before and the same thing applies for the other actions okay now let's talk rewards first there's no discounting second the rewards are fully determined by the agent's horizontal position that means we can picture the rewards like this so if the agent lands here they'll get a reward of one half so the agent wants to spend a lot of time in this region next I lied about how actions yield the next state that was in the no wind case but sometimes there is wind that is there is a horizontal wind which is again only a function of the horizontal position the magnitude of that wind is given with this curve for a positive number indicates leftward wind and a negative number is rightward wind so the wind is pushing the agent in these directions and finally every episode begins at a random position at the bottom of the state space within this area and the episode ends when we get to the top of the state space to get a feel for this I'll show a few episodes where we select the same action every time along with the returns at each state here is what happens when we only select up left and here's what happens when we only select write okay let's solve this game using reinforce

### Segment 4 (15:00 - 20:00) [15:00]

and remember we have a few things we need to specify before we can use it let's start with explaining the model fortunately I basically already did I'll be littering the space with Proto Points each with their own fixed pair of theta values like earlier our policy at any point will come from taking a distance weighted average of theta values and passing those into a soft Max to give action probabilities got it it's just like what we did earlier also we will initialize all thetas to zero and Alpha will be 0. 02 now this protopoint Theta view isn't very intuitive so instead I'll show the actual policy in many points over the state space this will be easier to interpret here I've colored the actions differently and like earlier their lengths tell us action probabilities now let's see this in action what I'll do is Flash an episode and you'll see the policy slightly change now as is typical of RL there'll be a lot of episodes to get through so I'll be skipping quite a few you can see the episode index here alright now let's learn all that's happening is the algorithm is learning to do more of what yielded High returns in the past and less of what yielded negative returns in this case the policy we learned is to approach the hyperboard region and to try to stay within it while fighting the wind so that makes sense it seems our algorithm Works another way of seeing it working is if we plot the g0 return for every episode but this is pretty noisy so we can add a moving average line to get a sense of the trend in fact we can reduce the noise further by re-running this 30 times and then averaging those moving averages okay we can pretty safely say this algorithm works but let's break it the reinforce algorithm is never used in the simple form I presented because It suffers from an annoying problem to see it let's make a seemingly benign change let's add 4 to all Rewards that shouldn't change too much right well it does it slows down learning I'll show you that slow down later when I can compare to an algorithm that fixes the issue but for now I'll explain why we have this problem it comes from the fact that in this case the return is always large and positive and to see why that's no good recall our update rule remember this is the direction we need to nudge Theta to make this action in this state more likely and since the return is always positive then all actions in every state visited will be made more likely in other words the policy never learns to not do things it only learns to do more good things which is everything in this case the optimal policy is only discovered when the really good things have had a long time to overwhelm the pretty good things so what's the solution well the problem is the always positive return so let's replace it with this where I'm introducing a new function this function is called the Baseline here's what you need to know about it first it can almost be any function without breaking the algorithm's convergence properties second I said almost it cannot depend on the action and I'll explain this in just a second and third it's chosen to improve speed that is we'll choose it to solve our all actions get more likely issue it turns out one Natural Choice for the Baseline is actually an estimate of the value or the expected return of being in state s this is just like the estimates we've constructed in previous videos the reason this is good is because then this thing whose sign determines where their actions become more or less likely is positive four actions generating returns above the expected return from merely being in state s and negative otherwise in other words we evaluate actions relative to what is expected in the state where they were taken that will solve our problem also hopefully it's clear now why the Baseline can't be any function of the action if it were it could favor some actions for us and we want the environment to do that okay now bring it together for you the solution was to rewrite our update rule to this to reiterate this is useful because this is the direction to nudge Theta to make an action more likely in a certain state and so we need this thing to be positive for good actions and negative for bad ones where good and bad are judged relative to some baseline And so this makes the next algorithm reinforced with Baseline pretty easy to guess it'll just be like the earlier algorithm except we'll use a new update Rule and we'll have to simultaneously learn a value function for the Baseline that would look like this I'm not going to walk through this in

### Segment 5 (20:00 - 25:00) [20:00]

detail it's exactly as I described but I'll call out a few things first we need to now specify two functional forms one for our policy and one for the value function that's certainly a heavier lift second we have two step sizes to choose and we should expect that the choice of one will impact the best choice for the other which complicates things a bit okay now let's apply this to the all positive rewards version of windy Highway in this case we'll Define the policy exactly as we already did and we'll Define the value function in an analogous fashion the parameters will be initialized to zero and the step sizes will be this and this since the value function needs to be learned before we can properly judge actions we give the value function a larger step size okay now with all that we can bring back the windy Highway and rerun the algorithm this time we have something the value function to compensate for all those positive Rewards to emphasize the value function is learning the value of being in any state so the policy can be updated according to the relative value of action in those States but this view doesn't show how much the Baseline has helped us to see that we rerun the no Baseline version 30 times to give us this average of moving average returns and this is the same thing for the algorithm with a Baseline as you can tell including the Baseline creates quite an improvement and if you're still curious I've included a notebook in the description there you can find details that should extinguish any lingering confusion okay now everything I've said so far has been very example motivated but that portrays a certain theoretical perspective what I'm talking about is we need to discuss the policy gradient theorem and that starts with a question what are we actually trying to do When selecting Theta that is what is our objective to optimize with respect to Theta well in the episodic case it must be the state value of the starting state which is the expected return from that state in this case for Simplicity we're assuming there is a fixed starting State s0 generalizing to a random starting State doesn't change anything important the point is we'd like to choose Theta so that we get a high expected return that's been the goal all along also to emphasize this is a true theoretical quantity in practice we never have access to this only estimates of it now the theorem States something remarkably useful it tells us about the gradient of the state value with respect to Theta in particular it says it's proportional to something proportionality is all we'll ultimately need since how far we step in the direction is already a choice determined on a problem-specific basis okay and that thing is this which I'll break down first this is something we've seen before it gives us the probability of being in a particular state if we Bounce Around according to the policy and this means we're taking an average of something weighted by these State probabilities and what is that something it's this thing which is a weighted sum over actions in a particular State the weights are the true action values remember action values are the true impossible to actually know expected return of taking a given action in a given State and the things we're weighing are these the gradients the direction to move Theta if you want to make a given action more likely in a given state so think of this whole thing as the weighted sum direction to move Theta if you'd like to make high return actions more likely when in state s and then the entire expression is just those averaged Over States again weighted by the probability of being in each state more simply it's just a big weighted average of theta directions that make high return actions maximally more likely and considering the left side is the direction to maximally increase the starting expected return that sounds pretty reasonable now why is this useful because it allows us to develop algorithm items that will optimize the objective in effect if an algorithm approximates this direction it'll approximately optimize the objective in other words any algorithm we develop that will nudge Theta in a direction that on average equals this theoretical Direction then that'll move us to high return policies in fact if you look at the derivation of some Posse gradient methods you'll see they manipulate these terms to permit sampling of gradients that have an expectation equal to this theoretical Direction alright and why is this remarkable because it only involves terms we can handle this we can calculate exactly this we can estimate with the methods we've seen and this will naturally get factored in since we'll be applying updates as we Bounce Around the state space according to this distribution to spark some

### Segment 6 (25:00 - 29:00) [25:00]

appreciation it doesn't involve something nasty like the gradient of the state distribution depends on our policy and so this gradient may have appeared and estimating how the state distribution changes with Theta that would be extremely messy and noisy so it's good news we don't need to all right and as a final note I'll make some general comments on policy gradient methods which I'll Now call pgms First pgms always involve smoothly evolving action probabilities so if these are action probabilities of a PGM model at some State and these are action probabilities of an Epsilon greedy value-based method like we've covered in previous videos then as we process episodes we might see an evolution like this notice the value based method can make these large jumps in its probabilities and that's because it's assigning high probability to the highest value action and that's something that can change abruptly with changes in value estimates and this has a few consequences first it can make naive pgms slow and inefficient the value based method is willing to make big leaps in action probabilities and the PGM approach is much more patient second it gives pgms nicer convergence properties since they're always able to make small advantageous adjustments on the other hand a value-based method can get stuck due to its discreetly determined Extreme Action probabilities so I'll summarize these ideas like this alright next it solves only for the policy in a sense it's doing the least it can to solve the problem and that can be either a bad thing or a good thing it could be bad since it may ignore relevant information for example a model based method might learn overall dynamics of the environment which might not impact the immediate policy but may be useful for generalizing the policy to Future circumstances or it could be good in that it may ignore irrelevant complexity there may be environment dynamics that have no impact on the best policy and would otherwise soak up modeling capacity next it can learn stochastic policies remember in our Epsilon greedy value-based approaches the level of stochasticity was determined by Epsilon which gave us the probability of selecting an action at random we needed to fix the choice of Epsilon up front which meant it's not learnable but the level of stochasticity might be a component in the optimal policy for example if the agent was learning to play poker there might be a circumstance where it needed to genuinely flip a coin that wouldn't be possible with the Epsilon greedy approaches and the last one I'll mention is it avoids needing to ARG Max over actions remember in our value-based methods we have to repeatedly determine the max action value well with many possible actions that could be a slow operation with pgms there is no need for this and this makes it friendly for high dimensional action spaces and that's it that's the end of my series on reinforcement learning I really hope it's made RL Concepts intuitive interesting and maybe useful to you one day now before I sign out I'd like to show some appreciation to my sources the most significant was reinforcement learning and introduction by Richardson and Andrew Bardo I started this series because I read this book in 2021 and thought these ideas were so General and useful they needed to be animated also it's a phenomenally well-written and self-consistent textbook if you're serious about RL it's an absolute requirement also deepmind has a great lecture series that was quite helpful there's a lot more information in those lectures than here so check them out if you're still hungry lastly open AI had some excellent material in particular they're spinning up documentation was important for connecting these theoretical ideas to existing RL systems I've included links to these in the description so if you're aggressive there's plenty of food there but that's it again I hope you enjoyed it and I'll see you again soon later all right I'm done