Stanford CS221 | Autumn 2025 | Lecture 9: Policy Gradient

Stanford CS221 | Autumn 2025 | Lecture 9: Policy Gradient

Stanford Online

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (15 сегментов)

Segment 1 (00:00 - 05:00)

So we're in the midst of talking about reinforcement learning. Last time we introduced the basic reinforcement learning algorithms and this is going to be the second lecture on reinforcement learning where we generalized to larger statesbased models and we explore policy based algorithms that will allow you to learn policy directly. So this is going to bring us closer to the state-of-the-art of what people are actually doing in reform learning. We started with MDPs. We understood what they are. We looked at Q-learning and SARSA and all these things. And now we're building up to hopefully by the end of this lecture you'll have the key ingredients to understand basically any sort of cutting edge RL algorithm. Okay. So let's start by reviewing reinforcement learning. So remember reinforcement learning is more of a setting than uh necessarily a particular algorithm and the setting is very elegant. It's in some sense the most general way to I think about AI. You have an agent. The agent takes actions in environment and environment gives rewards and observations back to the agent and along the way the agent is meant to make sense of the reward and observations and choose better actions for subsequent um iterations. So in what we'll study the environment is defined by a markoff decision process. Um remember the flaky tram MDP problem where the goal is to go from location one to location six in this case where the tram can fail with probability 0. 1. Walking goes you from I to i + 1. Tram goes from i to 2 * i. Okay. Um and the agent in this setting is going to be an RL algorithm. And we looked at multiple R algorithms last lecture. Um in particular we looked at Q-learning where you define some sort of exploration policy which might walk or tram randomly. And then um we looked at epsilon greedy where the exploration the get action function would either choose to explore according to this policy or exploit um the policy that it's learning along the way. And then the learning rate governs how quickly the um the RL algorithm is taking steps to estimate its um its Q values. Okay, so the RL algorithm and here's what an exact example iter interaction looks like. Um this is from the agent's perspective. So remember last time um there was a simulate function which called the environment and called the agent. But I'm sort of inlining this trace to see what the agent would be called with. Um, so initially you start in state one. So the environment says give me the action at state one. Okay. And then the action goes into the environment and then the environment comes back with says okay here's what happened. You were in state one. You walked to state two. You got a minus one reward and you haven't ended yet. Okay. So now the environment calls the agent with uh get action at state two um and then the agent might go um and uh you know return let's say it returns walk um and the ignore this tram that's out of date. Um and the environment would say okay agent here's what happened. you were in state two, you walked, you got into state three, reward minus one, you still haven't ended. And then um it will call the agent again and say, "Agent, in state three, what would you do? " And the agent might say, "I'm going to take the tram now. " And then the environment will update and say, "Okay, agent, in state three, you took the tram, you end up in state six, and you got a reward of minus two, and now you're done. " Okay, so this is a particular roll out and each roll out generates a utility which is a number which represents the discounted sum of rewards. So you take these rewards and if the discount is one then the utility is minus4 -1 - 2. Okay. So in particular it's important to remember the isolation here. The RL algorithm doesn't see the MDP. We're not passing the MDB into the RL algorithm. It just sees everything it sees is through get action and incorporate feedback. In particular, incorporate feedback. That's the glimpse of the world it sees. And it's using that feedback to try to figure out what to do next. Okay. So if you generate a ton of rollouts, let's say 20 rollouts and you average the utilities of all those

Segment 2 (05:00 - 10:00)

rollouts, that defines the value of um this uh RL algorithm against this MTP. Okay, so remember a policy is a fixed mapping from states to actions. Our algorithm is a dynamic object that will update its internal policy as it learns from interacting with the environment. All right, so that's the RL setting. Um, in order to define algorithms, RL algorithms, it's useful to define a bunch of uh different types of values. We saw all four values. Let me just put them here just so we have a reference here. And a value is synonyominous with um expected utility. So we first saw vpi of s. This is a number that represents the value of following policy pi from state s. If I got dropped into state s and I follow policy pi and I did you know infinite number of rollouts what would be the expected utility? That's what that number represents. Q pi SA is the value of in state S of first taking state action A and then following policy PI. Remember these are the Q's are on the those chance nodes and these are on the um the state nodes. And now if you replace the pi with stars we're basically saying take the optimal policy and do that. Okay. So v star is the value of following the optimal policy from state s and qstar is uh first taking an action a and then following the optimal policy. Okay. So now we define all these values. There's a question of how do you um what is your algorithm to model or not to model. So, if you're model-based, that means you first estimate the MDP, and we saw how to do that last time. You just count the number of transitions and um and record the rewards that you see along the way. And then you compute the optimal policy using value iteration. And then um what we saw was the first type of model free method which is a valuebased approach where we don't bother to estimate the transitions and rewards directly. We don't estimate the MDP. Instead we estimate the Q values directly and then once you have the Q values if you have this Q value you basically look at the action that gives you the highest Q and that's your policy. So going from Q to policy is relatively straightforward. Going from reward or MDP to Q values is not as straightforward. Okay. So for the valuebased methods the general form of the algorithms and this applies to model free uh Monte Carlo SARSA and Q-learning is we're estimating um some we're estimating QA and when we get a piece of feedback we have some sort of new point of view on what Q of essay should be. No one tells us QSSA directly, but we have some sort of noisy estimate. And we're going to call that a target. And a target is an estimate of the utility. And then in um when we do the update, we basically simply take our old QSA and we move it in the direction of the target by amount that depends on the learning rate. So if the target is already QSA, then this doesn't do anything. But if the target is above QSA then we're going to nudge it towards target in that direction. Okay. So um a few more things. So there's a question of on policy or off policy. Okay. So what does this mean? So in general in our algorithm you are following some policy. Let's call that the exploration policy to generate rollouts. These are this is a policy that's giving rise to the feedback. Now using that feedback the algorithm is going to estimate the Q values of a policy which we call the estimation policy. So in on policy all it means is that the exploration policy and the estimation policy are the same. So the data that you get comes from the policy that you're trying to estimate. Um and off policy just means that the data that you generate

Segment 3 (10:00 - 15:00)

uh is from a policy that's not the estimation policy. So when we looked at SARSA that is an on policy algorithm because Sarsa tries to estimate QI which is the values Q values of the policy that you're following and Q-learning was off policy because you're trying to estimate Qstar. So in Q-learning the really interesting thing is that you could be following a random policy but you're estimating the Q values corresponding to the optimal policy. So off policy algorithms are very nice because then in some sense you can take data that comes from some source and then use it to um estimate a policy that you want which is often going to be um the optimal policy. Okay. So what to use as targets? So remember the general form of the algorithm is that you're going to estimate Q and you're going to update towards the target. So there's two things we look at. One is full rollouts which is means that from state S you actually roll out all the way to an end state and then you take the discounted sum of rewards and that's a utility and that number is what you use. But the problem with that is that you have to wait until the end to get a number that you can estimate earlier on. And bootstrapping says let's not wait until the end. I'm just going to get an initial reward and then instead of rolling out to the end, I'm just going to use my current estimate to estimate what kind of future uh utility I'm going to get. Okay, so I'm bootstrapping um my utility from a model um estimate. Okay, so in light of all these um concepts, we can now think about the four algorithms that we looked at last time. We first looked at modelbased value iteration where we estimate them MDP first and then compute the optimal policy using value iteration. We then look at model free Monte Carlo and this says basically at you follow some policy pi um we're going to just look at the full rollouts from a particular state and taking a particular action and set target equal that utility. So there's no bootstrapping. I'm just using the Monte Cara estimate of the utility. Sarsa says, I'm going to still try to estimate qi. So I'm on policy, but I'm going to use bootstrapping where instead of using a utility, I'm going to take the initial reward and then a discounted and I substitute here the Q value of the successor state. So I don't bother to roll out and figure out what happened at the end. I'm just going to use um the Q value. And then Q-learning is almost the same thing but with one small change. Now I'm estimating the value of the optimal policy. So this is a off policy algorithm. It's still using bootstrapping which is why it looks like starsa. And the only difference is I stick a max in over a here instead of letting a prime be pi of sa pi of s prime. Okay. So I know that was a lot. Hopefully what I'm trying to do is try to break it down into the essentially the principal components off policy versus on policy um modelbased versus notbased bootstrapping versus not bootstrapping and so on so that you can see each of these algorithms not as an isolated thing but see the kind of the rich interconnections between them. — Yeah. So the question is when would you choose SARSA let's say versus Q-learning. Okay. So the difference between them is on policy you know versus off policy. So um I think the short answer is it depends on your um particular you know application. Um, but the nice thing about off policy is that you can explore more aggressively and still be guaranteed that you're estimating the optimal policy. Whereas if you use SARSA as you kind of run the algorithm, you essentially have to play much closer to the optimal policy because otherwise you would be um estimating the QI of the wrong not the optimal pi. So the question is yeah what's the difference between SARS and Q-learning um and what does this really mean? So remember this target is what is computed um in the

Segment 4 (15:00 - 20:00)

when you do incorporate feedback right and incorporate feedback you compute the target and then you update the Q values. Okay, the get action is actually the same for SARSA and Q-learning. You if you do epsilon greedy, you either explore or you take the current Q value um and choose the best action from there. So now if you look at um this in some sense it's the same difference as if you remember when we looked at policy evaluation and value iteration. In policy evaluation um the recurrence is essentially taking the single action that is given by the policy and in value iteration you're taking a max over possible actions and this reflects the objective here. So if you're interested in what is the target supposed to represent the target represents okay I'm at a particular point I get some reward and then I'm going to qi means I'm following the policy pi so that's why a prime has to come from policy pi whereas qst star means I'm going to follow the optimal policy and that's why there's a max over a prime because if this were the really true Q star max over a prime would be the best action and that would be the action taken by the optimal policy. So Q-learning is trying to estimate the optimal Q star Q values. Yeah. And the actions it takes um well it depends on the exploration policy but it need not be optimal. Okay. Um let me move on. So far we're estimating a Q value for every state action pair. So this is called the tabular setting in reinforcement learning. And the reason it's called a tabular setting is essentially you have a lookup table literally a dictionary how we implemented last time for every state action pair. Okay. But now in real world settings you can't do this. For example, if you're trying to learn a robot policy your state might be an image of what the robot sees and your action is going to be how you're going to move the robot arm. Or if you're doing things with language models, your state might be an entire sentence or a paragraph, let's say, and then your action is to generate the next token. Okay, so in each of these cases, the state space is huge. It's all possible images or all possible sentences. So you certainly can't have a lookup table for that indexes by an image. It just makes no sense. So how should we deal with this? Any ideas? Yeah. So the suggestion is you have to reduce the state space. Yes. Yeah, you definitely space um to maybe some smaller amount um and then you could do you know Q-learning or whatever on top of that. Um that's certainly one thing you could do. I think um what we're going to do is to try to get even more directly at the um the problem but I think the intuition of state space reduction is right. Okay. So function approximation in the context of reinforcement learning is used to contrast with tabular um learning. So nowadays with DPRL and all these things um function approximation is already just a given because you have whenever you have a large start states there's no other choice. So remember the tabular setting is you have a lookup table for every SA pair you have a number QSA that you're storing and function approximation says wait a minute what do we learn from machine learning what is that you can actually learn um very complicated functions of very highdimensional um inputs so sn let's say they're an image or a sentence they're very large things. We know how to map that into a number, right? So why don't we just parameterize this function with some parameters theta and then try to estimate the parameters theta directly instead of or instead of trying to estimate Q directly.

Segment 5 (20:00 - 25:00)

Okay. So this is the idea behind um function approximation. So, so we'll see some examples. So, now recall the machine learning lecture. Okay. So, what do we have to figure out? out what are the possible functions. That's a hypothesis space or hypothesis class. Um, how do we know whether a particular Q theta is good or not in the context of reinforcement learning? Now, and what algorithm do you use to reduce the loss? Okay, so this should all seem familiar and essentially we'll be able to incorporate many of the ideas from machine learning now into the RL context. So if you can extrapolate take everything we learned from RL and machine learning and we're just going to put them together now. Okay. So what possible functions uh can you choose here? um you know s and a are can be uh you know sentences, they could be images. So typically what we're going to do is we're going to map this into a vector of features because machine learning likes taking vectors and matrices and tensors and then we could use a linear function um simply a dotproduct between some um parameter vector theta and a feature vector f of sa or we can use a multi-layer perceptron which we saw before. We simply take this vector, we multiply by a matrix, and then we apply some nonlinearity, and then we multiply by another vector or a matrix and keep on going. I'm not going to talk too much about what are the choices here, but just think about this as a place where you drop in your favorite um model architecture. Okay, so let's try to work an example of this. So define the same environment as before um with this MGP and now we're going to define the same exploration policy do parameterize Q-learning and for this so what I'm going to do uh is just for simplicity we're going to assume that the state action pair just maps into indicator uh features. So we have one feature for every state action pair. So this will allow us to use machine learning to simulate the tabular settings. So we can just check that everything is working. But just in hopefully you can use your imagination to drop in other more powerful machine learning models. Okay. So here in the class I'm going to have uh my linear model here which maps from number of features to a single scaler representing the Q value and I'm going to have my optimizer um handy. Okay, it's just going to be SGD. Okay, so um let's figure out what this feature vector should look like. Um so conceptually I'm just going to map the state action pair into a um essentially a number between zero and the number of states times number of actions minus one. Okay so that's all this is doing states are number one through six. So I'm going to make it sort of zero through five times number of actions plus um the in integer representation of that action. So in that case um this is zero. I create a one hot vector of length number of features. So this is going to be the feature representation of um one walk. one tram, two walk, two tram and so on until I get to six tram. Okay, so every state action pair is just a different uh feature vector. Okay, so this is all handcoded. No, no learning here. All right. So now I have um I'm going to compute Q values of a state and action pair. And to do that I'm first going to map this into a feature vector space and then I'm just going to pass it through the model and I get some number out. Okay. So that's going to be how I compute the Q value of a particular state action pair. Okay. So now given um that I can cue and I know how to compute Q values. I can compute policies

Segment 6 (25:00 - 30:00)

and the policy that I get at stat state one is going to be compute Q values for all the actions and then choose the action that has the highest Q value and that's going to give me my um action that I return for my policy. Okay, I'm showing you essentially how to compute the feature vector, the Q and the pi given a fixed set of um parameters now. Okay, so now we have the building blocks. Now we can implement get action and incorporate feedback which is what every RL algorithm needs to do. Okay, so what happens now in get action? This is the same as actually before where I run epsilon greedy with probability epsilon I explore um and with one minus probability epsilon I you know don't explore um and now let's incorporate the feedback environment says I walk to state two what should I do with it okay so now um I'm going to compute the target so in order I'm going to um look at self. py of next state. Remember pi is going to take the arg max over the q values and that's going to give me the optimal action uh given my current Q estimates. And then I'm going to compute the target to be the immediate reward plus the discount did and I'm using bootstrapping. So I'm just plugging in the Q value of the next state and the next action. Okay. So now I have the target. I also have the predicted value of if I'm in state an action and take action. What is the value that I you know expect and I want this to be close to target right and remember in regression what do we do when we want two things to be next to each other? We define the loss to just be um the standard square loss which penalizes um deviations between the prediction value and the target. So now that I've defined a loss, I can just compute the gradients with respect to that loss. This is just standard PyTorch and I update the parameters. Okay. So hopefully you can see now how the machine learning enters the reinforcement learning. Um basically all the skeleton of you know get action incorporate feedback and how to Q-learning um is basically specifying the target. Now given the target now all this loss uh calculation and model architecture that's just within the realm of machine learning. So now I did this for Q-learning. You could imagine doing this for SARSA or model you know based uh model free Monte Carlo as well and it would just it sort of plugs in. Okay. So now maybe you go back and you say get action and now um one minus epsilon probability. I'm just going to choose the optimal action that I have so far. Um, and then I incorporate feedback. Um, compute the target, compute the predicted reward, form the loss, compute the gradient, optimize, opt, update the parameters, and then get an action. Um, this time I'm also exploiting. Um, and then incorporate feedback. I get um I'm at the end. So there's no next state, no next uh action. So that's just the reward and but nonetheless that's still the target that it can update against compute the loss compute the gradient of the loss and update the parameters. Okay. And if you simulate um this multip 100 times then you get some weights um and you get some value. um let's look at the optimal policy um that has been computed. So what I'm doing here is for every state I'm going to call the AR algorithms pi function to get me what is a policy that has in store for me and um I guess this time it didn't okay maybe I got unlucky here um uh so this is stoastic so it didn't actually end up with the right uh policy here um because it should have walked walked and then took the tram, but it didn't do that.

Segment 7 (30:00 - 35:00)

Um, but if I ran it for longer, it should it should do something reasonable. Um, okay. So what I'm doing here with this policy is I'm also looking at um the optim estimate of the optimal values which is taking Q state and pi state um and I get these values. Um let's compare it with the true values by solving MVP. So remember I had the original MVP. I can just call value iteration on it. Um the original MVP is not available to our algorithm. So I can't do this actually in our algorithm. So but I'm doing outside um and I get the result and the result if you look at the actual values these are the true optimal values of MDP. This is what the RL algorithm came up with. It's sort of kind of doing something reasonable but it's sort of off. Um and this is the actual optimal policy. Okay. Well, I think I got unlucky here because in the previous iteration it did get the optimal policy. So to summarize what we're talking about is function approximation. You have large state or action spaces and you can't use the tabular setting. You can't just look up this all possible images or sentences. So what we're going to do is parameterized Q fun by some um you know either linear or nonlinear mapping from state actions to uh some value number um we're first going to map a state action pair into a vector of features and then we're going to define either a linear function or MLP to map that feature vector into a single number. When we do incorporate feedback, we're going to define the squared loss between the predicted Q value and a target which is given to us by the virtue of doing uh Q-learning. And then we're going to um take a gradient of this loss function with respect to um theta and take a step in that direction. So the question is why is the loss function the square difference between um the prediction and the target. So the intuition here is that um the target if okay so let's see maybe the easy way to explain this is that suppose your target in model free Monte Carlo is just um a utility that you get from a full rollout. Okay. So if you did that a million times and took the average that average would be the value you would want. Um and now so target in some sense is representing a noisy version in general a noisy version of um what you want. So when you form the square loss it's saying like you're doing regression. you're saying I want the value to be close to target. So now when you update the parameters of the model after the loss goes down and the only way the loss can go down is value moves closer to target. So the question is does this method converge to the optimal value? Um in general no um in the tabular setting under there are some conditions in which Q-learning will converge um usually these things look like your you exploration policy needs to just be everywhere and if that's the case then I think generally it will be fine. Um but now if you put in function approximation in general there's no guarantees. um you might get be able to get show convergence to a local optimum um but that's maybe less exciting than showing that it's finding the optimal policy. So now we're going to move on to a different way of doing RL okay and this is the space of policy gradient or policy based methods. So if you remember where we started, we started with modelbased RL which means we're estimate the MDP first and then use that MDP via value iteration to compute the optimal policy. And then we said wait a minute we don't even we don't need MVP why don't we just compute the Q values directly and then using the Q values we can just get the policy by taking the action that gives

Segment 8 (35:00 - 40:00)

us the largest Q value given a state these are called value based methods but now if you're paying attention you said wait a minute actually we don't even need the Q values we can just directly estimate the policy why don't we just do that okay am I just wasting your time here? Um the answer is short answer is no but um the long answer hopefully will become clear later. So the way to think about a policy is it is just a classifier, right? What is a classifier? It maps inputs to outputs. So everything's a classifier in some sense. Um so the input here is a state and the output is an action. Okay. So if you remember the machine learning linear classification lecture, we said that well these classifiers are actually really hard to optimize because um you end up with like 01 loss and the gradient zero everywhere. You can't do that. So what do we do? We made proistic classifiers. So we have a proistic classifier that takes an S a state s and then defines a distribution over possible actions. and distributions are continuous and therefore we can optimize them. Okay. So, so that's the prelude. Now, let's talk about imitation learning for a little bit. Um, so if we had demonstrations of that policy, this would be actually we could just go home. Um, and what is a demonstration? It's basically having an expert that gives you uh a roll out of um so a desk expert came along and said, "Okay, here's what you do. You walk to state two, three, you take the tram to six. " Okay. And then if you had that data, you could say, "Okay, well, um I'm going to turn that into a bunch of examples for my training algorithm. " I said, hm, if the input is one, I want to walk. two, three, I take the Three examples. And then you could say I'm going to define um a you know objective function J of theta which uh it sums over all the examples and looks at the log probability of an action given uh the corresponding state. Okay. So this is called imitation learning because you're having an expert that shows you what to do and now you just train the classifier to slash policy to just do whatever the expert said. Okay. So for example, this is actually done um in context of um what would otherwise be RL problems. Um, so in robotics you have people that tell or operate a robot that says, "Okay, here's how you pick up the cup and put it on the counter without breaking it. " And in math, you know, um, companies will hire um, you know, people who know math to go and solve a bunch of math problems and write nice solutions. And those are demonstrations of what an expert would do in a given state. Okay? So you already you don't need our algorithms for imitation learning. It's just supervised learning. Okay. Well, that was easy. Um but the problem is what happens if you don't have demonstrations? Remember the RL framework? The environment is not nice enough to give you here's uh here's how what you should do. The environment is cruel. It'll just give you reward or no reward. So all we have this reward function. Um so what should we do? Yeah, you can explore. Great. So all you can do essentially is roll out your policy, get some data and but you don't want to just roll out and then just clone that data. You just you don't want to do whatever that data tells you. But along with that data, it told you essentially how good that outcome was. And so a crew thing you might think about is well, I'm going to roll out and some of the trajectories are going to be good and some of them bad and I'm just going to take the good ones and then do imitation learning and you wouldn't be too far off from what actually happens. So if you remember, if you just want to remember like kind of the zeroth order

Segment 9 (40:00 - 45:00)

thing in policy gradient, you basically just roll out your policy. You collect a bunch of uh rollouts and you take the ones that have high reward and you do imitation learning on them. Okay, in the rest of the lecture I'm going to give you something a bit more nuanced than that with more math. Um but the intuition hopefully is clear. All right. So let's get going here. So we have this roll out. Okay, there's going to be a you know more math in this uh in this section than in previous sections just so you aware. Um put on your math thinking caps. Um okay. I'm going to use tao to denote uh tao stands for trajectory. I'm using the term roll out. I think roll out trajectory episode you can maybe think interchangeably. So a roll out says uh it's basically state action reward state action reward state. Okay. So a concrete roll out looks like the thing that we've been looking at before where you have walk tram and remember each roll out produces a utility. Okay. So that's standard. Now what is a probability of a roll out? Okay. What I mean by this is what is the probability that this thing happened and by the chain rule um we can write this down as a set of conditional probabilities. It's the probability of starting s0 in all our settings it's been deterministic. So this is just one um for you know if assuming uh the first element of tow is s0 and then the first thing is that there is a policy pi theta um what is the probability of taking the first action given s0 and then times the transition probability of from s0 and a1 um transitioning into a state s1 and now it's agent's turn probability of a2 two given S1 and environment's turn probability of S2 given S1 and A2 and then there's a agent's turn and the environment's turn and so on. Okay, so the probability of a trajectory is this interled sequence of probabilities of the agent um policy the trans environment transition agent transition. Okay, so the components are just to say it again, there's a policy term and then there's the transition distribution term. Okay. So now if you just think about what we're trying to do in reinforcement learning, all we're doing is trying to find a policy that maximizes the expected utility of a roll out, right? So I can just write that down um if you know directly. So V stands for value of theta and theta remember are the parameters of the policy is equal to the expectation um of the utility of um towel okay expected utility and if I write that down it's basically thinking well what is the expectation over it's expectation over all possible trajectories how and the probability of that trajectory times the utility of that tra trajectory. Okay. So there could not possibly be a more direct statement of what RL should be doing, right? You're literally saying expected utility and we just want to optimize this So no MDPs, no Q values. This is in some sense the most direct thing you can do. Now how do you optimize this? We've only learned one way to optimize a function gradient descent. Okay, so let's just take the gradient. Um okay yeah this is expectation here that might be a little bit annoying but let's do it anyway. Okay so let's take the gradient of v theta with respect to theta. Okay so um gradient v um I'm just

Segment 10 (45:00 - 50:00)

writing out the definition of v. So far so good. Um I'm going to expand the definition of expectation to be this explicit sum over um p theta of let me just make this larger help the audience in the back. Um so this is the expectation over all trajectories. So this is summing over all trajectories probability of that trajectory times utility of that trajectory. Okay. So these are just definitions. I'm going to move the um the gradient inside the expectation or the sum because you know gradients are a linear operation. Okay. So now I'm going to do something. Okay. So what is the gradient of P theta? Okay. And you're probably like well I mean you didn't depends on what P theta is. It's just like some it's the gradient of P theta. But what I'm going to do is something that might look a little bit strange at first. I'm going to rewrite it as P theta* the gradient of log P theta. So, okay. Just so you're wondering, okay, where did that come from? Um, it comes from um, let's see, let's do it this way. Okay. So what is the gradient of log of um okay let me just write it down. Um okay so what's the gradient with respect to this? So remember the gradient of a log one over that quantity times the gradient of whatever is inside. Right? And then all I do is move this over here. And that's how I write this in terms of P theta times this. Okay. And there's a reason to this madness. Why am I doing it like this? It's because now I have something that's sum over trajectories times p theta of the trajectory. And I can neatly wrap that back in into a expectation. So whenever you see this expectation, you should think about it as summation over a trajectory times probability of that trajectory. Okay. So now um I have the gradient of this thing above which you know wasn't maybe obvious how to you know differentiate. Now I've written it into this expectation where I have um the gradient of this quantity and this identity is now called the policy gradient theorem but I mean it's sort of elementary so I just call it the identity. It's not really a it doesn't deserve to be called a theorem. Um okay. so the so this is actually you know good because in general whenever you see an expectation and you want to compute it and you can't because um it requires summing over like all possible trajectories you can think maybe I can just take a random sample from that. Okay, so that's what we're going to do here. We're going to replace this expectation with just a sample. So instead of summing over all possible trajectories and waiting by the probability of that trajectory, I'm just going to sample a trajectory and then I'm going to define an objective function J now which is just this thing on the inside log P theta to time utility to and then I'm just can just take um update theta by taking a step in the direction of this gradient. Okay, let me just uh pause here just I'll just kind of summarize. We started with this undisputable goal of maximizing expect utility. I did some took the said okay we only know how to optimize things by taking the gradient right now. So let's just take the gradient. the gradient is actually expectation of stuff and I said you can just take a sample to approximate the full expectation

Segment 11 (50:00 - 55:00)

and now um we're going to define this objective function whose gradient is exactly the thing kind of inside. So, so maybe just to foreshadow some stuff, what this is basically saying is that go ahead, explore, do your rollouts. Once you get a roll out, I'm looking at basically this term and waiting it by the utility of that and then just updating it. So, if I didn't have this utility, I would just be doing imitation learning, right? Just maximize the probability of the trajectories I saw. But now with a utility, it's sensitive to essentially how good this trajectory is. Okay. So, um I think something got a bit uh messed up in the formatting here. Um but if you think about the unpacking this gradient term, this is going to be utility times um a summation over all time steps of the probability of an action uh given the preceding state u overall action over all time steps and this is known as a reinforcement learning sorry reinforce algorithm which comes from you know Williams92 u is an instance of this general class of policy uh gradient algorithms but reinforce is perhaps earliest and simplest one and like I said before the intuition here is that we're essentially performing imitation learning on demonstrations which we know and understand um from demonstrations come from our own policy but they're weighted by uh the utility and if the utility were just zero or one basically success or failure then this is just imitation learning um on the successful demonstrations. So the intuition that I mentioned earlier which is do a bunch of rollouts of your policy and then just take the ones that are good and do imitation learning is exactly what reinforce would do in the special case where all the utilities are okay now if they're not zero and one then you have a little bit more if it's like utility is 0. 5 then it's saying well I'm sort of have a weight of 0. 5 on that um example a trajectory three and if utility is minus 10 then I'm like running away from I'm saying like don't do that so yeah what is this utility where is it defined so the utility of a roll out um if you remember is the sum of the discounted rewards so when you have a roll out it's just minus1* discount time -1 plus discount squar time minus2 too. Okay, so hopefully this should be pretty intuitive. Policy gradient um you know just to summarize optimize policy directly to maximize expected utility. Um due to the policy gradient uh theorem or identity um we can compute an unbiased estimate of the gradient by sampling a single trajectory and then just updating that and the algorithm is you know you sample a roll out which means that you're on policy um and then updating each state action pair in the roll out um weighted by the utility. Okay, so now that we've gone through the math, let me try to um implement it now. Okay, so we start with the same MDP and now I have reinforce. Um, note that we don't need the Okay, so reinforce I now I have a model that is going to essentially classify from map from states to actions. So it's a probabistic classifier over number of actions. So that's why it's a matrix now because I'm multiclass. Um and then I have the optimizer as before. Now like in Mando free model free Monte Carlo we're going to keep track of the current roll out as we get incorporate feedback because in the current algorithm as I defined you have to wait until you get the utility. Okay. So notice that we don't need explicit exploration policy because

Segment 12 (55:00 - 60:00)

we're defining these probabistic policies now. Um, and that should give us some exploration. Now, you could also add additional exploration as you want, but we're we don't absolutely need to here. Okay. So, let's define look, let's peek at what get action and incorporate feedback do. So get action um I'm going to um compute um a feature representation of the state. Okay, which is going to be a one hot um vector representing the state here. Um, I'm going to pass this feature vector through my linear matrix and that's going to give me some logits and then I'm going to comp take the softmax that's going to give me a probability distribution over these actions and I'm going to sample um according to those probabilities and then I'm going to take get the action and return it. Okay, so all uh get action is doing is you know what you would do in classification where you compute the logits get the distribution and now I'm sampling from that uh distribution. So now what does incorporate feedback do? Um like in model free Monte Carlo we're just going to remember the roll out for you know in general. So remember the start stage um keep track of the utility add to the rollout. Um if I'm not at the end I don't do any updates. Um I'm going to get action incorporate feedback. Get action. And now the last time when I um incorporate feedback I'm at the end. So here is my roll out. And now I want to update the parameters. Okay. So um I'm going to keep track of the loss. I'm going to go through each step of the roll out. out goes from some state to a particular action. Um so the first time it will be state one to taking walk. Um so remember this is the input and this is the output of the classifier. So I'm going to now um compute the model's prediction of the action given the state. So compute the feature vector, compute the logits and now remember in multiclass classification I compute the cross entropy between the logits and the target which is just a one high encoding of the action in this case walk and then I have um I add that loss. Okay, so I want these logits to match the target here. um the format of the algorithm is the is the same. So you go through the step second step um you add your um cross entropy term for the second step and then you go through the third um add the cross entropy term for the third step. Okay, I should be multiplying the loss by the utility. So now you update your parameters, compute the gradients of this loss and then take a step. And now these are the updated parameters. Okay. So um let's see what happens if I run it. And so I get some here are the weights get some value. Um I think it seems to have worked even though I messed up the algorithm. Um interesting. Okay. So it learned a distribution where in state one I should walk state two I should uh walk state three um I should take the tram, state four walk, state five walk, state six walk or I guess six state six is the end state. Um so these values are just kind of arbitrary. Okay. So um in summary, get action just samples from the policy and incorporate feedback updates the parameters um state action in the rollout um ex and waiting by utility which I didn't do but you should do. Yeah, this is online. All RL algorithms are generally online unless it's offline RL algorithm because when you're whenever

Segment 13 (60:00 - 65:00)

you're doing this it's online you're getting action you incorporate the feedback you get action incorporate feedback and so on. Um and here this is just one roll out and I'm already updating the parameters. Sometimes you might want to batch like have a bunch of rollouts and you batch them up so it looks a little bit more offline but um but the spirit is online. So the question is does initialization of theta matter? Um in general it does even in the standard supervised learning case if you're training a neural network that has is nonlinear then um initialization will uh affect the outcome. Um I think in reinforcement learning matters even more because as currently implemented um the initial exploration policy is given by pi theta. So the data you collect will also be governed by the goodness of your initial policy. And in general you want thetas to be set probably small so that you get a kind of a uniform distribution of reactions. Okay, so let me talk through a bunch of bells and whistles here. So recall the objective maximize expected utility. So we can just write it down. And the polyine gradients theorem says here is the gradient and we can't compute the gradient exactly because that would involve summing over all possible trajectories. But what we've done is use reinforce to say I'm going to take an get an estimate of that gradient by just taking one sample and computing um the gradient of that this one sample. Okay. But can you get better estimates? So to talk about what better means um I'm going to talk go through this um quick detour of talking about variance reduction. Okay, so let's consider a simple setting where you're just trying to estimate the mean over a set of numbers. Okay, so I is the index into the a list. Um every number has uh you know a particular probability. So there's four numbers. Um okay, this should be 0. 25. Um probabilities do sum to one. Um and the points are minus4 - 6 and 8. And so you want the true mean to be um you know basically the dotproduct between these vectors. Um and we'll just pretend these sum to one for now. Um I'll fix that later. Okay. So an estimate estimator to be formal is any random variable that tries to get close to mu. So remember mu is this thing that you let's say it's expensive to calculate here you're it's easy to calculate because you sum over four things in general you're taking the mean over all possible trajectories which is hard. So each estimator has some sort of bias and variance and maybe some also computation cost in computing it. So the simple estimate of mu is you just sample a single index and you return um that the point at that position. Okay. So you might sample uh an index um and then you get the point and then you return it. So every time you compute the estimator, you might get a different value. And then how do we know if the estimator is good? We have to compute its bias and variance. So um we're going to let's just try 1,00 times. We're going to call the estimator function. Um and then we look at its mean and variance. Okay. So here um for this particular estimator the mean is one um which is what it should be. So I guess everything normalizes. So just imagine this is 0. 25. 25. Um and then the variance is uh 36 which is not great if you think about it. Um, so normally if you're doing stochastic grain of descent, high variance means that it's just going to take more iterations to get to the right answer. Okay, so uh here's another estimator. Why don't we just sample two points and then just average them? Okay. So, um if you look at that estimator, uh the mean is roughly the

Segment 14 (65:00 - 70:00)

same, but the variance is a lot smaller. So, in some sense, this is a better estimate um of the mean. Um you can also make it worse if you let's say you choose a point and then you just add some random noise to it. Um so, if you do that, then the mean will still be roughly right. It's unbiased but the variance is going to be worse even worse than the original one. Okay. So now here's comes the fun part. So now suppose I have this magic function offset I. Okay. And this is going to have mean zero. Which means that if I subtract this offset um this is the same as this by linearity of expectation and then this part is zero. So that is the same it has the same mean okay so let's suppose I just define this offset to be uh -6 - 666. Okay. So I'm going to sample a particular index and instead of taking just that point I'm just going to remove that remove an offset. Okay. And I'm going to return it. So if I do this one, you'll notice that the mean is good and the variance suddenly goes down a lot. Okay, so this is magic, right? So, I managed to reduce the variance by a huge amount without affecting the mean. Right? It's easy to make the variance go down if you don't care about the mean. You just return zero and then uh that's going to give you not a correct answer. But here, as long as I pick some function that where the mean is zero, I can add it to my estimator and hope that I can reduce the variance. So, this is um in statistics called control variance. is a general idea that's quite powerful. Okay, so let's see how so going back to reinforcement learning setting, how do we find an offset that has zero mean and hopefully can reduce um the variance. Okay. And just to provide some intuition, why is this working? This is working because if you think about the mean is one, right? So the ones over here are sort of too far down and the ones that are over here are too far up. So what this is doing is it's sort of pulling up the ones that are sort of like too far down and pulling down the ones that are too far up. So if I have some prior knowledge about where the mean might be relative to my points, then I can hope to reduce the variance. so here's the key identity. Okay, so I'm going to claim that I'm going to define a function B of S. It can be any function that only depends on the state but not the action. Okay, then I'm going to claim that this quantity holds. So what is this? This is remember um this is the expectation of the gradient of the log policy times this baseline. Okay. And this holds for all s and a that's drawn from this policy. I haven't proven this but I'm just want you to stare at this for a moment. And if you think about why this is helpful, if you look here, the policy gradient theorem is this thing, right? So that means if this is zero, I can just add this or subtract this from that and I don't affect the mean, but I have the chance of reducing the variance. So if you remember AAR there is this idea that the horistics in AAR are using domain knowledge to improve the algorithm and if I can estimate the future cost I can really get a very fast algorithm. So this is kind of an analogy here where if I can figure out the right baseline um this is called this B is going to be called the baseline function I can actually vastly reduce the um the variance as well. Okay so the proof uh is fairly you know straightforward. It goes like this. Okay. So let's just take the expectation of P of S that is some constant

Segment 15 (70:00 - 73:00)

okay that doesn't you know depend on in particular it doesn't depend on um you know theta um and now I'm going to um take the gradient with respect to theta of this quantity and the gradient of is uh you know zero um and now um actually in the interest of time I'm just going skip this um and just trust me that this works out. The basically it's the same math as used for pol uh the policy gradient theorem but just going in the other direction. Um and so using this idea there are a few things I just want to highlight them. I I'm not going to have time to go into them in detail. Um the first is that using baselines. So I can just subtract a baseline from the utility and as long as that function doesn't depend on the action then I'm free to do whatever I want and I can come up with it can be a model it can be some sort of heristic um and that can reduce the variance. Another thing I can do is instead of using the utility at each step, I can use only the um the rewards from that point on. And this is valid because um the previous the past doesn't depend it doesn't affect the future. And then finally, I can also plug in a bias estimate of the value function. So I'm going to estimate I'm going to bring back Q's. I'm going to say I am doing policy gradient but instead of using the actual utility which is what I did in model free Monteola I'm going to bring back the cues which is what I did in these bootstrapping methods. Okay. And these methods are called actor critic methods because they allow you to estimate a value and also estimate the policy at the same time. And in deep learning you can even tie the parameters together. Okay to summary summarize the game here is to find an estimate of this the gradient of the expected utility that has low bias and low variance. So by using baselines, this is a pretty general purpose strategy to reduce the variance, but it's still uh unbiased, which is good. Bias is zero. Um and you can also use bootstrapping to reduce the variance at the cost of some bias. and then finally, um let me summarize this whole lecture on MDPs and RL. So we looked at three classes of methods for doing reinforcement learning. Model based estimate the MDP compute the optimal policy. Valuebased directly estimate the Q values and derive the optimal policy. Policybased just directly estimate the policy. The form of the algorithm is similar in all three cases. You roll out with some exploration policy. You form some sort of loss function. It could be in terms of the parameters of MDP Q values or the policy and you update the parameters using a gradient step. Okay, so that concludes MDPS and RL where we're trying to maximize the expected utility. So what we're going to look at is what happens if there are adversaries in the environment and you can't really use expectation anymore. You have to consider what happens in the worst case.

Другие видео автора — Stanford Online

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник