Stanford CS221 | Autumn 2025 | Lecture 8: Reinforcement Learning

Stanford CS221 | Autumn 2025 | Lecture 8: Reinforcement Learning

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI

Оглавление (16 сегментов)

Segment 1 (00:00 - 05:00)

Okay. Um, let's get started. So, last time we started talking about Markoff decision processes. Um, this time we're going to talk about reinforcement learning. But first, let's review what a Markoff decision process is. So when you define a MDP or a markoff decision process, you specify a start state, successors which specify the actions, probability, rewards and next states um and is end test and a discount factor. So we looked at this uh flaky tram MDP where the goal is to go from state one to state 10 and each at each point you could either walk from I to I + one you could take a tram from I to 2 I except for 40% of the time it fails and you end up in the same state. Okay. So um the start state is one. Um the successors of a particular state um are a set of you know um things successors where in this case if state is one you can walk with probability one you go to state two with reward of minus one. Remember reward is just minus cost. So that's a cost of one. And you can take the tram and with probability 6 you go to state two. And with probability point4 you go to state one. And in both cases it costs two or minus two to reward. Um and then is end is a test that given a state determines whether you're at the end and the game is over. Okay. So MDPs can be uh best visualized by a graph where nodes are states and edges uh are either actions or transitions. So there's two types of nodes. Um there is our state nodes such as this and chance nodes. So the agent makes decisions at these state nodes um for example to walk or to tram and then from the chance nodes um nature decides what happens. So if you took a tram then with probability point4 you go back to state one and with probability point 6 you go to state two and um when you transition you pick up your reward. Okay. And uh so you can see um up that um this is a graph and our goal is to figure out how to you know go from one all the way to um the end 10. So we also uh defined what a solution to MDP is which is a policy. A policy maps a state to an action. In search problems we could just have a sequence of actions and that was a solution. But in MDPs there's randomness. So you have to know for each state what to do. Um we looked at one particular policy which is take a tram if always possible and um you can call that policy if you want. Um when you uh take a policy M MDP you can generate a roll out which is a sequence of steps um where you're using the policy to determine what actions to take and you're letting the MDP uh you're choosing a successor from the MDP based on the probabilities. Okay. We also looked at the utility of a re uh of a rollout which is the discounted sum of rewards. In this case the discount one in which case the utility is just the sum of the rewards. If this discount is 0. 5 then you're waiting um each successive step um by 0. 5 or 0. 5 squared and so on. Um the value of the policy is the expected utility of the policy. Um meaning that if you do um infinite number of rollouts, what will be the average over your utility? We look at an algorithm called policy evaluation which computes the value of a given policy. And policy evaluation defines a recurrence kind of like when we do dynamic programming where um you compute the value for each state as a function of successor uh states. Um and then we also looked at value iteration where if you just insert a max in there um you can compute the value of the optimal policy and also you can extract the optimal policy which happens to be uh walk all the way to state five and take the tram. Okay. So I'll pause there. Um hopefully

Segment 2 (05:00 - 10:00)

that should all be review. Um yeah — the partial in the um in the policy. — Uh so the question is why do we have this partial? This is just a syntactic um you know trick here. So tram if policy takes two arguments an MDP and a state. So this is a partial um application of this function where um I've called it with only one argument. So then this object here policy is still a function that takes the second argument state. So it allows you to do partial application. Yeah. Okay. Any questions about MDPs? Okay. So I'm going to give you another way to um if you like, you know, code, I guess this is good. If you like math then hopefully this part will um help as well. So um the way to think about where value iteration and policy iteration come from is the recurrences that they're computing. So in particular the policy evaluation recurrence um can be visualized as so. So the idea is that you're at a state um what is the value of that state under the policy pi? So remember the value at a state is what is the expect utility or the value that you get if you start at s and you follow policy pi. Okay. And so this can be expressed as simply well you follow policy pi you call pi of s that gives you an action and that takes us to this chance node which is q pi s of pi of s. Okay, so this node is s pi of s a state and a particular action and from there um what is the q value? The q value of a state and a action in general is the value of taking an action a in state s and then continuing to follow the policy pi. Okay, so this can be expressed as let's consider all the possible successor states S prime. I look at the probability of going to that successor state s prime and using that to weight the reward that I get from making a transition um times oh sorry plus the discount factor gamma times um the value of under pi of this state s prime. Okay, so hopefully this visual helps you understand the recurrences. Um you know if you want to calculate the value here it's you know you trans you take an action and then to understand the value here you consider all possible um you know possible transitions. Okay maybe let me pause there to see if there's any questions about um the recurrence for policy evaluation. So then value iteration is very similar. Um so I'm using opt and star kind of interchangeably here. So value iteration remember computes the optimal value vstar um which is the value following the optimal policy from state s. So if I ask um what is the optimal value at state s well I want to choose the best action so I take a max over a of uh qst star of ss sa so instead of following a particular policy pi I'm going to take the max over um these possible actions and then this part is the same as before except for I have qst star instead of qi. So what happens when you're at a particular chance node? Um you consider all possible successors weighted by the transition probability of the reward plus the discounted um future value. Okay. All right. So then the optimal policy once you have QST star then you're basically done. you take the action with the highest Q value. So pi of star of s is just the arg max over of a over q

Segment 3 (10:00 - 15:00)

star of sna. So you're basically looking at um you know what is the optimal policy from here. You're looking at which of the successors gives you the most value and then that is the action you choose. Okay, questions about these recurrences. So, one way to think about these recurrences is that you sort of once you have a very crisp understanding of what a quantity is, then you just write it in terms of other uh quantities. And then when you go to policy evaluation and policy iteration, it's essentially writing these uh recurrences down in code and you know iterating as we did last time. Okay. So now we're going to do something new. So can we find the optimal policy if we don't know the MDP? Okay. Okay, so if we know the MDP, we can run value iteration. We can compute the optimal policy. But what if we don't know it? Um, and that is going to be reinforcement learning. So reinforcement learning in some sense is MDPs where you don't know the MDP. Um, hopefully that'll make more sense uh in a second. So, so here's the general reinforcement learning setting is you have an agent and you have an environment. The agent takes actions and the environment gives the agent rewards and observations and you iterate and stuff happens. Okay, so agent produces an action, environment produces reward and observation. Agent produces another Okay, so it's very simple which is one of the beauty beautiful things about reinforcement learning and intuitively what is happening. So you might think about okay how do I design an agent that does well? A good agent should try various actions essentially to find the good ones and then it should learn to keep on doing those actions. Um in other words, those actions are reinforced. So the word reinforcement learning kind of shows up in um the way I've described what an agent should be trying to do. Okay. So an important aspect of reinforcement learning is that in the beginning the agent doesn't know anything. So the agent really has to try different actions. Um and along the way hopefully the agent will learn what's good and what's bad as well as where um do actions take me for example. So one thing is that I like to think about reinforcement is really a metaphor for life. So in MDPs um already it's kind of tricky because you don't know what the outcomes are going to be but at least you know their probabilities. So, for example, you don't know if you're when you roll a dice if it's going to be three or four or five, but you know if assuming it's a fairsided fair dice that it's going to be 1616 1 1616 in reinforcement learning. We don't even know what the probabilities will be. But that's kind of real life, right? like you take some action but you just don't even know what the chance of success of a particular action is. So you sometimes just have to try it out and then learn from your mistakes. Okay. So in this lecture we're going to specialize a little bit and assume that the environment is defined by a markoff uh decision process or MDP and that the observation is the state that um you get when you take an action in the MDP. Um so in real life you don't get the false state of the world. you only observe part of the state and that falls into the category of partially observed um MDPs or POM DPS which um we're not going to talk about. I would say that reinforcement learning is actually a fairly general all these algorithms apply uh to um well maybe not these algorithms with the algorithms I'll talk about next time will can be appli generalized to pom dps

Segment 4 (15:00 - 20:00)

um but we're going to stick with MDPs because it's simpler to think about okay so um let's define um an example here um and see how RL uh works So again, we have this flaky tram MVP. Um, so that's going to be our environment and um, we're going to define an agent um, which I'm going to use interchangeably with RL algorithm. So agent isn't to be confused with uh, you know, language model agents which you hear all the time. Now it's a RL agent. And so um so let me define the simplest agent possible. A simplest agent is you have some policy such as take the tram if possible. Um and I'm going to define this static agent which just takes a you know policy. So just you know uh rem the agent just remembers the policy here and then um the agent remember what there's two things that it does. It you can call get action on it and you call um incorporate feedback on it. Okay, that corresponds to these two arrows here. So get action takes gets an action and incorporate feedback um lets the agent consume the reward and observations. Okay. So what does get action do in this static case? All the get action does is it just calls its um internal policy. Okay. Um and incorporate feedback. Um what does the incorporate feedback do here? Um, so this is an example of feedback that the agent might get, which is I was in state one, I walked, I got reward minus one, I ended up in state two, and I'm not at the end. Okay, so that's like a little fragment of data that um is an observation that the we're going the environment is going to send back to the agent and the agent can do whatever it wants with it. So in this case do uh it's a simple agent static agent so it doesn't do anything. Okay. So, so given an agent which is um this static agent and the environment I can simulate by generating rollouts. So, we already did this a few times in this class. Last lecture we um use uh rollouts to determine the value of a policy. Um so, we're going to do that here. So um this function simulate is going to take a number of trials and I'm going to just try multiple times and take the average. Um so okay the environment state is the state of the MDP. Um and then while I'm not at the end while I haven't ended I'm going to ask the agent for an action. Um agent says tram. Okay, great. Um, and then I'm going to ask the environment to sample a particular uh successor state or step in general given the state and the action and then now I have the so maybe the environment gives me this. Now this is a piece of feedback that I'm going to give back to the RL agent. So I have a particular state. I took action tram. I got this reward. I ended up in this new state and is end which is false here. Okay. um so I'm also going to just keep track of all this uh steps here. So this kind of goes round and round. Um we saw this last time. So I'm going to just jump over to here. Um, and oops, I didn't mean to jump there. Um, uh, let's see what happened here. Okay. Um, okay. So, I'm simulating. I'm going to jump, um, here. So one roll out gives me tra t tra t tram ra tra t tra tram walk and etc and that gives me a utility of minus 10.

Segment 5 (20:00 - 25:00)

Um and if I do this over multiple trials then each trial gives me a different you know utility. Okay um and this should be very familiar. This is what exactly we did in policy evaluation. So in some sense this these are this piece of code simulate defines the rules of the game. Um okay so now um so the simulation is going to yield some expected uh value um estimated value of the utility. Um, and notice again that the agent doesn't do anything with the feedback. So the main difference between a policy, which we talked about last time, and an agent, which we're talking about today, is that the policy is something that's static. It just maps a state to an action and doesn't change over time. Whereas an agent, in other words, an RL algorithm also takes uh states and maps them to actions via the get action function. But it can change over time and it's dynamic. In this case, this particular agent didn't change over time, but you know, we call incorporate feedback and the agent can do something else. So now what we really like is for the RO algorithm to use that feedback to improve its own internal policy. That's the whole point of reinforcement learning. So how should we um you know do this update is the question. Okay, I'll maybe stop there. Any questions about the reinforcement learning setup? what an agent is. — Yeah. — So for this current setup, we are using only the inverse of the cost as the reward. But don't we have to like set up the reward for reaching the end goal? — So the question is the cost is the negative reward here. Um so you can either think in terms of costs or rewards. The everything all the algorithms all the math will be the same. You just have to negate it. Um yeah — and we are currently just define the reward for each action but don't have to define the reward for reaching the end goal. — So do you goal? Um you do because reaching the end goal via action is just another um uh another um so for example here right reaching the end goal has some reward it's not really special every transition in the MDP there's some reward you could say there's no rewards on everything except for the at the very end so some problems are like that um and those are called kind of sparse reward problems. But in general, you can have reward at any point in time. — Yeah. — So can I just understand the agent as just changing the policy? — Yeah. So can you understand the agent as changing the policy dynamically? I think that's a great way to think about it. So maybe I'll draw a picture here. So um here is the agent and here is the you know environment and you can think about inside the agent there is some you know policy and we'll talk about what's inside the agent um later. So far we've seen this picture where there's some policy and um the logic of the agent in general is going to be um when it says get action it's going to go and ask the policy for the action and then when it gets feedback it's going to do some stuff to update the policy. Um yeah hopefully this will be clear once I actually describe some RL algorithms that are not just do nothing. Okay, so I'm going to talk about four different R algorithms. Um, modelbased value iteration, model free, Monte Carlo, SARSA, Q-learning. Um, and each one um has a slightly different um all of them are very similar in some sense, but there's some important differences which I'll try to motivate along the way. Okay. So the first thing if you have

Segment 6 (25:00 - 30:00)

already studied MDPs like we have um then this modelbased approach is the very natural one. So what makes RL hard is we don't know the MDP. Otherwise if we did know the MD we can just use value iteration to compute the optimal policy. So here's an idea. Why don't we just learn the MDP or estimate MDP from the feedback that we have? Okay. So, because the feedback tells us I'm in state, I took an action, I got some reward, I ended up in a different state, right? That information should allow us to get an idea of what the MDP is. Okay? So, this approach is called modelbased value iteration. And what uh we're going to do, there's three stages here. One is um an exploration stage where we're going to use some exploration policy which might be as simple as just choosing random actions and use that to estimate the MDP. Um and then I'm going to compute the optimum policy of the SMA MDP using value iteration. So this is once I have MDP I know I just go back to uh last lecture and that's what I do. And then finally um I'm going to just follow this end uh this policy that I just learned. Okay. So there's a sort of exploration phase and then there's an exploitation phase. Okay. So let's uh define this MDP uh again. Um so what should the exploration policy uh do? I'm going to call this walk tram policy. um and it's going to be um a policy that just essentially samples a valid action. Okay. Um so for example from state one um if I can take the tram then I'm going to choose either walk or tram with equal probability. So sometimes I get tram, sometimes I get walk. um from state six I can't take the tram because six times two would go over 10 so I can just walk okay so the reason why I'm using this as an exploration policy is that I want to explore so this is an exploration phase so I want to try out all the actions okay so now I'm going to define the agent as um a model. So this is an instance of the RL algorithm and I'm going to keep track of the exploration policy and also an exploitation policy which is um going to be estimated um soon and also an MDP. Okay. And so this is you know now I can maybe try to flush this out. So in the case of um model based value iteration so if I'm then what I have is I have the exploration policy um and then I have this MDP that I'm trying to estimate and then I also have this um exploitation policy. Okay. So what is this MDP? Um so this is this class. So normally when I define MDP it was essentially hardcoded, right? I just said here are the states, here are the actions and so on. But now I want to learn this MDP. So I'm going to make it a little bit more uh flexible. So in particular there's some start state which I don't know. There's some rewards And this is just going to be a mapping from state actions uh next state um you know uh triples to a reward and then transitions which I also don't know. And that's going to be um state to action to state prime um to some count of how many times I witness this transition. Um and an end state is going to be uh just a set of states that I saw where is end is true. Okay. So initially I don't know

Segment 7 (30:00 - 35:00)

anything. So all these are just going to be empty and discount I'm just going to it's just a number. remember that. Okay. So the modelbased RL algorithm starts up with the exploration policy empty MDP and exploitation policy that uh is also empty right now. Okay. So let me give some examples of um calling get action and incorporate feedback on this um on this RL agent. So if I call get agent state one, what does it do? Well, this is what the agent is going to do. So I if I have an exploitation policy, then I would do use it. If I don't, I call the exploration policy. Okay, so exploration policy is just going to choose a random valid action. Okay. Um and then incorporate feedback before I didn't do anything. So let's see what this uh how do I incorporate feedback here? Um I'm going to call uh the underlying MDPs incorporate feedback. Just pass all the feedback to the MDP. And here I'm going to update the MDP now. So um if I don't haven't seen a start state then I set the start state. Um I'm going to set the reward. So I saw this piece of feed. I was in state took action. I got to next state and I got this reward. So I'm just going to remember this reward. And then I'm going to note that I saw this transition from state action to next state. and I'm gonna add one to that. Um, okay. And if I'm at the end, then I would add it to end state. Um, but I'm not at the end here. Okay. So, um, after incorporating this particular feedback, then this is my representation of MDP. I know that I started in one. Um I know that if I took uh the reward of this particular transition is minus one and I saw that if I am in state one I walked then one time I got to two. Okay. Any questions about uh what is happening here? So maybe um just to flush this out there that this is um every agent has two functions get action and incorporate uh feedback. Okay. So I went through what it means. So I'm going to define a bunch of agents. So for each agent I have to tell you what its get action does. And in this case, it's just follow the exploration uh you know policy as long as we don't we're in the explore phase. Um and incorporate feedback just updates the MDP. — Yeah. — So do we not need to worry about like probabilities when we're doing this? — So the question is do we have to worry about probabilities? uh we will um probabilities and hopefully I'll show that. So but this is going to come in through transition counts. So um in walk it's deterministic. So it's always going to just be uh there's one possible choice here. But if you take the tram you'll see that you know four times uh you end up in the same state and six times you end up in you know two times uh the current state. And then we'll use that to actually um measure the pro uh estimate the probability. Okay. So, so let's see what happens when I um you know simulate the MDP using RL. So I I'm not going to step through this function since I already showed this is basically 10 times rolling out this RL algorithm on this MDP. Um and I get a you know reward of uh or expected utility of minus uh nine here. Um and so what I can do is now I can compare the true MDP which was used to simulate and the internal MDP. Um so you can think about this as there's a the environment has an MDP which is not telling the agent and um the agent is trying to make its internal MDP match the true MDP from the environment. Um so in this case I think

Segment 8 (35:00 - 40:00)

um let's see so let's look at the internals of this MDP. So um this is after doing 10 rollouts. The star state is one and here are all the rewards that I saw. Um so it uh I think I probably saw everything but maybe not everything. Um and then here's the transition counts. So to your question earlier um so if I'm in state two and I took the tram then two times I ended up in state two and five four. So then later we'll see that you can estimate the probability of this transition to be five out of seven and two out of seven. So you notice that the counts are not fully accurate um because it should be you know 60% and 40% but because these are small numbers sometimes it's bigger and sometimes it's smaller. Okay so that's the internal representation of MDP that's this guy over here. Um now let's uh see when we call the true MDP um okay so true start state is one and the estimated so let's say I have this MDP um if I want to get its estimated start state I just return the start state um so that's going to be the easy case and now um MDP if the true MDP has these successors um from state one and let's see what the estimated ones have. So how do I compute the estimated successors? This is where I'm going to go do compute the probabilities, right? Because the MDP class currently only stores the counts. So what I'm going to do is for every action um I'm at a particular state and for every action I'm going to look at where um I kind of end up um in the case of walking I always end up in two. So then the probability is one and I add um walk with probability one. So now for um the action tram I have these counts. Um so one time I end up in one two times two. So when you um go compute um those successors then the estimated fraction will be um 66 667 and 333. So this is the way I'm using the counts to define um these uh successors. So um you'll see that the estimated successors this is the true MDP and this is the estimated MVP and it's you know somewhat close but not exactly correct because it's just an estimate. Okay. And then um you can check that the is end function um uh you know works as well. Is end is generally pretty easy. Okay. So the more the agent will explore the closer the estimated MDP will be to the true MDP assuming that the exploration policy actually tries all valid actions. So obviously if the exploration policy only tries one thing then you'll never learn about the other actions that you didn't try. Okay. So that's stage one. What we've done is learn the um estimated the MDP. Um now we need to compute the optimal policy of this estimated MDP. And we're just going to run value iteration. And all value iteration does is well, you know, remember what value iteration does from last time. It computes um the optimal value um and as well as the optimal policy. Okay, in this case it actually turns out to be the optimal policy even though the MDP was slightly wrong. That's sometimes the policy. Actually I ran another one earlier today with a different random seed and um the policy

Segment 9 (40:00 - 45:00)

was wrong as well. So there's definitely some stoasticity. Okay. Um so notice that when you've run value iteration this policy is optimal for this particular MDP. But if the MDP is wrong then the policy is also going to be wrong. But if MDP is close to the true MDP then hopefully the policy will be close to the true MDP as well. Okay. So let's just uh compare the policies. I think in this case um the two policies are the same uh which is great. um okay that comment is uh because of a different random seed. Oh well. Um so using this estimated policy now we can go and exploit. Okay. So what I do here is um just uh run this uh you know it's the same agent but now inside the agent it has the updated MD uh estimated MDP. it has the exploitation policy and now I can get a much higher reward of minus 6. 6. Okay, so this minus 9 was I'm just going to randomly explore and of course that's not going to be very optimal. Once I estimated a um MDP and computed optimal policy on it, then now I'm getting much better. Okay. Um questions about this. — Yeah. — This called hidden markoff or partial mark or something. — The question is — is this called hidden markoff or partially observed MDPS? Um this is fully observed. So fully observed means that the RL everything we'll do doing here is fully observed because the RL agent um if you remember the getting um getting the feedback it actually gets the state. So if you only got part of the state then it would be partially observed. Okay. So just some notes. So notice that the utility or the value that you get in the exploration phase is suboptimal. It's minus 9. Um but this is critical because you know we're learning and so you invest in learning um and then you can exploit later. Um in practice you don't have to have these two phases where you're exploring purely and then exploiting. I'm just doing this for simplicity right now. Um what you can do is you can always continue refining your MDP and then once in a while you call value iteration and you update MDP and then you can even as we'll see later use the exploitation policy to explore as well as long as you're careful. Okay, so summary. So modelbased RL is uh what we're doing here and modelbased means estimating the MDP or the model of the world from uh from feedback and then once you estimate the MDP you use value iteration to compute the optimal policy of the estimated MP. It's not the optimal policy overall but you know you get what you take what you can get. Um and then once you have the estimated policy you just uh exploit that. Okay. — Yeah. — All the states that are possible right like at the very beginning we just don't know what to start or the end state. — So the question is do you know all the states uh up ahead of time? So you uh don't um if you remember let's see up down up here um so in this MDP when you initialize it um there's no transitions there's no rewards there's nothing you only start updating this incrementally when you get incorporate feedback from the environment — is it possible that

Segment 10 (45:00 - 50:00)

— right Yeah. So that's a good question. So when you run your policy after you've estimated your MDP, it's possible that you encounter new states. Um and in that case, yeah, you just generate a random action or in this case the code probably just crashes. Um okay, any other questions? Okay, so now let's think about the next thing which is can we estimate the optimum policy more directly. So at the end of the day we just care about the policy and so the MDP that we estimated was just a means to an end. We did that because we knew what to like how to get a policy from MDP, right? So now can we do this more you know directly? Okay, so the next three algorithms are going to be doing that. And in general, if you're um these methods are called model free, model base means that you're estimating MDP. Model free means that you're not estimating MDP. Okay. So before we estimate MDP first and then you use value iteration to compute the optimal policy uh of the estimated MTP. Okay. So let's just review this recurrence that we just saw 30 minutes ago. Um the obn policy is um the argmax over the Q values. and so the Q values is this other recurrence in terms of transitions, rewards and um future uh values. So the thing to note based on these recurrences is that all you need to know is Q, right? Like if you know Q, then you just can get the optimal policy when we did the modelbased approach, we estimated T and R and then we use that to compute Q. But maybe we can just get Q directly. Okay, so let's try to do that. And the idea is actually very simple. We actually saw it last time when we were doing policy. The first thing we did when we said we're going to evaluate policies, we just rolled out a bunch of things and averaged. So that's what exactly we're going to do here. Okay. So um there's going to be a slight, you know, nuance here, but u we'll get to that. So let's suppose that you did a roll out and you have you know you walked you took the tram and the first time the tram got stuck so you ended up in the same state the second time you ended up in state uh four so um so I'm going to for every step I'm going to compute a utility Okay, so normally when we're computing utility of a roll out, we just compute a single number which is the utility at the beginning. But for reasons that will become clear, I also want the utility at each uh step from that point on. Okay. So at step the steps zero I have minus one which is this reward plus the discount time -2 plus discount squar * this minus2 okay that's my utility here and then here um at step one I'm just going to pretend this first uh step zero doesn't exist and the reward is just going to be minus2 plus discount count time minus2. And then for the last step, I'm going to um assume that the first two steps don't exist and it's just going to be min -2 um plus zero. Okay, so there happens to be a nice recurrence uh between the utilities. So utilities at the step zero is the reward at step zero plus the discount times the utilities of step one. And similarly for each step it's going to be the reward at that step plus the discount times utilities at the next step.

Segment 11 (50:00 - 55:00)

Okay, so this recurrence is going to be helpful when we you know generalize um the algorithm. All right, so now let's go to a question of you know which policy should we use to roll out? If we roll out a bunch of we get a bunch of rollouts uh we can estimate Q. Um so for modelbased value iteration we had a purely random exploration policy. Um here we're going to do something a little bit more sophisticated. It's called epsilon greedy. And the way epsilon greedy works is that with probability epsilon, let's say 0. 1, um you're going to choose a random action according to the exploration policy. And with probability one minus epsilon, I'm going to choose the best action according to the current estimate of the Q values. Okay. So what epsilon is generally small and it controls how much exploration you do. The larger epsilon is, the more smaller it is, the less exploration you do. So what's going to happen with um let's see, I'll draw maybe another one of these uh diagrams. Um so now we have an you know agent that's um I guess model 3 um Monte Carlo. And what I'm going to do is inside it, I'm still going to have the exploration um you know policy here um but instead of estimated MDP I'm going to have um an estimate of um qi. Okay. So let's see how um you know I do that. Okay. So the same MDP as before um exploration policy is the same. So um in model free Monte Carlo I'm going to um keep track of my estimate of Q. And the way I'm going to do that is I'm going to whenever I see a um state action pair I'm just going to remember the essentially the all the utilities that I've gotten. um uh at that point I'm going to sum them. I'm also going to remember how many I have. So essentially that gives me a way to um keep a running average. Okay, so sum utilities uh means for every action and state sum of all the utilities I've gotten and count is the number of times I've seen that state and uh action. Okay. So, I'm also going to keep track of the current uh roll out. Um because the feedback I'm getting is in um is sort of in chunks. I only get one step of feedback at a time. And in order to compute the utilities, I need the whole sequence. So, this is sort of annoyance, but you know, I think we'll it's fine. All right. So um let's see what happens when I call get action on this particular agent um with state one. Um there is a case where I haven't seen the state before or I haven't taken any actions from the state then I'm just going to call the exploration policy. Um and what happens when I do incorporate feedback? Um this part is the same. I have to keep track of where this what the start state is and I'm going to um you know add to my roll out uh this particular step. Okay. So I remember the action I'm just sending probability one because I don't need the actually the probability here reward and next state. So if I'm at the end I'm going to actually update my Q values but I'm not at the end yet. So um incorporate second piece of feedback. Um I'm still on not at the end. And now the third piece of feedback. Um now my roll out is walk trimm and assuming that the end state is uh you know four. Um just to keep things simple. Now I'm going to um I'm at the end of this episode and I'm going to update the statistics needed for computing the Q values. So I look at the utilities and I want to compute a utility for each element of the roll out. Um where the last element is just

Segment 12 (55:00 - 60:00)

like at the very end it's always going to be zero. Um so I'm going to walk you know backwards and compute the utilities for every step and remember that you know recurrence I had um utilities of I is the reward of at step I plus the discount times the utilities of I + 1. Okay. So um so at this step I have the discount is one. So I have essentially minus2 plus 0 and then um I'm going to update that uh the sum utilities with the utilities I've seen and then remember that I uh update this counter to say I've I visited state action u once more um and then so here um the reward is minus two. So this is basically minus2 plus min -2 which is minus4 update the running sums and then finally this reward is minus1 plus -4. So the each s utility at each position is essentially the discounted sum of all the re rewards from that point on. Okay. So this gives you an unbiased um estimate of the utility um at a particular uh state action pair if you just were rolling out the policy. Okay. So when that epsilo ends I'm going to reset the history. So um uh because when I reach end state then I just wipe out the history and I can start over. Okay. So now if once I have these Q values let's see what get action does. So before I didn't see any accounts so I just follow the exploration policy and now I'm doing epsilon greedy which means that with probability epsilon I'm going to choose the exploration policy and otherwise best action according to the Q values. Okay. So let's dive in to see what pi does. So um remember I have okay so this sorry so um so for each action I'm going to compute the Q value of that state and action I'm going to take the best one. So how do I compute the Q value? It's just looking at the sum over uh the utilities that I've seen at as uh the state and action pair and dividing by the number of times I'm seeing that state action pair. Okay. So I do that for all the actions. Um and I can um just take the maximum the action that gives me the maximum uh Q value. Okay, so rewinding back, that's what get action um does. So inside I have Q values and uh Q values allows me to um quickly go to a policy. So let me try to uh maybe flesh out the rest of this. So here's environment um get action. So this pi is going to be used as um get action as well as this exploration policy. And then um this incorporate feedback is going to be whatever that updating these uh counts. So actually let me do it this way. So I have these counts that I'm remembering and the counts if you normalize it you get the Q value from the Q value you get the policy which you can use to drive get action. Okay. So in and notice that the MDP still exists. It's the same MDP, but I never have to explicitly um you know try to estimate this MDP. And just to maybe make this more symmetric um you can think about this when I have this MDP I also can get um a policy. So I guess that's what I've been calling exploit a policy exploitation policy and either in the first phase I'm using explore to get action or I use this

Segment 13 (60:00 - 65:00)

exploitation policy and then incorporate feedback um goes to this MDP. Okay. All right. So, um, let me skip these other calls for now. Okay, maybe I'll pause uh for questions. — Yeah, — Q just another style of the MDP. — So, the question is, is the Q another style of MDP? Um, it has information that's kind of like what you have in an MDP, but it's not this. It's not an MDP. MDP has transitions and rewards and all that stuff. You can think about as a maybe a compiled version of the MEP that tells you um how uh what the value of certain actions are. Um, — if so, how can we play that? the model free one will be more efficient than the model based one. — So the question is why is the model free one better than the modelbased one? This is actually a really interesting um you know question. So one okay so I'll give you one piece of intuition which is that if you think about you know often you might know what the right thing to do is but you don't necessarily you can't predict what's going to you know happen. So if you think about the real world right there, the state is super complicated and so um in general it might be easier to just learn a policy that takes the current, you know, uh observations and just you know what to do. um like if you're playing tennis or something, you might have the policy to go and hit the ball, but you can't really predict everything that's going to happen in the scene. Um now, one thing that's kind of so the Q value basically and also pi, you should think about these as essentially um you know, synonymous with each other, right? You can just quickly get pi from Q. um this just tells you what to do essentially like or at least the value of each action. Um whereas MDPs are more general and powerful. They tell you not only what to do well they don't tell you what to do but they tell you if you did something what would happen. Um, so these days a lot of people talk about world models and the idea of a world model in the context of RL is that you actually want to go back and build the MDP or some sort of model of what's happening in the world um because um you even though it's harder now that we have foundation models and um you know better models of the world um that can actually in the long term give you potentially better results. But this model free versus model base is like a kind of ongoing debate. So there are cases where you want to do one versus the other. — Yeah, — I thought you were trying to estimate two star, but we're taking the average of utility because that corresponds directly to star. — Yeah, that's a good point. So, uh, you caught me on this one. So, we're not even solving the same problem here. Um, I was trying to pull a fast one here. So we are computing Q pi and Qi is the value of um so what is qi? Let me just be very clear about it. Qi of SA is um value if you take state uh s and you take action and then follow pi. Okay. So this is not what you actually want. Um it helps you estimate how well you're doing. Um but you know later we're going to see how we can actually get to QOP. It's going to take a two more steps to do that. Okay. So uh let's simulate this um MDP using this um Mont model free Monte Carlo agent and we get minus 9. Um and I think this relates to your point that you know you see that the value here actually isn't very good. it's basically in the exploration phase and that's because we're not actually computing the optimal um you know policy. So, so the way you would actually use this is you would you estimate the

Segment 14 (65:00 - 70:00)

current policy. Um so qi gives you some information right because you can take the argax over this and it essentially gives you one step improvement. So this is called policy improvement which we didn't talk about and that gives you a slightly better policy and then you can try to estimate the pol value of that and you can improve that policy and so on. So there is a path um to improving the policy if you can estimate Q pi. Okay. So to summarize model free Monteolo estimates Q values of the current policy. It directly uses rollouts bypassing the need to estimate MTP and we're also using epsilon greedy to balance exploration and exploitation. Okay, so there's one problem here is that in life you only get one roll out. Okay, you will go to the end and you die and then that's it. So you know you can't really do uh model free Monte Carlo in real life. So now the question is can you update the Q values before the rollout is over right the problem with model Monte Carlo is that you're going to be you know 100 years old and you're going to look back across your life estimate the Q values and it'll be sort of too late. You kind of want to estimate Q values as you go. So that's going to be this new algorithm called SARSA. Um and we're going to try to estimate the Q values as you roll out. So this seems like a little bit tricky um because you know if you don't roll out completely how do you even get the because the utility is defined as going all the way to the end. So the trick here is um this idea called bootstrapping which is common in RL is you combine your immediate reward that you got with a model estimate of the future. Okay. So um remember the let's see okay I'm gonna go over here. So now I'm going to do maybe um I'll talk about Cersa. So remember the incorporate feedback what it gives you is state um action reward um state um you know pairs right um and so you don't see everything. So let's see if we can do something with that. um knowledge. So in Monte Carlo model free Monte Carlo we compute a utility and that's just going to be the discounted sum of the rewards. And the idea in bootstrapping is that we're going to take the immediate reward and then we're going to substitute it with our current estimate of the rewards. So gamma times Qi, right? And remember Qi is essentially the um you know our est current estimate of the value. Okay. So and then the algorithmically what we're going to do is we're going to perform essentially a gradient update to move our uh current estimate of the QI towards this new piece of information we got. Okay. So uh let's go through our favorite example again. Um so you know SARSA we're going to um keep track of this Q these Q values directly and um when you call uh get action on SARSA um it's it looks the kind of the same as before um you if you haven't seen anything you try the exploration policy um and now incorporate feedback this is the core source algorithm so the idea is that you have um your state, action, reward, um next um and next action. Um that's kind of a what a part of a roll out. And what we're going to do is we're going to take this next state and we're going to simulate getting a next um action. Um so at this point we have um also an action that we sort of you know simulated. Um and now hopefully you know why this is called the SARSA algorithm. Um so then you can compute your utility as

Segment 15 (70:00 - 75:00)

the reward um plus the discount times um Q of the next state and the next action. Okay. So remember I'm before in Monte Carlo all incorporate feedback did was add to the roll out and then when I'm at the end I can compute the utilities but now I'm going to just compute the utilities as best I can you know immediately. Um, and now this update is going to be take my Q, uh, current estimate of Q, and I'm going to nudge it with some learning rate in the direction of utility minus Q. So if the utility is equal to that, then I don't move it at all. And if the utility is greater than that, then I'm moving it kind of towards the utility. Okay. So, so I can once I incorporate the feedback then I can um Okay. In this case I got unlucky and I'm just exploring as well. But you know let's see if I can see pi. So um pi is determined as in terms of the Q as simply the arg max over Q state action. So choose the action that um gives me the you know the best Q. Okay. So let's uh try it out again minus 9. 6 um which is you know around this kind of the same and again you know there's observation that what I'm doing is estimating the Q values of the current policy just as you roll out. So it's not going to do better. it just allows you to update sooner. Um we were able to do that by using bootstrapping. Um so substituting part of the utility calculation with the actual model itself and then uh to estimate the Q values we did these gradient updates. So now um in the last minute I'm going to try to get you to the optimal policy. So um this sounds familiar from last week. So um this is called an on policy algorithm because the Q values that you're computing are of the current policy. But what we really want to do is estimate the optimal policy. So in order to do that we need to have an off policy algorithm because the policy that we're following has to have some exploration um and we don't know what the optimal policy is. So we want to estimate the optimal policy um but not follow it. Okay. So that's going to be Q-learning. It's basically you know a few characters change from SARSA. Um so SARSA estimates the Q values of the current policy. Q-learning optimal policy. So um everything's going to be the same here. So this is actually what I've done is um uh Q okay so this is actually Q-learning and SARSA are um are basically the same um algorithm um so actually this implementation I' I've subclassed um SARSA is now a subclass of uh Q-learning um so uh let's see incorporate feedback this is a part that's um you know different. So just to show you how similar there is Q-learning inherits everything from SARSA except for I'm overriding incorporate feedback. Okay. So what is what am I doing here? So um the only difference here is that now I'm using pi instead of um get action. Okay. and pi represents if you remember the optimal uh policy given these Q values and um the rest of the update is identical. Okay. So um what I'm doing I should have had uh so let's see. So let me do Q-learning here is um in Q-learning I'm up updating um Q star and I'm doing an update um so plus the learning rate and then I have the ut utility which is some reward

Segment 16 (75:00 - 78:00)

plus gamma times um to the value uh which is going to be um sorry Q sorry this the max over a prime of Q S prime A prime um and then minus uh this Q star SA and then in SARSA I have QPI of SA. I'm estimating QI of SA. And this is going to be the reward plus um the Q value of S prime A prime um minus Q of QI of SA. So in the code what you're seeing is that this is um this is coming from you know get which is basically the action that your that pi is giving you um your sort of your policy is giving you and this is coming from um you know self. py pi. Okay, so this is um on policy and this is off policy. Okay, I realize that was a little bit rushed. So maybe next time I'll try to review this a little bit more. Um okay so maybe just to summarize uh what we learned today. So reinforcement learning we're learning optimum policy with interacting with environment instead of being handed to MDP. Um the kind of the contract here is that we have an RO algorithm that takes actions in the world incorporates feedback. We looked at four algorithms. First you can just estimate the MDP and then you can compute the optimal policy. You can estimate the Q values directly from rollouts um using model free Monte Carlo. You can use SARSA to do the same thing but with bootstrapping so you don't have to wait until the end. And then finally you can switch everything over to estimating Q values of the optimal policy. Okay. So next time what we're going to do is uh consider the case if you have huge uh state spaces. So here everything you assume this number of states is small because we're just indexing computing the value for every state but in a much more realistic setting you can't see all the states um and then so what are you going to have what do you do in that case okay that's it for today I'll see you next

Другие видео автора — Stanford Online

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник