# Stanford CS221 | Autumn 2025 | Lecture 7: Markov Decision Processes

## Метаданные

- **Канал:** Stanford Online
- **YouTube:** https://www.youtube.com/watch?v=2ZtF1j3n6XE

## Содержание

### [0:00](https://www.youtube.com/watch?v=2ZtF1j3n6XE) Segment 1 (00:00 - 05:00)

Okay. Uh, thanks everyone for coming on this rainy day. Um, this week we're going to talk about Markoff decision processes and reinforcement learning. Um, just recall where we are. Last week we talked about search. So search was our attempt to get into agents that could reason. And we formalized search as a search problem where you define a start state, set of successors from each state which include what actions you can take, what their costs are, where you end up, and finally an isend criteria. The main thing about a search problem is that an action is deterministic. You're in a state, you take an action, you go deterministically to a separate state. So remember this uh tram problem that we had before. Um the start state was one and you're trying to get to location 10. Um so if you look at the successors of a given state, um you have different actions. Each action has some sort of cost and associated with each action is deterministically one particular state. Okay. And so now we're going to make things more interesting and talk about marov decision processes. So these generalize search problems. And the key difference is that now when you take an action, it can have a sarcastic outcome. And this shows up in a lot of different places. For example, if you're playing a game and you're rolling a dice or if you're uh for example trying to figure out how to go to your grocery store and you're trying to find the optimal decision to make. Do you walk, bike or drive? And you need to take into account the uncertainty which includes how much traffic there will be on the road, time to find parking and so on and so forth. So most of the time in the real world there's actually uncertainty and search problems won't uh be able to capture that because everything's deterministic and markoff decision processes will because it will show you how you can incorporate uh uncertainty. Now a lot of the structure that we have from search problems will just carry right over. So don't forget last week um we have to kind of build on top of that but we'll see some interesting things that um result. Okay. So let's jump in and define what a MDP is. So markoff decision processes um have a very long history going back to operations research in the 1950s. Um the word markoff comes from this theory of markoff chains um in probability and statistics. And the idea of a markoff chain is that the past and the future are independent given the state. Right? So remember from a search problem the state captures everything that you need about the past so that you can evaluate actions and costs going forward. Okay. So that's the markoff part of MDP. Um the D part is a decision which means that unlike a Markoff chain which is pure simulation you're just sort of running a chain and seeing what happens. Um a decision means that you're having a action agent that takes actions and trying to find the best action to take. And finally process just means um something that happens sequentially over time. It's a term from probability. So that's you know breaking down what Markoff decision process uh is. So to define our first MDP, let's take the travel problem from before. Remember we had locations 1 through N. Um and walking from I to I + 1 takes 1 minute. And taking a tram from I to I takes 2 minutes. This is all the same, but we're going to add one twist, which is that tram can break down with some probability. Okay, so in a search problem, we can't model this because either it breaks down or it doesn't. But in the real world, we know that things can break down with some, you know, probability hopefully low. So the goal is to go from one to n in the least time um in expectation. So again, the wording is chosen carefully here. When we had a search problem, we could just say what is the minimum cost path? There's a path and that's it. But now things can break down and sometimes it might break down, sometimes it won't break down and so we have to take an average over possible outcomes. Okay, so let's define the flaky tram MDP. So we have 10 locations and failure probability is 04. Okay, so this is a pretty bad tram, but it's um I guess if it's magic, it's uh it's probably not going to be so reliable. Um okay so we have a start state which is one um okay so now let's look at the

### [5:00](https://www.youtube.com/watch?v=2ZtF1j3n6XE&t=300s) Segment 2 (05:00 - 10:00)

successors this is where a lot of the action is so I've defined the interface to be essentially the same to highlight the parallel between the search problem and MDP um but it's going to look a little bit different um so in the search problem remember we had walk so if you could stay within bounds um then you can have a walk action. Uh this part is the same um except for there's a few things that are different. So the action is walk probability is one. And when I say probability one, that's basically my probabistic way of saying deterministic. Um if something happens with probability one, it's sure to happen. Um I instead of saying cost equals 1, I'm going to say reward equals minus1. Um that's just by convention. So cost is the negative reward. And then state I go to state plus one. Okay. So that's the walking action. And now if I look at the tram, what happens if I take the tram? If I can stay within bounds. So if I take the tram, remember I go to 2 I. Um then there's two successors that I create from this one action. The first is what happens if the tram is actually successful. So with probability one minus the failure probability which is 04 um then I can move to state 2 I and I get a reward of minus two. So that's a cost of two. Okay. So that means with probability 6 I can go to state two. Now what happens if the tram fails? Well, if the tram fails, then I just stay at the same place. Okay, so with probability 0 point4, I am still stuck in state one. Okay, so interesting note that this is actually a cycle, which uh might or may not cause you problems depending on um in the search case, but you know, we'll show you algorithms that can handle cycles, no problem. Okay, so those are the successors. So I have three uh successors uh for the two actions. Um because I have non-determinism, the number of possible places I can end up might be larger, not necessarily, but could be larger than the number of actions. Okay. And finally is n. This is the same as before. Um just test whether the state is reaching n or not. Okay. So let's we can draw this um out this uh let me try to make this a little bit bigger uh momentarily. Uh so it gets a little bit uh you know confusing but I'll try to explain it. So start state is one. Okay from one you can walk and I'm using these uh dotted circles um to represent sort of I'm at a state and I've taken an action. I call these chance nodes because at this point um there could be multiple places I go, right? But in this case, if I take a walk action from state one, then with probability one, I end up in state two with a reward of uh minus one. And if I take the tram action, then notice that there's two edges coming out of tram. I can go to state two with probability 6 reward of minus two or I stay in state one with probability point four and reward minus two. Okay. And then from two I can walk it goes deterministically to three. I can either try to take the tram in which case either stay in the same place or I go to four and so on. And you can see the graph and eventually everything goes to 10. Okay, let me pause there um to make sure everyone's on board with um MDPs, how they're defined. So the outline is pretty much the same. I have a start state. I have a set of you know successors. Now the successors have probabilities associated with it each one and then the is the same. Okay, any questions? — Uh can you explain where the probability comes from for each state? — Uh so the question is where does the probability come from? So this is the probabilities are part of the problem definition. So when I define what the MDP is, I have to specify the probability for every possible successor. In this case, I just made it

### [10:00](https://www.youtube.com/watch?v=2ZtF1j3n6XE&t=600s) Segment 3 (10:00 - 15:00)

up. I just put 04, you can put 3. Um it's just an example. Um on Wednesday, I'll show you how um if you in general, you don't know the probabilities, you can try to estimate them or you know use reinforcement learning which doesn't require you to know what the probabilities are. Any other questions? Okay, let's uh let's go on. So, so far I've done this in code. Um on the assignment, you'll see a bit of mathematical notation. So, let me just try to translate the between the two. So, from a given state S, I'm going to let actions of S denote the possible actions here. In this case, it's walk or tram. I'm going to use uh reward um s a s prime. So this to explain this. So suppose I'm in a state s and I take action a and I manage to end up in state s prime. What is the reward that I get? Okay, so it's a number that tells me if I take took this step, what would happen? Now notice that in general um these triples cap encapsulate two things from S to A that's the agent making a decision and from A to S prime that's sort of nature making the decision. Okay the agent importantly doesn't control whether the tram is going to succeed or not. That's up to chance. Okay. And the probabilities of chance are governed by this transition function T. Again the same type signature here. So I have S which is a state. I took action A and then I end up in state S prime with this probability. Okay. So in this example um let's see. So this P is a prob transition probability. So five um so transition of five tram and five is 04. 10 is 6. Okay. And because this is a probability something has to sum to one. So what sums to one is that for every state action pair the if you sum over all the possible subsequent states that probability has to be one. So in this example if I look at every state action pair which is every chance node here everything that's dashed and I look at all the outgoing arrows those probabilities have to sum to one. So 04 plus 6 is one. Okay just a good sanity check. All right. So, we'll go back and forth between code and this uh mathematical notation, but conceptually we're talking about the same um thing. All right. So, let's consider another game just for fun. Okay. So, this is going to be a little dice game. Um so, it proceeds in rounds. So each round you decide whether you're going to quit the game or to stay in the game. If you decide to quit, you get $10 and then we send you on your merry way. If you decide to stay, you only get $4, but then I'm going to roll a six-sided dice. And if the dice results in one or two, then we end the game. Otherwise, we continue to the next round. Okay? So if you decide to stay, you get less money, but there's a possibility that you get to continue and the next round you might get four more dollars and or you can always quit and get 10. Okay, so um so let's define this formally. So the start state is going to be um this constant called in, which means I'm in the game. Okay. And the successors of this state is um I can quit in which case with probability one I get to repick my $10 and go to the end state. If I decide to stay then with probability one/3 which is the dice coming up one or two then I pick up a $4 and I you know head out the door. um but with probability twothirds I pick up $4 and I get to stay in the game. Okay, so this is just formally how writing down the game here. Um and then is end is as you might imagine just a test whether the state is

### [15:00](https://www.youtube.com/watch?v=2ZtF1j3n6XE&t=900s) Segment 4 (15:00 - 20:00)

equal to the end state. There's only two states in this game in this mdp. And if we draw it out this is what it looks like. So we have in you're in the game you have two decisions or two actions you can take you can quit in which case with probability one you get a reward of 10 you end up in end if you decide to stay then with probability 1/3 uh you end up in end pick up a reward of four and with 2/3 you you go back to end. Okay, so just out of curiosity, how many of you would decide to stay in the game? Okay, some really strong beters here. Okay, how many of you would just take your $10 and get out of here? Okay, smart move, too. How many of you would do something else? Maybe it's like uh stay in for three turns and then quit or something like that. Okay. No. Okay. Well, we'll come back to that point. It turns out that um that there's no point in this case of choosing some hybrid strategy. It's either you're going to stay or you're going to quit. And I'll maybe explain why in a towards the end or if I forget, ask me. Okay. So, hopefully we've shown two examples of MDPs. Um just to reiterate the comparison with search problems. So there's a lot of similarities right both have idea of states and actions. Um both have a start state and as is end function. Both have this successors function that returns possible actions and their consequences. There's a superficial difference between them which is that the search problem uses costs and the MDP uses rewards. But again, I stress that you should be able to kind of in your head easily flip back between them. If I say a reward of minus two, that means a cost of two. Okay, it's this is just kind of pure convention because the folks who develop MDPs like to think about, you know, I guess rewards and the people who like to think about search was all about kind of minimizing cost. Just like in machine learning, people look at error rates and sometimes you look at accuracy. It's really two sides of the same coin. Now the deep difference however is that in search problems each action has one next state and MDPs each action has a distribution over next states and that is the difference that's going to cause us to have a lecture on this as opposed to not a lecture on this. Okay. So that's MDPs. Now let's talk about policies. So in fact we've already introduced what a policy is. remember when we talked about best of n um but let me try to introduce it again in the context of mdps it's actually the same definition um so here we have our flaky tram mdp so the first uh question is what does a solution to mdp look like okay so in a search problem a solution was just a sequence of actions everything's deterministic you start at the start state you can just pre-plan everything I'm going to take this action and I'm done. But this for MDP, this is not going to work because when you take an action, you could go either to state three or state five and then depending on where you end up, different actions might be applicable. Okay, so this isn't going to work. Um, so in MDP instead we have to define a solution as a policy. So a policy remember is a function that maps a state to an action. And this is important because now no matter where I end up, which state I'm in, I always look at the policy and the policy is going to be my guiding light and tell me what action to take. And that's what the solution looks like. It's not just one action path, but really for every state, what should I do? Okay, so let's explore some simple policies. So here's a policy for the tram case. Um, always walk. Okay, remember a policy takes a state, returns an action. So in this case, it just ignores the state and always returns walk. Um, and they clearly the action is walk here. Um, here's another policy. take the tram if it's possible to. Okay. So this policy is going to say if I'm going I'm able to stay within bounds then I take the tram otherwise and then okay so I took the tram and that's for state three. Um but if I pass in state six

### [20:00](https://www.youtube.com/watch?v=2ZtF1j3n6XE&t=1200s) Segment 5 (20:00 - 25:00)

then I have to walk because 6* 2 is greater than number of locations. Okay. So that's what a policies and there's you know potentially other policies you can imagine. Um okay so now you know what a policy looks like. How do you evaluate a policy? How do I know if it's any good or not? So um a roll out is a simulation of a policy in MDP which means pick up your policy and go at it on MDP and then that gives you a sequence of steps and then we can see what happens. So let's take an example. So let's take that policy that always walks. Okay. So um I'm going to have this function generate roll out that takes MDP and takes a policy and then I'm just going to roll out that policy. So you might have heard of the term roll out in the context of reinforcement learning the same idea which is that you run a policy on your environment and you get some sequence of steps. Okay. So I start with uh state one and then while I haven't reached the end state I'm going to choose an action in this case walk because I mean it's always walk policy after all and then so the policy does this choose action what does the MDP do the MDP is responsible for actually choosing the successor according to that action so in this case it's kind of uh boring so I'm basically looking at all the successors and I'm only uh taking the ones that have the appropriate action that was chosen. So the subset of successors with this action um look at a probabilities there's only one action so there's only probability one I sample a uh a successor with according to the probabilities and then I that's my step okay um in this case uh this is going to be somewhat boring because sampling from one choice is uh just choosing that option. Um but you can see kind of how this algorithm goes and at the end um I get a sequence of walk um you know actions and um each action has a reward um and you know in general you can sum the rewards and you get the utility. Now there's a one caveat. So you see this discount thing. Um MDPs all have some discount factor and I'll explain what that discount does. So um each roll out which is a sequence of steps has generates a utility and that utility is a discounted sum of rewards. So let me explain what that means. So the discounting is a way to capture how important we think the future is and this is represented by a number between 0 and one often written as gamma. So in the normal case you would say discount equals 1 which corresponds to no discounting and that says the future is just as important as the present. So if I look at compute utility, I give it a sequence of steps and discount one. What does that do? So I look at each of the steps and I look at the reward and I multiply that reward by discount to the E power for the E step. So the step zero will just have the reward. Step one will have the reward time discount. Step two will have reward time discount squared and so on. Okay, so in this case because discount is one, this term doesn't do anything and I get my rewards. I sum them up. I get my utility and that's it. Okay, so notice that each every step has reward one and each step contributes to the utility just the same number as any other step. Okay, so this is no discounting. Now what happens if I um discount with uh zero? This is full discounting in this extreme. The future doesn't matter. And so if you compute utility um notice remember I'm multiplying by discount to the I. If discount is zero that means uh essentially if I is not zero um then I

### [25:00](https://www.youtube.com/watch?v=2ZtF1j3n6XE&t=1500s) Segment 6 (25:00 - 30:00)

get a bunch of zero. It zeros out the future. Okay. So when I sum that I get a utility of minus one. And then finally usually you put the discount as something in between. So if you do half discounting the next step will matter as much of the as much as the present step. So you might think of um this as you know a dollar today is only worth 50 cents uh tomorrow. Okay. So if you compute the utility with a discount of you know 0. 5 you'll see that every subsequent step um I hit the reward with you know point uh. 5 okay so even though this used to be a reward of minus one this has been kind of really squashed to something quite small. Okay. Yeah. Do you always use like square discount squared or is there other functions you can use? — So uh the question is whether discount always looks like this. Um so the form of the discount is discount raised to the power. Um and in certainly in this class that's the only type of discounting that we're going to look at. It happens to be mathematically convenient for some algorithms. Um there are other forms of discounting that you can look at that um don't factor in this format but um yeah I'm happy to talk about after. Okay, so just to recap, I have MTP, I have a policy, I smash those together, I get a roll out, and the utility of that roll out is the discounted sum of the rewards where discounting uh captures how much I'm going to pay attention to the future. So, normally um discount is going to be one or very close to one. Um, usually you don't see things like discount like 0. 5 uh or smaller because that's like saying I don't really care about the future and the whole point of doing MDPs is that we're trying to make this kind of long-term decision-m so that the future does matter. — Yeah. — Sorry. What do you mean by whether the future matters or how important it is? — So a question is what do I mean by the future mattering or not? Okay. So um let's go back here. Um so when I roll out uh a policy I have at every time step a reward right and generally um you would look at the sum of all the rewards and that's the val basically how good that policy is. Um now when you have discounting you're saying that the rewards you get later on don't matter as much uh because they're essentially artificially squashed down right by multiplying by I'm multiplying by the discount. So suppose you had to make a decision your polic you're trying to think about you're trying to find the optimal policy. So a policy could say, I'm going to pick up, you know, $2 now or I can pick up $2 in the future. And this the higher the sorry, the more discounting you do, the smaller the value of uh gamma means that I'm going to try to capture the reward now as opposed to try to get it later because it's not going to be worth as much later. — Yeah. Can you put an example of another use case apart from calculating the present value for the discount? — Can I come up with a different use case besides calculating present value? Um I mean we'll for MD every MDP you have a notion of discount right so depending on your application this is uh I think the present value calculation is like the economic kind of motivation for this um and so the that's the right intuition to have um but we're doing this in the context of trying to find the optimal policy — yeah — so in definitions of MDP the reward function is based on the state in itself. So is there an advantage to doing it this way where the reward is dependent on like state action and the next state? — Yeah. So the question is um let me go back let's see to the definition of this MDP. So the question is sometimes you see um a formulation of MDP where the reward

### [30:00](https://www.youtube.com/watch?v=2ZtF1j3n6XE&t=1800s) Segment 7 (30:00 - 35:00)

just depends on the state or it and the action. Um I've written it to be dependent on both S A and S prime. Um turns out that these are all more or less equivalent. And what I mean by that is that if I had a form if I had written down a form that was like in this space. I could make it not depend on s prime by introducing other states. So I can make a dummy state that says okay well I'm going to transition to that dummy state and then I'm going to pick up my reward. So there's games you can play to essentially have the formula the simpler formulation. The reason I do it this way is that it's much more natural to define mdps because you give it more you're giving more flexibility and it doesn't change the algorithms. Yeah. Yeah, sorry. Is the uh discounting factor always right? Like if there's a question that we really need the the steps that in the future are we set a very big discounting factor say does that mean that we will deviate from the right answer? — So the question is like is a discounting always right or maybe a good idea. So suppose you have you're gonna get the right answer in the future. Um right so if you're imagine you're you need to have a policy that thinks a lot and comes up with an answer to a problem. You probably don't want discounting or you want discounting to be very small. Uh sorry it's always confusing more discounting means smaller value. So discounting to be like 0. 99 for example. So 0. 99 discounting will say like you you're not discounting a future that much but it applies a little bit of pressure to say like come on let's get on with life and try to short make sure your action sequence is as small as possible. Yeah. So it's like kind of a length penalty is another way to think about it if you want. Okay. So good questions. Let's uh let's move on. Okay, so we have MDP. We take a policy, we smash them together to create a roll out um and that generates a sequence of um you know steps and that defines a utility. So if I do this multiple times, I'm going to get potentially different utilities. Okay. So, for example, this time um I took the tram and I got stuck and I'm in state one, but this time I managed to go through, for example. And you can imagine, you know, there's exponentially many different things that can happen. So, notice that the same policy yields different oops different rollouts. Okay. So now the question is how do you eval going back to the question how do you evaluate a policy the answer can't be just a random number um I can't just say my minus 10 here so what should I do — yeah so you can look to at the expected value so in this case it's very intuitive you roll out many times and you just average. Okay. And formally the expected utility of a policy also known as the value of the policy is defined as the average or expected utility. So you might have heard of like value functions in reinforcement learning which I'm going to talk about later and also next time. This is basically the value is this value. Okay. So notationally in math often write that as V subpi of S and this is the expected value or expected utility or value of starting in state S and following policy pi. Okay. So I'm generalizing this a little bit to say I can start in any state not just the start state. Okay. So some of you who kind of remember last week in dynamic programming should be thinking this is a little bit starting to look a little bit like future cost. Um it's not quite future cost for various reasons but um the idea of defining like from a given state some sort of quantity that involves going to the end. Okay. So, so let's implement this very naive

### [35:00](https://www.youtube.com/watch?v=2ZtF1j3n6XE&t=2100s) Segment 8 (35:00 - 40:00)

algorithm for estimating the value of a policy. Um, so this called Monte Carlo policy evaluation. You pass in a um MDP and a policy and a number of rollouts that you want to do and uh it basically calls that generate rollout function that we saw before and takes the utility you know and number rollouts times um and then just computes the average and that's it. So for walk I only did on one roll out. Why is that? It doesn't say every — Yeah, it's deterministic. So, I only need one roll out. Okay. So, for the uh tram if possible policy, I'm going to do 20 rollouts. Um and just in case, this partial notation basically says this function takes two arguments. Partial is applies it to um the one argument and the resulting is a function that takes the second argument. Um, okay. So, let's do this. So, I do a number of rollouts and each time I get a different value. Those are the utilities for each of the rollouts. And then I average them and I get minus 11. 7. Okay. And if I increase the number of rollouts, eventually this will get more and more accurate. Okay. All right. So um maybe just one comment on this is that you know it will converge to the correct answer but what rate does this uh estimate converge. So if you remember kind of probability and statistics um the error in your estimate is roughly one over the square root number of rollouts. Okay. So if you have um a 100 rollouts then your error is like one over 10 roughly. Okay. So, you know, it's not bad, but we're going to see that we can do a lot better. Yeah. — Um I I can't really understand why it's always converg to one. Like it if a map is going to have two results with same probability, then it's likely to have two results separately, right? So the question is why does this Monte Carlo policy evaluation always return one number? — Um so maybe just to simplify this. So imagine you're trying to estimate the probability of heads of a coin. So every time you flip a coin you're going to get either heads let's say one or tails which is zero. Right? So if you flip this a 100 times and compute the number of heads that gives you an estimate for the probability of heads. So does that make sense? — But let's just take this example. — If you have a head at the tail, so next time it's going to be a half possibility tail, half possibility head. So it won't converge to any one or zero, right? It will goes to one or zero. — Um, — so okay, so first of all, this is the exact same principle as a heads and tails example. So I'm glad we have something simpler to think about. Um, so it is true that the sequence of heads and tails doesn't converge, right? You've run a million times, it's still going to be heads, tails, heads, tails, whatever it is. But if you average over that time horizon and if you run a million times and you look at the number of heads you have, it's probably going to be something around like 500,000 if it's a fair coin. And so I'm interested in the fraction of heads, not in any individual outcome. — Okay, I'm happy to uh address that later. Yeah. Okay. But the average utility is just the um the average of the utilities um naturally. Okay. So um that was the fleeky tram example. Now let's do this for the dice game. Okay. So this is going to be the uh the moment where you re we reveal which is the better option. Do you quit or stay? Okay, people um can change their mind if they want. Um okay, so let's say uh we do always quit.

### [40:00](https://www.youtube.com/watch?v=2ZtF1j3n6XE&t=2400s) Segment 9 (40:00 - 45:00)

So what's the value of always quitting? Actually, we don't even need to run this, but you know, hopefully it should be fairly clear. — 10, right? Okay, but just to go through the you know the algorithm um you generate a roll out and you collect a utility of 10 and I only need to do it once because it's deterministic and that's it. Okay, so what about uh stay? So I'm going to do this 20 times and sometimes I get four, right? So this happens with um so this is like I stay and I got kicked out. So that's happens with probably 1/3. Eight happens with probability 2/3 times 1/3. So you know I stay but then I got kicked out and sometimes I get wow 60. I got really lucky there. Um but sometime often I get four. Okay. So if you take the average okay you get 12. 8. eight. So, um, how many of you would still stay? I mean, sorry, quit? Wait. Okay. So, staying gives you an average utility of 12. 8 and quitting gives you a um value of 10. Okay. So, how many of you would uh quit? Okay, maybe you don't trust the simulation here. Okay, fine. That's fine. If you trust this, then I think you should probably uh probably stay. Uh if you're trying to maximize utility, if you want to lose money, that's another story. Okay. So, um this is just a kind of example of in some sense like we haven't introduced any algorithms yet. It's just trying to introduce the concepts of a policy. Policy roll out gives you utility and you can average utilities. And if you are only comparing two policies, you can just com estimate the utilities or the values of these two policies and choose the better one. Okay. Now, of course, these are only estimates and um so can we compute them exactly or more exactly if that's a thing you can say. All right. Um, let's talk about policy evaluation. So, so far we've performed multiple rollouts of a policy that gives us multiple utilities. I average them and that gives me the value of a policy. So, is there a way to compute this more efficiently? And the answer is yes. And the key is going to be using recurrences. So remember how we look at dynamic programming for search problems an idea of future cost it's going to be the same form here the details are going to be different because I have probabilities involved but really the structure of that uh recurrence is the same here so first I'm going to introduce idea of okay so what is a Q value so um a Q value which takes uh three arguments um a state, an action and a value function. Um this is a little bit un non-standard but I'll hopefully it'll become clear why I do this. This is the value of taking action A in state S and then obtaining um value uh based on whatever state I end up in. Okay, so the idea is that um let me try to draw this here. Okay, so the Q value is the idea is that I'm in state S. I take an action um and that leads to S A. So chance nodes are I've taken an action and say S. And then I'm going to remember this can lead in multiple places and to some state s prime. Okay. So the Q value is a function of these chance nodes. So I have a QSA and this is going to be um whatever reward um I pick up by you know taking this action and going somewhere plus whatever um you know value I'm going to pick up uh next. So hopefully I I'll finish writing this. Um so I get some reward

### [45:00](https://www.youtube.com/watch?v=2ZtF1j3n6XE&t=2700s) Segment 10 (45:00 - 50:00)

um and then I get whatever happens afterwards. Okay. So remember in future cost it was always you take a first step something happens and then there's the rest of the problem. So the first step is taking getting some reward and then there's uh the v which specifies how much value I'm going to get. Okay. Um hopefully this will become clear with an example. So let's consider our flaky tram example again. Um so let's suppose we just come up with initial set of values. Um that corresponds to just doing nothing essentially. Okay. So this is going to be for every state I have uh if it's an end state I get zero because if it end state I'm done and I just get reward zero and I'm going to put minus 100. This is a little bit arbitrary for every other state that I is not the end. Um it's minus 100 because um you know I haven't even figured out how to get to that state or uh or from that state what to do. So minus 100 is kind of like just saying it's invalid. Okay. So let's um warm up with computing the Q value just for one state. So let's choose state 9. Okay. And state 9 is right before the end state. And um let's consider the policy of always taking the tram if possible. um in that state nine um that action is going to be walk because I can't actually take the tram. So now um the successors are this. So one successor going to stay 10. And now I'm going to have this function called compute the Q value. And that's going to take the successors, the discount and the values. Okay. So what does this do? Um is going to look at all the successors. So remember, think about this as I'm out of state. I took an action and here are the successors. So these are these arrows. Each of these are successors. Three successors on the board. For each one, I'm going to compute the utility of if that were to be chosen, which means that there's a reward that I got plus the discount times values of the um subsequent step. Okay, so this is this reward um plus gamma times this um and I'll just write it like this. Okay, so I'm considering all the possibilities that can happen. Okay, and um so those are the utilities. I'm going to weight each utility by the probability and then um sum them all up and that gives is going to give me some value minus one. Um okay so I can also maybe I should also write this um in math just to make sure people are aligned. So um QSA um this is the supposed to be the value of taking action A and state S and then seeing what happens. So in order to figure out what happens, I have to sum over all possible successor states, the probability of that successor state happening, which is t of SAS prime. And then I have um the reward of SAS S prime plus this is immediate reward plus gamma times uh V of S prime. So — yeah, — each of these S primes are different S primes, right? — Yeah. So each of these S primes are different. Each of these Rs are different. I'm just I could write one, two, as well. — Yeah. — How is this? Do you just explain again how this is different from how we were evaluating before? — So the question is how is this different from how we're evaluating it before? So what we were doing before is simply um you add a state, you take an action and I'm going to just pick a successor

### [50:00](https://www.youtube.com/watch?v=2ZtF1j3n6XE&t=3000s) Segment 11 (50:00 - 55:00)

based on these probabilities. So just choose one. And now I'm looking at all of them. Yeah. And the reason you might hope that this does better is that if you choose one, then you have like sampling noise. Whereas here, I'm just like looking at all of them and that'll give me a more exact estimate. Okay. So, any questions about this equation on the board here? So, I'm at a chance node. I've already committed the policy has committed to an action and I'm trying to figure out how good is this consider all possible futures weighted by the probability of that you know outcome and then the reward of ending up in one of those successors and then um you consider this uh v which is the value of like whatever happens afterwards. Okay, so we're almost there to policy evaluation. Um, one thing I need to do is this for all the states instead of just state nine. Um and then the key step is that I need to repeat this instead of just V being um you know this sort of terminating set of values but I'm going to um compute the values and then I'm going to you know put them back into the um sorry I'm going to take this these values val I'm going to use these to compute a new set of values and then I'm going to replace these values with a new set of values and continue iterating. So let me kind of explain this uh a bit better. So there's this idea called bootstrapping and so after zero steps which is in the very beginning these values are um denoting essentially the value of just terminating okay and that's why they look like this because if I terminated then I get zero reward at the end state and then this is you know everything else is like uh invalid essentially after one step. What I want is that the value to represent the value of following the policy for one step and then terminating and after two steps the value of following a policy for two steps and then terminating and so on. Okay. So, so let me run this uh okay before I run the algorithm I want to talk about one other thing which is convergence. So if I'm running this algorithm how do I know when to stop? So in gradient descent obviously there's also the problem how do I know when to stop and I just sort of said well you know run it for 100 iterations or whatever. Um now there there's in these MDPs you can actually do a little bit um you know better because the optimization is um I guess more exact here. Um so if whenever you have an iterative algorithm that computes a set of values and then set of values um you look at the maximum change between iterations. So you look at these values compared to these values and suppose you started with these values and then you um after one iteration you um you updated to these values. Okay. So you compare a distance compute the distance between a set of values and that uh so element wise you look at the distance um and you take the maximum um you know distance between them. So this is also called the L infinity distance. Um and there happens to be two here and that corresponds to the fact that despite state one having a really awesome estimate um state three is still pretty far away. So you want to make sure all the state values are close. So you that's why you take the maximum and when the distance is maximum distance is less than some uh tolerance say 10 to the minus 5 then we can say we're done. Okay. And this primitive will be useful for both policy evaluation and value iteration later. Okay. So, so I'm finally ready to introduce the policy evaluation algorithm. Hopefully, you've seen all the moving parts. I just need to

### [55:00](https://www.youtube.com/watch?v=2ZtF1j3n6XE&t=3300s) Segment 12 (55:00 - 60:00)

assemble um the pi. Okay. So, let's take up our flaky MDP, flaky tram MDP. I'm going to look at the tram if possible policy. Um and we're going to run policy evaluation. Okay. So, let's jump in here. So first I initialize the values. So every state um I assign a initial value which is zero if it's is end and minus 100 if not. Okay. So um this array is just going to keep track of the distances between successive iterations to see how much progress we're making. So now I'm going to start iterating. Okay. So um what I'm this one turn of the loop is going to essentially take whatever is in values and um turn it into update the values to get new values and I'm going to replace values with new values. Okay. So values remember is supposed to be the value of a policy at a given state. Okay. So I go through all the states. um if it's at the end then the value is always zero um if not then I only I consider the policy action so apply the policy to state I get an action in this case it's tram here um I'm going to look at all the successors of uh taking action tram in state one and this gives me the two possibilities and then I'm going to uh look at this compute Q value which I um you know presented before and that's going to give me a new values of state. Okay. So compute Q values considers these two successors and weights them by the probabilities looks at the rewards and then also values of the um the resulting state. Okay. So I do this for all the states. Um and then and after I'm done here I you know compute the distance. The distance is pretty high. So I'm not going to break out of this loop. And now the updated values look like this. So let's pause a moment and try to understand why the values look like this. Okay. So before they look like this. And now after one step remember which corresponds to following the policy and then just terminating they look like that. Okay. So notice that nine has now updated to minus one. So what does that correspond to? That corresponds to walking, right? If you're in state nine, you can take a walk, get a reward of minus one, cost one, and you end up in state 10. And that's you're picking up the zero from, you know, before. And notice that five also got updated. So why is it like minus 42? Any ideas? — Chance to just take the tram and go straight to 10. — Yeah. So there's I think it's 60% chance of taking the tram and going to 10. So that's why you have minus two. And then there's also a 40% chance that you ended up in the same state. And remember what was the value before? It was like minus 10 - 100. So it's like -00 * 04 plus um 6 * uh -2. I think that gives you the right answer. Okay. So then if I keep on iterating this you'll see that over time the values kind of get updated. Um and notice that the ones over here are kind of interesting because these are the ones where you can only walk. So that means there's no decimal. There's no randomness here. So it's just like okay well I'm say six. I walk and I you know if I keep on walking I get a reward of minus4 and I end up in the end state. But these ones uh they take a bit of time to converge but eventually they should converge to um so you see the distance is going down over time which means that the values are changing

### [1:00:00](https://www.youtube.com/watch?v=2ZtF1j3n6XE&t=3600s) Segment 13 (60:00 - 65:00)

less and eventually I get to -12 uh 0. 5 and then at some point I just uh if I get I just you know exit after 27 iterations um the distance has fallen be below the 10 to the minus5 tolerance. Okay. And so the value of this policy is minus12 which means that it costs 12 to go from the start to the end on average. So remember what was the estimate I got from before was like minus 11. 7 or something like that. So you see the estimate is you know somewhat in the ballpark but it's not exactly uh right. Okay maybe pause there. So this is policy um evaluation. Let me just kind of review this one more time. So a policy evaluation takes MDP and a policy and it starts with some initial values and then for each iteration I'm computing updating the um the values uh by considering um let's see I iterate over all the states and consider the action taken and then do this kind of Q value computation. Okay, actually let me write this down um in math. I think you know sometimes it's easier to explain this in math and in code. Um so value uh sorry policy evaluation we'll say at um so at a particular um iteration t. So this is the iteration um this is the policy. So this is going to be um this essentially let me see if I want to write this down. Actually let me just expand this. So basically this is going to be um equal to um the summation over s prime over t sa um okay so what do I what I should I put in for a I should put in pi of s because I'm looking at the value of a particular policy um and then this is R S P of S prime plus gamma hopefully people can still see that let me write it over down here plus gamma of v pi of t minus one of s prime okay so this is what's being computed in code here and this quantity here is Q of um S pi S um and using the notation I had I'm putting in V pi of t minus one here. So why don't I just to make things explicit I'm going to change that. Okay. So if I'm going to see how good is this policy in state S, I look at essentially the Q value where the action is just um so this is a action under the policy C pi. I plug in the action here and I do this which is equivalently you know this thing. Any questions about the recurrence here? — Would the convergence take like maximum the number of steps to the left? Sorry. — Yeah, I think the question is like how long does it take to converge? So

### [1:05:00](https://www.youtube.com/watch?v=2ZtF1j3n6XE&t=3900s) Segment 14 (65:00 - 70:00)

actually this is a good uh segue into the next thing I'm going to talk about. So if you look at the distance — um over a number of iterations, you see that the distance starts at 100 just because I've initialized the policy to be the values to be minus 100. um and it doesn't move for a long time and then it starts going down exponentially. Okay. So the thing that's going on is that there's two phases here and one is that uh the first phase where I think this is what you're you were trying to say is that you're essentially just trying to reach all the states. Um because if a state isn't even reachable, you can't even reach the end state from a given state then that value of that is going to be essentially minus you know 100. Um and uh once you can reach everything then now you're trying the second phase um is you're trying to refine the all the values and that part the distance decreases exponentially. Um, so if you remember back here, if we just go and rewind time here, um, okay, so if you remember here, this is where it started, right? And then, so look at the minus 100s. So if the walking um so this is still I can't I don't know how to go from um you know state one to state 10 in two steps that's impossible right so this is still like you know minus 106 and then eventually um I'm able to figure out how to get there and now things start decreasing kind of exponentially from there. Okay. So, let me go on now. Um, so we can also play the dice game um and run policy evaluation. Um, let's just see what it looks like. Um, the code is the same. So, remember here there's only two states um in and out and end is sorry in and end. And I'm going to have a value of zero in minus 100. And I'm just going to skip down and try to see what the values look like. Oops. and you see that in um, you know, converges to uh, 12. Let me just go here. So it converges to 12. Okay. And remember the estimate we had was around 12. It wasn't exactly 12, but uh policy evaluation gives you the exact um answer in the limit. All right. Um so to summarize, policy evaluation computes the value of a given policy and remember the value is defined as the expected utility of doing these rollouts. But we don't have to do the rollouts. We can just use math and write down our recurrence. The key quantity in writing down the recurrence is the Q value which is the value of taking an action A in state S and then following um V um in the subsequent state. And this idea of bootstrapping which means that you start with a set of values and then you iterate a recurrence to produce a new set of values and then you take those values and you keep on iterating is uh a common thing in reinforcement learning um you know algorithms. Okay. So, we've we're only 10 minutes left in the class and I still haven't told you how to solve the MDP. And you might wonder, okay, Percy's really bad at time management now. I'm going to run out of time. But actually um that might also be true, but um the value iteration is actually going to be quite similar to policy evaluation. So, we've actually done most of the work here and I've argued that it's only a four character change. Actually, okay, I lied. It's a little bit maybe a oneline change to um policy evaluation to get value iteration. So policy evaluation says compute the value of a given policy and value iteration is going to compute the value of the optimal policy.

### [1:10:00](https://www.youtube.com/watch?v=2ZtF1j3n6XE&t=4200s) Segment 15 (70:00 - 75:00)

Okay. And then the process construct the optimal policy as a side effect. So you might think this is pretty hard, right? Because think about if you have a function f ofx computing it is can be trivial and finding the minimizer is can be you know intractable. So how is it possible that for this problem evaluating a policy and find the optimal policy are actually not that different. So evaluator shen goes back to Richard Bellman's work on dynamic programming for the 50s. And let's look at the um I'll just declare what the um the recurrence is. So this is the recurrence for policy evaluation which I have on the board. Um and this is the recurrence for value iteration. So notice that the only thing that's different is now I have this max over a and I have before I had a fixed policy. So the actions I chose were just given by the policy. Now no one's giving me a policy. So I'm just going to take the maximum the best policy or the best action in some sense. So for every state I'm just going to take the best action. Okay. So all I have to do is implement this recurrence and I'm done. Okay. So let's uh let's do that. So flaky tram MDP um I'm going to get the initial uh values. This part is the same as before and the key operation would is going to be computing the value of a given state. Okay. So let's do it for state 9 which is um one step of the end state. Okay. So another way to write um the optimal value var is just the maximum of action over Q sa V star. Okay. So what I'm going to do is now for every action I'm going to consider well how good is that action and I'm going to put in those arrays. So for every action um I'm going to compute the Q value of that action. And now I'm going to just take the max over all these Q values and I'm going to take the arg max to get the action and then that's it. Okay. So just to kind of write this on the board here. Um so V star of S. Um so for value let me write it over here actually. Okay. So this is value iteration. So v star of s is going to be the max over all possible actions that are valid in s and this is going to be q s a and then um you know v star okay so this is a sort of fixed point version if you're writing an iterative algorithm then this would be like t minus one and this would be uh t okay so like I said I've done most of the work here because computing the q values is actually where most of the action is and then um either for policy evaluation you just plug in the policy action or in value iteration you just uh take the max over actions and that gives you the right answer. Okay, so now let's run value iteration. Um, get the initial values. I'm going to have a pi to give record what the best action is given state. Um, and then I loop over all the states and I call this function value iteration for state. Um and you know that's and I check for convergence to see if the new values that I've computed are close enough to old values. If they are then I stop otherwise I keep on going and you can see that um the values let me track the values that are computed. So this is for the flaky tram problem. Now I'm computing the optimal policy.

### [1:15:00](https://www.youtube.com/watch?v=2ZtF1j3n6XE&t=4500s) Segment 16 (75:00 - 80:00)

Um, and over time you'll see that the optimal policy will converge to this. Okay, I guess I didn't update this um but you can see the values. So optimal policy is actually minus 7. 3 which is better than the value of um taking the tram if possible which is also better than walking. Okay. So which is to be expected um because um we can find a better policy using um using value iteration. Okay. And if you look at the convergence for value iteration, it looks uh kind of very you know similar to policy evaluation where it's sort of high until it figures out how to even do anything and then it explain decreases. Okay. So here is the actual uh policy which is optimal for the flaky train problem. So it is to walk until you get to five and then take the tram or try to take the tram. Okay. So intuitively this makes sense uh because remember the tram is flaky. So if you try to take the tram early on then you're probably going to waste a lot of time and also takes two minutes whereas the uh the calculation it just works out. So if you're at five the best thing to do is to take the tram to risk it because you get multiple tries to take the tram and you there's 6 probability of having it succeed which saves a lot of time. Um so the math kind of works out that this is the optimal thing to do. So in general this should not be necessarily obvious for a given problem. That's why we have these algorithms that can just compute the value for us. Okay. Any questions about value iteration? — Yeah. — So the resulting optimal force is always deterministic. So question is the resulting policy always deterministic and the answer for MDPs is yes and the reason is that if you take a max um so the value is just a max over a right so you can always obtain the max through some point now it's possible that a stocastic policy could reach the same optimum value if you have two actions that have the same value then you can take any combination of those and that's fine. Turns out that for games next week uh that's not always the case and randomness does help for games but for MDPs um you're fine with just choosing the deterministic policy. Okay, so to summarize value iteration computes the value of the optimal policy. We take a max over actions compared to the policy evaluation which just takes the policy action. And then the same idea of bootstrapping where you comp start with initial set of values and then you iterate some recurrence to produce a new set of values and you keep on going until the distance between successive value computations is below some threshold and then you stop. That part is the same. Okay. To summarize this lecture. So markoff decision processes they generalize search problems because actions can now result in a distribution over next states as opposed to just a single next state. The solution to MDP cannot be a sequence of actions. It has to be a policy that maps each state to a given action. We defined a utility of a policy. Um sorry if you take a MDP and you take a policy and you perform a roll out that produces a number called the utility if you take the expected or average over those utilities then you get the value of that policy which is a determines like a non-random number and we saw two algorithms policy evaluation which computes the value of a given policy and value iteration which computes the value of the optimal policy. And as we've stated before, both algorithms are, you know, very similar. So what I recommend is really kind of study the recurrences and understand how they work as opposed to just trying to memorize the algorithms as is. Okay, so that's all for this lecture. Next time we're going to do

### [1:20:00](https://www.youtube.com/watch?v=2ZtF1j3n6XE&t=4800s) Segment 17 (80:00 - 80:00)

reinforcement learning and talk about how we can find policies when we don't know the transitions or the rewards or um the successor states. Okay, see you next time.

---
*Источник: https://ekstraktznaniy.ru/video/20914*