# Stanford CS221 | Autumn 2025 | Lecture 10: Games I

## Метаданные

- **Канал:** Stanford Online
- **YouTube:** https://www.youtube.com/watch?v=SMOD_GiRzb8

## Содержание

### [0:00](https://www.youtube.com/watch?v=SMOD_GiRzb8) Segment 1 (00:00 - 05:00)

So today's lecture is going to be about games. Um remember last time we talked about Markoff decision processes and reinforcement learning and the time before that we talked about search and games will complete our sort of tour of these statebased models. Um and so let's dive in with an example. Okay. So, we're going to play a game. Um, this is called game one for lack of better term. So, there are three bins here, A, B, and C. So, um, you are going to choose one of the three bins, and then I choose a number from that bin, and your goal is to maximize the chosen number. Okay, pretty straightforward. Um, so let's do a poll. How many of you would choose A? Okay, a few. Some people would choose A. How many of you would choose B? Wait, I think some of you are voting twice, but okay, that's okay. U maybe a sarcastic policy. Um, and how many of you would choose C? Okay, most of you would choose C. Okay, so what should your strategy be? And as opposed this is very illdefined right on purpose. And the whole point of games is that your strategy should depend on what I'm going to do. And for example, if someone chose a I might be lenient. I just give you 50 or I could be um really trying to screw you over and I'll give you minus 50. Um so the challenge in games is that you don't know what the opponent's going to do. So, how do you guard against that? I'm actually surprised more people didn't choose uh B. I guess you guys don't think I'm as adversarial as uh as I might be, but okay. So, MDPs, the agent is trying to maximize utility and the environment is random and unknown. If it's not known, then we do reinforcement learning and it becomes known either implicitly or explicitly. In games, the agent is still trying to maximize the utility, but there is an opponent in the environment and the environment opponent strategy is unknown. Okay, so here is the road map for today. Um, we're going to talk about modeling for a while. What's a game? We're going to see how we play a game and what the notion of game value is. And then we're going to look at a bunch of different recurrences that are going to help us understand the optimal values and optimal policies. And then the last part of the lecture will be about how we speed things up either exactly or approximately. Okay. So let's start with the notion of a game. So the key object you should have in your head is a game tree. So I'm going to use this example throughout the lecture. So let me just draw on the this on the board. So the root of the tree is the starting state of the game and each uh edge that's coming out of a node are the particular actions that can be taken. Um in this case you can go ABC and um and then now it's um you know my turn and I choose a number and that can either end in let me I'll just leave the notes up there. So minus fif minus 50 um 13 or - 515. Okay. So every route to leaf path is a roll out or a trajectory or a a game outcome of a game. And each node a player has to make a decision. Some of the nodes are um the agent making a decision and some of the um nodes are opponent making a decision. Okay. So in this lecture and in this class in general we're going to focus on two-player zero sum games. So what does that mean? Two-player means that the number of players is two. Usually this will be an opponent and an agent. And zero sum means that the utility of an agent is equal to the negative utility of the opponent. So if you add those two things together you get uh zero. Okay that's why it's called zero sum. So in particular if I get if you the agent gets utility of one the opponent gets a utility of minus one and everyone's you assume is trying to maximize their uh utility. So the total utility is zero.

### [5:00](https://www.youtube.com/watch?v=SMOD_GiRzb8&t=300s) Segment 2 (05:00 - 10:00)

Okay. So let's look at um this example in more detail. Um we can define this game formally. Um so a game formally has a start state as before. So this is just going to be the string root. Um at each state I can call a function to see if the game is over and that's is n or whether it's a leaf node. Um this is new. So at every state I also have a player function that tells me which is the player whose turn it is. Okay. And that is in general going to be a property of the state. So the state knows which whose turn it is. Um and it's either going to be agent or op for opponent. And then finally each state is going to give me a set of successors. And I've written successors um for our purposes here as a mapping from an action to a state. So taking action A takes me to state A. Okay. So let's do one other uh node here. So from state A um I'm not at the end. Um so A just to be clear is this one. Actually let me just write this um this in so for a reference. Okay. Um and so this is like ch uh choosing actions ABC and this is one two just I'm just going to label them as one two. Okay. So um so the player at state A is OP for opponent and the successors of this state are I can the opponent can choose either um action one to give me to get me to A1 or action two A2. Okay. And if you're in A1 the game is over and the utility is minus 50. Okay. So these functions I can call to on any state and that tells me what the utilities and the successors are. Okay. So what's special about a game? Um one is that all the utility is at the end state. So this is not a hard restriction. Sometimes you have games that give you points during the middle and if utility is a number of points then that's fine but usually I think in a game um what matters at the end is whether who wins or not and or whether it's a draw and so even if you have a lot of points you might still lose the game and generally the utility is trying to capture what you want. Okay so this is going to make things challenging. Um, this is in the context of reinforcement learning. This is known as a sparse reward problem because you don't really know what's going to happen until you get to the very end. You might seem like you're doing really well and then everything crumbles. Um, and then the more important thing is that different players are in control at different states. So in Markoff decision processes, we saw this a little bit. Remember there were um agent control nodes which are ones where the agent's trying to make a decision and that goes to a chance node where the environment transitions to a successor state. So now we're going to see many other types of nodes where the opponent is controlling some of the nodes. Uh now so just in terms of notation um a policy is still a mapping from uh states to actions. So a deterministic policy pi subp of S is the action that a player P will take in state S. And a sarcastic policy which in general we're going to assume is this is going to be the probability that a player P takes an action A in state S. Okay. And we'll see some examples of this. Okay. So here's another example of a game. So this is a very simple called the having game. So I'm going to start with a number n like let's say 11 and the agent and opponent are going to take turns choosing actions and an action can be either decrementing n or replacing it with um uh having it rounded down. Okay. And the person who is left with zero wins. So what this might look like is um so you start at the starting state n is 11. It's the agent's turn. So here we're seeing that the state marks uh whose turn it is explicitly as a field. Um the player here is obviously the agent. Um

### [10:00](https://www.youtube.com/watch?v=SMOD_GiRzb8&t=600s) Segment 3 (10:00 - 15:00)

the successors of this state are I can decrement and that takes me to this state where n is now 10 and its opponent's turn or I can have where I take 11ide by two floor it and that gives me five and then here I haven't ended yet. Okay. So you can think about what is the optimal strategy here. I don't think it's uh it should be obvious unless you're able to you know do the minax recurrence in your head. Um but uh this is just an example which we'll come back to. All right. So now you know what a game is. The question is do you have to consider discount when you're considering utility? So in this lecture, we're going to assume all the utilities at the end, which means that there's just one number. And so discounting makes sense when you're accumulating rewards over time and you're discounting the future, but there's only one number. So that's just the number that's the utility. So let's talk about game evaluation here. So given a game which we've defined and a policy for each player the value of the game is the expected utility over all possible rollouts. So this should be very familiar from MDPs where we define the value at a particular state uh of a policy is essentially what the expected utility is. So let's look at a particular policy. Um, so we're still on this game here, the game one. Um, we're going to assume the agent always chooses a. And here, because in general, I'm going to consider stocastic policies, this um is going to return a dictionary mapping each action to the probability of that action. Okay, so if I'm a deterministic policy, I just see the action colon one. So with probability one, I'm choosing a. um the opponent let's say chooses randomly between um one and two. So uh that's going to be represented by one has probability. 5 and two has probability 0. 5. Okay. So now I just going to pack these policies into policies and that's going to determine the policies of all the players. And now let's play the game. So game play um is going to be given by the simulate function which is very similar to what we've seen in search problems as well as um for MDPs and you start with a start state and then while the game is not over um you see whose play uh whose turn it is and then you get the policy of that player you pass in the state and you get a distribution. over actions. And then you sample an action given that distribution. And now once you've chosen action, the game dynamics take over and you look at the uh successors of that state, which remember is a mapping from action to successor state and you pass in the action and that gives you a new state and then you just you know keep track of all the steps. So at the very end um you get a utility which corresponds to a particular uh roll out. So in this one this case um I that corresponds to choosing a and then um the random policy is going to do a coin flip and it happened to choose uh two here and that gave me a utility of 50. Okay. So that is one roll out. Um and if I do it multiple times in average you'll see that sometimes I get minus 50 sometimes I get 50 and the average of all that is 15. Okay. Um so now 15 is not the actual value of the game because there's sampling error. I'm only trying it 20 times. If you increase uh this number to let's say a million, what is the mean? What do you think the mean utility will turn out to be? It should be zero, right? Because half the time you're choosing minus 50 and 50. Okay, so can we compute the game value exactly without simulation? And the answer is should hopefully be yes be because we can just define a recurrence to do this. Um so I'm going to use this picture to denote uh a pictorial representation of what the recurrence is doing. So here schematically it

### [15:00](https://www.youtube.com/watch?v=SMOD_GiRzb8&t=900s) Segment 4 (15:00 - 20:00)

represents um at a particular state I'm going to the agent is going to move and then opponent's agent's going to move and so on. Okay. And we're going to see variance of this. So given that I can just define the recurrence as follows. So v eval of s is going to be the game value at state s and with respect to the policies pi agent and pi up and then so if the I'm at an end state then the you I just return the utility of that state. Okay, that's the base case. If it's the agent's turn to move, then what I do is I look at um all possible actions that the agent could take. What is the probability of that agent taking that action and um and I recurse on what is the game value of the successor state. Okay. And then if it's opponent's turn, I basically have the same uh sum over all actions, the probability that opponent takes action times the game value of the successor state. The only difference is agent versus opponent here. Okay, so this should um hopefully seem familiar from MDPs where I also had a sum over um actions and also a sum over successor states multiply that by the probability. So this is essentially doing a very similar calculation. So let's uh you know see how this looks like in code. Um, so this is Val. Um, if I'm at the end, I return the utility. Otherwise, I see whose turn it is, get the player, um, look at the policy, and then I'm just going to look over all the actions where for every action, um, with that with probability prob, I'm going to see where the game takes me. If I take action um in this particular state, what happens? And I'm going to recurse on the game value of the successor state multiply by the probability. And if I sum up all these values, then I get the expected value. Okay, so this is just the same thing but in code. All right. So one note is that this computation will be exact but it could take exponential time. So in particular if at every node you have two actions and the game runs for 100 steps then you take two to the 100 time which is not great. Um but um that's what it is. So let's look at this example now. So here the agent always chooses A and the opponent is going to be this random policy that chooses um one or two with each with probability half. So conceptually what the recurrence is doing is that for every node from the bottom leaf to the top it's going to compute a value which I've denoted in green here. So for this node um the value of this node is going to be 0. 5 * -50 plus. 5 * 50 that's going to be 0. Here it's 0. 5 * 1 +. 5 * 3 that's going to be two. And here it's 0. 5 * - 5 plus. 5 um * 15 that's going to be five. And then finally this um is just going to be um choosing the action according to policy. The policy always chooses a. So it's going to be 1 * 0 plus 0 * 2 plus 0 * 5 which gives me zero here. So if I hand you a policy for a agent a policy for the opponent in a game you should be able to compute at the root node what is the value of the game which is the expected utility of the agent. And remember it's zero sum which means that expected utility of opponent is just minus that. Okay. So summary. So value is the expected utility of the game of the agent. We can simulate the game by just rolling out and average the utilities and this provides us with a Monte Carlo estimate of the value. You can also compute the value exactly but this could take exponential time. And uh this is analogous to uh policy evaluation in MDPs. Remember policy evaluation. You're given a policy of the agent. You're also

### [20:00](https://www.youtube.com/watch?v=SMOD_GiRzb8&t=1200s) Segment 5 (20:00 - 25:00)

given the uh environment transition dynamics which you can think about as the ponent if you want and then you're just computing uh the value. So now you might wonder why would you ever do this? Because it takes exponential time and probably you wouldn't actually do this for game evaluation. you would just um sample a bunch of rollouts and average and that'll be um converge you know relatively quickly. But we'll see later um that you can't always um sampling doesn't uh won't work. Okay. So now we know how to evaluate a game given fixed policies. Let's try to actually compute optimal policies. So um so this is going to be called expected max. So instead of evaluating a fixed agent policy, we're going to find the optimal agent policy but with respect to a fixed opponent. So the schematic looks like this. I'm going to use triangle to denote um a max node where the agent is in control and a circle to denote um a node where there is a fixed policy. that's being run here. So the agent goes first then opponent and then the agent and so on. So here um the expect value of a state I'm going to have a very similar recurrence and the only thing that's changed is the part that's in red. Okay, but just to go through everything. If I'm at the end state, then I'm just going to return the utility. This is the base case. If I'm it's opponent's turn, then this is the same thing as before. I just consider the probability of the opponent's policy taking various actions, use that to weight the value of the successors. Um, and then but if it's the agent's turn, instead of taking a sum over actions, I'm taking a max because the agent wants to maximize the expected utility. So therefore I want to uh the agent should choose the action that maximizes the game uh expected max value of the successor states. Okay. So that's it. And you can see that for the expected max you can't just sample um naively and get a Monte Carlo estimate because of this max. Okay. when you have a sum you can sample and that gives you unbiased estimate max uh you can't. Okay, so let's uh see what this looks like um for our particular example. So I'm going to bring back this random policy which chooses um one or two with probability half. So that corresponds to this opponent policy down here. And um intuitively um expect to max basically says um at the opponent um nodes I'm doing the same thing as before. I'm just averaging. But at a max node I'm just going to take the maximum value in which case here it's a five um which is larger than zero or two. Okay. So some of you were doing expected max. The people who said, "I'm going to choose C. " You probably thought I was just going to do a coin flip, so 15 seems pretty good. Minus five. Well, hope that doesn't happen. And you chose C. Okay, so in code, this is what this expect looks like. Um once you see these recurrences, they're going to be very um somewhat repetitive, but hopefully it'll um the marginal cost of understanding a new one should be going to zero. So expect max if the game is ending um at the particular state, then just return the utility. Um otherwise, see whose turn it is. And if it's an agent's turn, I'm going to remember choose the action that maximizes utility. So I'm going to look at the next uh states. Um here in this game, it's um A, B, or C. I'm going to recurse and figure out what the expected max values of those states are. And then I'm going to just choose um the maximum one. Okay. Um, and then I'm not going to go. So, the opponent is the same thing as uh before. I've written it slightly differently, but it just takes an average uh with respect to the opponent

### [25:00](https://www.youtube.com/watch?v=SMOD_GiRzb8&t=1500s) Segment 6 (25:00 - 30:00)

policy. Okay. So, the optimal action is to choose uh C in this case. Okay. So expected max finds the optimal agent policy with respect to a fixed opponent policy and this is analogous to value iteration and MDPs. Game evaluation was analogous to policy evaluation. Expected max is analogous to value iteration. Okay. So far I haven't actually done anything interesting. This is essentially all really MDPs but just um described in a game language. Okay. But now I'm going to do something different and this is where I'm going to talk about minimax. Okay. So before the opponent policy was known but the whole point of games is that we don't know the opponent's policy. So what should we do? So there is a principle here that we can follow called miniax. Um, and this is to assume that the opponent is out to get you. The opponent will play their best possible strategy and is going to minimize your utility. Okay. So, here is the miniax recurrence that describes this principle. So, now I have um as before I have a uh upward triangle to denote where I'm taking a max and a downward min. Um, so first the agent moves, that's a max. The opponent moves, that's a min. The agent moves, that's a max, and so on. And the recurrence is actually fairly simple. It's actually simpler than um the other recurrences we've shown. So the value the miniax value of a state is if I'm the at the end, I get the utility. If I'm the agent, then I take the max overall actions of the min max value of the success of states. And if the opponent is taking a turn, then I take the min overall actions. Okay? And that's why it's called miniax because I have max and I have min. So this is what it looks like for um game one. So if I have these opponent nodes which are now min nodes then I take the min over minus 50 and 50 I get minus 50 take them in I get one minus5 and then the agent is going to choose the max over these min values and the max is one. So for those of you who chose um miniax uh chose B initially, you're probably running Miniax in your head. Um and those of you who chose A, well, I don't have an explanation for that. All right, so this is what it looks like in code. um miniax compute the value minia max value of a state if I'm at n I return the utility here I'm going to take a little bit of liberty and I'm also going to return the associated action not just the value because I'm going to want to actually know what to do um so whose turn is it is I'm getting the player I'm going to recurse on all um possible next states and get um a set of values. So um that's going to be min - 51 and minus5. And then if I'm the agent, I'm going to take a max over all these values. Um and if I'm the opponent, I take a min over those values. And that's going to give me in this case action B which gives me value one. So minax value of a state is what the expected utility for the agent would be if the agent played optimally and maximize utility and the opponent played optimally and minimized the utility. Okay. So let's see what happens with a having game now. Um, so first of all, I'm going to turn this miniax recurrence into a policy. And the way I do that is fairly simple. A policy, remember, takes a state and returns um a distribution over actions. And so the policy is going to say I'm going to take the state and I'm going to compute the miniaax value of that uh state and associate action

### [30:00](https://www.youtube.com/watch?v=SMOD_GiRzb8&t=1800s) Segment 7 (30:00 - 35:00)

and just return that action. Okay. And let's say I'm going to play the random policy. Okay. The random policy is just going to um choose decrement or half each with probability half. Okay. So now when I simulate I get uh something that looks like this. So the agent decremented. Now we're in 10. And now it's opponent's turn. The opponent decided to have. Now I'm in five. Agent uh is going to decrement. Opponent decrements. um agent is going to have and then the opponent to decrement and I get zero and now the agent wins. Okay, so this shouldn't be surprising because the agent should well win if it's possible to win. Okay. So, let's see what happens. Um, if I um I'm going to simulate 10 times and see what happens and the agent crushes the random policy. It wins every time. Okay. Um, so a few notes here. One is that the policy is um when you're actually using the miniax policy, you are actually um computing this recurrence at every state. Okay. And um so here just for fun I can show you um for each of the first 11 states what the miniax uh value um is. So the minax value is for 11 is one which means that the agent will win assuming um the agent plays optimally and even if the opponent plays optimally and the best action is a decrement this the min what does a minus one minax value mean? So, the opponent is guaranteed to win if the opponent plays optimally, but doesn't mean that all uh hope is lost because if you're playing a weaker opponent, you might still win. And then on these states, the agent is going to win. And then six, the opponent, if the if plays optimally, it's going to win and so on. Okay. So what do I what can I say here? If the value is one, the gu agent is guaranteed to win no matter what the opponent does. And we'll make this a bit formal later. Um and the value is minus one, the opponent is guaranteed to win if they play optimally. Okay. So, a few notes. Um, just this is kind of a more of an aside. Um, when the agent and opponent are playing optimally, this is known as perfect play. And you might have heard people saying um, you know, checkers is solved or a game is solved. And what that means is that the outcome of the game under perfect play is known. Okay, which means that we know the miniax value of the route. That's what solved solving a game means. So there are some games which are strongly solved which means that we know basically the mini val max value for every single state tic-tac-toe nim connect 4. weekly solve means that from just the initial position I know that what the miniax um is and then there are un also games that are unsolved like chess and go right so there's an important distinction between people think that chess and go are quoteunquote solve because we have computers that can beat um humans um but that could be true but we don't know uh the main max value of a node and we don't have the optimal policy. We just have policies that are better than um humans. Okay. To summarize here, the agent in miniax, the agent maximizes utility. Opponent minimizes the utility. Unlike the expected max, there's no fixed policies given. Expected max is only optimal with respect to a fixed policy. Miniax, as

### [35:00](https://www.youtube.com/watch?v=SMOD_GiRzb8&t=2100s) Segment 8 (35:00 - 40:00)

we'll see, is agnostic to any sort of prior over what the opponent is doing. And this is the part that is unique to games. There's no analogy to MDPs because MDPs we only max and average. We don't win. Okay. So this next part is going to be um taking account of a look at all these policies that we can compute and drawing some relationships uh between them. So this part is going to be a little bit more um involved and we'll have to do a bit more um thinking but hopefully it'll be um give you a lot more insight into the relationship between expected max and um game values and miniax. Okay. So far what we've done is we computed these recurrences and the recurrences give us policies. So v min max the minia max value will give us the pi max which is the optimal agent policy and pi min opponent uh policy. Okay. So these are basically um remember in miniax we have min and max and if we take the arg min and arg max that corresponds to um the pi max and pi min. We also could do expected max and that gives us um an expected max policy. I'm using parentheses 7 here. So pi 7 I'm going to assume is an arbitrary fixed policy that the opponent is playing and expect to max 7 means I'm expect to max with respect to pi 7. Okay. So you have to be very careful when you talk about optimal policy. You always have to ask optimal with respect to what? Because each optimal policy bakes in some assumptions about how the other player is playing. So in miniax we have that pi max is optimal against pi min. Okay. And we also have that pi min is optimal against pi max because the min and max are um you know symmetric. And we know that expected max the pi expected max 7 is optimal against um whoever we're computing expected max with respect to which is pi 7. So what happens? So we have four policies here. We have pi min, pi max, pi expect max 7 and pi 7. And we what happens when we play these policies against different policies. So I'm going to use the notation V of a particular agent policy and a particular fixed uh opponent policy to be the value of a game when PI agent plays pi op. Okay. And we can play the policies against each other. So there's two agent policies pax and pi expected max. And then there's two opponent policies pi min and pi 7. And uh for each cell here I have a particular value which is what happens when the row policy plays the column policy. Okay. So the question is there any relationship between uh these values um and let me actually just draw this on the board um as and I'll try to mark things off as we go. So we have um pi max pi ex expect a max um and then pi min um and pi seven. Okay. And each of these has some value and I'm trying to find the relationship between these values. Okay. So what can we say? There's three properties here. So the first property remember is pi max is the best policy against pi min. Okay. In particular this is saying that this value is at least as great as if I had substituted any other agent. Okay. In particular this is the pi max policy. If I substitute pi agent that does something else, it can only decrease the value. Okay. So using that fact then we can conclude that v of pi max is pi min is greater equal to this other policy. If this is for all other policies this is

### [40:00](https://www.youtube.com/watch?v=SMOD_GiRzb8&t=2400s) Segment 9 (40:00 - 45:00)

just another policy. So this right hand side is greater than left hand side. Okay. which means that um this uh holds. Okay. So this value is greater than that value. Okay. So second property pi min is the best policy against pax. Okay. So that means if I'm the opponent, I shouldn't choose anything other than pi min if I want to um minimize the utility. Um this also another way to interpret this is that this provides a lower bound against any other opponent, which means that if I'm playing pi max, this uh playing the wor the pi min is going to be the worst I do. If the opponent does something else, I can only get better. Okay. So, meaning here this I'm guaranteed a one, right? If the opponent does something different like if the opponent happens to do 15, then um then, you know, I'm still going to get I can't opponent choosing 15 isn't going to decrease this one. And it might go make it go up because if the opponent chooses three here, then I get a three here. Okay. So applying this uh property then I see that if I change the opponent's policy to something else I can only increase the value. Okay. So which means that um this uh is this is the relationship between this number and that number right so this hopefully it should make sense if I'm at mini this minia max value if I make the agent worse then it's going to get smaller if I make the opponent worse it's going to get bigger. So one important corlary of this is that um if the minax value is one that means the agent is guaranteed to win no matter what the opponent does. I claimed this earlier but now you can see formally why this is true because if the agent does something else it's going to be at least one and well I guess in these games there's only one so it has to be one and then so I'm guaranteed to win that's why you can say with confidence that if you compute you're playing the miniax policy and um that you return the value and it's one then you're just guaranteed to Okay. So property three. So remember what expected max uh 7 is expected mass is optimal against pi 7. So one consequence of this is that the minax value is not optimal against um the if the opponent is known. So I mean just to understand this so expect to max is optimal against pi 7 which means if I change expect this agent policy to be something else in particular like pi max which is just another policy then this can only go down okay and you can see this here expected max remember was choosing C and if I decided well I'm just going to play the minia X policy I would choose B and two is worse than five. Okay. And this is because I'm assuming I'm playing pi 7 which is going to be the random uh policy. Okay. So based on that this um pi max is set pi 7 is going to be less or equal to this value here. So, an important realization is that you can actually do better than miniax if you know your opponent. If you know you're playing a weak opponent, you can do all sorts of things that you wouldn't dare do if you're playing um a really strong opponent. Okay. So putting everything together, this is what we have um you know on the board. So the minax value is one. Um this value if you play pi max against pi

### [45:00](https://www.youtube.com/watch?v=SMOD_GiRzb8&t=2700s) Segment 10 (45:00 - 50:00)

7 that's two. If you play expected max which will choose this uh branch against pi min you're going to get minus 5. And if you play expect to max against Pi 7, you're going to get five. Okay? And it checks out that minus five is less than one, which is less than two, five. I think it's really cool how all these um values are actually kind of related to together. Okay, so let's um summarize. So when you think about optimality, you should also be very careful about what you're optimal with respect to. Uh miniax provides two interpretations. One is that it is the optimal policy against um pi minor possible opponent. However, another way to think about it is a lower bound against any unknown opponent. The way I've sort of um introduced games is you don't know your policy. And I think that's I generally like that interpretation better because um it's I think generally perhaps rare that you're going to get the absolute worst opponent. But you still want to use miniaxs when you want guarantees against um anything that could happen. Okay. So now um I'm going to talk about expected miniax. Um this is in some sense going to be um like a dessert. It's uh it's not that deep, but hopefully it'll just give you another um example of what you can do with recurrences. So suppose you have this game. You have the three bins ABC. Um and now you choose one of them. But now I'm going to flip a coin and if it's heads then I'm going to move one bin to the left wrapping around and I'm going to choose an number from that bin. So if it's tails and I stick with the same bin and your goal still is to choose the max maximum to maximize the chosen number. So to model this game we can just draw the game tree. I'm going to introduce um a quoteunquote policy called coin um which chooses um one action with um left with half and sorry tails with half probability and heads with another half probability. So what's happening here is that suppose you choose a then this is a fixed policy that's why it's a circle. I'm gonna choose either heads or tails. And if it's heads, um, I move one to the left and that and I end up in C, which is why I have minus 5 and 15. And then if I choose, uh, heads, then I sorry, if I choose tails, then I stay in the same bin and I have a here. Okay. And if I choose B, then I either stay or move to the left. Um, Okay, by now hopefully if I gave you this tree, you should just be able to compute the value at the top. This is known as a expected minax. Uh, because there's expectations, there's mins, and there's maxes. You have everything here. And um you know I think it's pretty self-explanatory. So the min nodes you min these nodes you take the average um and then at these max nodes you max. So here's the recurrence. Um there's uh three recursive cases. Now there if it's the agent's turn you take the max. If it's opponent's turn you take them in. And if there is the coin's turn then you compute the average weighted average um with respect to pi coin. Okay. So in general you might imagine all sorts of different recurrences. Hopefully this will you know unlock your imagination. So you might imagine more than two players. So, in the Pac-Man assignment, you're going to have a Pac-Man and multiple ghosts, and you're going to have to reason about the multiple ghosts. Um, you can also have multiple agents um in the setting. Um, but in some sense, it's all can be handled uh by this general recursion uh framework. You can even do things like players can take extra turns or choosing who to go

### [50:00](https://www.youtube.com/watch?v=SMOD_GiRzb8&t=3000s) Segment 11 (50:00 - 55:00)

next or losing a turn. And this is just because all this we're doing is transitioning taking a state and transitioning to some successor state and the player whose turn it is and baked into the state and you can transition to any state you want. Okay. So you know if I you can brainstorm a game with some crazy rules with it. You can just define your recurrence like we've done before and you're off to the races. There are a few things that aren't covered by this kind of game tree recurrence framework and this includes things like games with imperfect information such as poker or most card games I think. Um and the reason this is not included is that if you think about these game trees everything passes through the state and the state has to have all the information. This is the distinctive property of MDPs and search problems and also these games. So if you have um players with different information, you can't kind of represent this in a tree because um you know as you pass from a min to a max node, you there's no way to like bypass the state if that makes sense. Um there's also we're considering only zero sum games, but there are also nonzero sum games such as uh prisoners uh dilemma and there's also collaborative games and so on. And there's also non-turnbased games. So here it's turnbased. So someone takes a turn, it's someone else's turn, someone else takes a turn. And then there's some games like rock paper scissors which are simultaneously. If you did turn-based rocker, paper, scissors, that would be pretty boring. Okay, so now we've concluded the modeling um section of uh the lecture. Now I'm going to focus on miniax and try to speed it up. So let's talk about alpha beta pruning. So let's revisit the basic miniax formulation. No dice, no um weird stuff. It's just a classic miniax game. And so there are max nodes, min nodes. You max at the max nodes, you min at the min nodes. And now this takes exponential time in general. Um ex exponential in the depth that is linear in the number of nodes, but the number of nodes can be exponential. So how can you speed this up? The key behind this technique is a very classic technique from games called alpha beta pruning is that you don't have to visit all the states before you know that some are just like not worth looking at. Okay, so now the key is going to figure out how do we know when something is suboptimal if we don't even look at it. It seems maybe a little bit kind of magical that you can just prune out parts of your tree. Maybe there's like actually the optimal policy or optimal action sitting there. Who knows? So to give an example of why this might be possible, considering the following case. Suppose you have two possible actions A and B. And I tell you A can give you utility something between three and five. I don't know what exactly but something between three and five. And then B is going to deliver a value between somewhere between five and 100. Okay, which one would you choose? How many of you choose A? B. Okay, so why do you choose B? Well, it turns out you don't need to know what the exact value or utility of A and B are. You know that you know A no matter what happens, B will be better or at least as good. So there's this idea called branch andbound which is a general idea principle which are used in comtorial optimization which um says whenever you have you're going to maintain some intervals that keep track of where particular values might lie and then you can prune out things where if you have an interval that you know is strictly larger than some other interval. Okay. So here is an example. Um let's consider the following game tree. So I have um let's imagine I'm searching. Okay. So I'm do minia max search. I do a depth first search traversal. So I come down here. I have a three and a five. Take the min of three and five. I get a three. I go up here.

### [55:00](https://www.youtube.com/watch?v=SMOD_GiRzb8&t=3300s) Segment 12 (55:00 - 60:00)

And at this point, you know, this is my best so far here is three. And I come over here. come down and find a two. Okay. So, what can I do at this point? So, I can prune I claim that I can prune this off and I don't need to explore anything over here, right? And the reason is that the root is the max between three and whatever is over here. This is min of two and some x and this is going to be three no matter what x is. This can be a million. This is still going to be this. This is two and this is going to be three. This could be minus a million and then this is this is still three. Okay. So in this particular situation I've given you one example where hopefully you can convince yourself you don't need to look over here no matter what this value is. Okay. So let me try to generalize that principle a bit more. Um so in general when we're doing a search we're computing minmax um I'm going to be at some node here and I'm going to intuitively think about for every node I'm going to have an interval or a bounds and for the max nodes I'm going to have a lower bound and for the min nodes I have a upper bound. Okay. And this is because if I've discovered a six, that max value can only increase. And here if I discovered a eight, that number can only decrease. Okay. So max nodes, I have these uh lower bounds, ma min nodes have upper bounds all the way down for every node um in the tree. Now think about where the minmax value comes from, right? it must have come from some leaf because I'm just taking max and mins of successor states and so there's some leaf that um must have given rise to that value and so the I'm going to talk about the optimal path is a path that's taken by these min and max policies to that leaf that is the thing I'm trying to find Okay. And so this path needs to be within the lower and upper bounds that I have. So which means that if a node gives me bounds that don't overlap with every ancestors bounds, then I can prune that away. Okay? So for example um this is greater than six less than 8 greater than three if this is somehow let's say uh less than or equal to two then I can prune that away right because this is greater than equal to three um and this is if this is the value of this is less or equal to two then there's no way that this can be on the optimal path. So the question is can you prune this uh because this says it's greater than six and less than equal to five and the answer is yes right because this interval does not overlap with this one and I'm have to look at the entire um ancestor stro path okay so let's do an example on the board here um let me see here and try to remember what nines. Okay, so I have three nodes 97. So notice that the tree doesn't have to have the same number of levels either. So six um 3479 three Whoops. Right. Six 3 4 7 9 and then over here 83. Okay. So suppose we're doing minax um or alpha beta pruning. So here initially the intervals we don't know anything. So we're going to go over here come down here discover the nine. So now at this point um I pop come up here and I say this has to be uh greater at most

### [1:00:00](https://www.youtube.com/watch?v=SMOD_GiRzb8&t=3600s) Segment 13 (60:00 - 65:00)

nine. Um come over here I see that's a seven. I update that to be um less seven. Actually this is just equal to seven because I explored everything. So now I come back here. Um sorry I can't move this up anymore. But this is going to be um I'll just write down here at least seven. So this is the alph. So I forgot to say this the um let me just write this. So each node um so for the max nodes I have alpha values and the alpha values are going to be essentially uh lower bounds lower uh bound on the value and then um for these nodes I have beta values which are upper bounds. Okay. So here this is greater equal 7. Come down here. I found a six and I come up here. Um this is at most uh six. So what does that mean? Can I prune something? Should be able to, right? So in this case I can actually prune out um I can just stop I can just prune out everything here right because this is already greater than seven this is less than six then no matter what happens over here I can't actually change the value here okay so then I come over here greater equal 7 I find the eight so this is now lesser equal to eight. Uh, can I prune this? I can't, right? Because these intervals still overlap. So, I don't have enough information. So, I have to explore this. I find that's a three and then I have three and then finally update that to uh seven and I'm done. So notice that alpha beta is can be pretty powerful because it's not just pruning one node, you prune an entire sub tree if you're lucky. Yeah. So the question is does this work for expected max? Um the not the let's see short answer is no because when you have an expectation you have to consider all the possible things that could you know happen. Um you might still be able to do some alpha beta you can definitely do some pruning. Um I think there are cases maybe when um like in expected max if you have bounds on the utility then you can still control how much even the utility could be um I have to think about this exactly but um that's a good question how do I get the lower bound greater than equal to seven here because when I saw this value of this subree is seven so whenever I am at a node. I'm going through the each of the children and updating the max. — Yeah. So you're it's a top down recurrence, but the you sort of go to the bottom and then you come up. Yeah. Okay. So how do you get lucky? So let's talk about ordering. So the order in which you visit the successors of a state matters for how much you prune. So for example in um in this case uh that we saw before um let's see so we had so if we had 3 five 210 then this 10 will get pruned. However, if we have 2 1035, now when you're computing this, you get a two. And now when you go over here, um this is greater equal to two. This is a three. So you won't be able to prune this. Okay. So in practice, what you want to do is go through the child. If you're at a max node, you want to go through the child nodes that are more likely to be large because the larger they are, the

### [1:05:00](https://www.youtube.com/watch?v=SMOD_GiRzb8&t=3900s) Segment 14 (65:00 - 70:00)

the tighter the bound is and then the more you can prune. Okay? So, if I were able to do this one first, three first rather than two, then I could have pruned it, but not the other way around. Um so in practice you can use a heristic evaluation function which we'll talk about uh shortly to order the children. Um you can decrease for max nodes you order the children by decreasing value and for the min nodes you order by increasing values. In other words, you're really trying to find this optimal path kind of guess the heristic and if you guess well then you are you can be faster. So this is like you know just maybe a general comment. We've seen many cases in this course where um you can have exact algorithms still um even with a heristic and if your heristic is good for example like in a star if your heristic is good then it's great you are very fast and if your heristic is bad then you're not so fast but you're still correct. Okay, so to summarize alpha beta pruning we speed up miniax search it's exact um and your compute how much you save is going to be depend on you know your ordering in your problem in general it's still going to be exponential time so it's not like we've you know solve p= mp um and the intuition is that you want to order your actions to shrink the bounds um as fast as possible. Okay. So now let's talk about final thing for today is evaluation functions. This is uh going to be an approximate way to speed up search. So here's the minia max recurrence again. Um and one idea the idea of a evaluation function is that you just say forget search I'm just going to take a wild guess say how good is this state okay um so it is true that in a game the only thing that matters who wins at the end but nonetheless you can still like if you're playing chess you kind of know in some cases who's doing well and who's not doing well by just looking at it. And so the evaluation function is trying to capture this prior knowledge about what might be good. So in the case of chess, here is an example of a type of evaluation function you might have. So in general, evaluation function is going to be a sum of different terms. Each term captures some intuitive property. Think about it as a feature if you want. So um so you want a lot of pieces that's material. You want, you know, mobility to be able to move. You don't want mobility, um, king to be safe. You want to control the center of the board and so on and so forth. And for material, you might think that like, well, uh, the number of kings I have versus opponent has, kings are very important. Um, queens are worth nine, rooks are worth five, um, and so on. Okay. So what do you do with an evaluation function? So you could just choose the action that um you know produces a state that gives you the highest evaluation function. This is um well it depends on what evaluation function you are. But if it's something like this, this is probably not going to be a very good chess player. It's just going to probably be too greedy and try to save your pieces and not really do very well. Um but you can actually use it in evaluation function and search and the way to do this is called depth limited search and it basically says take miniax I'm going to have a depth which is an integer let's say five and say I'm only going to search five steps and then once I if depth is zero which means I've Um, then I'm just going to call the evaluation function. It's like, okay, I I'm not going to search anymore. I'm just going to, you know, vibe uh vibe check and just give me a number. And then if I'm I still have um search budget, if the depth is not zero, then when I recurse, I if I'm the agent, then the depth is um you know D. And if I'm an opponent, the depth is um d minus one. And this is

### [1:10:00](https://www.youtube.com/watch?v=SMOD_GiRzb8&t=4200s) Segment 15 (70:00 - 73:00)

just to make sure that depth unit of depth corresponds to both the agent and opponent going. So I don't subtract twice here. Okay. So that's basically it. Hopefully you've seen this is like the what sixth recurrence you've seen. So hopefully you're used to seeing these things. Um there's no guarantees that you'll find the optimal policy. It's a complete heristic and how effective it is going to depend on how good your evaluation function is. But the nice thing is that you have a bit of this kind of control. The deeper you search, the more uh the less sensitive you are to the evaluation function being bad. Whereas if you do a shallow search, your evaluation function better be good. Okay, evaluation function um prior knowledge that you evaluates a state quickly um do search uh to a certain depth and then use the evaluation function. This is death limited search and the analogy is um bit like um you know future cost in search problems where you look you know a step forward and then you um approximate. Okay. So to wrap up here actually one other note about evaluation functions just to prime uh for next time is that if you think about what does evaluation functions remind you of maybe something in uh reinforcement learning. Yeah. So Q values or just more simply values right. So um values are basically meant uh well so Q values in reinforcement learning let's just go with that since that's what we studied is a sort of approximation of the true value and so what this is going to allow us to think about is well this evaluation function can also be learned um it can be learned from supervised data reinforcement learning and that's exactly what we're going to uh next lecture is to learn the evaluation function. Okay. But now to just summarize. So we've introduced games here. The agent is still trying to maximize utility but the opponent strategies as unknown. The miniax principle says just assume the worst case. The opponent is trying to get you and to minimize utility. Um we define a bunch of different recurrences. Um given a game and a policy we can do game evaluation expected max miniax expected minia max and you can imagine others expect a mini mini expected max or whatever you want and we look at two ways to speed things up alpha beta pruning which is exact and evaluation function which is um approximate. And uh next time we're going to look at how you can do learning in the context of games. And we're going to look at something called TDarning.

---
*Источник: https://ekstraktznaniy.ru/video/20911*