Stanford CS221 | Autumn 2025 | Lecture 11: Games II

Stanford CS221 | Autumn 2025 | Lecture 11: Games II

Stanford Online

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (15 сегментов)

Segment 1 (00:00 - 05:00)

So, welcome back everyone. Last lecture we talked about two-player zero sum games. And so, we're going to review that a little bit. And this time, we're going to start bringing together reinforcement learning and games. And then we're going to actually play a few games in this class. So, hopefully that'll be fun. Okay. So just as a review, last time we talked about the miniax principle. Um, and this is the way that we solve games. So remember in a game you have different players being control at different uh states which means that you can't just max over everything or take expectations. You actually have to min max and min. So um this was a game tree that we looked at last time where at the root node the agent is trying to maximize its uh utility and at the next uh level the opponent is trying to minimize uh the utility. So at these min nodes you take the minimum over the outcomes of each of these sub trees and then at the max node you take the max. So the miniax value of this game is one which corresponds to the agent choosing the action B here and then the agent choosing um action one. Okay. So we talked about ways of speeding up miniax search. Uh if you want to keep it exact you can use alpha beta pruning. Um in alpha beta pruning remember you keep track of lower and upper bounds on the possible values of each of the um uh states in your ancestral tree. And for example here you after you've explored this part of the tree you know that this is at least one and you come over here this is at most minus five. These don't overlap. So then you don't even need to look at the 15. Okay. So we also looked at evaluation functions and this is if you want to bring in domain knowledge and you don't mind being approximate. So then you can just define an evaluation function to assess whether a state of the game is good for the agent or not. So we looked at this one in chess which combines a variety of different factors related to how many pieces you have and how much mobility and control of the center you have and so forth. So this was very uh much about manual heristics. So now the question I want to pose is can we learn the evaluation function because if we can learn it then maybe we can do better than a handcoded system. So indeed that's what we're going to do today. We're going to use reinforcement learning um in particular TD learning to learn the evaluation function. So uh you have to bring in back uh your memories of reinforcement learning and uh let's dive in. So just as a refresher in reinforcement learning um there are two quantities of interest. So this is value based reinforcement learning. So you have v pi of s which is the value or the expected utility that you would obtain if you followed policy pi from state s. So in a state s you follow some fixed policy and this is how much value you would pick up and there's also Q values which are um have an extra argument uh which takes the action here and this is the value of a utility when you first take action a when you're in state s and then you continue to follow policy pi. Okay, so these are the two quantities that show up um over and over again. We saw them first for policy evaluation um in MDPs and then we saw them in SARSA and they're now they're going to come back again. So remember the SARSA algorithm from reinforcement learning. So this was an on policy algorithm which means that it's estimating the value Q values of the current policy that you're following as opposed to the off um optimal policy which might be not the one you're following and SARSA was also um the main insight was bootstrapping remember where you're updating the Q values towards a target and the target rather being uh just a empirical utility of the full roll out. It's an immediate reward plus an estimated future reward. Okay, so here's the uh SARSA algorithm.

Segment 2 (05:00 - 10:00)

Every as you run the reinforcement learning algorithm um every time you see um a state and you took an action and you got a reward and then you ended up in state S prime and then you took another action. Um then you perform the following update. First you look at um the reward that immediate reward that you just got and then you form an estimate of how much reward you would get after that. So you after this reward you ended up in state s prime and now you're taking action a prime. So that estimate sort of packages up your uh belief of how much reward you're going to get. Okay. And then what you're going to do is you're going to nudge Q pi of SA which is your current Q value estimate in this uh direction by some learning rate ADA. Okay. And you keep on doing this over and over again. All right. So in code it looks something like this. So I instantiate an RL algorithm SARSA. Um and then in um in the remember the RL algorithm it implements two functions get action. So I'm in state one and I get an action and then the environment does something with that action and then I incorporate feedback which is tells the agent what happened. So incorporate feedback state um one I walked I got a reward of minus one um I end up in state two and I haven't finished and so on so forth. So get action get uh incorporate feedback get action incorporate feedback. Okay so one thing that's interesting um uh I want to emphasize is that the strong connection between Q values and policies. So sa computes qpi which is the q value of the particular policy pi. But given this q value I can read out a policy and the policy I read out is simply well I just take a the you know the best action here. So arg max of a over qi of s a and that gives me a particular action to follow. So you can think about this as one step of policy improvement where here is the value of um this particular policy and a new policy is simply well I'm just going to you know myopically optimize over the next action and then I think about following the best uh whatever um values I get after that. Okay. So the whole point of this is that given a Q value, you can read out a policy uh just by arg maxing over A. Okay. So now I'm going to invoke another thing. We looked at policy evaluation from MDP. So remember in MDPs you get the transition and the rewards and you know how the world works. In reinforcement learning you don't. So in MDPs uh we had these set of equations. So we defined a value of a policy um to be simply the Q value of the action S and the action given by that policy. And then separately you can also define a recurrence for um qi of sa and that just corresponds to summ over all possible successor states when you take action a waiting by the probability of that transition and then immediate reward plus um an estimate of the future. So just as a recapa is a reinforcement learning algorithm. It computes Q values Q which is a function of SA. I can get a policy out very easily. Now the question is if I let's say I just gave you uh VPI could you compute a policy from that directly? Can you read out the policy? If I only told you essentially how good various states are, can I get a policy out? Well, I guess the question is like what do you need to get a policy out? Right? And in some sense, well, you look at this and you say, well, if I'm given V pi s, I can just compute q pi sa using this formula and then I take the arg max over a. And then that's my policy, right? So I'm only given v. So I don't have q, but I can the q is in some sense

Segment 3 (10:00 - 15:00)

a one-step um transition from v. But here's the catch is that computing this requires knowledge of the MDP. So this is why in general you can't do you can't just estimate a value of states right because I if I don't know how the world works you can tell me this state is great this state is bad but that doesn't tell me what actions how does that connect with actions and the thing that connects actions and states are things like the are the transitions but I assume I don't know the transition so that doesn't help me act. Whereas if I have a Q value, this sort of binds the action into the value. So I know that in this state, not only is it good, but I take a particular action and that uh gives me some value. Okay, so that's why when we started with reinforcement learning and its full generality, we only looked at SARSA and Q-learning because we need policies. Now, however, for games, we actually do know the transition and uh rewards unless you're in a cruel game where you don't know the rules. Most of the time you're playing a game, you know perfectly the rules. Um you actually things are might be even deterministic as we've defined it. For example, chess go everything's deterministic. Everything's fully observed. So there's no uncertainty. So then you know given the V value it's actually pretty trivial to compute the best um policy. You simply look at the action. You're in a state s. You look at the action. You basically consider all actions. You see where they go. So remember there's a determin successor function that takes a state and action and tells you what successor state you end up. And you just evaluate how good the resulting state is. Okay. Well, wait a minute. So, but if we already know the MDP, then why are we using reinforcement learning, right? I thought the whole point of reinforcement learning is that you don't know the MDP. If you know the MDP, just do value iteration. So in principle you could also do value iteration for games because there's no not there's in some sense no need to learn anything you already have everything in front of you but the problem is usually for games um besides you know tic-tac-toe or something the number of states is just vast it's going to be exponential and you won't be able to do value iteration which grows with the number of states okay So therefore, we're going to use reinforcement learning. And it's important to know why we're using reinforcement learning. It's not because we don't know the MDP. We actually know the MDP quite well, but it's because the number of states is exponential and reinforcement learning allows you to um use function approximation and effectively reduce the number of states. Okay, let me pause there. We did a bit of review of Q values uh V values um policy evaluation and SARSA and some motivation for MDPs and uh reinforcement learning. How does the first policy update simplify to this one? Is that the question? Yeah. So um in a game the transition probability is one for one particular s prime and that particular s prime is suck sa it's a function okay so the summation goes away and then um in a game all the reward is at the end so uh this is zero for essentially all transitions um and then this count is one so then you end up with just vi Yeah. Okay. So now let's actually talk about turning this into a an algorithm. Um so TD learning um the way to think about it if you had to just kind of remember one thing is TDarning is as to VPI as SARSA is to qi. Okay. So SARSA estimates qi TDarning estimates vpi. And basically that's the only difference. Okay, so in SARSA QPI tells you how good each action is from a given state. And in TD learning, all I vi is going to tell you is how good each state is.

Segment 4 (15:00 - 20:00)

And since we know the action to next state mapping, that's all we need because we just need to know how good states are. So this is good and it simplifies things because we have a function of just the state rather than the state and the action. So it's a m minor point but um but that's it does make some things easier. Okay. So now we're going to assume we do function approximation. So remember in function approximation in reinforcement learning we take uh instead of keeping um a tabular representation meaning for every state we have a separate value that has nothing to do with any other um value we're going to um define um a function a parameterized function um also written v which take as input s and then depends on some weights w. So these are the parameters of that function as you vary W you get different uh functions. So in deep reinforcement learning typically vi of S is called a value network. Okay. So here's the nutshell TD learning in a nutshell. So you get a piece of experience. So you are in state S. You took action A, you got reward R and then you ended up in say S prime. So the model predicts that this state has this particular value V of SS given some parameters. Okay. So based on this new piece of information, how can we compute a refined estimate of that value? So we define the target which incorporates this piece of experience by this bootstrapping um where you have the immediate reward plus um some discount gamma times um the value of the successor state given the same set of weights. Okay. So this is you know very similar Q-learning. It's just you have V's instead of Q's. And now I don't need a prime because this is just a value function, not a Q function. So now to move the model's prediction towards the target, I'm going to define a loss function. And the loss function, I'm just going to use the square loss here to be the different square difference between the model's prediction and the target. And then def getting that uh loss function I can just take a gradient step by moving the weights in the um opposite of the gradient um uh of this loss function. Okay. So to put that in more I guess uh summarize that equation. So on each piece of experience I'm going to look at the model prediction look at the target and um I guess how you uh yeah so if you differentiate um this um you'll find that there is a sort of natural form where you have the difference times um you know the gradient of what's inside which is just the gradient of the value function at W. Okay. So one actually one important point is that this term also depends on W but I'm going to um treat it as a constant. So remember in the uh second lecture or so I had this we were doing computation graphs and you could detach. So this is where you would um you know detach this because this is when I back prop through this I only want to um this uh first W. Yeah. So uh so let me explain that again. So you can think about the target as just of a value a constant. Okay. And then all you're doing is you're saying your model prediction move it towards the target. The target is fixed. You don't change it. Um and that corresponds to this gradient. Um but however the target also depends on W you know implicitly or actually explicitly but for the purposes of computing a gradient I'm not going to differentiate through that. Okay. So let's see a code implementation. So we're going to do TDarning for this uh tram problem. So I'm going to define the MDP exploration policy uh which either walks or takes the tram. Um and then TDarning I'm going to pass

Segment 5 (20:00 - 25:00)

in the MDP now. So if I just want to evaluate a value of a policy, I don't need this, but uh I need the MDP to be able to compute the policy as I argued. um earlier um so I'm this is for a general MDP not just games um so in the R algorithm implements two things as usual get action and incorporate feedback so when I get action TD learning if I'm doing epsilon um greedy exploration I either with probability epsilon explore or with probably one minus epsilon um return um pi and what is pi? Pi corresponds to the policy uh that uh given by uh vpi which are the values I'm estimating. So what I'm going to do is I'm going to compute the q values because I only have v. So I compute the q values um for every action. So for every action, I'm going to sum over all the possible successors um of that action um waiting by the probability of that transition times the immediate reward plus the discounted um future reward. So this is where V comes in. So if I take V, I um I combine it with my knowledge of how the world works, the transitions, and then I'm going to get a Q value as a result. And then that is going to give me a set of uh Q values and then I'm just going to take the action with the highest Q value. Okay. And then this case this is uh this is walk. So given the these v values I can compute u a policy. Okay that's going to allow me to act. And now when I um act and the environment sends me feedback. This is what I'm going to do. I'm going to look at the predicted um value of a given state. I'm going to look at the target which is reward plus discounted um of the next state and then I'm going to just uh perform this update. So notice I'm not doing any function approximation here for simplicity. Um if there were function approximation I would form a loss function and take the gradient but I'm just keeping it simple for now. And in the tabular setting um this has a simple interpretation which is I'm just moving it towards the target away from predicted at some learning rate. Okay. And as in this example I predicted zero the target was minus one. So I'm going to nudge the value of um state one to be um minus 0. 1 which is in this direction. Okay. So then I do this over and over again and that is uh essentially TD learning. Okay to summarize TDarning estimates VPI from experience. Okay. So remember VPI is the value uh of a particular state under some policy pi. So this tells me basically if I'm following a policy how good are various states and it is on policy. Notice that we're only rolling out with the policy pi and we're not considering any other actions. Whenever you see a max over actions that usually means something like off policy or Q-learning. If you don't see that it's probably on policy. That's a good rule of thumb. And then we're using bootstrapping. um where we're looking at the target uh consists of the model that's a sign that it's uh bootstrapping. Now let's specialize TD learning for games. What we did is TD learning in general for any MDP. Um now we're going to adapt it to games. So what's special about a game? Well, as we discussed earlier, transitions are deterministic, at least in the in um the games that we've looked at so far. Um so when you make a move, it does, you know, one particular thing. Um and there's no utility at until the end of the game. And then there's two players, agent and opponent. Okay. So in some sense maybe I lied a little bit. It's not a necessarily a special case but it is um the fact that

Segment 6 (25:00 - 30:00)

you have deterministic transition that's a special case. No utility at the end But this idea of two players agent and opponent remember that is the thing that's actually special about games. So how do you deal with that? And the way you deal with that in the context of adapting reinforcement learning is actually quite um simple and elegant. You just let the agent and opponent use the same value function. So this is known as uh self-play where you're essentially saying agent and opponent you are your minds are I guess wired in some sense and you're just going to use the value that I'm going to estimate to drive your uh decisions. But of course the agent and opponent are not doing the same thing. the agent is trying to maximize and the opponent is trying to minimize which means that there's a small tweak that we need to do. So this we saw earlier that given an estimate of VPI we can read out a policy for the agent by just choosing the action that leads to the best state under VPI. But what is the opponent trying to do? It's trying to do the opposite, right? It's trying to minimize. So you just replace the arg max with arg min and you say that's the opponent policy. Okay, so it's actually pretty nice that all you are focusing on in TD learning is estimating this vpi and once you have vpi you can play the either the agent or the opponent. And this is generally true. If you know what's good and what's bad, you can choose to do the good thing or you can do the bad thing. So let me instantiate this just for um just as a fun example for the game of uh back. So here's the way back works. So you have this board. It consists of u these triangles called points. Um there's two players in this picture. There's the red player and the white player. Each player has a set of um pieces. And the goal is to move your pieces off the board. Okay. So, if I'm the red player, I have my pieces here. I'm trying to basically move all these pieces around uh clockwise to my home board and then off. and the white is going to take all these pieces and try to move it counterclockwise um and off the board. Okay, so how do I move pieces? Um there is dice involved. So I roll two dice and let's say it's white's turn. Okay, I roll two dice and the dice uh determine the number of spaces I can move. So, if I roll a three and a five, I could say take this piece and I could move a three and then another five. Or I could take let's say this piece and move at five and take three. Okay. So, what makes this game interesting is that um if you land on a point with one opponent piece um I guess there's no examples here, then you move that piece to the bar. you sort of crush the piece and that goes to bar and it sort of has to start over. So that's bad for your opponent, good for you. But if you have uh but you can't land on a point with more than one opponent pieces. Um so generally the strategy is that you're trying to put your pieces in tall stacks so that um they're safe from your opponents. Um but at the same time you have to move everyone across. So this is probably a better move because this um allows you to um have two um so you don't have any exposed single tense essentially. Okay. So um I think there might I'm not sure if there might be another rule but that's basically the um the gist. So notice that this game has randomness in it. So it's a little bit different than you know MDP or a classic game but remember um we shouldn't be afraid when there's sort of slight variations because as long as we understand the principles we can write down these recurrences and then everything kind of makes sense. Um so the first thing is to think about the dynamics of the game here. So we're going to use pi dice to denote the dice as a I guess a player in some sense

Segment 7 (30:00 - 35:00)

which is a fixed policy and just generates um you know the values of the two die. Okay. So the dice pie dice is going to move first and then it's going to be the agent's turn and then when the opponent rolls that's going to be another pi dice and the opponent's going to move and then agent's going to roll and then agent moves and so on. Okay. And the reason that the dice has to be before the pi agent is because the agent's move depends on what the outcomes of the dice were. Okay. So when we do um TD learning for backmen we are um trying to learn PI agent and PI op at the same time. Remember this is selfplay but PI dice is fixed. If you were in a cruel world then you could also play against adversarial dice but I imagine that would be a much harder game to win. So one um example of how you could proceed then is um if you want to apply TD learning you have to define what the value function how the value function is parameterized. Okay so typically you have a game it has a game state and you have to turn it into a vector for machine learning to work. So here is an example of how you define a vector for each state. So we remember we did this in the machine learning um section of the class. So you have a state let's say the simplified state. Um typically you can define a set of features um that uh capture the state. So for example a feature might be the number of O's in column zero. Uh is that number equal to one? That's true. So that's a one. Number of the O's on the bar that's a one. the fraction of O's that are removed. So two out of four, that's a half. Number of X's in column one um equals one. Number X's in column 3 equals three. Um that's a one and is it O's turn and um that's it's one here. Okay, so every state of the board gets mapped into a feature vector. Okay, so now if you have a feature vector, you can do many things. You can um define a linear value function um which we talked about. So this is basically linear regression here or you can do something fancier. You can define a multi-layer perceptron and have multiple layers. Um you can you know do number things. Okay. And then you just apply uh TDarning. So, so next I'm just going to go through um this is this part isn't technical but I think it's interesting to reflect on some historical examples where um machine learning and reinforcement learning have been useful in uh game playing. Um so I think I have mentioned this on the first day of lecture checkers was a very early example of where um learning was helpful. So um Arthur Samuel in 1959 um trained a chess program and it learned by playing itself repeatedly. So this is perhaps one of the earliest examples of selfplay in action. Um at that time uh you know compute was limited so you couldn't really do that much u you know deep learning. So you have to invest time in getting uh features. So the type of you know I gave you a very simple set of features for back but you can define much more complicated features. Um they use a linear evaluation function linear value function um and then also define some intermediate rewards. Um remember in games the true reward is only if you win or lose but that is a very hard thing to learn from because it's a delayed reward problem. So um they essentially added some intermediate rewards to make the problem easier. They also did alpha beta pruning um with some additional heristics which are domain specific which we alpha beta pruning we talked about and this really reached a amateur level of play. So it could probably beat me probably. Um and this you know you got to remember 1959. So this was on nine kilobytes of uh of memory. So um pretty good. okay so back gammon which we just talked about. So the big result in back gammon was uh this program called tdgamin made by Gerald Tasaro in '92.

Segment 8 (35:00 - 40:00)

And this was an a agent that also learned by selfplay except for it was 1992 instead of 1959. So they could do it a million times. Um and because it was the compute was more plentiful and you can get more data, they actually used fairly uh simple features and they used a neuronet network and had no intermediate rewards. Okay. So you could see the kind of the direction as you get more data, more compute, you don't have to engineer features quite as much. And now this reached a human expert level of play, not just an amateur um level. And you know, one of the earliest examples of providing new insights into the game. So um it how um the first move should uh should play out. Okay. And then there's go which I think um is more of in recent memory. Um so in 2016 there was Alph Go. I want to highlight Alph Go Zero because it's much more bullish on reinforcement learning. This learned by selfplay as well. So they played about 5 million games here. But at this point it was also dumb features and you just had the stone uh positions right so in go there's a lot of complexity about you know what's what pieces are live and how everything's connected and how many degrees of freedom there is and previous programs you would try to encode all that knowledge into features but here we're just saying machine learning algorithm these are where the pieces are go figure it And this was obviously using a deep neural network, no intermediate rewards. Um they did have to use Monte Carlo uh you know tree search which I'm not going to talk about but it's a search algorithm that allows you to um essentially get a better policy rather than just taking um the action that gives you the next reward. So you can think about this as a less greedy version of um the types of policies that we've been looking at. So this actually beat Alph Go the version from 2016 which beat um Leido in 2016. Um so of course at this point go plane uh systems are much better than humans and similarly this also provided new insights into the game. Okay. Okay, so these are three examples of um just ground out how these techniques that we've been seeing in the class map onto real uh world examples of uh gameplay. Okay, so that's all I'm going to talk about in terms of TD learning. Okay, so next we're going to look at um two generalizations or extensions if you will of the types of games we've been looking at. So, we've been looking at turn-based games. We're going to look at simultaneous games. And we're going to also go beyond the zero sum uh assumption. Okay. So, so far we've been talking about turn-based games and now we're going to talk about something called simultaneous games. And you know simultaneous games. A famous example is rock paper scissors. Um two players have to play at the same time. And so here's a question um for you to think about. Can you still play optimally if you reveal your strategy to the other player? Okay. So obviously if I am playing rocket pair stacers and I say I'm going to play rock um that is not very smart. Okay. But is there a way to play a view of you still reveal and still reveal your strategy? And the answer turns out to be what kind of uh strategies you're allowed to do. So in turnbased games um we have that nailed right. We just compute the miniax policies using a game and um so the game tree naturally models the turn taking where each node only one player is moving. Okay. Uh but in simultaneous games, two players have to move at the same time. So we can't use a game tree anymore, right? You can't have this. If two players are trying to move here, then it's confusing. Are you maxing? Are you mining? And it doesn't work. Okay, so we're going to focus first on zero sum games, zero sum simultaneous games, and then we're going to do the non-zero setting. Okay. So, we're going to start with a

Segment 9 (40:00 - 45:00)

few definitions and then um there's a theorem that essentially solves it. Okay. So, I'm going to present to you um another game which is hopefully more obscure than rock paper scissors. Okay. So, here's how the game works. So, you have two players A and B and each is going to show either one or two fingers. If both show one, then B gives A $2. If both show two, $4 and otherwise A gives B $3. Okay, so intuitively A wants both people to play the same number of fingers because that's where uh A can win. And there's a FA um A really wants it to be more two than one, right? But if A always plays two, then B is going to play one and then A is going to lose money. Okay, so there's sort of this balancing act here. Okay, so while I'm going through this, maybe in the back of your mind, you could just think, hm, how do I what's the strategy here? All right, so let me introduce some formal uh notation. Um, so this is called a single move simultaneous game. Okay, so it involves just playing the game. Single move, you're done. There's no state. Um, there's two players A and B. We're going to define this payoff matrix, uh, which is essentially going to be the equivalent of the value function or the reward function. Um, and it takes two arguments, um, A and B. And for each action pair, I'm going to give you a value. And that value is A's utility. And because it's zero sum, B's utility is just the minus of that. And again, there's no state. There's only one action. Okay. So this is what it looks like for uh two-finger Mora. Um let me just write this on the board. So it's just a as a handy reference. So I have um so A can play either one finger or two fingers. B can play And then if they both play one then A gets two. two um A gets four otherwise um A might loses three. Okay. So that's the um you know payoff matrix. Okay. So, um just to ground things out a bit. So, there's two actions, one and two. Um and this is the payoff matrix. Um so, V uh of one and one gives you two and so on so forth. Okay. So, that's a game. Now what is the equivalent of a policy here in game theory? It's known as a strategy. And there's a pure strategy is a deterministic policy. And a deterministic policy just is a single action. Um and a mixed strategy also known as a stoastic policy is a distribution over actions. So for example in two-finger mora always one is a strategy and this means for the action one finger I'm going to have probability one. Okay another way I'm going to write that in math is pi the policy is a vector of probabilities of each action. So for one I have probability one for two I have probability zero and for always two that is the strategy where all my probability mass is on two fingers. So written in math it's pi equals 0 comma 1. And if I were playing uniformly at random then I put 50/50 probability one on each action. Okay. So I've defined uh the payoff matrix the policies and now I can define the value of the game which combines them. And in math the value of a game according to two mixed strategies pi A and pi B is I'm going to consider each uh every pair of actions A and B. the probability of um A playing um this

Segment 10 (45:00 - 50:00)

this action and B playing this action um notice that they're independent um times the payoff matrix uh value of action A and B okay so in code if I have this um payoff matrix V that I'm going to pass in with let's say the uniform form policy for both players. Um, this is what I'm going to get. So, I have this payoff matrix. I have two policies. And then, you know, this is pretty straightforward. For every possible action with A with probability A, for every action B with probability B, I'm just going to um sum up the values. probability A times probability B times the corresponding entry in the payoff matrix. Okay, so if I do that then I get um a value of zero. Okay, and this sort of makes sense. If I take the average over all these four values um I should get uh zero. Okay, so every everyone's playing uniformly the value of uh this is zero. Okay. So now um what happens if I player A plays always one and player B plays uniformly. Okay. So that the value is going to be minus0. 5. So you can think about this in this case player B is uniform between these two and player A only plays one. So that means it's going to be average of these two values and that's going to be minus. 5. Always two means I'm always two down here but uniform over these um one and two and that's going to give me um the average of minus3 and four which is 0. 5. And then um you can um look at if A is playing uniformly and B is playing one then it's going to be the average of these two and that's going to be minus. 5. And the fourth case is um uniform A is playing uniform over one and two and B is playing only two and that's going to be the average of these two and that's going to be um 0. 5. Okay. So hopefully game evaluation is clear and just to build some intuition uh it makes sense that for example um player one wants to play two more and if you know B is playing uniform which is you know a sort of generic strategy you can kind of get away with playing uh two and getting some value. Okay, so in some sense it's much simpler than having states and actions and transitions and rewards and discount factors. Um, but you know the solution is actually kind of uh more subtle here because of this simultaneous nature. All right, so how do you come up with the optimal strategy here? So, just to have a bit of fun, um why don't you I want everyone to find a partner um and try playing like a few rounds and see if you get any insights. All right. So, how did people do anyone uh figure out a good strategy that worked for them? It shouldn't be that. This is fairly subtle. So, I'd be very impressed if you guys uh came up with the optimal strategy. Um, and one other caveat is that this is meant to be a single move game, right? So, you play once and you're done, right? What's happening obviously when you play iterated game now they're state because now you have the history of what the other person's doing and that changes the dynamic. So, it's not quite the same game anymore, but anyway. Okay. So let's try to solve the game. Okay. So probably you're thinking now you're dying to know like what's the optimal strategy. Um so the game remember the game value it's very simply defined as v of pi a and pi b. So uh but the problem is that play player a wants to maximize this quantity and player b wants to minimize and they want to do at the same time. So you get into this sort of jam where like you don't like how do you move forward. Okay. So how do you break out of jam? Well, you just have to let someone go first. Okay. So we're going to see what we can do here. Let's let someone go first. So um first let's consider pure strategies.

Segment 11 (50:00 - 55:00)

So when players are just playing a pure strategy, what happens? Because this case we know how to do right. This is just normal miniax. So let's say player A goes first. Okay, so player A is trying to maximize, player B is trying to minimize. So if player A goes first, then whatever happens, player B can just play the opposite and get you minus three, right? So player the minia max value of this game is minus three. So it's pretty bad for player A. But if player B goes first, then player A can just match the number, right? So if player B goes uh with one, then A is going to play one. If player B goes with two, A is going to uh sorry um uh A is going to play two. And then B knowing this is going to play one because you know two is players. is B is trying to minimize two is less than four. Okay, so we solve this. Um, but for these like weird games that we don't really care about, but this process hopefully gives us some insight and there's something that's worth commenting on which is in this setting and in general going second is no worse and often better. So in particular this value is smaller equal to this value minus three is less than or equal to two. Okay. So in more generality max over a minim over b is lesser or equal to min b max a. So going second is going to give you a bet higher result for player a. This sort of intuitively makes sense because if you're going second, you see what the opponent does and you can react. And since there's no state, it's not like the player A takes the card from you and now you can't take the action. You can take the same set of actions and it's payoff. The matrix is the same. So all you're doing is getting more information if you go second. Okay, so that's for pure strategies. Let's see what happens for mixed strategies now. Okay. So suppose player A plays a mixed strategy. So this is a little bit of a kind of a strange thing, but you're saying play by what this means is that player A says announces their strategy. I'm going to flip a coin and with heads I'm if it's heads I'm going to play one. If it's tails two. Okay, that is what is meant by player A playing a mixed strategy. He's not revealing. He's not flipping the coin yet, but he's they're announcing the strategy. Now, player B, suppose player B knows player A is going to do this. What is the best thing for player B? So, now I'm going to do some very simple algebra to figure out the answer here. So just expand the definition of um uh the game value into four terms where each of the four terms corresponds to a different u combination of actions. So each term has probability of under a probability of under b and then the payoff matrix entry. And then I'm going to you know start simplifying this. So actually I can h let me actually do this on the board here since I'll think it'll be easier. So in this case um player A is um half right player A is uniform and so the question is like what is the best thing for player B? So player A is half over these two. Okay. So, which means that um if I play one, I'm just going to get the average and that's going to be um minus one/2. If I'm playing two, I get the average and I'm going to um play half. So, what should the optimal um pi b be? Yeah, you should always play two. So, it should be zero or one. Okay. because half is greater than minus half. Okay, so this points out another important fact is that the second player can always play a pure strategy in this case, right? So if the first player plays either a pure strategy or mixed strategy, um if the second player is trying to minimize this, um it can always be tamed by a pure strategy. And the reason here is that uh suppose this were not half but anything

Segment 12 (55:00 - 60:00)

then I can do the same math. I average, take the weighted average, I get some number. Take the weight average, get some other number, and I look at which one's bigger and that's I'm gonna put all my eggs in that one basket. Sure. Yeah, you're right. Um, okay. I got a bit confused. So, okay. So, player B wants to minimize. So, um, so this is the preferred choice. Okay. Okay, so no matter what player a the first player does, the second player can always play a pure strategy. Okay, so hopefully we're making progress. We've made two observations. One is that going second is no worse. And if you go second, you can just, you know, choose one item. You don't have to randomize. Okay. So now let's try uh to this do this more generally. So I'm going to do miniax but where the actions are actually mixed strategies. Okay. So player A is going to play go first now and play a strategy which is P with probability P play one finger with probability one minus P play two fingers. Okay. So the way to think about this is you have a uncountably infinite number of a possible actions that um player A can take. And then at this node um player B remember from this observation the optimal thing is just to play a pure strategy. So there's only two possible values one or two. Okay. So now we just have to compute what the value of the leaves are. So um if player A plays P 1 minus P and player B plays one then the game value is um basically P * 2 plus 1 - P * minus3. Okay. So just to write this out here. So now I'm going to assume that this is a p 1 minus p. So I have p uh 1 minus p here. And so now this quantity is 2 p - 3 um 1 - p which is um which is 5 p - 3 I believe. And this is going to be - 3 p + 4 y 1 - p that is of uh what is that? That is - 7 p + 4. Hopefully that matches what I have. Okay. So I have the game values here and then I'm going to choose you know which one is I'm going to take the max over these two and that's going to be uh that is going to be the minax value of the game. So it's exactly what we did before now with P instead of plugging in half and half. Okay. Um, sorry. You're right. I should be taking them in. So maybe we'll consult this equation now. So the minax value of the game is I take a max over p the continuous value and then I take a min over these two values. It's a bit weird because I have this like infinite number of actions here, but hopefully if you can generalize our intuitions for miniax into this setting. Okay, so um just jumping ahead here. The way you solve one of these things is um you this minimum is obtained when um these two quantities are equal. Let's see if I can try to draw it. Um maybe I'll draw it. I don't think I can get exact umly to scale. that you know basically you're taking the um am I the min over two things so 5 p minus three so kind of looks like this and um so this is p and this is

Segment 13 (60:00 - 65:00)

let me try to use a different color here looks something like this is definitely not the scale. Um, but I'm taking the min, which means that I erase the these two pieces. So, this is like 5 p - 3 and then this is - 7 p + 4. Um, and you if you set them to equal, you get basically this max value. Okay. So when you set them equal, if you have 5 p - 3 = - 7 p + 4, um then you find that p = 7 over 12. And then if you plug in the actual um p into one of these either one of these, then you get that the minax value of the game is minus 1 over 12. Okay, so bit of algebra but the upshot here is that we are still doing making this turnbased where player A goes first with a mixed strategy and player B goes second which can also be a pure strategy because they're going second and then we compute the minax value of the game which is minus one over 12 where the player A is playing 71 12ths So, this favors player B, right? Because player B goes second. So, now it's uh let's just to be fair, let's give uh player A chance to go second. So, now let's let player B go first and do the same calculation. And if you do it, which I'm not going to do again, you're going to notice something remarkable that happens. which is that the minimum vax value of the game is exactly the same minus 1112th and the prob probability is 7 over 12. Okay, so you might think okay I rigged all these numbers so we just kind of got lucky here. But this is actually not a coincidence. It turns out that there is a very important theorem in game theory um called the miniax theorem. This comes from John Vorman from 1928 which says that for every simultaneous two-player zero sum game as long as you have a finite number of actions um we have two actions here so that counts. Um going first and going second is the same. Okay. So which in other words max over pi a min over pi b is the same as max over pi a and it's important here that pi a and pi b has to be over mixed strategies as we saw in the first example if pi a and pi b are pure strategies this is definitely false right going second if with a pure strategy is going to be um uh always favor favorable. Okay, so um the proof of this is actually um you know linear programming duality. It turns out this is a linear program and there's some really nice math that shows you that this is true and furthermore you can compute the optimal mix strategies very efficiently using uh linear programming as well. So we just solved the linear programming sort manually for this example, but you can solve this for any number any payoff matrix and any um number of actions. But the upshot of all of this is that revealing your optimal mix strategy doesn't hurt you. So answering the poll question from before, you can go ahead announce to all your friends that you're for two-finger mora. This is the strategy you're going to play. 71 12ths on one, 512ths on um two, and you won't uh you know suffer. Okay, so it's sort of bizarre that it works out, but you know, it's math, so you know, you can uh you can check the math, but then once you've checked it, you kind of have to trust it. Um okay, I'll skip that part. Okay. So, if you go back to the miniax principles, which we covered before for turn-based games, the same thing applies here. If you're at a miniax, uh these are miniax strategies. Okay? Which means that if your opponent deviates from that, oops

Segment 14 (65:00 - 70:00)

deviates from that strategy, it can only be good for you. And furthermore, if you deviate from your own strategy, you can only make it worse. Okay? Okay, so Miniax is this really nice uh setting where you want to be playing Miniax um so that your um you know opponent can't uh you know there's nothing your opponent can do and you can't do any better. And of course, the same idea of, you know, miniax being optimal against a perfect player makes sense. If you, for example, knew that your player was all your I'm player A and you're always going to play two fingers, then of course I can do better. I can just play two fingers as well. So remember, expect to max is the best thing. if you know what your opponent's going to do. But if you don't do, doing the miniax thing guards you against any sort of potential huge downside. So, let me cover nonzero sum uh games. So, this is going to be just a brief overview just to get you some exposure. We're not going to have time uh to go through it. So, the idea with non-zero sum games is that it's not zero sum. So the utilities of the players are arbitrary. Before we had always so far in this class we have assumed the utility of player A plus B is zero. In other words, utility of A is minus utility of player B. So there are many uh possible settings you could be in. If you're playing competitive games, that's zero sum. If you win, I lose. If I win, you lose. Okay. There are cooperative games where the utility of all the players are the same. There's nothing really to be said there because there it's just pure maximization. Um it's just search because everything is a max and real life is somewhere in between. So here is an example. Um how many of you know about princ's dilemma? Okay. So this is more popular. Um so the idea here is that you have two players. Um they the prosecutors ask both of them to testify against each other. Both are accused of doing something bad. If both testify then they're sentenced to 5 years in jail, both of them. If they both refuse to testify, then they're both sentenced to one year in jail, which is I guess better than five years in jail. And if only one testifies then they get out for free and the other one gets a 10ear sentence which really sucks. So the um this is going to be captured by a payoff matrix where now I have to um have a subscript p to denote the utility of a particular player. So for um if the utilities minus number of years in jail, this is what the payoff matrix looks like for the prisoners's dilemma. If they both testify - 5 if they both refuse - 1 and then on off diagonals if A testifies and B refuses, A gets off and B gets thrown in jail for 10 years and vice versa. Okay. So, u we don't have time to let you guys play this, but um that's fine. So, how do you solve uh nonzero sum games? So, we can't apply vonoman's minimax theorem anymore because it's not zero sum. But it turns out we can get something um and that's called a Nash equilibrium. So is a pair of strategies pi a star and pi star b um such that no player a or b has an incentive to change his or her strategy. So in other words if um if pi a tries to change that's going to only make it worse and if pi b tries to change that's also going to make it worse. Okay, so this is a Nash equilibrium and in 1950 uh Nash proved this you know really beautiful theorem the existence theorem which says that for any if you as long as you have a finite number of players and finite number of actions there exists at least one natural equilibrium. So there is a at least one you know solution. Um, so here are some examples just to give you some intuition what Nash equilibria look like. So we looked at two-finger Mora before and the miniax strategy is also a Nash equilibrium because no player has an incentive to move off what they're doing.

Segment 15 (70:00 - 73:00)

If you consider a fully collaborative version of this where both A and B um have the same reward um then the N there's two N equilibria either both play one or both play two right and it's stable because if you're here even though it's not optimal if uh if B tries to move off then it's just going to make it worse for everyone. Okay, so this is a local optimum. So local optimum are also Nash equilibrium in search problems. And then finally the Nash equilibrium for um pris dilemma is both players testify. Okay, because um you know no one has a um incentive to move off. Okay. So, so the po point here is that um both refusing is not a lash equilibrium because if you're in this state um you know someone's going to like I'm if I'm here then you know both A and B are going to want to re uh testify and get out for free. So this is not stable. Um and of course if you're here then that's also not stable because um B is gonna try to move here. Okay. So this is only stable point and the whole point of prisoners is that um you know sometimes unfortunately um if you have agents that are only after their own selfish interest you can get into situations which are um suboptimal. Clearly this is better for both of them but it is not a Nash equilibrium. Um okay so just to contrast simultaneous zero sum games here we have a monorma miniax theorem we can produce a single game value and in nonzero sum games we have Nash's existence theorem and that uh can lead to multiple Nash equilibria and multiple uh game values. So the way to think about national equip that they're just stable. They're locally stable. So no one can change um no individual agent can change something. Um it's not really about optimality in the same way that we've been pursuing in the rest of this class. Um and it sort of makes sense because everyone has different utilities. What does optimal with multi-objective uh even mean? And fortunately they do exist for non-zero sum uh games. Okay to summarize um we looked at simultaneous games. Two players move at the same time. In general second player has an advantage and play a pure strategy. But the miniax theorem says if you allow for mixed strategies then second player doesn't actually have advantage. Both players can be uh optimal. And then we look at Nash equilibria exists for nonzero sum games but not uh it's not unique. Um and in general there's a huge literature um in game theory and economics. So if you like uh this you know uh second half of this lecture you should go um take a game theory course and learn more. Okay. So that concludes the game um unit and all our work on safe space models. I'm going to start a brand new unit on um Beijian networks. Okay, see you next time.

Другие видео автора — Stanford Online

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник