# What is Q-Learning (back to basics)

## Метаданные

- **Канал:** Yannic Kilcher
- **YouTube:** https://www.youtube.com/watch?v=nOBm4aYEYR4
- **Дата:** 25.11.2023
- **Длительность:** 45:44
- **Просмотры:** 114,539
- **Источник:** https://ekstraktznaniy.ru/video/12284

## Описание

#qlearning #qstar #rlhf

What is Q-Learning and how does it work? A brief tour through the background of Q-Learning, Markov Decision Processes, Deep Q-Networks, and other basics necessary to understand Q* ;)

OUTLINE:
0:00 - Introduction
2:00 - Reinforcement Learning
7:00 - Q-Functions
19:00 - The Bellman Equation
26:00 - How to learn the Q-Function?
38:00 - Deep Q-Learning
42:30 - Summary

Paper: https://arxiv.org/abs/1312.5602
My old video on DQN: https://youtu.be/rFwQDDbYTm4

Links:
Homepage: https://ykilcher.com
Merch: https://ykilcher.com/merch
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://ykilcher.com/discord
LinkedIn: https://www.linkedin.com/in/ykilcher

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https:/

## Транскрипт

### Introduction []

all right with the recent advances in AGI notably the qar algorithm by open AI I thought we dive into a little bit of the background of what people suspect that it is now keep in mind qar could have nothing to do with anything that we're talking about but everyone's speculating around what that could be and I thought it'd be good to go back to the basics and look at what is Q learning and you know what is it what does it do and how has it been used recently or not so recently namely in this paper right here uh playing Atari with deep reinforcement learning by Deep Mind which was one of the first papers that really catapulted Deep Mind to the Forefront of everyone's attention uh so it's not going to be sort of state-of-the-art research today it's more going to be a basic introduction into Q learning I've actually done a video on this particular paper before if you want to know more about this paper today it's really going to be about Q learning in general so let's dive in Q learning we're in the field of so-called reinforcement learning and again we'll keep an eye on the application how that might apply to a language model or something like this as we go through this but in the basic sense in reinforcement learning you have your agent over here that is the ugliest a I have ever seen so you'll have your agent over here and you'll have an environment and the environment is giving the agent what's called an observation O Okay um the agent is going to react to that observation and uh then reply with an action a the environment is going to take that action and give the agent a reward and a

### Reinforcement Learning [2:00]

next observation okay and this is the basic cycle in reinforcement learning this is the most basic concept uh the agent gets these observations uh from the environment and responds with actions based on those observations and the environment after the first step it always gives a reward how good the Last Action was now there are multiple configurations of this um so the one simplification we are going to do today let's say is that we're going to say the observation is always equal to the state um this is so-called markovian fully observable um decision process and that's just going to simplify things a lot so when we say observation or when we say state today will mean the same thing think of a chessboard You observe the chessboard as it is okay here is maybe a simplified chess board You observe that you know the Rook is here and the queen is here and so on You observe it once and that's everything you need to know that's the whole state of the game there's nothing hidden there's nothing dependent on something in the past which is actually not true in chess uh whether you can Castle or not is actually dependent on kind of the past history and today we'll also just ignore that so you're in a given State and you're asked to make it an action and that will lead to some reward now in chess uh you already see that that's not really true in the colloquial sense so if I move the Rook over here maybe that was a really bad move good move uh it doesn't matter unless it's a Checkmate I don't get any reward at all right so my reward from the environment in chess is whether I have won or lost the game and that only happens at the very end of the game no matter how good my moves are in between uh I will not get any reward from the Environ environment until the very last step and that's a problem in reinforcement learning so I can go from state to state from step to step I can go through this game and all and I only get a reward here at the end now this is very tricky situation for a system to learn because how are you going to know whether the move you made here was ultimately the correct move you have a very similar uh challenge when you do like sequence to sequence learning with language models because first you go through a sequence right and then you output the sequence over here so this would be the encoded stage and then from here on you'll decode and as if it's as if you wouldn't get a loss each token that you produce but only at the very end someone would look at the whole sentence and say well that's pretty good or no that's pretty bad right and in fact that you know that's what we do when we do reinforcement learning uh even for RL so reinforcement learning doesn't necessarily mean you only get reward at the end but many environments are structured such that only after you've sort of completed the ENT per episode you get a reward at the end and you need to sort of figure out what was good and what was bad that you done you've done in the meantime and Q learning is one way that you know is kind of good at let's say mitigating that and doing this credit assignment so the assignment of you know this actually depended on oopsie dependent on moves that I've done before so how do we go about this formally well let's not go too formal we have this is called a mark of decision process um fully observable as we said and that means that I'm always in some kind of a state I take an action I go to a next state so State one I transition to state two by doing action one if I did action two instead I'm going to transition to state three um maybe from state three if I do action one I transition to state too and so on so I can move around these states by performing these actions also we're just going to consider that you always have the same actions available in no matter what state you're in it's also not entirely true but um yeah so I need to figure out I'm in a given state right and I need to figure out what action should I do and one way to go about that is to have a function that takes in a state and gives me an action that that is possible um that is if

### Q-Functions [7:00]

is that is possible um that is if you do any sort of uh actor critic reinforcement learning or something like this policy gradient methods uh they do this so they train they learn a function that you input a state and it's going to give you an action and it doesn't explain how or why or anything like this um today we're not going to do that we're going to do something different we going to say I'm going to define a function q and what Q is going to do take in a state and an a proposed action so you're in state s and you're saying if I did task a what would my total reward my total reward be if I did action a so you in some State here right you could go here from here you could go in many different places and at some point da da da there will be end States and in the end States this is a good one and this here might be a bad one and you don't know how you get to these right so you're in state in this state here and you're asking yourself which action should I do action one or two well if you had this function what it would tell you is okay if I plug in A1 here let's say I plug it in I am in state s this is State S I plug in A1 sorry A1 this gives me five Q of s and A2 gives me 10 obviously I'm going to choose 10 so A2 so the Q function is a function let don't worry how it comes about the Q function is a function that in current state you give it a proposed action it tells you how your total reward from here on out how much that's going to be if you now do action whatever action you plug in that is missing a little bit of information if you've realized that um namely even if I do action A2 here I'm going to be in this state I'm still going to have options even after that right so just because I now decide to do A2 after that I can still and I probably will have to choose many more times what to do until I finally reach some sort of good or bad end state so in order for this Q function to be well defined I also have to specify what I do after that and is that's called a policy so a policy and I previously called it f um but technically or not technically commonly it's called Pi so Pi of s uh this is a policy um you give in a state s and it tells you do action a sometimes this can be formulated in a probabilistic manner so you can say my U probability of doing action a given State s can be defined as this so it's going to give you like a uh a probability for Action a given State s but we can also go the deterministic case right here and say a policy you give it a state and it's going to tell you go left go right go up go down note the policy says nothing about reward the policy is simply you could write it down right you could make a table and say well if you're in this state do this and there can be good policies which lead to high reward and there could be bad policies which lead to low reward in fact policies can be kind of ordered so you can say you know a policy is better than policy one 2 if in any state you can be in uh the actions if you follow policy one you get a better reward than if you follow policy 2 but obviously these are not um there not a total ordering in that sense because for some it can really depend on the state but never mind so I hope you realize the difference between a policy is simply a function that tells you what to do and there can be good policies and bad policies and the Q Q function now we can fully Define it so in order to fully Define a q function I would have to also Supply it with a policy so usually you say this so q Pi of s and A1 tells you you're in state in state s if you were to now take action A1 in this state and after that do whatever the policy Pi tells you to do then you would get a reward of five however if you're in the state now and do action two and after that equally just do what policy Pi tells you to do then you would be in uh in a reward of 10 okay so in a sense what the Q function tells us is that we'll commit to following a particular policy after this step but for this step we don't really know what to do so we plug each action into the Q function say well after this step I'm going to follow this policy right here but please tell me what to do right now and the Q function evaluates each action and tells you a reward that it thinks you know after taking this action now and then following the policy uh you will the total reward you'll get until the end of the episode by the way we haven't really specified reward and episodes right now so in classic reinforcement learning you always kind of assume that the episodes go for an infinite time Horizon um so your total reward is going to be the reward that you get at step one plus the reward two plus the reward you'll get at step three and so on now you can see that if this goes on for an infinitely long amount of time you can get infinite reward uh so what people do even in the finite Cas is people say there's discount Factor so plus GMA 3 R3 R4 and so on so well okay I kind of screwed up my indices here but we'll essentially say you know as time progresses the rewards are going to be worth less and less in the now right so the if I and that's similar to how humans perceive reward I guess if it's further in the future it's worth less to me now okay it's going to be worth more in the future obviously but it's worth less to me now if you promise me $100 tomorrow uh than if you just give me $100 right now and that's not because I'm unsure about the tomorrow right it's just because uh I'd like to have it right now we can there are other discount factors that model things like uncertainty um either in the problem or in our estimation of it but the basic the basics basic total reward can be calculated as a sum and here I'm going to switch my indices of gamma to the I so I goes from zero to possibly infinity or t uh of the reward I'm going to get in that particular step okay so this reward here is what I want to maximize the total reward across the episode and as we said if this here if our episodes are such like in chess that just once at the very end I'm going to get a reward um we might as well set the discount Factor here to just one and because it this is going to unroll to just the last reward anyway so uh actually h I to T so this is going to be reward r i equals T something like this is not super well written is it um but you hope you understand what I mean so if we only get one reward we might as well and we don't care how long the episodes are right in chest don't care how long a game is if I win a win doesn't matter if it's in 20 or in 60 moves we might as well put the discount factor to one and just take the last reward um but in normal regular reinforcement learning there's discount Factor so why I did that it's important to understand uh the framing of the Q function and we could ask ourselves why is this five and this 10 right picture yourself you're on a chess board right and you have only two moves available and you ask the Q function you tell the Q function okay I'm going to I have like uh I don't know I have like Gary Kasparov on the phone or Magnus Carlson and I'm going to I can call him after this move right so that I'm going to this is the this is like a good policy I know but I just need to know what I have to do in this particular move right now after that I'm going to call Magnus Carlson and he's going to tell me whatever to do but you know how should I move my Rook or queen and the Q fun is going to tell you well if you move your Rook uh your even if you call Magnus Carlson after that your total rewards going to be five but if you move your queen right now and then call Magnus caros and your total reward is going to be 10 why it's important to understand the reason why these numbers could be different and there could be two reasons reason one which is probably what you're thinking first is that well the current action of moving the queen just gives me a higher reward right just um actually chess isn't really good example for this so it could be that in a reinforcement learning problem just the current action two it just gives you immediate five reward right and after that you'll go on whereas action one gives you no reward right now and that's why it's five different but there is another reason and that could that's if you're playing chess then that for sure is the reason mely um even though you're following the same policy after that you will end up in a different state if you take the two actions so if you take action one versus two you'll be in a different starting state so if you're here take action one you'll be in this state whereas if you take action two you'll be in this state now these actions could be on their own relatively neutral but because you're in a different state you will have much less chances of ending up in a bad or a good

### The Bellman Equation [19:00]

State depending on the state you're in so the Q function encompasses both of those things like how good the current action is in the current step and also how good the current action is looking ahead in the state that you'll be in and that's going to be lead us to the one of the fundamental um equations recurring equations in Q learning which is going to be the development equation the fundamental thing here is this the Q value in state s of action a and following policy Pi after so after the next step is what's that going to be well that's going to be the reward of Performing action a in Step uh action a in state s so the immediate reward of that step um I think before we've gone capital r let's go lowercase R so let's say like capital r is the total reward and that's going to be the sum of discounted lowercase RS okay and the lower case are the um rewards I get in the individual steps okay so the Q value is going to be what re word do I get in the immediate step that I'm taking right I'm going to maybe that's zero right but there in there could be a reward right there if I put a Checkmate it's a one if I am in a doom and I shoot some monster I do get actual points right this is reward uh note that there is no Pi is not here and that's because Pi only matters after the next step so that the uh the reward is only the current step plus whatever comes whatever reward I'm going to make from the state that I end up so let's just say if I'm in state s and I perform action a I'm going to end up in state S Prime we we'll understand or maybe S Prime a or something like this we'll understand that means the state that I'm going to end up with if I am in state s and take action a so the reward that I'm going to end that I'm going to achieve in the whole rest of the game uh from this state well what's that going to be isn't that just going to be the reward in S Prime you know and then what action am I going to take here um I could say I can take the action whatever Pi of S Prime is going to tell me to do and uh I'll do that for all the s like primes that are coming up I hope that's clear so in the next step I'm going to just ask well what does Pi tell me um to do that's the ACT I'm going to take I'm going to get some reward into a new state then I'm going to ask Pi again what action should I do I'm going to take that action I'm going to get some reward and so on so this reward is going on until the end and we've split up our we' split up our problem in two parts the left part here is not dependent on that pi and the right the a right here okay now we can think of something thing and say well actually here sorry about the one apple pen confusion this thing here I just said well um I have magnos Carlson on the phone why don't I just kind of call Magnus Carlson right now right uh and that's that's one thing okay that would just solve the problem the other thing that I could do is I could say well if I have this Q function available if I have it available if I can ask the Q function you know what should I do right now to get like the best or to get their reward to get like I can ask for a given action what reward will I get across all the game right for a given action right now which means that I can define a policy and we're going to call that policy uh Pi Q for now and the policy takes in a state and it's going to give me an action and I'm just going to say how about I just always whatever state I'm in I'm always going to ask my Q function what to do so and I'm always going to take whatever the maximum is right they just going to plug every a into the Q function and it's going to tell me a number and I will just do whatever the highest whatever action leads to the highest number and that is a policy right that if I at every step just do that just do ask my Q function uh that is a policy because I can plug in a state and I will get out an action now again this Q function is incomplete we have to supply a policy and here is where the recursion comes in I'm going to tell the Q function I'm going to what happens if I'm going to do action a right now and in the future I'm going to act according to the policy where in each step I ask you Q function what the best action is so this it's I'm making it more complicated than it is but essentially we're defining a policy that's defined Itself by always asking the Q function what I should do right now if I intend to follow the Q function also in the future and if the Q function is actually good then that is going to be the optimal policy so we call it like qar of s is going to be I'll always select the maximum action according to my Q function um by following the optimal policy after I did that and that is called development equation

### How to learn the Q-Function? [26:00]

um and it gives us hints on how we can learn these things now you have to imagine Q learning it can be done in many different ways but essentially what it boils down to is the following so I am in a state S I ask my Q function hey Max a okay of qsa I ask that it's going to give me an action actually I only going to do that with 1 minus Epsilon amount of times and in I'm just going to take some kind of a random action explore exploit right sometimes I just want to do something else because uh I might not be the best right now so the Q function um I don't have I did not pull this up correctly so imagine I don't have the Q function now so far we've always imagined we have this magical Q function that just knows everything about our problem can tell us accurately what will happen in the future obviously if I have that I can just always ask it you know which of these actions is the best one and that's precisely what we can't do if we don't know anything about the problem and therefore Q learning is all about can we learn this Q function so the Q function is now going to be parameterized somehow and it needs to take a state and an action and it has to tell me some sort of an estimate of R right okay um imagine I am in a reinforcement learning problem right I am here my little figure is here and I ask it goes this way and there is I don't know it's the dinosaur game then it's not a person it's a dinosaur is it but let's say there is a bird coming at me and I ask should I uh jump or should I duck right and these are the two actions and let's say I choose either one Q learning can even be performed as far as I'm aware you don't always have to choose the maximum Q you can do bit of exploration here so you have actually performed some sort of action and you've gotten some sort of reward of that step now what does the Bell equation tell us the bman equation tells us that the Q of Pi Q of the state s and action a should be the reward I'm going to get um in state s and action a plus the discounted future reward and that's going to be the discounted Q value um of Q did I discount over here I probably did not discount over here um of being in S Prime a so I've done action a I gotten i' I I've that led to reward R maybe reward is positive because I ducked I got a little bit further and that gives me a few points and I've am in a new state S Prime the new state is I'm ducked under here under the bird okay now what I can do is um I can kind of self regress so Q learning can be done in multiple ways uh the easiest thing is to have a table literal table that says that has the state s and the action a so it tell it would tell me Well if you're in state one and do action one then your Q value is going to be five if you're in state one do action two your Q value is going to be 10 if you're in state two and do action one your Q value is going to be eight and so on okay if we don't know anything yet we'll just fill this here with random values now there's obviously also a different method which you might be more used to and that's we'll build some sort of neural network it takes in a representation of the state so the state is going in as some sort of embedded or image right here there's going to be some sort of a neural network and then there's going to be one output for each of the possible actions and then each of these is going to give you a value there's also different methods with neural network where you'll have as an input the state and you'll have actions also as an input encoded so this part encodes the state and action and the neural network is just going to give you one output one number however you do it right you have to initialize with kind of random numbers and then what you'll do is you'll simply use your own estimate as a Target so you're essentially saying well whatever the Q function is um if I ask it what the Q value in this state is given an action a right it should and actually this here on the right hand side isn't super correct it should uh give me the it should give me the same as if I take the reward I'm actually getting by doing this action plus the future reward from the new state now you'll see there's action a here action a actually needs to be marginalized across um action a this policy so this is not action a from here um but you can kind of see the basic principle The Q function needs to fulfill the fundamental recursion of bellman which means that the Q function in this state must somehow be the reward I'm getting by following the Q function plus the Q function from the next state onwards and even if I don't know the Q function yet I can kind of regress to myself so I can even if this is a crappy estimate as of right now and I know if I'm in state one I do this I go here that gave me a reward of maybe four right then I know aha four and8 is probably 12 so this here I really underestimated the total r reward I'm going to get to at the end of the episode so I should probably change that I should make that 12 right and if we iterate and iterate we're going to get better and better estimates and so that might fluctuate and so on and that's one of the problems of Q learning but you kind of taking your own estimates as targets um you combine them with the actual rewards that you get from the world so you always say well the reward I'm getting plus what I estimate my future reward is that's my target for the total current reward and thereby you can reduce the whole problem to just estimating single step so the single step is then the reward that you get from the world that's a real number needs to somehow be the Q function in this state minus the Q function in the next state okay and you're going to train your Q function so that this difference here matches the reward that you're getting and you usually only train one of the uh parameters here and the other one you kind of keep fixed but I guess you could train both at the same time I'm not sure that's super good idea and that's how we get to this paper deep Q learning considers Atari games so Atari games lend themselves quite well to the combination of deep learning and Q learning because you have an input state which is quite complex so you can't do this with tabular Q learning you have to do this with neuron Network so the input State it's quite complex lots of pixels and so on but the output action space isn't that big you can go like on an notari left right up down press a button and that's kind of it um you have some combo actions of all of these but the action space is very limited and thus they can actually do Q learning right here so you can see here the fundamental um equations that we've gone over the Q function with the star is the optimal the Q function following the optimal policy that is if um I have my reward from the current step plus if I follow the best action in The Next Step Forward okay and this is the actual correct way of writing it right here I botched that a little bit I hope this becomes bit clear now so the Q function of the optimal policy is whatever the reward next step is plus the Q function for the optimal policy from The Next Step on out here is their loss function I said the loss function is Yi being maybe they say that somewhere Yi let's see ah there we go where Yi is this plus this so essentially what we what we've just seen so that their target is the right hand side here and um so the right hand side must be equal to the left hand side Balman says for the optimal policy this must be true and therefore why don't we just make it true using gradient descent so the gradient they consider here gradient of the loss function is going to be um you differentiate through the Q function of the current state you keep the Q function of the next state fixed the reward is fixed anyway so therefore you're going to take whatever you estimate currently and make it closer to whatever it should be and again that includes your own estimate of the future and that's why it's a little bit different than like supervised learning or something like this so if we think of this in language modeling right you always have you have a sequence of tokens and if you do auto regressive language modeling then for each token that you output we'll start one late you actually have a Target in mind so from here you predict this but you have a Target and then from these two you predict this one but you have a Target and that's why you can do supervised learning when you do autoagressive language modeling because you always predict the next token and that's a defined either correct or not correct um if we talk about reinforcement learning in language modeling it's more like where the whole sequence is going to be

### Deep Q-Learning [38:00]

predicted first and maybe at the end you're going to get some reward and then you need to figure out well which of the tokens actually were the good ones the bad ones and so on what you should do next and that's where currently in rhf proximal policy optimization is used but it could be conceivable that Q learning can be used as well because in language modeling we have a sequence or a partial sequence that's the input and we can encode those really efficiently using Transformers um and we have a fixed set of outputs namely our token vocabulary even though that's quite a big set of outputs we have a fixed set of outputs U being the token vocabulary um that defines our action space so to say so Q learning is not too you know too far out there except that with such um High action spaces it in my understanding it tends to become a little bit brittle let's just quickly go through uh what they do right here they have a few tricks up their sleeve one is this experience replay uh so they don't just always learn on policy but they have a bit of a buffer of things they do so in deep Q learning for tar they uh do episodes they with probability Epsilon select a random action otherwise select an action according to the maximum of the current estimate of the Q function again we don't have it we need to learn it so at the beginning it's going to be crap but as we go on it's going to improve and improve and thus we're going to play better and better games which leads us into states where we can explore more and more the good paths right and if we just explore the good paths then um yeah technically we could explore everything right technically we could just run a random exploration in this environment and then learn from that however um that tends to not be too good as long as like as long as these states are really big so if there are many many states we don't just want to do a random exploration because we with high probability we will never reach the interesting states and or not enough not often enough and therefore we'd rather already start going along the good trajectories as we learn the Q function right and only with a small probability do something random so that we explore a little bit we don't get deceived by always going into what we think is the maximum reward Direction because it could be that you know right now I have to do a little bit of a suboptimal move but then there is much more reward and that's what the exploration is supposed to do it doesn't work in all environments and there have been even books written about the deception of uh rewards and the deception of goals and so on notably um in even in the machine learning field so don't recall what it's exactly called but you'll find it uh what they do then is they execute the action and they observe the reward and the next image they set the next state to be okay that's a step um they store the transition eventually they'll sample a mini batch of transitions so the transitions include the current embedding of the state the action the reward and then the next embedding of the state if it's a terminal they just set the target to the reward if it's a non-terminal they set the target to that Bellman recursion and then they perform a gradient step on the loss function that we've seen before and that's essentially it and that's what gets you a uh giant super duper company that's getting acquired by Google and I mean it takes a bit more than that but this is really what put them on the map uh back

### Summary [42:30]

then so I hope this was a little bit of a dive into Q learning uh into what it does what it means again we consider the Q function is something that tells us if you did an action right now um if you did that in the state that you're currently in what would the reward be not just now but until the end of the game if you follow if you commit to follow a certain policy afterwards and then we said well what is what if my policy is just I'll always ask the Q function what I should do and that led us to the realization that hey um yeah that's possible in fact it gives us nice recurrence relation that essentially says the Q function now is simply a combination of the reward that you'll get next step and the Q function of Next Step that again uh led us to a method of saying well okay that's cool what if I don't know the Q function well then I can just use this recurrence relation to learn it and learn it meaning I can break down the whole problem of what if the reward is like forever and there are many steps and which is which like which piece of reward is responsible for which action here and so on this whole credit assignment problem I can break it down into just learning from single steps because I'm always saying well whatever else happens the recurrence relation must be true for a single step so the reward I get must be the difference between what the Q function told me last step and is going to tell me next step and that I can use to learn the Q function at least the Q function of the optimal policy and we discussed a little bit that initially Q learning was done really with tabular Q values uh but more and more as we move into modern ages it can be done with neural networks where you encode the state inter a neural network and you ask the neural network you know for each of the actions what would the output be this could be done by having one head um that just has an output for each available action or by also encoding the action and then have uh one output also encoding the action is going to allow us to go to a lot more complex actions even continuous actions and so on but we will not go over that right now and here I hope this was at least somewhat informative and turned before everyone goes out and runs around and you know hypothesize about qar keep in mind this is entirely possible this has nothing to do with qar at all but it could so stay hydrated bye-bye