Temporal Difference Learning (including Q-Learning) | Reinforcement Learning Part 4

Temporal Difference Learning (including Q-Learning) | Reinforcement Learning Part 4

Mutual Information

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (6 сегментов)

Segment 1 (00:00 - 05:00)

if one had to identify one idea as Central and novel to reinforcement learning it would undoubtedly be temporal difference learning and that's coming from Richard Sutton and Andrew Bardo some of the pioneers of modern reinforcement learning so yes I think this is worthy of a video in fact this is part four of my six part series on RL fundamentals my Approach will be to present temporal difference learning and the related algorithms of q-learning sorsa and expected sarsa as the result of making a series of adjustments starting from Monte Carlo methods which we're already familiar with the hope is to compress the subject so we can understand a lot in a short amount of time and I do mean a lot among other things we'll do a careful walkthrough of the mechanics understand and compare performance and get down to the fine-grained actions the algorithms take in different circumstances if you're trying to understand RL this will be a pretty efficient use of your time so in the spirit of that efficiency let's get started now normally I do a review of the earlier videos but there's just too much parts one to three altogether is a little over an hour which is the equivalent of and I did the math 73 000 Wikipedia articles you just can't summarize that so instead of review I'm going to just say watch the previous videos or at least watch the video right before this one otherwise you'll be completely lost so from this moment I'll assume you've done that okay now as mentioned my strategy is to propose temporal difference learning as an adjusted version of Monte Carlo specifically the plane on policy constant Alpha MC algorithm we covered in the last video If you recall which you better almost all RL algorithms fall within the generalized policy iteration Paradigm where there are two components policy evaluation where we learn the value function of a fixed policy and policy Improvement where we use the policies value function to create a slightly improved policy now the adjustment we'll make sits entirely within evaluation that means we can study a Markov reward process which is essentially a Markov decision process without actions also we may reference State values instead of action values knowing that these evaluation methods can be applied to both if that doesn't make sense to you I think you know what to do okay the adjustment addresses a disabling feature of Monte Carlo that is an episode must complete before values can be updated and that's a problem if episodes are long learning can be really slow that's because the Transitions and rewards within an episode contain useful information which MC delays learning from if we could learn from those that would change our policy within an episode which would change and likely improve how we explore that episode this changes the algorithm's behavior to something that's often more efficient than MC so addressing this is important but I should note we'll see the adjustment has other effects Beyond merely fixing this issue now to visualize the adjustment to evaluation I'll use a simple Markov reward process we have four possible States S one two three and T where S capital t is a terminal State the possible rewards are the integers from minus three to three and there's no discounting and just like in Monte Carlo we don't get to know the distribution function from here we can visualize our sequence of States over time with this grid and our rewards with this let me explain I'll walk through an episode for time T equals zero we sample our first state which happens to be S3 and then with a distribution that depends on that state we sample a pair of a reward and the next state then we do this again and again until the end of the episode the reason I'm representing things this way is it makes sums of rewards clear each one of these horizontal lines indicates the reward accumulated so far in the episode that makes it easy to picture the returns for example the return G3 it's just the difference between the levels at T is equal to 3 and the end of the episode next let's remember our goal in evaluation it's to estimate the value of each state which we do with a table we refer to that with a capital V now what happens in constant Alpha MC is we sample an episode to completion then we'll apply this update rule for each time step of every episode let's see that for a few more episodes now let's get into the temporal difference method let's say we just sampled this episode and we'd like to apply the update for the state at T equals two as mentioned the issue with MC is that we need to observe until the end of the

Segment 2 (05:00 - 10:00)

episode so let's compromise that a bit let's say we just had to make do with only observing up until four steps after T is equal to two meaning we don't see the rewards beyond that point the number of steps is an algorithmic input so we call it more generally n this means we need to update this in the update rule since it's only known at the end of the episode we need to replace it with something well we've observed some portion of the return so we should certainly use that the question is what do we use in place of the unobserved part hmm well these things add up to G6 so we need something to approximate that wait a minute that's exactly what our value function does we can use the value of the last observed State as an approximation to G6 here the value is zero but during the algorithm it'll do the trick and now we can see the TD method in action though you'll have to forgive me It's tricky to show the adjusted return like I did for MC a second ago on screen so I'll leave that out but outside of that this is what it looks like to process a single episode notice how the value updates are applied within episode and are lagged by four steps and obviously this process repeats very fortunately just like in MC we ultimately converge to the right answer okay let's now review this more generally it's called n-step temporal difference learning and it's easiest to understand if we recall the Monte Carlo approach where we used this update rule for evaluation to be clear this refers to the state at time T of the nth episode now in this rule we call this the target it's what we move our estimate towards on each step in endstep TD we replace the target with this which is the sum of discounted rewards between t and t plus n plus our existing Value Estimate to approximate the remaining rewards discounted by n steps the reason we do this is because that's what's observable at time t plus n also things are appropriately set to zero when t plus n is beyond the end of the episode now let's talk about n as mentioned it sets the lag between the time of the state whose value is being updated and the time of the most recent observation and it's worth emphasizing just like Alpha and must be specified up front before you run the algorithm if you set this integer to 1 the algorithm is very eager to make updates values are updated after each individual reward if n is infinite then TD is identical to MC since the delay is always greater than the episode length also some terminology using a Value Estimate in the Target is called bootstrapping as we'll see bootstrapping has major implications so let's see that by comparing its performance to MC on a pure evaluation task that is the random walk example from the previous video to refresh it's a Markov reward process with seven states two of which are terminal all rewards are zero except when transitioning to the upper terminal state in which case it's one also no discounting the initial state will always be the center and each transition is a 50 chance of going up or down so one episode looks like this as a reminder when doing evaluation the goal is to estimate the expected return from each state like last time we can show the True Values like this our algorithms Begin by initializing estimates at one half and we can measure performance using the absolute difference between our estimates and the True Values keep track over time we'll plot the average absolute difference over the number of episodes processed to demonstrate I'll run Monte Carlo with Alpha equal to 0. 03 which looks like hopefully this is nothing new okay now we can see TD I'm going to use an alpha of 0. 1 and an N of 1 which gives us this it looks like TD does a better job certainly in the short term but this isn't a complete comparison the randomness of a single run and the specific choices of Alpha and N make this view a bit too specific so let's try something different I'll use the same plot of error versus episodes but this time I'll include sliders for MC's Alpha and TDS Alpha and for TDS and steps but keep in mind n is

Segment 3 (10:00 - 15:00)

an integer here I'm starting it at n equals four a few other changes instead of using seven states I'll use 11. this will allow us to investigate a larger range of ends also instead of running the algorithms just once I'll run them 200 times which will give a whole distribution of performance for instance this is the performance of MC where the line gives the performance average over all 200 runs and the band is plus minus one standard deviation and now for the punch line this is what TD's performance looks like not only does it do better faster but it has a tighter more certain distribution of performance at least for these choices of Alpha and N TD performs better and is less risky now let's move around MC's Alpha as we can see and as we already expected over a certain range a higher Alpha makes for better early performance but makes performance quite noisy in the end let's freeze this at an alpha that performs well now for TD the first thing to recognize is that if we set end to a very high value and change Alpha to be the same as MC's TD is identical to MC so TD includes MC as an option but if we set n to 1 then we could do a lot better by increasing Alpha this is because a small n reduces the variance in the updates so you can afford to take larger steps okay now just to get a feel for the space of behaviors I'll cycle through a few ends and for each vary the Alphas hopefully this gives you a sense of how performance can change as we vary these knobs and if you'd like to explore this more I've linked a code in the description you can use to play with this there you can find a more fleshed out discussion of these two algorithms from this angle but now let's try a more direct perspective for viewing TD's trade-off between n and Alpha what we'll look at is the average error after 10 episodes what I mean is this here we have Alpha along the horizontal axis and N along the vertical axis on a log scale using colors I'll show the average error after processing the 10th episode for example this is what errors look like for n equals eight I've normalized the errors to have a value of 1 when they have achieved their best value for a particular n and that's colored as blue that means around here we achieved the best performance for n equals eight now if we do that for the remaining values of n we get the full picture clearly for lower values of n best performance is achieved for higher values of alpha so we see there's an interdependence between n and Alpha for achieving good performance still observing performance like this doesn't explain why they perform differently sun and barto's book has a very interesting way of explaining it the argument goes like this we'll compare MC to One Step TD and we'll adjust their algorithms slightly by using batch training I won't go into the details but the important thing is a fixed batch of episodes are processed repeatedly by each algorithm until their estimates stop changing one reason this makes for a better comparison is because the final estimates arrived at do not depend on Alpha as long as Alpha is sufficiently small and here's the interesting thing m c and TD arrive at different estimates and their criteria for those estimates are different it turns out MC's criteria is to minimize the mean squared error over the data which is a known property of taking an average while TD's criteria is to maximize the likelihood of the mark off reward process the thing that we are assuming generated our data I'll spell that out a little bit more the mark off reward process is defined by a distribution P it gives the probability of the next state a reward from the current state now a very natural model based approach would be to estimate P directly by counting transitions that is we form our estimate for the probability as the count of transitions from one state to the next state and reward divided by the count of visits to the starting State this is the simplest version of Maximum likelihood estimation for a discrete distribution from here we can use this to estimate the expected return from any given State now here's the cool thing this is equivalent to one step batch TD without actually doing this approach batch TD arrives at the same answer not sure about you but I thought this was pretty unexpected and it's a nice way to conceptualize TD so with that explained I can State the punch line TD's criteria is better for predicting returns my impression is it's because TD more directly uses the assumptions of the MRP when we use TD it's as though we are directly modeling the data generating process MC on the other hand is just an averaging of the returns from each state which doesn't reference the MRP and that in fact has its advantages MC is more robust to violations of the Markov

Segment 4 (15:00 - 20:00)

property but that robustness comes at the cost of doing less well when those assumptions do in fact hold and One Last Detail we can understand the non-batch version of TD and MC as roughly moving in the direction of what their respective batch versions would do so that's the argument okay that's a lot if you've watched from the first video and you've made it this far this has required a lot of focus so pat yourself on the back and give yourself a two to three second break great let's keep going now we're getting into the full-fledged algorithms that do the whole of generalized policy iteration that is policy evaluation and policy Improvement to start we will discuss on policy TD control what is called n-step sarsa as a reminder because this is a form of model free control we need to apply our algorithms to action values not State values if you don't understand why it clicked on the wrong video now the first step is to redefine the return like this the change here is that the rewards Beyond end steps in the future are approximated with an action value not a state value like we saw earlier this means our update rule is slightly adjusted to this is just the update will be covered at MC control in the last video except we've changed the return in fact it's most efficient to understand TD control as just some modifications from the MC control algorithm we covered which to remind you looks like this and I'm not walking through this Beast again now the main differences from MC control R we change the return definition as we mentioned and the update rule is applied during each episode and with a delay of end steps as we covered at the start of the video there's a bit more subtlety than that since we're interweaving value updates and taking actions but that is the level of detail I will ignore so that leads to burning question why do we call this sarsa well it comes from one step sarsa in that case every update application operates on some Tuple of s a r s a sarsa okay but naming aside an important question to ask is what's the advantage of having n greater than one after all my comparison of TD to MC seems to suggest TD is better and increasing n just makes the algorithm more like MC well there certainly is a reason and that's where the best n is problem dependent to see this let's imagine a grid where we start here this is a deterministic environment where we have four available actions up down left or right as long as the action wouldn't move the state Beyond a wall our goal is to get here where the episode ends all transitions into the goal State yield a reward of one and all other transitions zero and Gamma will be 0. 95 to encourage the agent to get to the goal State quickly so that's the game now let's use eight step TD and observe the first episode we just blindly bounce around until we hit the goal state but here's the important thing let's ask which action values get updated following this first episode well it would be the eight transitions that immediately preceded hitting the goal State now this becomes an advantage on our second run because we are a lot more likely to stumble upon action values we've previously updated than the actual goal State once we're there we're very likely they're not guaranteed to follow the path to the goal State this will create for another set of updates which help learning in the next episode as you can imagine if we run this many times will eventually uncover the optimal strategy so what's the advantage of a larger n means we learn more from a single episode and that quickly changes how we search the space and that often makes learning more efficient but again as we saw earlier a larger end means noisier updates so it's a trade-off okay at this point we're in a really convenient position to learn two other algorithms the first is Q learning to understand this I'll post it as an adjustment from one step TD control it's easy to extend it to end steps the notation is just heavier okay so the primary adjustment is to the Target in one step TD the target is this remember the target is what estimates are moved towards on each update now in Q learning the target gets changed to this What's Happening Here is instead of using the action value of the upcoming State action pair to approximate the future rewards we are using the max action value over all actions in the next state and that in fact makes this an off policy method that's because the max operator is the presumption of a different policy one in which we select actions by taking the max action value that policy could in fact be different from the policy generating the data that could be for example an Epson greedy policy and this has an important and

Segment 5 (20:00 - 25:00)

subtle implication it means that while we are taking actions according to whatever our Behavior policy is we are directly targeting the problem's optimal action value function that's different from on policy MC control where we are approximating the optimal policy with the best behavior policy we end with and also relative to One Step TD there's a change to the updates timing to see that let's write out a sample trajectory for some episode m in one step TD we update the Q table here which follows the completion of each sarsa Tuple for one step cue learning we make the updates here the updates happen before the t plus one action because that action isn't part of the update rule the maxing action takes its place and that does it for key learning we'll see how it performs in a minute for now we're on to expected sarsa confusingly this is an adjustment to the one step cue learning algorithm not sarsa the adjustment is to change the target to this what we have here is our next step action values weighted by our policies action probabilities as we'll see using an average instead of relying on sampled values as we would in sarsa has the advantage of being less noisy at the cost of a more expensive computation and One Last Detail as presented this is an on policy method but it's very easy to make it off policy if the policy we use in the Target is not equal to the policy generating the data we'd have an off policy method easy enough and finally we can do some examples for sarsa Q learning and expected sarsa we'll look at their performance on two examples in these comparisons we'll only cover the one step case because that's the version I just showed and otherwise it would be a very heavy lift visually the first example is called Cliff walking for animation purposes my version is a little different from Sudden Bartos okay as is typical involves a grid where we start here and want to get here which ends the episode we have actions like these available to us which deterministically shift the state to the neighboring state and this is the cliff if the agent moves into it they get a reward of -100 and the episode ends otherwise all transitions yield a reward of -1 and there's no discounting so we'd like to get to the goal State as fast as possible while avoiding the cliff in fact this is the optimal path now for all episodes we'll be using an Epsilon of 0. 1 and an alpha of 0. 5 we'll be keeping score here where the horizontal axis is M the number of episodes processed and the vertical axis is the sum of rewards which is G zero so when we run sorcer for the first time we get a pretty terrible sum of Rewards but we'll get better let's try again and again in fact let's just do it many more times after we've done so this is what a typical episode looks like now to evaluate the algorithm we should rerun it say 400 times in that case the average sum of rewards looks like this naturally we can also plot Q learning which is quite a bit worse so what's happening it comes down to this in the typical case sarsa is executing this path whereas cue learning is executing this which due to the Epsilon chance of exploration is rather costly to travel along to understand this let's recall their targets what we see is sources update uses the next step action it actually takes which is sometimes an exploratory action therefore its estimating returns for the policy that includes Epsilon exploration and because of that it knows that going near the edges costly and so avoids it however Q learning is doing something different the target has this Max operator which means the Q table isn't estimating the policy used to generate the data rather it's estimating the policy defined as taking the max action value which is very similar to the policy generate the data except it doesn't include Epsilon exploration so it doesn't in fact think the edge is costly that cost is due only to Epson exploration in fact if we set Epsilon to zero at the end Q learning would execute the optimal path now what's the expected sarsa which has this performance and its most typical path at the end is this basically expected sarsa on a per episode basis is a more efficient version of sarsa again it can be understood by the difference in targets for sarsa we know that this action comes from the policy distribution in expected sarsa we are just using that distribution directly to calculate the expected next action value in other words what sarsa does with sampling expected sarsa does with an expected value and that eliminates one source of sampling error and makes the updates more accurate though it should be noted the calculation of the target is more

Segment 6 (25:00 - 28:00)

expensive for expected sarsa since it involves a weighted sum let's do one more example this time one that isn't deterministic this one is called windy grid world and again my version is slightly different from the books again it's based on a grid where we start here and we want to get here with the typical actions however now we have some wind blowing up these numbers tell us how much our next move is shifted up for example if this is the state and the action is to go right then this number says the next step will be one position above the state point to which is here let's also add some Randomness if the wind is not zero then we will shock its value by minus one zero or one each with equal probability that means if the agent selects go right repeatedly that would look something like this okay now I'll show you the algorithm's average performance over a thousand episodes this is sarsa this is Q learning and this is expected sarsa one thing that's clear is that Q learning learns the fastest in the long run it looks like expected sarsa and Q learning are about the same and sarsa is slightly worse than both alright I'm not sure what explains this difference maybe it'll help to consider one run of each algorithm and inspect the paths they took after a certain number of episodes so this is the first path taken by sarsa one taken by Q learning and expected sarsa all right we obviously need to see more than that over here I'll show how many episodes have been processed now here goes thank you okay that was cool to see and shows you how these algorithms bumble around to find the right answer but it didn't explain their difference in performance for that I don't have much of an answer it makes sense that expected sarsa is better than sarsa in the long run since it reduces a source of noise Q learning maybe is more aggressive in pursuing strategies that might be optimal leading it to more quickly go rightward I'm really not sure but I guess I shouldn't be surprised there's no obvious Theory as to why one algorithm is always better than the other if that was true the runt algorithm wouldn't be used but we know they're all used the best one really must be problem dependent and I think this is a typical experience in RL you see some performance differences on multiple algorithms and it's very hard to attach a story as to what explains that difference it's just a really hard problem speaking of problems you have one you understand reinforcement learning only up until the tabular case the case where all possible States can be listed discreetly in a table but the real world is much more complicated than that for that extreme Challenge you'll need to understand function approximation if only there was a video explaining that topic

Другие видео автора — Mutual Information

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник