EfficientZero: Mastering Atari Games with Limited Data (Machine Learning Research Paper Explained)
29:25

EfficientZero: Mastering Atari Games with Limited Data (Machine Learning Research Paper Explained)

Yannic Kilcher 03.11.2021 25 738 просмотров 886 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
#efficientzero #muzero #atari Reinforcement Learning methods are notoriously data-hungry. Notably, MuZero learns a latent world model just from scalar feedback of reward- and policy-predictions, and therefore relies on scale to perform well. However, most RL algorithms fail when presented with very little data. EfficientZero makes several improvements over MuZero that allows it to learn from astonishingly small amounts of data and outperform other methods by a large margin in the low-sample setting. This could be a staple algorithm for future RL research. OUTLINE: 0:00 - Intro & Outline 2:30 - MuZero Recap 10:50 - EfficientZero improvements 14:15 - Self-Supervised consistency loss 17:50 - End-to-end prediction of the value prefix 20:40 - Model-based off-policy correction 25:45 - Experimental Results & Conclusion Paper: https://arxiv.org/abs/2111.00210 Code: https://github.com/YeWR/EfficientZero Note: code not there yet as of release of this video Abstract: Reinforcement learning has achieved great success in many applications. However, sample efficiency remains a key challenge, with prominent methods requiring millions (or even billions) of environment steps to train. Recently, there has been significant progress in sample efficient image-based RL algorithms; however, consistent human-level performance on the Atari game benchmark remains an elusive goal. We propose a sample efficient model-based visual RL algorithm built on MuZero, which we name EfficientZero. Our method achieves 190.4% mean human performance and 116.0% median performance on the Atari 100k benchmark with only two hours of real-time game experience and outperforms the state SAC in some tasks on the DMControl 100k benchmark. This is the first time an algorithm achieves super-human performance on Atari games with such little data. EfficientZero's performance is also close to DQN's performance at 200 million frames while we consume 500 times less data. EfficientZero's low sample complexity and high performance can bring RL closer to real-world applicability. We implement our algorithm in an easy-to-understand manner and it is available at this https URL. We hope it will accelerate the research of MCTS-based RL algorithms in the wider community. Authors: Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, Yang Gao Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Оглавление (7 сегментов)

Intro & Outline

hi there today we're going to look at mastering atari games with limited data by yeah hua liu tanahar kurotouch pietrabil and yang gao this paper presents the efficient xero model which is a model that can do reinforcement learning with severely limited data so the paper tackles the atari 100k benchmark which means to learn atari the atari benchmark as a reinforcement learning task as for example deep q networks did but you only get 100k transitions uh this is about it's about two days worth of real time data to work with and after that the models supposedly be able to play atari so this is a variant on mu0 which is an insanely data intensive reinforcement learning algorithm and it introduces various tricks and various amendments to mu0 to make it more sample efficient so when we look at this paper you can see the gist of it right here uh if you do this atari 100k benchmark you can see that a lot of the other reinforcement learning algorithm they fail to even reach human level performance whereas this new algorithm out competes not only the other rl algorithms on in this low data regime but also the humans here they say efficient zeros performance is close to dqn's performance at 200 million frames while we consume 500 times less data efficient zeros low sample complexity and high performance can bring rl closer to real world applicability they even say we implement our algorithm in an easy to understand manner and it is available at this github address so this code is out there especially if you want to do reinforcement learning but you don't have as much compute or time or money this might be for you so we'll go through the paper we'll see what the improvements are there's not a single improvement there are many improvements uh three big ones to be exact and yeah if you like content like this don't hesitate to subscribe and and tell your friends and family and professors i guess all right so uh we'll first take a small look at what

MuZero Recap

mu0 does just as a recap i have done a video on mu0 but if you haven't seen that then here is a short a very short uh introduction to mute zero to the algorithm so in a classic reinforcement learning setting you have your basic setup of you have the environment and you have the actor and the environment gives the actor some sort of an observation at time step let's call it t um the actor uses that observation to come up with some sort of an action at time step t and then the environment gives the actor back a reward for that time step and the next observation t plus one and that goes on and on so the question is how is the actor supposed to come up with this action right here given the past observations that it has seen from the environment in order to maximize all of the reward that it gets now in a regular reinforcement learning algorithm or regular let's say in the simpler reinforcement learning algorithm what people are doing is they're doing model-free reinforcement learning which essentially means that they take the series of observation one observation two and so on that they've seen so far they take that they stick it in a big neural network and they train it to output some sort of an action and they train the neural network in order to maximize this reward right here usually using some sort of policy gradient or something like this so this is a rather direct way we call that model-free reinforcement learning because you directly predict the action without without an explicit model of the world now when you have a so when this environment here is well described for example a chess board in a chess board you know the rules you know everything that's going to happen in a chess board you can use a model of the chessboard so what you can do is this you can take these observations and these observations would correspond to some defined state or let's say tic-tac-toe is a better example so you know with the observation i can actually construct the board of tic-tac-toe that i'm in and then what i can do is i can actually search so i can try out i can say okay what if i put you know something here oh then my opponent's certainly gonna do that right here and then what if i put something here and then my opponent is gonna do that and then they win right so that is one way to do it and usually you visualize this as a tree so you are here at a root note that's your state and you have several options to do things and in these several options your opponent has several options or if it's a one player game you have several options again and so on so what you want to do is you want to search this tree for the best possible path and this is what things like alphago alpha zero and so on did they have this explicit model and they search through it and now the neural networks no longer predict actions directly the neural network help you search through that tree which means they they vote essentially on which par paths of the tree to explore because the tree quickly becomes too large to explore as a whole right you can't like if it's more than three moves ahead the possibilities just get giant even like especially in a game like go so the neural networks are here to guide the tree search and that was in general the techniques of that center around the monte carlo tree search because at some point you abort the search and you simply play one game to the end as sort of an approximation of what happens um and so on i'm not going to go into that super duper right here but what mu zero does is mu0 says well this whole three search stuff essentially only works if i have an explicit model of the world such as the tic tac toe board it's clearly defined how it works right also i can i can have a simulator for it i can rewind i can try again this doesn't happen when you're interacting with any sort of real world thing let's say or even the atari benchmark so in atari i know there is there's hacks where you can save the rom and so on but essentially you're not supposed to go back in time or go forward in time you're not supposed to be able to try something out and then say well no that didn't work i'm going to search for a different path in the tree instead so what people do is they they try to learn a model of the environment so in absence of the they try to learn one and there are many different ways of doing this and what mu zero does is it learns a latent model of the environment so how does that look so here you have the current observation t what mu zero does is it uses a neural network i think they call this h or something to get this into a hidden state so they map the current observation into a hidden state and then they plan using the hidden state so they plan they say okay i'm not going to predict what the next observation is going to be like in the tic tac toe board i'm only going to predict what is the next hidden state going to be t plus one like this is one this is two this is three so you know depending on which action i do which which is going what is going to be the next hidden state of the environment um sorry of yeah of the environment what's going to be the next hidden state and from that hidden state i always going to predict what's going to be the reward for transitioning there what's going to be my own policy which is a bit weird that you have to do this but you have to and which is going which what's going to be sort of the value and the value is what is going to be my future reward when i go from here so these are the sort of things that mu zero predicts and with that it is able to search this latent tree note the addition to mu zero sorry yeah the addition sorry to alpha zero which is this one right here so we might label this is something like re in force this is alpha zero and this is mu zero so the difference to alpha zero being that we no longer have an explicit model so in order to do tree search we have to learn a model and the model that mu0 learns is in the latent space purely right there is it doesn't predict future observations and it only learns all of this from the signal that it so it predicts the reward it predicts its own policy and it predicts the future value and those are the only learning signals for the world model that is good because it focuses the algorithm on what's essential it is essential to get the maximum reward possible and therefore the learning the more the learning signals center around those concepts the better but that also means learning the entire world model just from signals like the reward is extremely sparse so it uses a lot of data and that is that's essentially the catch right here so we're not going to go into you know how exactly uh mu0 uh does monte carlo research they have a way of balancing exploration and exploitation right here by essentially using an upper confidence bound formula that you can see right here but so efficient zero goes and says there are

EfficientZero improvements

three main weaknesses with mu0 first of all they say lack of supervision on the environment model that's what i just said all the model the latent model of the environment is learned purely from the signals of the end from the reward signal the value signal these are single numbers and to ask the model to learn a transition function for the environment model is a big ask and of course needs a lot of data just from that the second one is hardness to deal with alea toric uncertainty i like i'm i've given up on trying to remember which one is alea toric and which one is what's the other one epistemic i have no idea okay let's just read the paragraph the predicted rewards have large prediction errors so if there is uncertainty in the environment for example the environment is hard to model the reward prediction errors will accumulate when expanding the monte carlo research tree to a large depth resulting in sub-optimal performance in exploration and evaluation so what they mean is that if i predict if i'm if this reward right here has a bit of an error and then i go on searching right these branches right here and then the reward i predict right here also has a bit of an error and so on and we go down the tree uh and every reward has a bit of an error what i'll do in order to uh you know at the end right here i have a path and i don't go to the end i stop after a while and i add up the rewards that led me here and that's sort of you know how valuable this node is plus the value that i predict right here that's going to be the the value of this path is going to be the sum of the rewards until i'm here plus the value from here on out and if all of these little rewards have little errors on them that quickly adds up to a big error so that's their second criticism right here that's something we're gonna have to solve and thirdly off policy issues with multi-step value and that is a general um thing in these reinforcement learning algorithms the more distributed you make them the more sort of what people usually do is they have like a learner box in the middle learn so there's a neural network there but then they have a lot of actors um actor machines so they distribute training and interacting with the environment and these send back data there's usually a replay buffer right here somewhere and that means just that the neural network that is here at the learner is not the same that generated the data because the data is kind of old and until you use the data to practice the neural network will have already learned from other data and therefore you get an off policy issue even though it's an on policy algorithm now mu0 does a little bit to correct this but they say this has to be done more so how are we going how are they

Self-Supervised consistency loss

now we tackle these three things so the first thing they tackle is this lack of supervision on the environment model so what they do is they add a self-supervised consistency loss you remember that we map the observation at time t to a state a hidden state at time t and then we use our latent model to predict for a given action what's the state going to be at time t plus one and that's an estimate right now what this paper says is that wait a minute if we simply look at what happens in the real world right observation t plus one and we send it through the same so through this same encoding function then that gives us the hidden state at time t plus one so technically these two things here should be equal so the hidden state at time t plus one and the estimated hidden state at time t plus one they should be kind of the same so what they do is they use a self-supervised consistency loss uh that they nap from simpsium so simpsium is a contrastive learning framework or self-supervised learning framework and it's usually used to have two images which have been differently augmented so to make their representation equal so till the model learns to sort of ignore the data augmentation that's how you train self-supervised image models but here we don't augment differently what we do is we take an observation and we take the observation at time t plus one and the first observation we actually map it through that function that is supposed to give us this estimation of the next state and then we use a similarity loss in order to pull those two things together so this function that gives us the next state and the representation functions they're now going to be trained in order to make those two things the next hidden state and the estimation of similar to each other in fact the left branch right here is the one that's trained but that includes the representation function and the next state function so you might ask you know this is kind of the first question that everyone in mu zero has is like why is this not done because this is if you look at the loss of mu0 you can pretty easily see that is possible and i think the mu0 authors have deliberately not introduced a loss like this because they say no if we learn from just the reward signals that is going to be a better algorithm even though you know it might use more data but at the end it really trains for what is important for what is the end goal and that's why they didn't introduce a loss like this introducing a loss like this clearly uh trades off uh the what's the actual target is namely optimizing the reward right we actually don't care if anything's consistent we simply want a higher reward so it trades that off for sample efficiency because now the supervision signal here is much larger because now we work with different hidden states which are entire vectors so that's going to be about much better signal

End-to-end prediction of the value prefix

signal so that's the first improvement the second improvement is what they say end-to-end prediction of the value prefix so they make an example right here of saying okay what's the value you know if you look at this you have to predict sort of the future value can you really predict what's it going to be like either the green player let's say the ball flies in this direction the green player is going to catch the ball or not right and that makes a huge difference now you as a human at this point you know that it's not going to the green player is not going to catch that ball um and at this time you're kind of sure uh but it's quite hard to predict at this time right here and it's even harder to predict um when you know at which step in time that player is going to miss the ball and that's an argument they make for essentially saying if we add up the rewards of our own predictions they can introduce a lot of mistakes and but that's exactly what we do if we look at the q value that we use in this tree search what we do is we add up the rewards that we got in the path so far and we add the value at that particular path and that is very error prone because this sum right here accumulates all the little errors that uh that happen in prediction and you know as i said if we're not exactly sure at which point um that is just one of the examples to show you how hard this problem is of predicting rewards step by step if you look into the future so what they do is pretty simple they say instead of adding up all the rewards k steps into the future what if we simply take the hidden states that we predict k steps into the future and just shove them into a neural network and then that neural network will output the sum of the rewards so instead of summing the rewards directly we have a neural network output the total sum much like we have a neural network that outputs the value function uh at that looks ahead this neural network right here it will look sort of back it will look into the past from the current state to the state the end state that we rolled out in imagination it'll predict the entire value they're using lstm for that because it can take in arbitrary number of states and the lstm has a per step rich supervision because we have a reward at each step and therefore they

Model-based off-policy correction

say that works quite well so that's the second thing the third thing is the model based off policy correction so um yeah this one is a little bit more tricky but essentially we can see where is it okay we can read a bit through it to see what it does this is an off policy correction mechanism and uh they have two different mechanisms to do of policy correction i already said of policy correction you have to do it because the data that you get to learn from comes from your replay buffer comes from delay from the network and so on and is a little bit older than the network that you're learning and that turns out to be quite a big problem so um what we usually do is we sample a trajectory from the replay buffer and we compute this uh target value z right here for the value function the value target sums from off paul sorry suffers from off policy issues since the trajectory is rolled out using an older policy and thus the value target is no longer accurate now mu0 re-analyzed this is a particular version of mu0 already handles that a little bit in that it actually recomputes the values the scalar values with the current network before it learns from them but still the policy used to generate that data is from an old policy and so they say when data is limited we have to reuse the data sampled from a much older policy thus exaggerating the inaccurate value target issue so what they do is they say well instead of using sort of the path so we're this is the state right and here is what actually happened right we took some actions that's what actually happened and now what we would like to do is take this and learn from it but the policy used to generate that path is an old policy so the current network might have done something entirely different it might have done a different action right here and got to a different point and that is a problem because in an own policy method we'd largely like to learn from actions that have been generated with the current policy so what they say is that we're simply going to not use the entire trajectory for learning but we're going to cut off at some point because of course the further out the more uncertain we get and that cut-off point is going to be closer the older the trajectory is so for a very recent trajectory might cut off towards the end but for a very old trajectory we went cut off like all the way here and then what we do after the cut-off point is so we take this we cut it off at some point we say well it's old but you know this part right here is still sort of the uncertainty is is not large enough for us to worry so much and then what they do is they use because they have a latent model for the and for the world they use that model to imagine a rollout so much like something like dreamer or so they now train using imaginary rollouts from the point where they cut off so the trajectories in the replay buffer are more like seed values and after that they imagine rollouts using their latent model of the world all right so yeah so i think that's it um we redo an mcts search with the current policy on the last state and compute the empirical mean value oh yeah so at the last node right here they redo an mcts search they uh in order to get a really good target value there with the current policy yep that's it okay so these are the three improvements again they introduce a consistency loss on the hidden states uh to make their transition model better second they directly predict the um value what they call value prefix this thing right here instead of summing up the rewards as they go along the tree search and thirdly they seed they use the collector trajectories as seed values and then train essentially um in half imagined uh half imagined rollouts with the current policy so that's it so what does that give them

Experimental Results & Conclusion

it gives them very good performance on this atari 100k benchmark they do some additional um things right here additional ablation studies for example they try to reconstruct the observation from the hidden state and they see that for example if you don't have a consistency loss this quickly fails so this would be the original mu zero whereas with the consistency loss you can see that kind of sort of there is and you know there is something right there uh that looks like the observation now here i don't know if that is after the 100k steps because of course mu0 after 100k steps also doesn't perform super duper well uh and therefore you wouldn't be surprised like that this is or it could be because their reconstruction method is just kind of poor as well but the difference is noticeable between the two models the one that has the consistency loss and the one that it doesn't they also analyze for example the validation loss if you have if you directly predict the rewards or if you use this value prefix prediction method you can see during training it's approximately the same however at validation time this loss is much lower and lastly well they do a lot of ablations that is it what i was surprised or not surprised what i noticed in the ablations and this is pretty much in all the ablations there is no consistent ranking so they have three improvements right here and sometimes uh this improvement right here for example will be the most valuable so you can see that without the value prefix alien drops quite a bit and in other times you can see right here this one will be the most valuable and yet in other times some other one like the last one will be the most valuable don't see one right now but i have i've looked at it and that there is no consistent thing so that it means that there's not a single recipe to make this thing better it's a conglomeration and for different atari games different things are important and that sort of leads you to think you know is this this isn't a method from let's say principle this is they have looked at what fails and they have fixed essentially one by one the major mistakes that they found and that is a way to go about it but it is also a danger that we sort of over engineer to the benchmarks that we have because you know clearly if i just put one of these improvements then some of the atari games will improve by a lot but others won't and that to me is a little bit of the of the danger right here and this is why i'm not you know like i can't tell you if this algorithm is going to be a staple algorithm for sample efficient rl or if it just works particularly well on this benchmark they do another benchmark they do the um deepmind control a benchmark but i think there's going to be more evaluation needed but yeah i'm excited it really has the potential uh to be something cool all right that was it from me uh thank you so much for listening watching let me know what you think in the comments and bye

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник