# Long term credit assignment with temporal reward transp… | Cathy Yeh | OpenAI Scholars Demo Day 2020

## Метаданные

- **Канал:** OpenAI
- **YouTube:** https://www.youtube.com/watch?v=jjmTmYMsET0
- **Дата:** 09.07.2020
- **Длительность:** 17:47
- **Просмотры:** 4,970
- **Источник:** https://ekstraktznaniy.ru/video/11594

## Описание

Learn more: https://openai.com/blog/openai-scholars-2020-final-projects#cathy

## Транскрипт

### <Untitled Chapter 1> []

the first graphic I wanted to share with you guys was a graphic of reinforcement learning agent playing an Atari breakout video game and that's an example of a case where standard RL can learn in a certain environment very well but the problem I'll be focusing on is specifically a problem with delayed rewards and this is very relevant to real life because as you interact with your environment you don't typically get points every time you move there's actions that are separated by very long timescales from their effects Kathy would you like me to help you move your sleds yeah what I can't seem to change my screen my keyboards not responding to that I'll show them oh okay thanks Francis and so here's the I think I might have it apologies for the technical difficulties okay can you guys see my screen my slides yes okay um so the plan is going to be three parts so first I will describe to you why standard RL does struggle and these tasks with long delayed rewards next I will describe the temporal reward transport algorithm or trt that I've been working on to address this problem and then finally I will share some results from experiments using TRT okay so um the motto in

### The discount factory introduces a timescale for the exponential suppression of future rewards [1:59]

reinforcement learning is you have an agent interacting with the world and as it interacts it transitions from state to state and it can pick up a reward along that trajectory and so here I have an equation or something called the discounted returns and this is just the sum of all the rewards that the agent picks up along its trajectory um but there's this extra bit that's added to this returns and it's this discount factor gamma and this gets at the crux of why standard RL algorithms do struggle with tasks with delayed rewards the gamma introduces a timescale and it's basically a heuristic that says that you care about rewards now versus later in the future so you're discounting your future rewards and so in this plot here I have an example if you're standing at time zero and you look forwards a hundred times steps then the reward a hundred time steps in the future is discounted by a factor so that's about 37 percent of its original value and so this is totally fine if you're in an environment where your immediate actions really just effect the most immediate rewards but in cases with long delays then it's not going to work as well and so let's take a look at that so here's an example we have this little guy who is walking through the environment and at some point he can choose to pick up a key or not he doesn't get a reward for picking up that key so this agent continues and interacting with the environment until the very end it reaches a green goal where if it did pick up the key in that first state then its rewarded extra bonus points 20 points um so the problem with standard Ariella though is that it wants to reinforce actions based on the rewards that were acquired after taking that action but if we look at the rewards that would weight this particular state action pair where he's next to the key we'd see that the future rewards are highly attenuated and so there's a very really low signal and learning is very slow as a consequence and so what can we do about that brings me to the next part of my talk which is a the tier T algorithm so this algorithm was based on work by hung from deep mind on optimizing agent behavior of her long time skills by transporting value and the idea is that if you've identified the significant state action pairs that should receive credit for some long-term reward then you can splice those distant rewards to these state action pairs so that you can amplify the signal to reinforce those actions and so that's what we see here so in this original situation on the slide this agent receives zero immediate points we're picking up the key but we splice in these future rewards in this case the distal rewards might be 20 points suddenly we have a lot more signal to amplify to increase the probability of taking an action in that state which is what we want the agent to learn to pick up the key so that brings us the next question how we know what the significant state action pairs are in order to receive these spliced rewards the TRT on so that is the problem of credit assignment in green flow and reinforcement learning and the way we do that and the way that hung did it is using an intention mechanism and so the idea is you do a full rollout of an episode so the agent interacts with environment at the end of that rollout you pass the entire sequence of states and actions to a model in my case a binary classifier and you look at what state action pairs were paid most attention to by the other frames excuse me and so here I have a heat map and this heat map is a plot of the attention scores and you can see there's these two really bright stripes oh by the way the axes denote on the frames and the trajectory you both on the x and y axis so in this case there's two really bright stripes and they correspond to highly attended state in action pairs and if we do a sanity check we can actually bring up what that particular observation was we see that it actually makes sense with what we would expect so in this case we have an attention and we have an a an agent who's a little bread triangle next to a key so it might be moving forward towards it or trying to pick it up so this is a good confirmation that we are attending to the important states in actions so next step is to test out whether this

### Create a 3 phase environment to test long term credit assignment [6:40]

TRT algorithm works and so I created an environment specifically constructed to make it challenging to learn if you don't do credits I meant over the long timescales in this case this environment has three phases first phase the agent is encounter as an empty grid with just a single key and Aiden can choose to pick up that key or not but it doesn't receive any immediate reward if it picks it up the second phase is a distractor phase so we fill this distractor phase with gifts and when the agent opens a gift it gets immediate rewards and then the final phase this is the focus of our evaluation is the phase at which the agent can earn a distal reward so when the agent navigates to the green goal if they learn to pick up the key in phase one it gets 20 points if it never learned to pick up the key it just gets five points so that's going to be the focus of the rest of the experimental results I'm going to show you we will focus to see if the agent learns to pick up the key and correspondingly get the twenty points in Phase three so the I have three separate experimental slides I'm going to show you they all involve varying the parameters of the distractor phase essentially making it more and more challenging for the agent to learn and so the three parameters I vary are the time delay which is the time that the agent is forced to spend in the distractor phase the gift reward size for the distractors and then also the variance of the destructor rewards okay so these plots here show the total rewards that were earned by the agent in Phase three to see whether it picked up the key or not and as you move from left to right it becomes increasingly difficult each plot corresponds to a certain delay and so tau of gamma is equal to the discount factor time scale and we see increasing from left to right and so you can see initially when the time delay is not that long the agent does learn to pick up the key doesn't do quite as well as with the trt algorithm on top of the advantage actor critic the baseline is a TC advantage actor critic but then by the time you get to the rightmost plot you can see that ATC has basically plateaued at five and that five points if you recall corresponds to only moving to the goal and never really there I need to pick up the key and whereas if you had trt that shows consistent progress about learning how to pick up the key and then the next slide is the slide with experimental results for varying the destructor award size so again left to right it's harder as we increase the size of the distractor rewards and again we see the same pattern a TC with the trt algorithm does better than a TC alone and then the final slide was an experiment showing the last returns in Phase three when we vary the variance of the distracted rewards so in this case we have four gifts and they all have a mean reward of five but for each gift we sample from a uniform distribution around five and we increasingly we increase the range of the a uniform distribution in order to increase the variance and so they all have the same being reward but there's just greater variance and again you can see that ATC plus trt does better than ATC alone so to summarize we've seen that adding

### Temporal reward transport helps long term credit assignment [10:30]

to purl reward transport on top of standard reinforcement learning algorithms safe to show some benefit for long term credit assignment this work has built directly on the ideas from the hung paper in 2019 and the two core concepts to take away from that are the idea of using some sort of temporal value or reward transport to splice on to significant state action pairs and then to use attention to identify the important state action pairs so our contribution here has been a completely different architecture implementing these two core concepts it's much simpler than the original papers implementation it's also much simpler environment but I think there's definitely merit and having showing that this concept kind of just carries beyond the original implementation and the paper the implementation is also very modular so I totally separated out the attention part of this algorithm into a separate classifier in order to identify the significant state action pairs and so you can imagine if you want to try adding trt to some other model you can do that it's very easy to add it on because of the modular implementation so in the future well there's tons of work there's never anything where there's always more this is just a heuristic and so it would be interesting to move beyond just a turistic but it's a useful heuristic um and also I've only shown you results for a very simple grid world environment and so it would be interesting to see how well the algorithm can hold up in more complex situations so with that said I had a lot of fun working on this project and so I want to move on to the Q& A stage but also I want to sent out my cops so I'd like to thank my mentor Jerry at open AI for being with me through this whole process I wanted to thank opening I itself for this wonderful opportunity and all the different people I've talked to informal casual conversations I've picked up so much thank you to the program organizers Kristina and where I have been really wonderful so supportive of the scholars and I also have my lovely scholars cohort to think they were also very supportive there was a lot of knowledge sharing as we all ramped up on deep learning at the same time finally I'd like to thank square my employer for giving me this chance to take some time off to do this program and then if you're interested in more project details my write-up is available at my blog if ab d be calm okay and then now I'm going to take a look at cushioned so the question is can you explain why the distractor fitties makes the task more difficult and in other words in your opinion why does the agent not learn the more general behavior of simply interacting with all objects picking up the key and opening the gifts so I think this gets as with standard RL algorithms there's this idea that for policy gradients if you have a policy which is how you want to choose an action in a particular state you can amplify it or reduce the likelihood of doing that taking that action based on the rewards that follow it and because of discounting you won't see rewards in the far future and so in this case if you are in the key state or in the state in the phase one where your connects to the key it's not gonna receive that amplification in that particular state because the rewards in the distractor phase are unique to being in the state next to those distracting gifts so the agent very quickly learns to open the gifts because it sees an immediate reward from the guess so that that's that waiting of the reward is very high and so it reinforces this idea we want to do the toggle action to open the gifts and so it doesn't just translate thought of that simply just because you've learned that you have to take this particular action in this state the way the algorithm is set up it doesn't just translate to learning how to do that if your next is the key okay the next question is I am curious to know if using more advanced deep RL algorithms like PPO would wait more than the TR T's influence and the results I guess so I'm not 100% sure about this wording it sounds like what would happen if I had tried it on PPO instead of A to C and that's a straightforward test we can do because of the modular and presentation on but so PPO is more sample efficient than a 2 C it is able to do several smaller updates compared to one single update by ATC before starting your batch of experiences so given that I would expect it to have a better learning curve I'm not sure if the interaction with trt would be any different but I would expect PPO to fundamentally look a little better than the baseline that I showed here I just haven't tested it out yet what was the most challenging part of your project oh there's a lot uh well so I think at some point when I finally when I decided on this particular path I always had a bunch of ideas for how to get it working and every time I tried something new and you commit to get hub I thought maybe this will be it and so the challenging part was seeing the deadlines kind of looming and starting to realize that all my fixes weren't necessary submit necessarily becoming like the last fix but then in the end things worked out when I had like a little I realized there's some artifacts that were being introduced due to the way this algorithm is set up it's very sensitive to the full context of the episode and I had to think about how I can handle that because it would have required quite a bit of reengineering of my code with very little time in order to do parallel training which was really important I needed to run this on many workers and so I had an idea to rejigger or something well so I'm running out of time um but basically it was just pushing through and it kind of worked out in the end so I'm really glad of that and then so with that okay thanks