# Reinforcement Learning with Unsupervised Auxiliary Tasks

## Метаданные

- **Канал:** Yannic Kilcher
- **YouTube:** https://www.youtube.com/watch?v=-YiMVR3HEuY
- **Дата:** 28.08.2017
- **Длительность:** 11:00
- **Просмотры:** 4,495
- **Источник:** https://ekstraktznaniy.ru/video/14026

## Описание

https://arxiv.org/abs/1611.05397

Abstract:
Deep reinforcement learning agents have achieved state-of-the-art results by directly maximising cumulative reward. However, environments contain a much wider variety of possible training signals. In this paper, we introduce an agent that also maximises many other pseudo-reward functions simultaneously by reinforcement learning. All of these tasks share a common representation that, like unsupervised learning, continues to develop in the absence of extrinsic rewards. We also introduce a novel mechanism for focusing this representation upon extrinsic rewards, so that learning can rapidly adapt to the most relevant aspects of the actual task. Our agent significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10× and averaging 87\% expert human performance on Labyrinth.

Aut

## Транскрипт

### Segment 1 (00:00 - 05:00) []

hi there today we're looking at reinforcement learning with unsupervised auxiliary tasks by Google so in this paper the author's consider a reinforcement learning task and I can show you what it looks like this kind of a maze or this is an example that they give where you have to navigate the maze it's 3d and from pixel inputs you have to collect apples and reach the goal and this gives you rewards so on the left you can see what the agents actually see on the right you can see it from a top-down view the problem is of course that the input is very or the reward is very sparse meaning that you have to navigate a lot of maze before you even get a single point so reinforcement learning has a big trouble with this because it relies on constant reward to notice what actions are good and what actions are bad so the author's proposes in addition to the regular loss and that you would have so your reward which is this thing you would also have an additional set of auxiliary tasks and here C goes over the observe you control tasks that you specify each of those has a reward and you're also trying to maximize these each with some kind of a weight here and the thing is that the parameters that you maximize over control all of the different tasks so they are partly shared between the tasks so what you're hoping is that by kind of learning to do one thing you also learn to do another thing so the difference between this and let's say you might have so we've seen kind of work of it like this before where you do it in more like an autoencoder setting so for example you can't agencies the input on the left here and it kind of tries to predict what the next in but we'll be what the next frame will be developed behind this is if you can accurately predict what the next frame will be maybe it learned something useful about the environment in this work it's different because now we couple a reward to these tasks and I can show you here what the authors propose as additional rewards sorry they're further on top let me go there especially they considered here these two auxiliary control tasks so pixel changes which means that the agent actually tries to actively change pixels so it gets a reward for changing the pixels in the input so it tries to maximize this it needs to learn what do I need to do to maximize my pixel changes and probably that will be moving around so we will learn to kind of move around not move against the wall because if it moves against the wall the pixels won't change so it will kind of learn to move along the like how a regular human agent would also move speak not into a wall not like into a dead end or something such that the pixels always change of course it's not perfect you can also change your pixels quite a bit by simply spinning around in a circle but this is one of the early tasks that they are meant the agent with the other one is Network features so it's kind of a meta learning here you actually reward the agent for changing its own internal activations so the hope is that it kind of learns about something about itself how can i activate my internal neural network units and it gets rewarded for that so we might want to activate a lot of them and want to learn how they're activated so this kind of self introspection you also hope that it kind of leads to a network does more sophisticated tasks or that by nature of trying to get most pixel changes and the most network feature activated that you also learn something useful for the actual task um so these are the two tasks they propose in addition they also do and they have a drawing this over here they also do a lot of other things namely on the top left you can kind of

### Segment 2 (05:00 - 10:00) [5:00]

see here that what's a database agent this is an a3 see agent meaning that it's an active critic so you learn a policy and you learn a value network we might go over this in a future video school just consider this a standard reinforcement learning agent you feed its experience into a replay buffer and out of the replay buffer you do many things so for one you try to learn these auxiliary tasks note that these are shared parameters between all these networks that's why I do daily tasks actually help you also try to better learn your value function and they call this off policy learning because you kind of pause the reciting training for a while and then you train the value function some more just because that helps you also try a reward prediction in here and the way they do it as I explained is kind of in a skewed sampling way so how do all the situation's you can be in the agent will have a reward very few times so what they do is they simply sample out of the replay buffer out of all the experiences they had so far they sample more frequently the experiences where they actually gotten a reward that way that the whole is of course the agent if you look at when you can zoom in here if you look at the experience here where you actually get an apple and the agent might learn a lot faster or there's some kind of Apple there and I move towards get a reward so that's the hope that you instantly recognize high reward situations and kanda are not so interested in non reward situations of course it doesn't reduce bias in your sampling and you might decide for yourself if that's good or bad here it seems to work so there's a lot of experiments in this task and labyrinth tasks and they of course as with research they read state of the art they're much better than anything else no I mean they don't boast as much so it's actually a fair comparisons the criticisms so they also evaluate a motor against the criticisms that I have are twofold first of all the choice of ability tasks is completely up to the implementer which means that I have to decide as an implementer of this algorithm what my Tillery tasks will be and here pixel changes and Network features they seem like fairly general tasks that you could apply to a lot of these kind of problems but it always kind of comes down to how much knowledge about the task would you like to go into the actor and here I mean you can see it makes sense to get at least the pixel changes as an auxiliary task but it's questionable how much of kind of domain knowledge this already encodes so the fact the choice of these are certainly something that you have to decide as a human and I think these are good choices so they're not too domain specific but also they do correspond to like some visual moving around game tasks and the other um kind of criticisms not really criticism is just a remark is that they do a lot of things so I mean the paper is about the auxiliary tasks but they also then do these skimmed sampling and the policy value learning and so on and of course you can kind of argue yeah this is all done you know the reinforcement learning tasks that's why it's a fair comparison I guess it's a philosophical question if you want to reach state of the art of course you have to first of all get a better method here this would be the auxiliary tasks this is the new idea and then implement all the tricks that the other people have discovered which is good because you kind of reach the highest performance you can get but also the problem is you make it harder to compare see where the improvement is coming from have you simply chosen better high parameters for

### Segment 3 (10:00 - 11:00) [10:00]

the reward predictions of things have you simply is there any interactions maybe between the auxiliary tasks and dispute sampling part all these kind of things wash out and it's not really clear where the improvement is coming from on the other hand if you simply take a basic basic algorithm like just a three see here on the top left and you augment it with nothing but these are the early tasks the bottom left then and then you see an improvement you can be relatively sure it's due to your new idea but of course you won't reach any state-of-the-art numbers because everyone that does a3 see also does these tricks philosophical question Here I am standing more on the side of not doing the tricks or maybe doing both yeah decide for yourself and have a nice day