Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery

22:35

Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery

Yannic Kilcher 12.04.2020 1 967 просмотров 66 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

DDL is an auxiliary task for an agent to learn distances between states in episodes. This can then be used further to improve the agent's policy learning procedure. Paper: https://arxiv.org/abs/1907.08225 Blog: https://sites.google.com/view/dynamical-distance-learning/home Abstract: Reinforcement learning requires manual specification of a reward function to learn a task. While in principle this reward function only needs to specify the task goal, in practice reinforcement learning can be very time-consuming or even infeasible unless the reward function is shaped so as to provide a smooth gradient towards a successful outcome. This shaping is difficult to specify by hand, particularly when the task is learned from raw observations, such as images. In this paper, we study how we can automatically learn dynamical distances: a measure of the expected number of time steps to reach a given goal state from any other state. These dynamical distances can be used to provide well-shaped reward functions for reaching new goals, making it possible to learn complex tasks efficiently. We show that dynamical distances can be used in a semi-supervised regime, where unsupervised interaction with the environment is used to learn the dynamical distances, while a small amount of preference supervision is used to determine the task goal, without any manually engineered reward function or goal examples. We evaluate our method both on a real-world robot and in simulation. We show that our method can learn to turn a valve with a real-world 9-DoF hand, using raw image observations and just ten preference labels, without any other supervision. Videos of the learned skills can be found on the project website: this https URL. Authors: Kristian Hartikainen, Xinyang Geng, Tuomas Haarnoja, Sergey Levine Links: YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher

Оглавление (4 сегментов)

<Untitled Chapter 1>

hi there if you look at this robot has learned to turn this valve by itself now by itself isn't really correct but it has learned it in a semi-supervised duay with only 10 human inputs along the entire learning trajectory so only 10 times was there a true reward for this reinforcement learning procedure and the rest is unsupervised discovery of this skill and the paper were going to look at today and the technique by which this was achieved his dynamical distance learning for semi-supervised and unsupervised skill discovery by christian hart ich einen shin yang thomas Hanoi WA and Sergey Lavine so this is a technique for reinforcement learning so they claimed reinforcement learning requires manual specification of a reward function to learn a task right and they say while in principle this reward function only needs to specify the task goal in practice reinforcement learning can be very time-consuming or even infeasible unless the reward function is shaped so as to provide a smooth gradient towards a successful outcome so what does this mean let's look at it so if you want the robot here to turn the valve to the right ideally you simply want to say so the robot is here this is the start state right ideal you just want to say I want this right I want the fear that the thing to turn to be at the right so this is good all of this I don't want any of that right and and the reinforcement learning I mean this is enough this is reward function right all of this is 0 and this is 1 right this is a reward function and in theory if you apply any sort of reinforcement learning algorithm with any sort of guarantee this should get you there but of course we all know that it's not that easy right there is basically an exploration bottleneck where your robot has these three digits and lots of joints to move around and the the probability that by itself it discovered that it needs to do this here and get this reward is very slim so what you want to do is in your reward function that you're providing to the robot you would want to say okay so this here I see the blue thing is a bit more turn so I'm maybe going to give this a 0. 1 and then the here it's a bit more turn so maybe this is 0. 2 and this I really like 0. 3 here 0. 6 maybe because it's even more right and then one at the end right so this is what they would call a

Smooth Gradient

smooth gradient in the reward function where it's kind of the reward function ramps up until the goal is reached but oftentimes this isn't really possible because if you already knew how exactly to do the task which then you could you can only shape the reward function truly if you know how to perform the task in the first hand and then why exactly do you do reinforcement learning except for as an academic exercise so the issue this paper has is clear right what they want to say is like let's let's assume that your reward function is actually pretty bad can we provide artificially a way that this discovery of these what they call of these new skills is facilitated as if the reward function had some sort of a gradient so that's the outset let's actually go back to the to this for a second and they have these mazes as a kind of a an example so if you look at these mazes what we want to keep in mind is let's actually draw this over here so let's say you have one of these mazes right and always there is a start state so you're here and then there is a goal state right let's say over here and the task is you can move up down left right and the task is to reach the goal right but if the reward function is simply that if you reach the goal you get a reward of 1 and otherwise 0 then all the agent can do is kind of explore around right until it reaches the goal now if you do random exploration like a

Random Exploration

lot of reinforcement learning algorithms for example Q learning or policy gradient they'll have some sort of a just over a random exploration element where they if they don't absent of what they of that when they know what to do they just kind of Google around like up up right left down right up that doesn't work okay down left down so they it's sort of and then up again right and then they just kind of wonk around so this method takes issue with that and it says okay while the agent is doing its thing trying to reach the goal right what we can do is we can learn a distance function between states now we'll reduce the problem for now and just say the task is always that the goal state is reached in the shortest amount of steps right so let's say the agent does something right it goes here here and then here right it that's one roll out of the policy and then it crashed into a wall okay that's bad so it gets a negative reward right but in addition to that we can learn so it has visited all of these states here in between right these are intermediate states this paper wants us now to learn a distance function between the states so this distance function let's call it D it learns how far two states are away so it'll you can tell it okay this state here let's call that state a and this state here state B how far are those away now this is not super well-defined yet but you want to say how far are they away for this agent here so the agent has maybe policy PI like that's what it used to create this trajectory under policy PI how far away are states a and B and how far away is simply going to be the amount of steps that it takes the agent to go from A to B so in this case that would be 2 right so the D and you can do this between any two states right this and this the right here these all of these states you can also start from this state right let's do it another different color and do every so DT the this distance function D can actually has a pretty tight reward signal like a pretty wealth of information that it can learn these things from write that so the policy PI in this case can't learn much because it just got a reward of 0 or something because it didn't reach the goal but the distance function has very big reward or a big reward it has a very dense reward signal where it can learn distances between two states all right now let's say we've explored a bunch right a bunch we've had many trajectories some here like two here and then here sometimes we even reach the goal right so sometimes we actually reach the goal so we learn the D distances between all of the states now if we had a perfect distance function let's assume we have our task now becomes very simple so let's assume that's so let's assume I'm here where the green agent is and I have these I can either go up or down and let's go let's up let's say that's X and the down is y all right which one should I choose now without even asking my policy per se what I can do is I can ask hey distance function so I can ask a distance function two different things so first let's do it like this distance function what do you think of the distance between X to the goal and what do you think of the distance from Y to the goal and the distance function if it's learned correctly it will tell you the distance of X to the goal is whatever maybe you need eight steps the distance of Y to the goal is ten steps right so definitely you would go with X right so if you had a good distance function then you could solve the task fairly easily now this by itself isn't super interesting you will quickly notice that if you are able to learn such a good distance function especially with the goal state here then you might as well learn a good policy because that means you've reached the goal a fair number of times right so there the kind of information theoretic signal of D versus the signal on PI if you just want to reach the same goal to me it seems the same this paper it tries to talk this up I feel but to me if you are in the situation where you have a fixed goal and that's it then this doesn't seem too interesting or too beneficial with it compared to let's say just learning a value function right like you would do it in a three C or something the difference between this and the value function so if if the number of steps is actually your reward so your negative reward you want to reach the goal in the shortest amount time then learning a value function is the same the difference is for a value function the value function simply takes a state s right and the policy pipe while the distance function takes a state s and the goal state for the policy PI value function is implicit right so it implicitly has the goal state because you assume that the goal is always the same with the distance function you can technically change your goal and this is where it becomes interesting so let's say you've explored but you haven't quite reached the goal yet right but we said okay most of these are algorithms they have some sort of some notion of random exploration right in order to reach the goal what if you went from here to here and to here and you learn the distance is fairly well for the trajectories that you can do but you just haven't been able to go any further what you can say is you can go to your replay buffer right your memory of everything you've done and you can ask which of these states has the furthest distance from my starting state and the answer will be okay this state here as the furthest distance so now what you can do is you can make this your goal right you can just try to reach that state right and once you reach the state you can explore from that state right because this is the farthest away from your original starting state that probably means that you know if you that's kind of the frontier of what you know so if you explore from here you can go even further notice because it is the farthest that you know so it might turn out that from here you can only go back right so that's a possibility but probably you could go even further right so then you go further and you might reach this state here right and again you ask your you replay before and tells you this state here is the farthest so force he take this as your new goal and now you're just trying to reach that and explore from here this is extremely similar to an algorithm like go explore that I already made a video about where it remembers what it did and then it all will always travel to the furthest states it has seen so far and then from there try to go further right so this if you can learn a good distance function here that will help you in exploring the space and eventually of course you think you might actually reach this goal state so you might go far enough into in this maze you might explore it enough such that you stumble over the goal state by itself all right so this is sort of the goal this can be used in a number of different ways now instead of always going for the farthest what they did in the robot is they just let the algorithm explore right you explore explore if this is like a state tree and then at some point it asked the human which one is closest to what you want and then the human says this one and then they say ah ok cool so this is now the new goal right so we'll try to reach this as much as possible and then explore from here right so this in the case of the robot simply just like does some things it explores in the unsupervised manner and then at some point you ask the human which of these things that the robot has done you like the most and then that becomes the new intermediate goal State and the algorithm explores from there right so that's the main gist and how you can use this now the entire learning thing is actually pretty simple so what they propose is simply to learn the

Learn the Distance Function

distance function it be pretty formal here they say ok if you're two states that were visited after one another in the episode then you can define the distance function as the sum from I to J if the they were visited at time steps I and J respectively this is a discounted cost function across this but ultimately they consider problems where its shortest path problems so the cost function simply becomes how many steps does it take you to reach the goal so the cost function so this becomes this becomes the identity I guess you can set it to 2 1 and this you can also set to 1 so this simply becomes J minus I how many steps does it take you to reach the state in time step J from the state you visited in time step I and then they simply train a poly neural network or I'm not even sure if it's a neural network but you train a bunch of a parameterised function that learns to map the distance between these states to how many steps it took you from one to the other right and you do this simply by having a by regressing so mean squared regression mean square is loss regression simple as that and that's how you learn the distance function and then you can use the distance function in the ways we discussed either to improve your shortest path policy by giving it by providing it so what you want to do is you want to provide the distance function as the negative reward right so they say they they provide the distance function as a negative reward for this or you can do this in an unsupervised fashion where you always propose the farthest away goes or you can do this in the semi-supervised fashion so they have a bunch of things that they did here videos of things that they trained and this is from the human sorry from the semi-supervised where the humans were simply selecting the hoppers that went furthest to the right and you can see over time this hops to the right with very sparse input only so this is semi-supervised right and then it goes to the right and it also has an unsupervised video where you simply let it perform and it um in unsupervised fashion it tries to discover states that are as far away as possible from its initial States and you can see it actually learns to move to the right and to the left because these are these reach states that are very far from its original state right so that's it's pretty cool that it turns out that the unsupervised method will discover such states all right so what to make of this if you recognize this already it's very plausible because I had seen this some sort of this idea in many papers before so and they make some connections in there related at work so if you know for example Universal value functions sorry universal value estimation Universal value functions and so on where basically it's also an unsupervised way where you always just each select two states you say this and this agent now try to go from here to here right just try that and so it is and then you'd select two new States so you basically teach your agent to go between two states that you choose at random and it's supposed to in an unsupervised fashion learn something about the environment very similar to what we have here right also a bunch of other things like just pure value functions are also pretty similar I think to this go explore there is a big connection to go explore so this has been around in one way or the other but possibly not in this specific formulation and what I think is cool applied to this specific semi-supervised task so if I had to formulate a criticism to this method I would guess that it probably doesn't work when let's say the branching factor of the task is super high you see here you can only really turn the valve in one way or another of course the digits and that the joints are they have degrees of freedom but if you think if the branching factor is super high right so from a given state here you can go in many many different ways and then from each of those you can go in many different ways right then the notion of something being far away right you go to this thing and you see what's the furthest away all right is almost meaningless because you have so much not explored right so if you have if you are three steps deep here right it will always tell you well this state here is the farthest away but you haven't explored these you know fifteen directions here right so it might be that you actually miss so that you go so here's the goal and here's the start and you go a long way but you miss this obvious shortcut here because you always want to go along the longest path around so it seems like there is there are probably environments where this works well right but there right but it appears that if either the branching factor is super high or if there are maybe this kind of loops in the game loops between states non-obvious combinatorial things it might be somewhat even counterproductive sometimes not sure about that but it seems to be very specific environments where this would work alright so this was my commentary I invite you to read the paper check it out and bye

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник