Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

18:39

Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Yannic Kilcher 08.08.2019 4 260 просмотров 124 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

The goal of hierarchical reinforcement learning is to divide a task into different levels of coarseness with the top-level agent planning only over a high-level view of the world and each subsequent layer having a more detailed view. This paper proposes to learn a set of important states as well as their connections to each other as a high-level abstraction. https://arxiv.org/abs/1907.00664 Abstract: In many real-world scenarios, an autonomous agent often encounters various tasks within a single complex environment. We propose to build a graph abstraction over the environment structure to accelerate the learning of these tasks. Here, nodes are important points of interest (pivotal states) and edges represent feasible traversals between them. Our approach has two stages. First, we jointly train a latent pivotal state model and a curiosity-driven goal-conditioned policy in a task-agnostic manner. Second, provided with the information from the world graph, a high-level Manager quickly finds solution to new tasks and expresses subgoals in reference to pivotal states to a low-level Worker. The Worker can then also leverage the graph to easily traverse to the pivotal states of interest, even across long distance, and explore non-locally. We perform a thorough ablation study to evaluate our approach on a suite of challenging maze tasks, demonstrating significant advantages from the proposed framework over baselines that lack world graph knowledge in terms of performance and efficiency. Authors: Wenling Shang, Alex Trott, Stephan Zheng, Caiming Xiong, Richard Socher

Оглавление (6 сегментов)

Introduction

hi there today we're looking at learning world graphs to accelerate hierarchical reinforcement learning by when Ling Xiang at all from Salesforce research this work is based in the world of reinforcement learning and especially hierarchical reinforcement learning so in hierarchical reinforcement learning the idea is that in order to perform a task like in this case they perform all

Hierarchical reinforcement learning

of their experiments on mazes like this so imagine you have this maze and this red thing here is the agent and the goal is the Green Square and the gray things obviously are walls and the black things are everywhere the agent can move the agent can always move one step any in any direction that it wants and that isn't blocked by a wall so in order to fulfill such a task the agent needs to take many steps like go here here here each one of those is a step in addition this specific maze has an additional property namely that there's a lock to door here and first you need to pick up the key to basically to open the lock the door so in order to reach the goal the agent needs first to pick up the key then open the door then go to the go and each one of these it has to traverse many steps so the idea in hierarchical reinforcement learning is that you have two parts to it to the agent so your agent which is this entire box here is divided into what's called a manager sure and a worker and this is a divide so what the manager sees basically it I do an example here they do it differently but the manager could see large could see the world basically only in these large chunks right and it doesn't really care what is in or it cares what is in the chunks but it doesn't distinguish points within the chunks it just knows about these chunks basically and what the manager will say Oh first I need to go to this chunk here then because there's the key in this chunk and then I need to go to this chunk here because there is the door there's the goal so the in the view of the manager which has a very high-level view of the world is the action sequence is down here over here then over here those are like three actions that's a pretty simple and then the manager would pass this information to the worker and it would say hey worker please go to this state here please go to the first state and then the worker would be tasked with basically moving the individual steps to go not to the final goal but only to go to that chunk and then in that chunk the worker would go to the key and then once it has the key the manager would say good job now please perform the second action which is go to this chunk here so the second action that the worker would so you basically get the idea whoops I am doing something here you get the idea that the I'm creating text boxes that the worker and the manager work together in that the manager has a high-level view of the world and then the worker can basically execute the actual actions that the manager has decided on in a fine-grained way so this is gives you several advantages namely the manager can plan high-level and far away things and then the worker really only has to care about its close neighborhood because each step the manager proposes is a fairly short range so the worker can implement it they do this in a kind of different way so let's actually start from the back from of this paper which is I find is a bit more explanatory and it makes a bit more sense to look at it what they propose is to learn a world graph so in

World Graphs

a world graph what is a world graph consists of two things first a set of states which is the or the blue states here so all these blue states which are so-called pivot states or important states so these are states in the world that are very important determined by some measure right so these are basically states that look at where they are they're often that like narrow passes you see here they're at these narrow passes so basically if you reach those states as an intermediary goal then you can basically go a lot of places from here so these are very let's say powerful states and these states are connected by a neighborhood graph so basically which states of these are close to each other and for example here you would connect of course those two because they're neighbors those you would probably connect those some I'm attempting to kind of draw the world graph you can you might connect those doesn't need to be like a tree it can be like such so you see that the graph kind of takes shape these are fairly reachable so whenever a node in the graph whenever one of these important states is fairly easily reachable by some other state it's designated as a neighbor so with that with this world graph here this is what you get an abstraction basically you get a set of states with connections between them that says how easy or hard is it to reach from one state to the other if you have these things then you can really easily imagine a hierarchical reinforcement learning algorithm that now in incorporates this information namely the manager will only use the important states to plan so for example if the goal isn't drawn in here but let's say the goal is here and then the door is here it's a locked door here and then the key let's draw in the key come on okay this doesn't want to all right think the key is somewhere let's say here there's the key here is this all right then the now let's put the key further away come on so over here I'm gone off with the colors and key here all right so what would the manager do the manager would then say ah okay the key is here so this would be a good state to reach of my importance if the manager is only allowed to Co important states right so the manager says because it has the graph right it says aha this state is easily reachable from let's say this state and this state so it plans go here and then go here then get the key right this is a kind of a micro action that is not an important state then I need to you know go here this is reachable from this state that's reachable from this state and that's reachable from my origin so from the key then next go here go here and then open the door and then of course go here and solve the task the worker then would only ever need to implement the following it starts here and it says aha I need to go here what do for example down and over and now once I've done this I need to go here so I need to go right down right so you see the worker only ever has to care about going from one hop to the next hop making it really easy for the worker while the manager only has these blue states available which makes its search space much more condensed and much more over viewable especially with the nodes in between the world graph so that's if you have the world graph right if you have this set of states and how important are how easily they reachable they are between each other you can very easily do a reinforcement learning approach that is a hierarchical has the manager plan on the world graph has and then has the worker implement the fine-grained actions and there is already a method that does this paper here uses feudal networks so we won't go into that later just saying it's pretty easy if you have those things so the real question is how do they learn the world graph and what they do is the following then they describe it here in kind of this oh sorry this way what they want to are to finally learn is a prior that tells them for a given state how important it is it and that's a beta prior a beta distribution is a continuous approximation on a kind of a binary 0 1 variable so how do they do it they use an LS TM to encode trajectories so these are trajectories from kind of rollouts of policy and then the lsdm encodes it and for each step it outputs this posterior over the what's called these latent variables here they say how important is a state so these are the posteriors whereas this over here is the prior and the posterior of course only makes sense in context of a trajectory that's why the ultimate decision happens for the prior because the state needs to be important or not important to any trajectory so what they do is they roll out policies and they have certain methods of doing this so they have random exploration of curiosity goals but they also train this continuously so they updated continuously via this what's called a goal condition policy and what is basically is you put

Goal Condition

condition policy is basically is you put the agent somewhere in the maze actually let's use this maze over here you put the agent somewhere in the maze let's say here you for example make a bunch of rain make a random exploration that's here so you know these two things are reachable and then you train they'd and say go from here to here right this is your goal now the agent tries to kind of reconstruct this random walk to there and you can riff so this is how you train an agent to basically go from any - well reachable States to each other right from here to here and so on now you won't train it to go directly from here to over here because a random walk would be very hard for around the walk to find its way over there but what you end up with is somehow an agent that is able to reach closeby States and that's exactly what the worker is supposed to do right here and so all of these trajectories you can then unroll them and decide on the kind of on these

Important States

pivotal States so how do you do that and this is where this top part here comes in so down here you input the trajectory and you output how important is each state all right and now you see in this example here the light color means that the lsdm decides this state isn't important and the darker orange color means the Alice team decides this state is important so what you do next is the states where it decides it is important that notice the beginning at the end are always important it feeds to a second LS TM as an input you see here here so in this case of these four of these six states in the trajectory three are important namely the start the end and this one here where the lsdm decides hey that's important that goes into a second LS TM which is generator so this here is an encoder and this here is a decoder and what it does is it decodes the sequence of actions right and here given nothing just given this it decodes a sequence of actions and at the end what you want is that the actions output here reconstruct the actions input this might sound a little confusing but the core value of this is what you want is to reconstruct the actions of the trajectory taken given only the important states what does this mean in our example here this means if I have to go from here to here right and for example I took the following path this is this so right down right this is these were marking sequence now if I only have the start the end and one state in between let's say this one all right then can i reconstruct what actions were taken and if I erase the blue thing and I tell you I went from here via here to here then you could very much reconstruct the actions here so this state here is a good candidate for being an important state whereas if it were a different state if it were for example if I told you I went from over here to here and then to here you'd say well this could be either something like this or it could be a path like this right there could be many paths or like this leading from earlier so this state here is not probably not very important so that's kind of how they learn which one are the important States via this encoding trajectories in an LS TM and

Conclusion

trying to reconstruct the state the actions taken in the trajectory given only the states that were deemed important by the lsdm so that's how you train the lsdm to recognize important states and once you've recognized the important states in a trajectory you can then use those to learn prior so basically you ask over all possible trajectories which of the states are generally important and that's how you end up with these blue states all right and then the last part is to connect the blue states and that is fairly easily done in their approach what they say is all right we have blue states which would be pick one and we do a random walk from it right random walk random walk if we hit another blue state like this one here in the random walk we simply say well there are probably neighbors so we do this a bunch of times if you hit the blue States of course without hitting another blue state first then you connect the two in the graph so these would be connected these were probably connected what we ended up at the beginning right you have this graph maybe these two are connected and so on so this gives you this world graph and now you end up with a set of important states and connections between them that tell you which ones are easily reachable from each other so you can train the manager on that worker as we said before to simply select two closeby States trainee to go from one to the other that by the worker will alert that so in essence that's how they do it you can look at the experiments themselves they show that this basically transfers so if you terrain like this pre train then you can give more specific and more complicated tasks and this will rapidly accelerate the learning of this yeah look at the experiments if you have time that was it for me thank you for listening

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник