Divide-and-Conquer Monte Carlo Tree Search For Goal-Directed Planning (Paper Explained)

25:33

Divide-and-Conquer Monte Carlo Tree Search For Goal-Directed Planning (Paper Explained)

Yannic Kilcher 08.05.2020 5 600 просмотров 206 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

When AI makes a plan it usually does so step by step, forward in time. But often it is beneficial to define intermediate goals to divide a large problem into easier sub-problems. This paper proposes a generalization of MCTS that searches not for the best next actions to take, but for the best way to sub-divide the problem recursively into problems so tiny that they can each be solved in a single step. Paper: https://arxiv.org/abs/2004.11410 Site: https://sites.google.com/view/dc-mcts/home Abstract: Standard planners for sequential decision making (including Monte Carlo planning, tree search, dynamic programming, etc.) are constrained by an implicit sequential planning assumption: The order in which a plan is constructed is the same in which it is executed. We consider alternatives to this assumption for the class of goal-directed Reinforcement Learning (RL) problems. Instead of an environment transition model, we assume an imperfect, goal-directed policy. This low-level policy can be improved by a plan, consisting of an appropriate sequence of sub-goals that guide it from the start to the goal state. We propose a planning algorithm, Divide-and-Conquer Monte Carlo Tree Search (DC-MCTS), for approximating the optimal plan by means of proposing intermediate sub-goals which hierarchically partition the initial tasks into simpler ones that are then solved independently and recursively. The algorithm critically makes use of a learned sub-goal proposal for finding appropriate partitions trees of new tasks based on prior experience. Different strategies for learning sub-goal proposals give rise to different planning strategies that strictly generalize sequential planning. We show that this algorithmic flexibility over planning order leads to improved results in navigation tasks in grid-worlds as well as in challenging continuous control environments. Authors: Giambattista Parascandolo, Lars Buesing, Josh Merel, Leonard Hasenclever, John Aslanides, Jessica B. Hamrick, Nicolas Heess, Alexander Neitz, Theophane Weber Links: YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher

Оглавление (10 сегментов)

Intro

hi there what you're seeing here is a divide-and-conquer Monte Carlo tree search in action this is a planning algorithm that plans in a kind of an unconventional fashion so we're going to explore this today in this paper divide and conquer Monte Carlo tree search for goal directed planning by Giambattista poorest condo low and lower sprucing and list of other authors and I believe this is from deep mind and Max Blanc and a da and yeah that's it all right

What is planning

right so what does this thing do it is a planning algorithm and planning might be not really familiar for you so let's say you are in this room and or this set of rooms right here there's a bunch of walls and you are up here and you want to reach the goal down here so first of all this is a goal-directed problem right here it says goal directed means that you give the algorithm a goal to reach and this could be a different goal each time you run the algorithm right so you give it a goal to reach then the second thing here we see is planning so it is a planning algorithm what does planning mean planning means if you're if you traditionally reinforcement learning you would think ah I'm just gonna go ahead and run my agent here and maybe you can move in the four different directions run my agent here do some things right da da maybe I hit a wall I get a negative reward at try again in planning you don't have to move initially what you can do is think about moving and you can think ahead of what's going to happen and that's usually because you have some sort of model what happens this is very famous applied in for example alpha go or alpha zero where you have you know if I move to the right here I'm going to be here right so you will know once reached the goal right if you're if I'm here I go down I reach the goal cool so you can think all of this without actually moving in the environment you can think yourself ahead what would happen if I did certain things and that in turn also means that you can think ahead multiple different paths so you can think ed what would happen if I move right down and then you can think in the next layer if I move right what would happen if I moved right again what would happen if I move down instead so you can easily see the planning problem becomes a tree search problem in this first in this case we've done a breadth-first search right and eventually you'll see that this will get you to the goal so this breadth-first search or maybe you want to employ a depth-first search will ultimately get you to the goal we can represent this as a search tree so you're here in a particular state and you have a bunch of actions in this case for to move left up down or right and you can choose any of them and you will get into a new state and from each of those you could choose and you again and you can think ahead all of this you can construct this entire tree and one of these branches will lead you to the goal promise what is the problem is if the path to the goal here is let's say D steps long then this tree here is going to be D layers deep and in our case that means we'll have four to the D nodes in that tree before we even reach the goal and that is just a long or a big tree and we can't construct all of it so algorithms have come along for example the a-star algorithm where you can incorporate something like a heuristic and the heuristic would be well it's generally good if I'm close to the goal in l2 distance so you would not build the entire tree here but you would prefer nodes that bring you towards this heuristic so this note down here is closer to the green node in terms of l2 distance and this note down here is even closer but then you're kind of stuck so a star will explore a bit probably along this wall here and once you're here you have a clear path again right so you can gently take the actions that minimize this l2 distance so this will already get you to a real good point Monte Carlo tree search as it was employed in alphago or alpha star has a similar structure where it after a certain while it stops and evaluates a heuristic to say what's the value here and so on so it is in for some problems an even better method of constructing the search tree in a way where you don't get overblown by the number so the Monte Carlo search tree in this algorithm refers to the fact that we are generalizing Monte Carlo tree search from the let's say the alphago paper so what's the idea here so far we've known everything the idea is the following if I had an Oracle if I am the master here and I can tell the agent look I guarantee you that this state right here in the middle you will pass that state if you want to reach the goal you will pass this for sure if I tell this to the agent now what can the agent do the edge you could say oh ok if I know that I can simply I don't have to search for a way to the goal I'd much rather search for a way from my start point to that point where I know that I'm guaranteed to be at someplace and then I can search also from a way from there to the goal right so this now you remember our long our path was D steps long for the original problem this is now let's say D half long each of these paths and that means we just construct two trees each one of them is going to be for to the D half and the other one is also if we add them that is much smaller than the original four to the D tree that we build so right there we have subdivided our problem into two subproblems and each of them are much easier than the original problem this paper basically does this recursively so it will subdivide the problem by proposing some middle state where you are going to be at some point for sure and that for sure we're going to take a look at and then for each of those problems again it will try to subdivide it into subproblems and therefore recursively solve these subproblems until they're small enough that they can be basically solved in one step so that's the big idea here and this is illustrated in this point right here so you are in this s0 of the start state and you want to go to this s infinity the goal state in your case what this paper does is it proposes to split the problem here in the middle and then simply solve the two problems recursively now what is a bit confusing right here is that it is the planning already is a tree search right so a plan is like a tree but we are searching over plants so we're searching for the best plan which means that we are searching over trees and that search itself is a tree search so the problem where we go down one route and then oh no and then we maybe go down another route and here and then here so the search is a tree so we're now tree searching over trees that's the kind of tricky thing to remember so each of these plans even if it's half if it's only half done like this is only half a plan we don't know what what's gonna happen in here this half a plan is a node in the tree that we're searching over and then it splits it it's as you can see here into two subproblems the two subproblems also are nodes in that search tree so you see that this top thing here would correspond to this node even though in itself it is a plan a tree string in this case and each of the two subproblems would become these protect these nodes in the search tree so keep that in mind as we go through this paper it is very easy to get confused in this respect the algorithm

The algorithm

is pretty simple in this case so this algorithm rests on this traverse procedure so we're going to traverse this what they call these or nodes so they divide the problem into n nodes and/or nodes I don't believe that's particularly necessary for us to think about but here's how this works so they traverse the or node s to s prime this is simply again this is a node but the node is a path from s to s prime where we don't know yet what happens in between right so what we'll do is we'll run this procedure select and select it you can see it outputs an S Prime and the S prime is going to be a node here somewhere in the middle where the model says this is your subdivision point and then it will recursively traverse the left and the right branch of this tree so it will subdivide the problem into two problems and then recursively call this traverse you see that's the function that we're defining call this traverse function on these so it will subdivide this problem into these problems and it would for the next step again it will eat for each of the problems propose a middle node and subdivided further and so on until you have a full plan right at some point you're going to have a full plan now here again is the important thing to remember this is just one branch of the search possible plan and we are going to do a tree search over these plans so this select function here it has returned this as Prime but it could have returned any point between s and s double prime so let in it this is just one branch I'm going to I don't have space to draw here but I'm going to draw it down here again so it could have also returned this particular node here like it's a different s Prime and then subdivided the problem into different problems and then of course those problems are now different so they would be subdivided differently again and so on so this top part here is just if you consider this thing here your root node this is where you search from this top part is just one node one branch in the tree but we could also subdivide like this and then that would be another branch in your tree and this tree here is the thing that you're searching over it's important to keep this in mind we're searching over these different possibilities now the rest of this algorithm here is basically the carry over from Monte Carlo tree search and I don't want to go into that in this video but if you're interested in you know how to actually implement this you'll have to go look at MCTS and then all of this just carries over from that algorithm because you have to keep estimates of value and visit counts in this tree and so on and also you have some sort of a value estimator but yeah I'm mainly concerned with how the tree is constructed right here so basically here's the difference between a between the Monte Carlo tree search and the divide-and-conquer Monte Carlo tree search in Mon

Finding the next action

Carlo tree search ignore the yellow one for now you're in the green position and you want to go to the blue position in Monte Carlo tree search what you're searching over is the next action to take in this case you have four possible actions to take that's what you're

Building your search tree

searching over and that's what you build your search tree from your search tree is going to be which action to take right up left down or right that's why you have four actions in month in

Search over subproblems

divide-and-conquer Monte Carlo tree search you're not searching over actions you are searching over the best way to subdivide this problem right you're searching over which of these all the black squares should I use to subdivide my problem into subproblems and that's what you build your search tree from so naturally you can already see what kind of possibilities do we have here to subdivide this problem I drew one white square but any of the black squares are candidates to subdivide this problem right any of the black squares could be potential subdivisions and this is what we search over so in Monte Carlo tree search we search over the actions which gives us this four to the D tree but in divide and conquer we're searching over all the ways to subdivide the problem as you can see there that are many more possibilities so from this first starting node we have like a hundred possibilities to subdivide this problem into two problems right and each of those again if you now

Subdivide

you've decided on a subdivision let's say you decided on this one right here you say I want to pass through that point on my way to the goal now you have to subdivide that in this subproblem into two problems again every possible black square I'm not saying which one is good a good thing to subdivide the problem I'm just asking what is a possible candidate every single black square here for a path from here to here right and again for this particular subproblem you have to do the same thing so the search tree here even though we said before it is this one is very deep and this one is probably only log d so log to d deep its width is going to be enormous and that

The Catch

is the catch right the catch this is not a method that is like a magic pill the catch is even though your trie is not as deep it is much wider and it is intuitive right because you still have to have the ability to construct any possible plan so your trie is going to have as many nodes as the original Monte Carlo tree search tree your if you were to fully expand it right so it's your trading of depth for width here I hope that's a bit clearer so your entire promise of this method is going to be can you from all of these possibilities so you don't even want to go and search even one layer deep through all of these you don't even want to consider all of them right you want to search in this tree you want to limit your search to very particular ways of subdivision here if you can do that efficiently if you can pick efficiently candidates to subdivide then this could be a successful thing because your deep is now not as your tree is not as deep as the original search tree would have been and you can limit the width effectively to only very few candidates so here we could for example make a heuristic that will always only pick squares that are kind of on this straight path to the goal so everything rests on how you do this select action this thing here the entire algorithm relies on the fact that you can select effectively select in between states where you're pretty sure that the algorithm will have to pass through there because the worse you make these predictions the worse your research is going to work and what they do of course

Deep Learning

is they use deep learning as you might have guessed to do that so they have they will have a model that for a particular start and end goal will give them a probability distribution across candidates now everything that's black here also has probability mass but it's just so small you can't see and these blue ones are that the lighter blue the more probable this model thinks that this is going to be an in-between state now the tree search can now limit itself effectively to only the ones here with the highest probability right so we select the ones with the highest probability and will only search plans that have these as the first possible subdivisions again we're searching over plans so we're searching over ways to subdivide the problem into smaller problems that is our search space so once we've decided on one of them let's say here the yellow one again we have to evaluate that model for each of the subproblems and this is kind of a step that's missing here so in between here there would be a model evaluation that would again tell you which of these in between states were probable subdivision candidates and then you would select again one of those in that particular search branch and in a different search branch right you're searching over these things in a different search branch you would select a different one and see is this possibly a better way to subdivide the problem and so on so the question of course is

Training

how do you train this model a model that gives you good candidates for subdivision and the answer here is comes from the idea of hindsight experience replay so let's say again you were here and you want to go here and you're not very good at it initially so they trained this model as I understand along with letting their agent act in this environment so the agent uses the model to plan but initially it's not very good so maybe the agent will fail a lot of times so instead of going to the Blue Square it will reach this white square right here and we'll go here and here we'll reach the white square and instead of saying I failed what you can do and this is the idea of hindsight experience replay is to say well I did fail but I did reach something right I have reached a thing and it's actually possible that thing could have been the goal but this particular episode this was the goal remember the goal changes every time it's a goal-directed policy so it says well this could have been the goal possibly so if I just pretend this was the goal then I have a training example for a successful run so the hindsight experience replay basically pretends that what you have achieved even if you failed was your actual goal and that gives you a positive training example for an episode with that as a goal and the this it could have been the goal because the goal is chosen at random so this gives you a good training example now this paper just generalizes the hindsight experience replay or applies to their particular framework and they say well if I reach this thing that means any point on this path is a good candidate for subdividing the path because I did actually reach the point remember the goal is to propose a point where you're for sure are going to pass through now since I've taken this path to this goal I have passed through any of the squares in between and so these are my possible sub candidates and all other black squares I don't want that so now have a classifier I can train I can say any of these squares on my path are good squares to subdivide and any not on my path are bad ones they go a step further I believe and they actually say we're so if this was M steps we're actually going to take the particular square that is reached after M half steps so the exact middle point of that path is going to be our training example for subdivision so you have a classifier that has exactly one target to train so this you train along with acting in the environment and of course your model for proposing subdivisions is going to be better and better and that makes your planning algorithm you collect better episodes and so you can sort of get bootstrap yourself up this thing now this is the basic experiment of the paper they also do this in a 3d manner where they move this little spider here around so the spider was trained to just move from one block to the next book and the planner basic has to tell it where to go and they show that they outperform the traditional Monte Carlo tree search now I have to say this is cool but you have to remember this is only advantageous in very specific types of problems so first of all it has to be this goal directed nature otherwise you probably couldn't train this predictor here super well then given that you have such a good predictor the problem needs to be such that if you have a start state there could be many ways to go about reaching the end and if you have an end state there could be many ways from where you could come from but there is like some bottleneck state in the middle where you're pretty sure that you're going to have to pass through it so if your problem is of that nature right if it has these bottleneck states where you can predict with reasonable accuracy that you're going to have to pass through then this is a good algorithm to consider and is obviously I mean it's intuitively outperforming the original Monte Carlo tree search because you have much less deep search tree and you can effectively limit its width by using that model they also have made this website where they kind of show videos of their spider and I've seen it in a while but it is like next to the mouse if you can see it so you see this is kind of a continuous control problem that also requires planning and they also have these kind of gifts of how they're there what order their plans are constructed in so I invite you to check this out read the paper if you like this subscribe leave a like leave a comment and thank you for listening bye

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник