Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping (Searchformer)

44:04

Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping (Searchformer)

Yannic Kilcher 06.04.2024 36 964 просмотров 1 042 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Paper: https://arxiv.org/abs/2402.14083 Abstract: While Transformers have enabled tremendous progress in various application settings, such architectures still lag behind traditional symbolic planners for solving complex decision making tasks. In this work, we demonstrate how to train Transformers to solve complex planning tasks and present Searchformer, a Transformer model that optimally solves previously unseen Sokoban puzzles 93.7% of the time, while using up to 26.8% fewer search steps than standard A∗ search. Searchformer is an encoder-decoder Transformer model trained to predict the search dynamics of A∗. This model is then fine-tuned via expert iterations to perform fewer search steps than A∗ search while still generating an optimal plan. In our training method, A∗'s search dynamics are expressed as a token sequence outlining when task states are added and removed into the search tree during symbolic planning. In our ablation studies on maze navigation, we find that Searchformer significantly outperforms baselines that predict the optimal plan directly with a 5-10× smaller model size and a 10× smaller training dataset. We also demonstrate how Searchformer scales to larger and more complex decision making tasks like Sokoban with improved percentage of solved tasks and shortened search dynamics. Authors: Lucas Lehnert, Sainbayar Sukhbaatar, Paul Mcvay, Michael Rabbat, Yuandong Tian Links: Homepage: https://ykilcher.com Merch: https://ykilcher.com/merch YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://ykilcher.com/discord LinkedIn: https://www.linkedin.com/in/ykilcher If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Оглавление (9 сегментов)

Segment 1 (00:00 - 05:00)

hello there today we're going to look at Beyond AAR better planning with Transformers via search Dynamics bootstrapping by fair at meta this paper teaches a language model how to perform planning in the sense of it teaches it how to think about a planning problem and you know what steps you have to do to plan ahead and applications of this are in this paper kind of game applications so example this game right here this soan puzzle where this person here has to push all the crates into the respective places so any crate needs to go to a space at the on the floor but you can only push you cannot pull and thus if you push for example this crate down here the game is over because you have no way of getting it back so planning ahead is really important in these situations and in many real world situations this paper tackles planning using Transformers uh or planning using language models in general so that's what the paper largely is about uh we'll go on how they do it uh they do it via bootstrapping from a planning algorithm called AAR and we'll go what kind of results they have at the end they're going to be at a place where they have this model called search forer that model not only learns or has learned to plan and output optimal plans but it has learned to do so in fewer search steps than standard AAR would take to Output optimal plans and that's pretty cool so we'll see how they do that to be said this paper we've discussed this in our Saturday uh paper discussions on Discord so almost every Saturday we have paper discussions um on Discord where we talk about different papers and look at them in depth so a lot of the things I'm going to say right here are actually sourced from the community if you want to attend paper discussions or even present something to the group you know feel very free to come it's not recorded So any stupid questions can be asked freely uh without any consequences whatsoever there are by the way there are even almost daily paper discussions going on Discord at various times so it's a very Lively place all right let's dive in planning what is planning is um very related to something like reinforcement learning but it's also a little bit different so it's kind of the same and kind of different what you do in planning is you have a start and you have a goal and let's say this is like a maze task maze tasks are very often used in planning tasks just to illustrate so in a maze task you can either move like s laterally or down okay and so you will have to find a path from the start to the goal with these steps uh but your the challenge is there's some walls where you cannot go so there might be a wall here and a wall here so you'll have to somehow find a way around those those walls now we could just do it right we could just go and train an reinforcement learning agent and every time it crashes into a wall we're like no that's not good start over and so on but this is It's workable in a game scenario where we can restart the game uh but it's not workable in a scenario where we're actually in the real world or we can't restart or something like this uh once we crash into a wall we actually kind of hurt and thus we don't want to do that what we want to do is we want to plan ahead so planning is like doing this here but in your mind so in your mind you walk ahead you're like okay what happens if I do this step okay I would be here what happens if I go there I can't do that okay so maybe I consider going here oh that might work out and I'm going here this looks good going here um bit too far so I can of backrack backtrack go here right um now the backtracking isn't actually doing steps back but it just says okay if I go if I reach here that might be a bit too far so let me go back in my thinking process to this point here and let's in my mind choose a different path to go along so you can see that planning is a lot like acting in the world except it's in your mind and the only difference is you can sort of at any point uh choose a different path

Segment 2 (05:00 - 10:00)

in the past so if you go down One path and you don't like it then you can just say well okay how about that we don't do that how about we go down here and you cannot do that in the real world because you would have to actually do the steps back until you're in this state here again but in a plan in a fictitious world you can and that's the entire difference so you can immediately see why that is super useful in especially in scenarios where going down One path could lead you into a situation where you can't go back anymore because you've pushed a crate at against the wall and you can't pull it anymore and yeah so I hope you can see a little bit the difference so it's not huge but there is a difference the other difference is the output of a reinforcement learning process let's say is a policy right and the goal there is to maximize reward using the policy in the world the output of a planning process is a plan and that plan so the most basic plans are the actions that you are going to take right so it says Okay first go down then go to the right then left down right that's the output so the output is the plan itself now planning can be infinitely complex right so for example you can have conditional plans you can uh you can say Well if you know go down go right in case there is nothing here which maybe you don't know ahead of time in case there is nothing here go there in case there is something there go around it that could be a plan so the plan could be the entirety of if this then that and so on the plan um can be conditional the C plan can be probabilistic and so on in this work we only consider very simple scenarios where the plan is in fact the steps you're going to take in order to reach the goal the precondition in our setting is to have a model of the world so that's also good in games because you know exactly right if I give you the situation you're here here's the goal and then here are the walls right here you know exactly what's going to happen if for example you make a step down you will be in this situation right here you know exactly what happens if you make another step down you can't do that right so having an accurate World model is giving rise to a subfield of planning that is quite easier than the places where you don't even have an accurate World model right then you have to sort of plan ahead in a abstract or in a learned World model or in so on so um yeah just wanted to set the framing for planning a little bit because people tend to confuse it with sort of reinforcement learning which is fair because they're very similar except um reinforcement learning actually acts in the world and planning doesn't planning just mentally acts in a fictitious world that could be an accurate or non-accurate representation of the real world in this case it's an accurate representation all right going back okay it makes it even more difficult the fact that um the fact that what we're going to do is actually doing some sort of learning but in the acting in the planning space so we'll consider the planning Dynamics as the real world and there we'll actually do a learn procedure so sorry um you must be more confused now than you were at the beginning but safe to say llms usually struggle when it comes to solving planning and reasoning tasks right people know that llms are really good at creative writing and so on but once you give them like a thing where they need to do multiple steps of reasoning or sort of looking ahead how to reach a goal and plan that out they're usually not that good people have try to do Chain of Thought and tree of thought and whatnot but they're more crutches than anything else so and they say in many cases these uh techniques lead to worse performance for example due to self enforcing in the other hand existing traditional planning and search techniques they usually work really well right we have from simple to sophisticated planning algorithms if you give them a planning problem and

Segment 3 (10:00 - 15:00)

the planning problem you define in form of a graph right so um what we saw up here that could be excellently I'll draw it again displayed as a graph right every place that you could possibly be here is a graph node and um if the if I can reach one node to the other with a given action I will have a direct and possibly even weighted Edge right here between the graph nodes and planning in this case just results to a search for the shortest or respectively lowest cost path through the graph right here and I can Implement that in a simple case with something like a breath first search um so planning is a wellestablished discipline so it's a bit weird not weird but there is a bit of a discrepancy between the power of large language models and the Simplicity of planning algorithms that just still do much better than llms at planning stuff all right so what they do is they say they demonstrate how to train Transformers to solve complex planning tasks and present search forer Transformer model that computes an optimal plan in fur search steps than symbolic planning algorithms such as AAR search uh okay I have to preface all of this by saying what they're going to do in a simplest case is they're going to have this problem right and they're going to let a classic planning algorithm run right the classic planning algorithm is going to do like okay let's go here okay let's consider this I can't do this okay let's go here okay uh let's consider this can't do this let's go here let's go here um not sure if that's still right let's maybe try up here no that seems wrong okay let's maybe go further down here ah okay now we reached the goal like this whole sequence of what I just did this oh maybe here okay no that seems good that seems not good that Al that's a sequence of steps right every step including the ones where I say oh no let's not go here let's go somewhere else or the where I say not sure if this is right let's consider something up here all of this can be just expressed as a string of language like I just did I just sequentially did a few actions which in turn can be made into a sequence of tokens right like for sake better sake like let's go down let's go down oh no we can't go here uh how you know um let's go right and so on those are tokens right I can make I can build the language model across this so this is a token it's actually the same token so I can define a vocabulary over this language and then I can train a language model on it to exactly um learn this language right and then the only point is if I do this for many different maze tasks um in different variations and so on and I always encode the situation here in language and then I encode the plan here in language or sorry the planning algorithm and then at the end I encode the plan itself right the what I actually want to do in language that defines a sequence of tokens and I can teach it to a language model to just take in this language and output the same so if I next time only feed the description of the problem it can itself output the search Dynamics and the final plan and that's it so when they say we demonstrate how to train Transformers to solve complex planning tasks um and computes the optimal plan in fewer search steps than symbolic planning algorithms such as AAR search that's what they mean they train a Transformer to mimic a planning algorithm and then they have a technique to actually reduce the number of steps here um that being said I have a problem with them saying here compute an optimal plan because this these are language models and they're subject to any shortcomings of language models including their stochastic they are um prone to hallucinations and so on so whether or not what comes out down here is an

Segment 4 (15:00 - 20:00)

optimal plan is completely non guaranteed right they measure it experimentally and they find Yes actually in many cases what comes out at the end is an optimal plan like that's fine but there is no guarantee that what comes out is even a valid plan um or an optimal plan so keep that in mind as we go through here they build a data set of verifiable both valid and optimal plans and then they train a language model but what comes out just mimics the style of planning there's no guarantee that it actually results in true plans and you may think oh well that's kind of disheartening and so on but what the paper the paper's idea isn't to build the best planner in the world the paper's idea is can we explore the POS like what happens if we teach a language model how to think about planning problems and that's the part in the middle right so the part at the beginning that's the problem and end is the actual plan that you know the model outputs so the solution if you will now what we have previously done is always just we input the problem here and then just trained it to Output the solution right that's kind of the status quo and there are some crutches in between so people have discovered well what if we then append let's things step by step or something like this and what this paper is just kind of the formalization not necessarily formalization but the uh more rigorous implementation of something like this it's the question what if we teach a model how to think about planning problems meaning the part here in the middle we explicitly teach them look here is the steps you need to do in order to solve a planning problem and the main question is will it help will it then output down here more often an optimal plan than if we just train it to go from problem statement directly to Output okay so this is all prefacing kind of what the paper does at its core it's not trying to solve planning or built the best planning algorithm or even you know this is the way to do planning with llms and so on no what this is simply answering the question what happens if we actually explicitly teach a language model how to think about planning problems do planning problems become more accessible to them because if yes then we can make some kind of we can say something about or against the people who say oh no planning this requires really symbolic thinking this is totally out of the reach of language models they will never be able to do it this is totally different from kind of next token prediction right so this is where the do the paper is okay I already mentioned kind of the most of this in my rant just now so let's look at actually some things as I said this here is one of these maze tasks right you have a start uh start cell and you have a goal cell and a plan this here is the output of the algorithm is what you're going to do and the entire in between is how you going to reach the plan itself so the prompt is what's the situation here that's the prompt so you can see there is a start tile um there's a gold tile and there are wall tiles that being said this example here I'm pretty sure is kind of wrong so the paper for one is inconsistent with itself so this notation here does not match the notation in the appendix they kind of swap things around and then um the even the example here doesn't actually match the situation as it's described here so if you read this paper yourself don't be confused um that's one of the things I think we discovered in our discussion on a Saturday so this is the input of the planning algorithm and this is the output right so it's a plan it's what will I do in order to reach the goal and as I said we must somehow go from input to Output now in classic sense we do this with a planning algorithm so that would be the AAR algorithm and in the New Age we would do this with deep learning we would feed the input and do

Segment 5 (20:00 - 25:00)

and train it to Output the output endtoend learning right and this paper says as I said what happens if we also explicitly teach it how the intermediate process here should look like so what's the intermediate process it's something like this so after we provide the input and here is the output um we also want to produce this inner thing here and this inner thing is what they call the execution trace of AAR is a planning algorithm and it pretty much does what I did before it does like ah can I go here ah this looks good no this doesn't look good let's go back and so on so it does this by manipulating these uh sets of kind of closed search nodes and Frontier nodes and so on and um we'll go in a bit exactly how a star does what it does it's a very classic algorithm so if you already know you will uh follow but if you don't just it's just some algorithm that internally does some steps it's important to consider the steps it does internally the trace then leads to the plan right it's different than the plan the trace is just how the planning algorithm goes about Computing the plan now what's interesting is um the plan can be optimal or not optimal right and uh for example a star under certain assumption is guaranteed to produce an optimal plan meaning a plan that will reach the goal with the shortest cost or the shortest distance and maybe there's multiple optimal plans for the same problem right but um if AAR will certainly find one of them how it finds that so the trace here that is um different planning algorithms differ in that even though maybe all of them are guaranteed to find you an optimal plan how long this is and how complicated this is differs from planning algorithm to planning algorithm and even within the same planning algorithm it can be due to like tiebreaking and so on how the order of in which you do things that this is long or short or whatnot so a bre first search and AAR are both guaranteed to give you uh the shortest path in a maze task but I think so at least I think so um but they will have really different execution traces right and why people use different planning algorithms even though all of them are giving them a optimal plan is exactly because some planning algorithms manage to reach an optimal plan uh way sooner with way fewer internal steps than others so our goal is going to be can we train a Transformer to go from input to correct output and by correct output we mean an optimal plan or a nearly optimal plan for big problems with the shortest possible in between shortest possible execution Trace so now would be a time to see what a star actually does so if we go to the AAR algorithm on Wikipedia the AAR algorithm is essentially a Dyer algorithm with and you can see right here the blue is always the currently kind of best candidate for the optimal plan and red is everything that's been explored already and green is What's called the frontier um so anything that's green in every step the algorithm chooses one of the green points no this could even one be one of the green points at the top right so it just chooses one of the green points and then considers all the steps it could do from there and kind of explores that and then it considers the next Green Point and you can see if you were to do a breath first search right here um you can probably imagine that for example the top left or the left hand side and the top hand side of this picture would also be explored yet this AAR algorithm for some reason knows how to kind of go towards the goal because the goal is at the bottom right and that is the advantage that a star has above like really classic graph searches it can sort of go in the direction of the goal and it does that by having what's called a heuristic so this is the last thing we'll discuss about a star so a what a

Segment 6 (25:00 - 30:00)

star will do is it will if start go it will explore and so on and when it chooses to explore it will always consider two things one what's the distance from the start right so for example this node right here only has a distance of one whereas this node right here has a distance of three in that sense I would rather explore this one right here because if I find a path to the goal then this one right here costs me less than this one right sorry than uh this one right here so if for some reason these nodes turn out to be the goal or next to the goal right I don't I don't know how to get here yet um so if they for some reason turn out to be the goal I would rather have goal reached in one cost rather than three now on the other hand and um what a star does it says yes but like what about this node up here clearly also that has a cost of one right so should and you as a reasonable person would say no I probably should consider this node right here why because just it kind of seems closer to the goal and that's the second thing that AAR considers it's what's called a heuristic and the heuristic very often in these spatial planning task is just the air the distance to the goal in terms of like as the bird flies so for or the Manhattan distance or something like this or the L2 distance just what's the distance to the goal because if that becomes smaller then that's kind of a good sign now this could be misleading because upon reaching the middle you could notice that there is a big fat wall here and you were actually misled and then you have to somehow go around it but still in general that's a good idea um so these we call these heuristics and one special property is what's called an admissible heuristic is always one that always underestimates the distance to the goal and something like the L2 distance or the Manhattan distance in these maze problems it has this property like there's literally no shorter way to get from one point to the other point than the L2 distance um if these are like spatial search problems right so as long as theistic underestimates the goal the way a star mixes these two numbers distance from the start plus distance huris distance to the goal will always guarantee that the result is an optimal plan okay so you can see an execution Trace right here um but yeah as I said this here it's actually wrong uh I don't think it considers this piece of wall right here if I recall correctly so we will not look at it the last thing um is that if we are at any given place in a star and let's say um the goal is here and we could expand this node both are exactly equivalent right both have the same cost from the start heuristic distance to the goal so there we have to break ties right so in order to explore the children of a node sometimes we have to do tie breaks um if we consider the order in which we explore stuff uh we have to kind of order them in some way you can do this deterministically you always get the same result or you can do this nondeterministically now in this case it doesn't make any difference but if you think that okay there might be actually a wall here which you don't know yet about right then the order matters so the random choice of which of these ones you expand they're exactly the same to the AAR algorithm but the random choice will lead either to a longer execution Trace because you have to backtrack and do something else or a shorter execution Trace so that's a one crucial piece of this paper so they will um they will execute a star and then train two different models one they call solution only oops solution only sequence and that literally is just prompt which is the input that how does the world look right encoded in some tokens and then the output which is plan so they and they produce that data set synthetically by creating different instances of problems different mazes different Soo Bond puzzles and then running a star in order to come up with

Segment 7 (30:00 - 35:00)

an optimal plan and then they just put the plan and in the other hand they do the same but as I said before in between prompt and plan they put this Trace so it's important now to consider so this here will be the input to the language model and then the language model will be trained to produce these both of these right here so the goal is to teach it how to think about the planning problem and only then output the plan at the end the measurement is going to be twofold first is the plan even correct is it even optimal so is it correct does it reach the goal without running into a wall second is it optimal and only then we're interested in how long is how long is this Trace right here you know did is it did we plan in an efficient manner so we can train that and what they do is they train a encoder decoder uh T5 architecture these plans uh sorry these execution traces they can get really long uh so they use a and they train from scratch model that kind of pertains to that pertains to training from scratch um which is a T5 so this is not a super big we train gpt1 on this is relatively modest model size they because they still have such long sequence lengths um it still costs them a lot of GPU time to do that and they you can see they use rotary position embeddings which are good at sort of dynamic length stuff and really long stuff and then what that does is and we can maybe Jump Ahead to the results already a little bit what that does is as you can see right here um the search augmented models even the small ones they outperform the solution only models so there's a big solution only model and you can see it needs a lot of training examples to reach barely the same performance as the search augmented models and this quite clearly um shows so here we have um correct reson now that doesn't make any sense correctly solved test tasks that makes a lot of sense um so these are deterministic and non-deterministic is we Shuffle the order in which things are done without loss of generality um but in both cases you can see the search augmented ones so when we teach the language model how to think about the planning problem they need a lot less training examples um even though the sequences are longer right so the plan being further down could make you think well there's more noise so maybe it even gets worse no it gets better and the search uh the solution only models so the ones that directly go from input to plan they need a lot more training examples in order to reach the same performance so that means teaching language models how to think about planning problems makes them better at you know doing planning whereas with a lot of with if you don't explicitly teach them how to do that then they need a lot of data to figure it out themselves and even maybe they other experiments it seems like they don't actually even reach um the performance at all so that's really interesting I think um yeah and you can see even the small models right here uh they really outperform on small training examples the solution only models now there's a lot of stuff here like their optimality Criterion is one out of 64 so they sample 64 times and then take the best of that and so on I don't want to go much into that right here other than saying okay they can now do this they can train something that kind of mimics the execution trace of AAR now what do they do from then they say let's move beyond that we Implement a method to shift the distribution with which the decoder generates execu execution traces so first they train a model like they just did on the nondeterministic AAR implementation non deterministic is important because that introduces some

Segment 8 (35:00 - 40:00)

variance into the mix right so the same input problem can actually have different execution traces to reach the same or actually an equivalently cheap plan there can be many optimal plans but even if it reaches the same or if it reaches a different plan the execution Trace in the middle is going to be different because we break ties randomly we order children exploration randomly and so on um so they say merely change the order in which the different nodes are searched while still respecting aar's heuristic cost heuristic and cost calculations so this induces additional variance into the length of each execution trace the resulting search augmented model will then approximate the probability distribution with which the training sequences were generated once they have that they use that model to generate um execution traces and plans so imagine this you have a data set the original data set you have input Trace plan right and that Trace has a certain length now what they're going to do is they're going to take the same inputs they're going to generate a trace and a plan they check if the plan is optimal right they can do that by comparing it to the cost of the a star plan if it has the same cost it is optimal because a star is guaranteed to find an optimal plan um and if the trace is shorter than the one up here they replace it they replace this training sample with this training sample in the training set once they've done all of that they now train a new um model on just the new training set which is per definition has shorter execution traces for the same input problems uh producing all optimal Plans by construction this data set same inputs all plans are optimal but the traces are shorter now the question is does this new model actually also reach optimal plans with shorter execution traces and the answer is yes um and that's pretty cool now here is where I have a bit of trouble with this paper namely how to interpret this uh what they say at different points in the paper is something like yeah so for example this search foral model no longer imitates a star search and has instead discovered a new way of solving planning problem using fewer search steps and there is where I have a bit of a problem um there's two things that can happen here one obviously if let's say if you do this um and you actually discover a shorter trace for also an optimal plan it could totally be that it has somehow found a way to right produce a trace that's shorter because it's unconstricted right it doesn't need to follow any algorithm it just outputs tokens so you just sample some tokens and it can output whatever as long as it produces the correct plan um it's fine right and if it produces the correct plan and if it does so in a better way than if you leave out the trace all together which we did in the before experiment what that means is that the trace here is actually helpful but the trace is not the a star Trace from up here the trace is just something it output so that's where the authors base their claim on oh it has found a better way it does no long it deviates from what a star would do it has found the better way to do what it's doing I don't know if that's correct what I what is entirely possible too and again in on Saturday we talked at length at this which I find much more plausible is that you already told us that these execution traces had different lengths depending on kind of the order of execution didn't we just sort of um and we trained it with this variant included right so isn't it just possible that all of these shorter execution traces are still completely valid AAR traces but um but just the shorter ones right because what we do is we sample and if it's shorter we include it in the data set so we have a selection bias to the shorter traces it's still extremely possible that these are all valid a St traces but just in a sort of non-deterministic touring machine Style the traces that with Rand with the

Segment 9 (40:00 - 44:00)

random tiebreaking that would have led to Shorter you know total execution length um that I find much more probable than it's some sort of oh deviates from a star altogether what that means is essentially the model has learned to kind of globally look at the problem and do these tie breaks in a non-random way in an actual way that um so it can kind of know how it needs to break the ties in order to then get shorter execution traces which is nothing else than just learning a better heuristic for AAR honestly like if it's you know between here and you wonder which of these nodes to expand and there is a wall right here right sure if you have theistic of L2 distance to the goal yes you don't know which of these notes to expand but if you consider the wall right here you do and that that's just a better heuristic like sorry but I think and we've thought that what happened right here is just kind of it has learned to break these ties um in a more optimal fashion still totally valid AAR execution traces except that AAR doesn't have that Global Information available and the Transformer has and so it is equivalent to just teaching it a better puristic but this is never explored in the paper it's never explored whether the resulting traces are actually valid a star traces or uh de how much they deviate from it and so on and whether these traces really look like wiid has learned something very new and so on that is not explored um yeah and that's my criticism for that paper other than that it's pretty cool they show okay if we repeat this kind of down like step where we replace training samples by shorter ones so we can decrease the length of these execution traces and so on all of that is investigated well even also here yeah when we train models with search augmentation then you know we do get in fact better Solutions shorter Solutions more often an optimal solution which is also interesting so if you just train a search augmented model or if you train these reduced length models you actually get more often a valid solution which is also interesting and leaves room for interpretation but yeah um that's essentially what I wanted to say about this paper I don't want to go too much into the experiment because I think they're well encapsulated with what I said and you're super welcome to read the paper itself it's definitely interesting it's definitely really cool work and it definitely shows that yes if you teach a Transformer model how to think about a planning problem it is going to be much more capable of Performing planning um rather than you just not telling it or you just saying well think step by step or something like this and you know given that llms you know big llms are trained on like internet data you can interpret like the fact that think step by step even works means that in the training data there was some kind of thinking step by step and that means that's a further evidence for this like if yes if we actually train these thinking steps then they get better at these thinking steps and they also generalize in this case within the same problem domain but they do generalize and I think that's the major contribution of this paper and that is pretty cool and the then going Beyond a star and oh we can do fewer search steps and so on that's a bit sus that's it thank you for listening uh stay hydrated and bye-bye

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник