Stanford Robotics Seminar ENGR319 | Spring 2026 | Ingredientsfor Long-Horizon Robot Autonomy

1:05:45

Stanford Robotics Seminar ENGR319 | Spring 2026 | Ingredientsfor Long-Horizon Robot Autonomy

Stanford Online 30.04.2026 1 195 просмотров 44 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

For more information about Stanford’s graduate programs, visit: https://online.stanford.edu/graduate-education April 17, 2026 This seminar covers: • Developing the ingredientsfor long-horizon robot autonomy, which are giving robot policies a sense of memory, and training generalist behaviors that are both broadly capable and high-performing • Recent progress on both fronts through two works: π0.6-MEM and π0.7 • What's still missing on the path to long-horizon physical autonomy Follow along with the seminar schedule, visit: https://stanfordasl.github.io/robotics_seminar/ Karl Pertsch is a member of technical staff at Physical Intelligence, where he works on training robot foundation models.

Оглавление (14 сегментов)

Segment 1 (00:00 - 05:00)

Yeah, excited to be here. Uh I wasn't Stanford. I did live in Berkeley at the time, so it doesn't maybe quite count, but excited to be back on campus. Um yeah, so I work at Physical Intelligence. Uh it's a startup in San Francisco. Um the goal of the startup is to build robots that can do a lot of everyday tasks uh in the home and industry environments. Um and so in today's talk, I want to give a little bit of an update on some very recent work that we have done um on trying to give our robots the right ingredients to allow them to do long horizon tasks. Okay, but so kind of to just get everybody on the same page where we are in the field of robotics. Um at this point in time, we're actually pretty good at teaching robots to do pretty complicated tasks. So here are a few examples from the last few months uh of tasks that we have taught our robots to do. And you can see that these are like pretty dexterous things, right? Like unlocking that lock is not a very easy task, maybe even for a human. Um and you need to be very precise. You need to do reorientation of objects. Uh and you need to um really control precisely how you manipulate these objects. Um the one thing you will notice that a lot of these tasks that we can do today are really you know, they're not really jobs, they're tasks, right? So they're very short horizon. Um they they don't really constitute things that a human would like, you know, ask their coworker to do, right? You wouldn't ask somebody to just unlock a lock, you would ask them to, you know, I don't know, build a certain workpiece. And one of the individual steps in that very long horizon goal of yours may be to unlock, you know, this toolbox so that you can get a tool out of it. Okay, so kind of the point I want to make is that we have gotten pretty good at doing kind of very short horizon, but very dexterous things. But we haven't maybe gotten as good at doing the long horizon things. But you know, arguably to make these systems really, really helpful, you want them to do long horizon things. And so one thing that we've gotten quite used to over the last few months at only is agents that can do long horizon things. And what makes them feel kind of really magical is the long horizon autonomy aspect, right? Like I can give Claude or any of the other systems a task and it can go on for hours without me touching it or looking at it, and I have some faith at least that it will have done something by the time I come back. Okay, and so kind of the vision that we're trying to work towards is to have something like that, but for the physical world. Where the job you give to the robot is not to pick up an object, but the job is something real, like, you know, clean my apartment or like do my groceries. And it's not just in home environments, it can be things like assemble this server rack, right? Which is an industrial task. But it's also very long horizon or like, you know, I'm missing this tool in my uh workbench and I want you to order the tool, receive it, check it for quality, and then put it into the right place. Right? These are real tasks. Um and so kind of what the talk today is about is like how can we enable robots to do long horizon physical autonomy? And and not just in, you know, one very specific lane, but we want these systems to be flexible and adaptable so that, you know, if I change the layout of my workbench or if I change something in my apartment, I don't need to like start and retrain my policy from scratch to handle that diversity, but we want the systems to be adaptable. Okay, this is the goal. Now, what ingredients do we need? There's a few very fundamental things that today's robot systems are lacking. Very basic thing is to be able to keep track of what things you have already done. So basically very primitive memory, right? These robot systems, if they are supposed to do very long horizon tasks that go beyond like a minute or so, they need to be able to keep track of what they have already achieved. The other thing that's really important is that these skills, like the individual primitive things that these robots are doing, are high enough performance and robustness that you can actually string them together into very, very long tasks. Right? So if you imagine you have an individual skill like picking up an object that has a 50% success rate, um if you want to do this for a few hours and you still want to have some chance of actually succeeding, you need a much, much higher success rate and performance for these individual skills. And also you need them to generalize, right? This is maybe a little bit more of a nuanced point, but as you make the horizon of your tasks longer, the burden on generalization also becomes higher. And because the chance that you have actually seen this exact 20 or 30 minute episode in your data decreases the longer and longer you make your task, right? And so naturally, as the system interacts with the world, kind of the entropy increases, and so your system needs to generalize more in order to be able to do really long tasks. Okay, and historically, all of these things have been really, really hard uh for robots to achieve. And so in today's talk, I kind of want to cover two works that we have done over the last few months to try to address these points. It's one work about memory, and then one work that was actually released

Segment 2 (05:00 - 10:00)

yesterday, so these slides are very hot off the press, and please be uh kind if they don't look perfectly polished. Um but that paper is called PIO 7, and that kind of tries to target generalization and performance. All right. So let's get started. First part of the talk, I want to talk about memory. Um there's a paper called Mem or Multi-Scale Embodied Memory, and I do want to give a big shout-out and who has been extremely instrumental in getting this paper published during his internship at PI. Um but it was a big team effort, as you can see. And so the very kind of basic thing that we wanted to solve here is to give robots memory. Okay, and there is like many, many reasons why a robot may want to have memory. Um you know, there's very obvious ones, like if you want a robot to do a long task like taking out the ingredients for a recipe, it needs to kind of remember which ingredients it has already taken out. But there are also more subtle ways in which robot needs memory. Um so for example, in this bottom right video here, it's a task of unpacking groceries from a grocery bag, right? So I'm loading the grocery bag, uh asking the robot to unpack it. Now, it's not an extremely long horizon task, right? It like takes maybe a minute to get these things out of the grocery bag. But the subtlety here is that this is a partial observability problem. Basically, as soon as the robot takes its gripper out of the grocery bag and its wrist camera cannot see inside anymore, the robot doesn't know what's inside the bag, right? It does not remember, if it doesn't have memory, whether or not there is another object inside. Okay, and so that means if you have a policy without memory, what it would do is it would go inside, see there's no object in there, remove its gripper, but then at that point it had already forgotten was there an object inside or not, right? So it would go back in and try it again, and so you get into these endless loops. Okay, so there's many reasons why we want our robots to have memory. Now, if you look at what models today look like, they actually don't have memory. So these are kind of the maybe the very default architectures you would go to if you want to build a robotics model today. Um and I don't want to go into all the details of this architecture, but what you see is that they are typically conditioned on a single step in time, the current time step. Right? And they're not conditioned on any form of what happened previously. Again, as I mentioned, because of that, we do see some very funny failure modes in these models that are a little bit unintuitive. Right? So for example, one classic case is you're trying to wash a plate. If your robot doesn't have memory, it will endlessly wash a plate, right? Because it does not understand how long it has been the plate. The state of the plate doesn't really change visually, right? So the robot will just keep washing it, and you can watch it for hours just washing plates. Similarly, you know, if you want to cook something, like here we cook a grilled cheese sandwich, if you don't understand the notion of time and you don't have memory, you may let it cook for way, way too long, and you're not very happy uh if that happens to you. Okay. So memory is fundamental. Yeah, not a great sandwich. Memory is fundamental, but none of the current robot system has it, right? This is a bit weird, right? Like naturally, you would expect robot systems to have memory because clearly it's useful. So what is the problem? So if you take one of those standard architectures, so there's a PI 0. 5 model, one of our earlier models, and you said, "Okay, I want to do the most naive thing. I just want to like add more observations, right? Like add in what happened in the history. " It's not actually very hard, right? This is a sequence model under the hood, it's a language model. So you can essentially just like feed more images into the language model. It's a sequence model, right? So like, you know, putting in a longer paragraph of text, you can just add in more images. Technically not hard, but there are two problems. On the one hand, if you add more images, your latency goes up and your computational cost of training this model, right? Like anybody working on LLMs is very familiar. If you increase the context length of your model, kind of everything gets a bit harder and more expensive. Okay, in language models, this is kind of okay, like if you wait for a few seconds for your Claude instance to answer, that's fine. In robotics, this is not okay, right? We're trying to do closed-loop reactive control, and so what we find is as we add more and more observations into our robot policies, the latency of those policies goes up very, very quickly into regions where we cannot really afford that kind of latency. Right? Like we cannot wait for a few seconds for the robot to decide how to move its arm. Um so that is one problem. Second problem is a little more subtle. It's about distribution shifts. So basically, as you add more observations, more information into your policy, it kind of becomes more aware of its own shortcomings. Right? So if you have a policy without history, it's blissfully ignorant. It just looks at a state and tries to solve it. If you have a policy with history, it sees all the different ways in which it has recently messed up. Right? Which is not typically in its training data, because your training data for these policies is human demonstrations, right? They're perfect. They look great. And humans don't mess up that often, right? But if you have a robot policy with history, it will see its own mistakes, and it will get confused. And so as a result, what we often see is that

Segment 3 (10:00 - 15:00)

when we add memory to robot policies, they become slow and they work a lot worse. Okay? And this is why basically, for all of these reasons, people typically don't do memory. Okay, but we need memory, so we need to figure out how to solve these problems. Okay, and the major idea that we had in this paper is to try and apply compression to solve both of these problems. And I will try to explain a little bit how. So, if you look at like very roughly, to inspire ourselves at how humans handle memory, you could argue that there are different types of memory, right? We have kind of a more short horizon memory, um and that is usually quite a bit of detail, um because we want to use it to kind of very, in some sense, even subconsciously adapt how we do certain manipulation tasks. Right? Like when we wash a window, we don't think very actively about where we have already washed the window, but subconsciously, we do know to move on eventually and not keep washing the same plate uh the same place on the window over and over. Or similarly, when you try to like insert a portafilter into a coffee machine and you're struggling a little bit the first time, the second time you do it, you kind of use your memory of how you just struggled with it to adapt your strategy. Okay, so we have this short-term memory that is very dense, very detail-rich, right? Like we remember the exact ways in which we have just manipulated objects. We also have long-term memory. That's quite a bit different, right? Like if you do your grocery shopping, you remember all the objects you have already put into your shopping cart. Uh and so you kind of tick them off in your mind and you have this much more semantic understanding of memory, right? You kind of remember the steps at a semantic level that you have already taken. Similar if you kind of try to clean up your apartment. Okay, so we have these two different types of memory and they're a little bit different in what they try to remember. The short horizon memory is very dense, it kind of really remembers a lot of the details of what happened. The long horizon memory is much more sparse and at a much higher semantic level. Okay, and so the idea that we had in this first work is that we can leverage this fact that we actually need to remember very different things on very different time horizons to use different modalities to represent those memory structures. Okay, so for short horizon memory, where we do need a lot of detail, a lot of density in the information, we propose to use rich information like actual observations, images that go into our robot, but we just compress them and keep them around for a short horizon of time. And then for the long horizon memory that's supposed to span minutes or tens of minutes, we propose to use language, still with compression, but in a much, much more abstracted space, right? You can imagine that in language you can keep track over much, much longer time horizons uh than you can in image space. Okay, and so I will briefly go into the details of how we actually implement those two parts before I show some results of what they enable us to do. Okay, image space memory. So, we have discussed that this is not a good idea, right? Like just throwing all of your images into your backbone, uh you will have a lot of trouble actually actually, you know, doing that calculation. Now, we can do a bit of back-of-the-envelope calculation of how bad that is. Um if we imagine we want to do maybe 10 seconds of history, right? We typically run our robots at something like 50 hertz control frequency, a bunch of camera streams go in, maybe in the upper bound, four camera streams. And in a modern vision language model, your image encoder will produce something like 256 tokens per image. That's a very standard number. So, if you do that math very simple, you get something like 512,000 tokens that you need to put into your backbone of your policy. And so that doesn't make anybody happy. This is a face of Brian who pays for compute and the face of Quan who worries about inference speed. So, both of them are very unhappy cuz this is way too many tokens. Now, you could say, "Well, that's a little bit of a straw man, right? Like you don't actually need to feed 50 hertz video into your policy, you can just do 1 hertz. " So, you'll see like one frame every second for 10 seconds. You still end up with about 10,000 tokens and for reference, that's still about 10 times more than what you would usually do in a no memory policy. Okay? So, we still have increased our training cost by 10k by 10-fold at least. Uh and our inference latency is still significantly higher than what it used to be. So, still no good. So, the solution that we propose here is that essentially somewhere in your model, very early on, you force the model to really compress the visual observation. Okay, so we designed this video architec- video encoder architecture that looks still quite similar to a standard vision tokenizer, VIT, um but okay, let me back up slightly. So, in a standard VIT, the way it works is that you take your image, you cut it into patches, uh essentially rasterize your image, and then each of those patches gets encoded, embedded into an embedding, and then you do some uh self-attention across all of those patches. Okay? Now, the problem is still if you were to do this over like, you know, tens of thousands of tokens, that would still be a very expensive operation. And so what we propose is that we kind of sparsify that attention uh operation. As in, for most of uh the images and most of the layers in your VIT, you only

Segment 4 (15:00 - 20:00)

do attention within your current time step. And then every once in a while, every couple layers, you allow a token to attend to all the other tokens in that same position in the image over the course of the history of your observation memory. Okay? So, that's like basically mostly spatial attention and then a little bit of temporal attention. And so in this way, the model is kind of forced to aggregate that historical observation over time. And then the very last layer of our VIT, that's the yellow part here, we essentially just drop all tokens. We only keep around the tokens for the current time step. Okay? And with this architecture, we force the model to compress all of the relevant information from the history into the current time step's output. And there's two benefits to it. One, you end up with exactly as many tokens as you usually get in a no memory history uh no memory policy. And second, because we haven't introduced any new weights, these are all just modifications to the attention structure, we can still initialize this model from a standard pre-trained VIT weights. Okay, so this makes it very easy. Basically, don't need to change anything about our pre-trained model, we just fiddle with the attention weights, and we get an architecture that gives us a lot of compression. Okay, so core elements here are that we have sparse temporal attention, and then we have token reduction at the very end of our VIT. All right, what that gives us in practice is much better scaling over memory length. Uh so here again, in green, uh basically, we have our standard naive approach of feeding everything into the backbone. Uh it really blows up the inference latency. In yellow, we have what we get if we apply this kind of more compressed architecture. And you can see that over, you know, 12, 16 time steps of observation-based memory, we can still get reasonable latency speeds or inference speeds. Uh so, this red line here may seem a little arbitrary, we just say like, "Oh, 300 milliseconds is uh kind of the boundary at which we uh don't want to like that we don't want to go over. " Um in practice, this is something that we empirically found that if your policy takes more than 300 milliseconds, it does start to deteriorate the performance of your robot. Okay, so, you know, it may look a little arbitrary, but as long as we're below that red line, we're kind of okay. Okay, so what this gives us is basically with this dense visual memory, we can now span maybe 15 time steps. Okay, it's still no solution for doing, you know, tens of minutes of memory, but it is a solution for short horizon memory. Now, for long horizon memory, as mentioned earlier, we want to use language because language allows us to represent things in a much, much more abstracted way. Okay, so we can keep around little bits of information, but for much, much longer time horizons. And luckily, when we look at architectures like this one for PIO 5, which is one of our older models, we can see that actually there's already something in the model that deals with language, which we call the high-level policy, which is essentially a part of the model that takes the user's instruction, and it breaks it into a shorter horizon subtask, right? So, the instruction may be clean the bedroom, and then the short horizon subtask is like uh pick up the pillow uh in this case. Okay, so we already have a part of our model that deals with language. And so now the idea that you can have is that you can take that bit of our model and you can actually train it to keep track of memory in language space. The naive way to do it would be to just keep chaining essentially past language commands into the input of your policy. Right? So, this is kind of the naive version of memory if you applied it in the language space. And this is no problem from a latency perspective because there are not very many tokens in these language instructions, right? It's just like a few tokens per instruction. The problem here is distribution shift. So, basically, if you just naively kind of append your old instructions to your language-based memory, um over time, your policy will see shifts in the distribution between your training data and its inference time instructions. And the reason here is that typically, when we deploy these robots, they tend to fail a lot more than what they saw in the training data. Uh and so in language space, they will essentially see the same command over and over for much longer periods of time. Uh because the robot keeps failing at picking up this object, which in the training data never happened, but at inference time, the high-level policy will just keep saying to pick up this object, pick up this object. And so then the policy will essentially get confused. Like, "Why do you keep telling me to pick up this object for such a long time? " Okay? So, doesn't work well for performance, and what we have found, and also prior works, including one work from Chelsea's lab, have found is that when you do this naive type of language memory, you do exactly go out of distribution quickly, and then you get much worse performance. So, what we proposed instead is to actually teach our high-level policy to do compression. Right? So, instead of naively chaining all the language instructions into the input of that policy, we can actually teach the policy to output a compressed representation of

Segment 5 (20:00 - 25:00)

what it has already done in this episode. Okay, so this is literally the policy summarizing in natural language text what has so far happened in the in the rollout that the robot is in. Okay, and then we feed the last output prediction back into the input and ask the policy to update. Okay, and so over time basically the policy will keep appending to its language based memory, but importantly because it compresses, right, it doesn't always update, uh it only updates if something actually happened. Uh this input representation has much less distribution shift. Uh because even if the robot keeps failing on the same object over and over, it will simply not update the memory. It will simply say, "Okay, I haven't still picked it up, so nothing happened. " Okay, so to summarize, a core part of getting the language based memory to work is that you need to teach the model to compress this language memory and keep its own keep updating its own internal representation of this memory. Okay, and in doing that you get less distribution shift and so the policy works better. All right, I want to show you guys some videos of what this actually enables us to do. Um so here you see a robot that tries to set up the ingredients for a recipe. And at the bottom you can actually see that robot's memory in language space. So you can actually see the robot keeping track uh of the different things that it has already done and its plan um to prepare the ingredients for I think mashed potatoes in this case. And you can see that the robot is running around the kitchen and it's kind of uh trying to assemble all the different ingredients. Uh this is an unseen kitchen I should mention. Um Marcel has spent a lot of time um getting these policies to run in unseen kitchens. Um but now the robot can actually go around and can find the ingredients. Uh okay, we tell the robot where the ingredients are, but it can go and get those ingredients. And importantly it can keep track uh of all the different steps that it has already completed. Okay? This doesn't just work for one recipe, it kind of works for a variety of different recipes. All right, just tell the robot what kind of recipe you want to set up and it can go out and can get all those ingredients. Okay? One cool thing here is that these prompts are actually very detailed. Uh so these are the kind of instructions that uh Marcel in this case gave to the robot. And you can see that there's a lot of detail here. And if you contrast that to the typical instructions that we give to robots, they're much shorter. Right, they're typically just like pick up this object. Uh but now you can see there's a lot of detail, there are a lot of steps in these instructions, and it's kind of the first time that we could see policies kind of follow through on such a very complicated and long instructions. Okay, this doesn't just work for uh you know, picking ingredients in the kitchen. Uh we can also get it to clean uh kitchen environments. I will try to speed this up just uh oh, I don't actually know if I can speed it up here. Um this is one of the issues with showing really long horizon tasks in the talk is that you don't really have time for all the long horizon. Um okay. Yeah, I don't think I can actually uh Oh. Okay, this is challenging. All right. Okay, so we can see it can kind of like clean the counter. Uh it can throw away the wipe. It can put objects into a fridge. And then it puts away some of these plates uh and washes some of the plates that are in the sink. All right, this is a little bit of a very fast forward version of this, but you can see the whole video in normal speed uh if you look at our website. Yeah, I did want to show you a few ablations uh to kind of drive home the point of what is important here. Um basically we tested a few different versions of this model. So we tested the version without memory. Uh unsurprisingly this model doesn't work very well. Uh essentially if you don't have memory, you're kind of doomed at trying to solve those tasks. Um we also tested versions where we only have one of these memory types. So either text based memory or video based memory. Uh but we found that if you only have one of those, it doesn't quite work either. Uh the reason is if you only have short horizon video memory, well you lose track of the long horizon task. If you only have long horizon text memory, you kind of struggle with some of these intermediate bits like trying to wipe uh wash the plates for example and then infinitely getting stuck on those plates. All right, so both of these types of memory are really important. We also here tried this version uh what we call naive text, which is essentially doing text based memory but without the compression that I had mentioned. So this is the version where you just literally append language over language. And that we find also to not work very well for this distribution shift reasons that I had mentioned. And this is the same as uh was found in those prior works from Chelsea's lab uh that I had mentioned earlier. Okay, so we can get policies to do these long horizon tasks. Um you know, we had fun with different tasks. Uh so this was one of our uh crowd favorites. We did uh this um grilled cheese task. Uh so we made a lot of grilled cheese sandwiches up higher for the course of maybe 2 weeks. Uh it always smelled great in the office. Uh we had a lot of grilled cheese to eat. Some of them were burned, but others looked good. Um and yeah, it it's a good task because it really comes down to keeping track of

Segment 6 (25:00 - 30:00)

time. Right, like the robot kind of needs to understand how long has the thing been cooking in the pan to understand when it can turn around uh and then when it can serve it uh onto the plate. Okay, and so you can kind of see the robot here uh waiting its time for the thing to cook, but not waiting too long, otherwise you get uh burnt grilled cheese. All right, there are a few other tasks. I encourage you to look at the website. Um a lot of these tasks are well, half of them are maybe designed to really stress test memory. I would say the other half is kind of the ones where you wouldn't really expect you need memory, but you actually do. Right, so this is the grocery bag example we discussed earlier. Um the uh glass wiping example where the robot basically gets lost if it doesn't have memory and just like keeps wiping the same part of the glass over and over. Okay, so there are many tasks. I think this is one of the key points I want you to take away. There are tasks where you obviously need memory, but there are a lot more tasks where you very subtly need memory. And where if you don't have memory, you fail in very weird ways. Okay, and the reason you typically don't see this is because roboticists are good at picking tasks that are not like this. Right, there's a lot of things you can do to kind of try to craft your strategies in a way that you don't end up in states where you can't observe and where you need memory. Okay, but if we really want to solve open world tasks, we cannot make these kind of shortcuts. Right, we do need to have memory in our systems to solve these problems. Okay, one other cool thing that I want to show is that um beyond partial observability problems, memory can give you even more um ways in which it can improve your policy. Um so this is a very common failure mode that we see in our policies, which is that they fail the exact same way over and over. Uh maybe for those of you who work on robot learning, this is a familiar sight. Um so on the left side is one of our favorite tasks, which is to pick up a chopstick. Uh it's actually quite challenging because the chopstick is tiny. Uh and so the robot needs to kind of very precisely determine at what height to pick. Um if it gets it wrong, then it will just like, you know, go wrong. But then what we see often for memoryless policies is that they keep failing over and over the exact same way, which is not very intuitive. Um same for this fridge. This fridge is a little bit challenging because it's a symmetric fridge, so you don't actually know which way it opens. Right, it could open right side or left side. For humans this is no problem. You try one side, you realize that's not the right one, so you go to the other side and open. But for a policy without memory, this is not a thing it can do. Right, like it will try, it will fail, and it will forget, and try again, and fail. Uh and so we very often see robots getting stuck in these infinite loops of failing over and over. Now, if you do have a policy with memory, however, it can actually remember how it failed before. Right, like if it picked up the chopstick a little bit too high and it didn't quite succeed, the next time it can remember this phenomenon and then grasp lower, actually pick up the right height. Right, same for the fridge. Right, like if the robot policy failed a few times on one side of the fridge, it can remember and go over to the other side and open it. Okay, so kind of the key here is that memory doesn't just help you with memory tasks, memory helps you in many ways because it can allow your policies to actually learn algorithms, so to say, from data that allow them to be more robust. Right, like things like in context adaptation can emerge. All right, final result for this paper before we move on here is um we also tested these policies on tasks that don't require memory. Um so this is maybe a little bit less exciting in some way, but it's actually very exciting to me. Um so for the longest time we tried to train policies with memory, and the problem we always saw is that they kind of struggle to match even the performance of non-memory policies on tasks that don't require memory. Uh but what we see here is that when we tested this model on a lot of the different tasks that we had previously developed for our no-memory policies, we could actually get this policy to approximately match the performance. Okay, and kind of the key ingredients here, well, there there's not really a lot of magic. It's basically as long as your data is diverse enough and you're careful about mitigating certain distribution shifts, you can actually get this policy to work well even in tasks that don't require memory. Okay, and so careful about distribution shifts can mean things like being careful that your inference latency isn't too high, for example. Right, because during data collection, there was no latency. The human just moves at perfect speed. Uh and so if your inference latency is too high, this will be confusing to the model, so performance will be worse. Uh so you need to kind of take measures to make sure that your policies can run fast enough, that you have enough diverse data to really cover all of these different distributions. But then we can actually get them to work well even on tasks where you don't need memory in the first place. Okay, and the reason that this is exciting is because it allows us to essentially make memory a default feature. Right, there is no trade-off. Uh there are a few small trade-offs. Um basically it will cost you a little bit more compute to train, but you know, that's fine. 1. 5x is not too bad. You will still be able to inference them fast enough. Uh so you don't have issues on the inference side. And then you do not need to make trade-offs in terms of their performance, uh but you can now learn memory tasks, and in-context adaptation. Okay. And so because of all of those

Segment 7 (30:00 - 35:00)

reasons, we can now essentially mainline memory and make it one of the default features of all of our future models, uh which is one of the exciting outcomes, I think, of this project. Okay. I will stop for a short moment to see if there are questions uh before I move on to talk about non-memory topics. Yeah. Yeah, um this is awesome. One thing I was curious about for the grilled cheese, uh is it the memory that lets it like think about how much time has passed? Where in this process is the timeline? Is there timestamps on the memories? Or like all the text and all the videos that I saw, I didn't see any mention of Yeah. So basically, if you have video memory, you can keep track of time in video memory, right? You can see um how time progresses because it kind of fills into your video buffer. Uh so for this example, we actually made sure that the video memory of the model is long enough to actually be able to see the process of grilling cheese. Yeah. Uh the way you can make it longer, right? I said like we can fit 10, maybe 15 frames. The way to extend the video memory horizon is to make the stride larger. So you just like add fewer frames like every couple seconds only. And is this something that the model can do on the fly or is Um so you have to kind of design it ahead of time. Uh I don't think it would be terribly hard like if you trained with some randomization, you could probably get the model to like, you know, use the new stride at inference time. We haven't really prioritized it, but yeah, for now it's hand-tuned. Yeah. And um Do you think about other types of memory representation? For example, find in the floor point. Uh if you do navigation, you can store it as a map this Yeah. Yeah. So I think um in some sense, we prescribed memory representations here, right? And we said, "Okay, short horizon is video, long horizon is language only. " These are not perfect representations, right? Like as you point out, if you want to do navigation, you want to like remember the geometric layout of the environment, probably language is extremely inefficient for that, right? You would need to like in language describe paragraph long how the room layout looks like. Um so, you know, I think there are trade-offs in how you choose these representations. Um I don't think it would be terribly hard to condition the model on a map, right? Like after all, if you wanted to, you could just take a literal screenshot of your map and put it into the model. It's not a problem. Um We haven't prioritized it because for the kind of tasks we're doing, this is not yet relevant, I would say. Like all of these are still within a fairly constrained, you know, single-room kind of environment. Uh but I think as you expand the scope of the kind of tasks you want to do, this kind of question will become a lot more important. Um You know, it goes beyond maps. Like there are certain things like if you maybe want to remember precisely with how much force you picked up a certain object, and you want to remember that for a long time, um maybe text is again not the best representation for that, right? Because you It's very hard to describe in text how strong you pressed onto an object. Um so I think there's a lot more research to be done in like how can we design more um complete memory representations? And then I don't want to claim that this is the final memory representation, right? Um so yeah, I think it's a very good question. Yeah. Great quality question. I think the in-context adaptation examples are great. Uh I just couldn't really speak to the data that went into that. Um which I think would be nice to appreciate more than Yeah. So basically, for these examples, what we wanted to do is kind of show that in principle, it's possible. And so we very dedicatedly collected data that showed the robot strategies for how to adapt in context. Okay. So we basically showed it, "Okay, if you pick up too high, try to go lower and pick it up. " Um Now, you know, this means that it will really only work in the cases that we tested here. But I actually want to kind of emphasize that it should be a property that emerges from large enough data collection, right? Because even humans aren't quite perfect when they collect data. Um right? Like even teleoperators, if you've done it yourself, you know that as you make the tasks harder and harder, people will make mistakes. Uh actually, the whole second part of the talk is about how our data isn't perfect. We need to deal with that. Um but so you will naturally in complex enough data find cases where people recover, where people like regrasp, uh something slips and they try again. And so this kind of behavior of being able to learn from your very short-horizon mistakes should very naturally happen as you scale data collection. Um You know, here we try to kind of overemphasize this effect by explicitly collecting data for it, uh but I would totally expect that over time, this will be natural. But it only happens if your policy has memory, right? If your policy doesn't have memory, it doesn't even have the capacity to learn this algorithm to actually rego and pick up uh lower. Okay. Um Let's do one more question. From a neuroscience perspective, if you set a prior where if it fails, it explores the alternative from like motor learning

Segment 8 (35:00 - 40:00)

is that principle relevant here? So for example, you're trying to open the door from one side, it fails, and it keeps going to the same side. But is it not possible to give it a teleop model of the task which predefines that if you fail on one side, explore the other side? So you know, if you have a policy without memory, it doesn't have a notion of I have failed here, right? It will just not remember that it has failed on this side. Um So my whole point here is that in order for these kind of algorithms to emerge or even for you to be able to put them in, you need to give your system memory, right? Otherwise, it doesn't even remember that it has failed. Um Now, as in whether to try and handcraft this or learn it from data, I think there is like a lot of diverse strategies you may need to follow to recover, right? Like trying the other side may work in this case, but it may not always be the winning strategy. Right? Another kind of favorite example that we see a lot in our data collection is when you try to open a drawer, you don't know how heavy the objects in that drawer are, and so there may be different strategies you need to use to open a drawer based on how heavy the objects inside are, which you cannot see. Right? And so the robot may try to open it from the front, uh slips because the objects are kind of very heavy, and so then it can try to adjust the strategy and kind of reach in from the top if it has already opened it a little bit, because then you have much more mechanical strength when you can kind of pull the drawer from inside. But all of those kind of behaviors are very hard to learn if you don't have memory, because you won't remember that you actually slipped this way. Okay. Great question. Okay, I will move on. I think we have a bit more time for questions afterwards. Um So I hope that I could kind of drive home the point of how we can get these policies to do very long-horizon tasks. Um one thing that you can't really see here is how painful it is to get these policies to actually succeed at these very long-horizon tasks. I think Marcelo is the person in this room that knows the best um how much effort goes into this. And I think kind of the yeah, the key ingredients and the key pain points in some sense to get these policies to do very long-horizon tasks are, as I mentioned earlier, their ability to generalize, because if they don't generalize, they can't handle new situations, and the raw performance of these policies, essentially. If they're not good at each individual skill, it will be very, very painful to get them to do very long-horizon tasks. Okay. Now, we do want generalization. We also want performance. In the past, these two things have a little bit been at odds with each other. Uh so we actually had two different stages of training for our policies. We would first do what we call pretraining, where you essentially train them on a very broad data set, uh very diverse data, and you get generalization out of that stage. But then when you wanted these policies to do really, really well on a very hard task, you would do a post-training stage, where you essentially take your very broad data set, and you really narrow it down to only your most high-quality data, and then only fine-tune on that little bit of data, and that would give you the best policy that you can get. Okay. And now, the problem here is that these stages are a little bit exclusive, right? So you could either get broad generalization, or you could get really high performance. But my point is that if you want these models to do long-horizon tasks, you inevitably need both, right? You need them to generalize, and work really well. So why are these two things at odds with each other? Um So this is what I tried to express earlier, that when you collect data at interestingly enough tasks and large enough scale, you will necessarily get a lot of variation in your data. Okay. And so this plot is kind of um one way to show this. So this is a It's from a different paper we published, and the method here doesn't matter too much. What I want to emphasize is that if you look at our data, which is a teleop green curve here, you can see that there are very large variations in the speed in which people are able to teleop a certain task, right? This is all the same task. I actually don't know what exactly task this is, but it's very typical for our data to see a lot of spread in the speed in which people can teleoperate these tasks. And now if you do pretraining, you would essentially get what this uh red curve here is, which is um you know, you have a base policy, it has tried to match the full distribution of your data, it has a bit of error in that matching process, so it will accumulate over time. So very naturally, you will kind of get a distribution that's a little bit worse than your teleoperation distribution. Right? So it's a little bit slower than your average teleoperator. But then what we typically do in post-training is that we kind of try to narrow down the data to just our most high-quality data, kind of the fastest of our data, for example, and then fine-tune this red model on this bit of data, and then we typically get something like the yellow curve here. Like basically, a policy that is actually in some sense better than some of our teleoperators at doing the task because it is only fine tuned on the most high quality of our data. Okay, and I want to emphasize this is not only true for speed. Um, there is many factors of variation that you have in your data that introduce this kind of multimodality. Right, it can be about the quality of the motions, how many mistakes were made over the course of the episode, even what kind of subtask

Segment 9 (40:00 - 45:00)

sequences were used, right? Like maybe somebody when they fold a shirt they first pick up with the right hand and then with the left hand and then fold. And maybe the next person does it the exact other way around. Okay, and so when you do pre-training naively, you kind of force your model to learn the full distribution. Right, to model all the different modes of your data. And then as a result, you kind of get this like middling performance after pre-training and then you do post-training to really narrow it in on a certain kind of behavior that you want. But then the problem is you lose a lot of generalization in that post-training process. Okay? And so in the second work that I want to talk about today, we tried to resolve this inherent conflict. We wanted to get a policy that is both generalizable but also has really high performance. Okay, and so this is a paper that is called PIO 7. As I mentioned, it was released yesterday. Um, it was a very big team effort, so this is like a full pie kind of paper where everybody tried to help. Uh, but I want to explicitly point out the contributions of some of my colleagues, Ashwin Allen, Lucy who is a PhD student in Chelsea's lab, uh, Laura, Kevin, and Kyle um, who helped a lot in getting this paper over the finish line. Um, and so as I said, the core thing that we wanted to achieve in this paper is to train one model that is both generalizable and really high performance. All right, how can we do this? So, to remember, we have this distribution and we kind of want to train a model, um, to model this distribution really well. And the core idea that we followed in this paper is that we can actually make this model's task a lot easier if we provided more context. Right, so if you don't have context, you kind of force the model to model the full distribution of all the different behaviors in your data because it doesn't really know which exact one you're trying to have it predict now. But what we can do is that we can provide the model with more context about this particular training sample that we try to have it predict. This context can come in many different forms. We can tell it what task it is currently trying to do, maybe more precisely what subtask it's trying to solve. We can give it various forms of metadata, for example, like how fast the current behavior is expected to be or like of how high quality that behavior is expected to be. And then because we're in robotics, it doesn't all have to be text. We can actually give this model even more conditioning in the form of for example subgoals. Right, there are certain things that are just very hard to express in text. So, for example, whether I pick up the bottle like this or whether I pick it up like this. Kind of a bit cumbersome to describe in text, but if I just show the model, you're trying to like your gripper is here, you're trying to put it actions that lead to this outcome. It now really collapses the distribution of things that it needs to model, right? Because it knows it needs to model this behavior and doesn't need or this behavior. Okay, so subgoals can really kind of narrow the distribution during training time that the model has to predict. Why is this beneficial? Well, on the one hand, it makes the training job much easier. Right, like now the model doesn't need to actually model the full distribution. It needs to model a much more conditional slice of that distribution, which means it can typically fit the data much better. It also helps us at inference time because now this is what we call a steerable policy. So, we can now provide at inference time the kind of conditioning to pull out the behaviors that we want. Right, so for example, typically, we want our policies to be fast and we want them to be high quality, right? The behaviors that they produce. So, instead of doing the whole post-training business that we did before, what we can do now is we can train a single policy and we can then condition it on high quality and fast speed. Right, because now the policy takes that conditioning as input in this metadata field. Okay, and this can go to the literal thing that you put into the text make no mistakes. Right, it's a little bit of a meme that you can tell the robot to not make mistakes, but it actually makes the robot better uh, at doing the task that it's supposed to do when you train it like this. Okay? So, kind of to get the core point across, when we train our policies not in an unconditional way but with very rich conditioning like this, we can make their training time task easier, so they fit the data better. And we can at inference time choose the behavior mode that we want and typically we choose high performance behaviors. Okay, there's a little bit more detail here because some of these conditionings we cannot just set for a whole episode, right? Like while we can say for the whole episode we want you to behave at high quality, um, we cannot put a single subgoal and hold it constant for the whole episode or a single subtask and hold that constant for the whole episode. Uh, so certain of these conditionings we can either provide as humans or we can train models to provide these conditionings. Um, so in the case of this work, we have trained a high-level policy to produce subtask conditioning, um, we have trained a image editing model to produce subgoals. Essentially, you know, predict what a few seconds from now the world may look like so as to elicit from the policy this behavior to get to that subgoal. Okay, I also want to point out that we are not the first people to think about this idea. Uh, there has been quite a bit of work in the kind of steerability and conditioning world uh, outside of robotics a lot about prompt expansion

Segment 10 (45:00 - 50:00)

and so on in language and in vision models. But also in robotics, I want to point out this work by Will, uh, who is a PhD student in Sergey's lab, uh, who has already looked at these kind of steerability questions. And I would think of PIO 7 as kind of a scaled up and industrialized version of a very similar idea. Okay, what does it give us? Um, essentially, it gives us policies that can perform at really high performance with a single policy checkpoint, okay? So, what I want to emphasize is that all of these videos come from a single checkpoint that we trained and we just conditioned it on, you know, folding a shirt, screwing in a screw, building a box, and so on. Okay, and we have shown tasks of this complexity before in previous works, but typically we had to do a lot of post-training to get policies to do this task well. And then they wouldn't do any other task once we had fine tuned them for a specific one. And you can see that some of these tasks are really complex like the screwing one I really like, uh, where the robot needs to be really careful right here where it essentially tries to get the screwdriver to exactly fit into that screw before it can screw it in. Okay? So, if we look at, um, numbers, we can see that these policies perform at the level of our previously best policies on each of these tasks. And so all of the gray bars here typically had some elaborate post-training procedure, but in PIO 7, we can kind of absorb all of that data into a single model and perform at the same level. Okay, my favorite plot from the paper is this one. Um, so what we did here is an experiment where we took a lot of our laundry folding data. Looks a bit like the one on the right here. Um, and we ranked it by quality. So, we kind of took our whole data set, we sorted it by quality. And then we trained policies with different levels or different amounts of our data, like 30, 50, 80, and 100%. Uh, and so the bottom curve is our previously, you know, best training recipe with the PIO 7 architecture, so everything is controlled. And you can see that initially more data helps, but eventually, when you go to like the really bad data in some sense, adding that data in originally seemed to hurt a lot. I mean, this is quite intuitive. If you force your model to model the whole distribution, adding bad stuff into that distribution will make your model worse. Okay, but this is a little bit sad because we had collected this data. It was a lot of effort collecting this data, so we would want to benefit from it. And so what we see here in the yellow line is when we actually add this metadata conditioning, explicitly telling the model, "Look, this data sample, we know it's bad, but you need to predict the action anyways. " And then condition at inference time on good behavior, we can actually get this model to perform significantly better. Okay, this is a very positive sign because it basically means that we have found a training recipe to take bad data alongside good data and help our models generalize better and get to higher performance. Okay, um, a few other highlight results. I think I'm running towards the end of time here. Um, one really exciting one, um, that we found is that when you train models with diverse condition diverse conditioning including image conditioning, you can actually get them to transfer skills across robots. Um, so this is an example. We have a lot of this kind of data where our this is a called an ARX 5 arm. It is a Trustin robot, I think, um, folding shirts. So, we have that data in our training mix. But then what we tested is we went to a completely different robot. So, this is a UR5 by arm station. These are much bigger robot arms, much more heavy, expensive, and industrial. Um, and we asked it to also fold a shirt. And importantly, we had never collected data with laundry for UR5, okay? So, this robot doesn't know what it means to fold a shirt technically. But what we find is that when we use subgoal conditioning, and you can see predicted subgoals at the top, we can actually get this robot to fold shirts. And this was like very surprising to us. Okay, and so the intuition here is that actually predicting a subgoal of a folded shirt is a relatively easy task because folded shirts, you know, they kind of look quite similar. And then what we found, which was a surprising bit, is that the UR5 station, which had been trained on many other tasks, had learned how to move the world into the state that we wanted to move it in. Okay, so by predicting the right subgoals, we can kind of guide a UR5 robot for a completely new task, which it hadn't seen before. Um, which is a really exciting result because it allows us to transfer skills across robots. Okay, final experiment that I want to show before I wrap up is about coaching. Uh, so I think this kind of points to where I think the whole thing is headed in the future. Um, so one thing that steering enables is that you can actually teach robots new tasks without using teleoperation, right? Because now this robot listens to language really well. And so you can get it to do new tasks simply by talking to it. Um, so here in this example, we to teach the robot how to use the air fryer. Um, it hadn't We hadn't collected data on air fryers, right? So, this is an unseen object. The robot doesn't really know what to do with an air fryer. Uh, it sees the handle, so it can kind of pull it open. It has a little bit of this kind of manipulation intelligence. Sees a potato, so it picks it up. It sees something that looks like a ball, so it puts it inside. But, it's not very, you know, coherent. Like, it doesn't know that it needs to push it back in, for example, for the

Segment 11 (50:00 - 55:00)

task to make sense. But, then what we can do is that uh our master coach or Lucy um can actually teach the robot how to do this task without teleoperation simply by talking, right? So, she can give it very detailed instructions of opening the air fryer, inside, and then closing it again. So, essentially walking it through the full task purely in language. Let's see whether the thing actually closes. Uh, it's struggling a little. [snorts] — Okay. And then we can take this data that we have now collected of the robot doing the full task with Lucy's help. And we can distill it back into a fully learned policy that can now actually do this task. Right? So, this is now an end-to-end policy, no more Lucy um that actually goes through all of the steps of this task and is kind of successful uh at actually, you know, using the air fryer the way we would expect it to use. And so, the part that is kind of exciting for future work here is that all of this was possible without any more teleoperation, right? The robot can kind of learn to string together a few of the basic manipulation primitives that we have learned from a lot of data into a new task. And it can do this here with a human's help, right? Lucy had to be really detailed about how to do these different parts of the task. But, in the future, you could imagine that, you know, a reasonably powerful vision language model, for example, could be used to help the robot go through new tasks simply by providing language instructions of how to do different steps of this task. And so, I think kind of if we think of Okay, where are we in today and how do we get to these really long horizon autonomy futures that I have envisioned? I think what is missing is basically tying in high-level intelligence with this low-level manipulation intelligence that we have built. Right? So, now we have robot policies that when you tell them to open an air fryer, even though you haven't really ever collected data on opening air fryers, you can kind of get the thing to do it. And so, now we just need a high-level intelligence that can walk robots through new tasks to solve all these kind of complicated long horizon problems that we discussed in the beginning. Okay. And so, I think it's a really exciting direction to try and work on how to integrate high-level reasoning and intelligence into uh the primitive robot manipulation behaviors uh that we can now do. Okay. Uh, with that I want to close. Uh, I think we're about at time. Um, to summarize what we discussed today, I think there were two key ingredients that I covered. Uh, basically, we talked about memory and how to give our policies memory. And then we talked about how to on a training and algorithmic side enable us to train policies that both have high performance and generalize. Final slide. I uh have to plug some of the resources that we have put out to help people do research. If you're, you know, if you got excited about this kind of stuff, um we have open sourced quite a few of the models that we trained at Pi, for example, the Pi 0. 5 model. Uh, so you can use those. Um, we have also worked on benchmarking. Um, so there is a uh real-world robot benchmark called Robo Arena, which we have developed with academic collaborators. Um, you can submit your policy and get it evaluated uh in the real world, which is pretty cool. Uh, and then also very recently uh with students from the University of Washington, uh we have built these simulation environments, which essentially allow you to build uh digital replica of real-world environments, so you can do your evaluation much, much more easily before you actually put your robots uh onto or your policies onto a real robot. So, if you want to do research in this direction, please check those out. I think they should be helpful. Uh, and I'm happy to take more questions now. THANK YOU. — YES. UM, REALLY COOL TALK. UH, UM I WAS wondering if the grilled cheese example um like it's a deformable object that where you can fold or you can drop it into different shapes and different ways. I was wondering like how does memory help in that case and have you done Yeah. Have you guys done more work on the on the Yeah. Um, so, you know, in principle, memory should help for any task that requires you to reason about dynamics. But, if you don't have memory, it's very hard to understand dynamics in your system. And so, if you have a floppy object, it actually becomes a somewhat dynamical task to manipulate that object. You know, I don't want to overstate this grilled cheese example. Like, I don't think the grilled depending on the cheese, it's not actually that floppy, so I think even a memoryless policy could scoop it up and move it. Um, you know, there are other tasks that are a lot more dynamic, like picking up objects from moving conveyor belts, for example, where you would totally expect that memory should help. Um, I'm sure you can come up with other more floppy uh deformable objects where memory should also help. Um, but yeah, I think for this grilled cheese example in particular, I don't want to like overstate the floppiness. Uh, and so, I think this one you could probably solve this part without memory. Uh, I think the part that is really impossible to solve uh is the waiting for it to cook part uh because For example, we see like memory the need for memory grows.

Segment 12 (55:00 - 60:00)

I guess AI in general like do you see embedded memory needs to keep growing like Is there Can we get to a point where we need to offload some of these to the cloud or do you think it's fine off the memory on the robot? Okay, so this is an interesting point. Actually, all of our models run in the cloud all the time. Uh, there is nothing on the robot. Uh, the robot is basically a dumb camera with a bunch of actuators. And then all the intelligence lives somewhere in the cloud. Um, this is currently fine. Uh, basically, the round trip communication time to a cloud-hosted policy is like maybe less than 20 milliseconds. Uh, so our policy inference time is like 150 milliseconds, so the network latency is not too big of a fraction here. You would expect that as you kind of try to improve the um dynamics of like try to make your policies faster, essentially, you need to have them be more reactive. And so, you would need to push down on the latency, and then at some point probably you reach a point where the network latency is actually a problem. And at that point you probably want to put some stuff on the robot uh like with some local compute. Uh, we just haven't quite reached that point yet, I would say. Yeah. I just I want to talk a little bit more about the waiting examples. Like, for example, with the grilled cheese again, like the kind of difficulties of like having a like a long time of you guys not like doing anything. Is there a way Is it convenient to have the robot switch between like the kind of waiting mode and like long mode? Because I don't think you guys can do it on itself. Like, do you envision like a robot that could say know when to switch from kind of like that kind of faster, high-resolution data to like the long term like waiting data? Hm, that's an interesting question. Like, I think as soon as you start being concerned about like energy management and stuff like this, maybe this is like a really top-of-mind concern. Uh, we are not currently very concerned with that. Like, we have these like huge mobile bases. We can put a lot of batteries there if we wanted to or we can tether. Um, I think yeah. Like, I think if you look at how current, for example, coding agents work, right? They have this ability to like set timers and just like wait. Right? And And they will deterministically get a callback after 3 or 5 minutes, and then they will wake up and do something again. I think the benefit here is less about kind of saving energy or something like this. It's more about the deterministic part. Essentially, as soon as you put the grilled cheese down, you can set, you know, a deterministic timer that 5 minutes later you will come back and like make a decision. Um, and so, right now this is all kind of latent and learned somewhere in the memory. Uh, I think there could be benefits to having it be a lot more explicit. Um, but we haven't really, you know, tried this, but I think there's nothing in principle that would prevent you from doing that. Yeah. Okay. You're asking basically whether we stopped teleoperating and just like talk to the robot now? Okay. Yeah. Good question. Okay, there are a lot of edge cases. Like, you know, don't get me wrong, these policies are not perfect by any means. Their low-level dexterity is not human level at all. Right? So, we definitely will do a lot of teleoperation. Um, so Okay, this is good to say. So, basically, you know, I talked about this coaching example and like how do you get policies to do new things? I think this is really useful for stringing together primitive manipulation behaviors. Like, you know, if it already knows how to pick up a potato, pull a handle, you can string those together in language. If it has never seen how to do an in-hand rotation of a pen, you will not be able to describe in language and get the model to do that, right? Uh, okay. Right now it has parallel jaws, so it can't do that anyways, uh but, you know, like you get the point. There are certain manipulations if you have never seen them in your teleop data, you cannot get the robot in language to do those, right? Like, language is fundamentally a little bit more of an abstract medium. And so, there are certainly many, many behaviors that we haven't yet taught the robots to do as a low-level manipulation behavior. And so, we're nowhere near stopping to teleoperate, for example. So, we do need to collect a lot more data to get robots to be very generally capable in manipulating objects. Um, I think what I was trying to point to is that I think there will be a second mode to teach these robots which goes more towards long horizon manipulation, where we can get them to do things without having to teleoperate for hours uh to get really long horizon behaviors. If that makes sense. Cool. Yeah. Do you see yourselves changing to more complicated hands like human hands or are you trying to improve the quality first before getting into more hands? Yeah. Okay, everybody loves a complicated hand, right? Like if I could get a five-finger hand that works perfectly and never breaks, I would love it. Um there are some very practical constraints. Um so basically a very complex hand is going to be really expensive. Um typically if you look today at the market, there are a lot of hands. Um some of them are very simple. They can basically just do this

Segment 13 (60:00 - 65:00)

so that's not very helpful. There are hands that have very high degrees of freedom, but they're typically extremely expensive. Like they cost maybe $20,000 or more. And our whole robot arm costs like less than $5,000. Right? So if you add a hand for $20,000, that like makes a robot a lot more expensive. Um and also they break a lot, right? Like because this is a very, you know, fragile piece of mechanical engineering. And so if your robot policy isn't perfect, and our policies are not perfect, they will oftentimes like bang their hand or their gripper into the table. And if this is an expensive $20,000 hand, and it takes like a month to repair somewhere in China, um then you don't want to break it. And so, you know, just for practical reasons, basically all of our robots right now have these parallel jaw grippers. And you can see there's a little bit of variation, right? We have like the pointy ones, and we have the more kind of hard metal ones. Um so we try to adapt them a little bit, but it's mainly for reliability reasons and cost reasons. So there's a follow-up to you changing the grippers. Like do you train specifically for the different variations of the grippers, or is it all pretty much the same? So it's all the same. Like it all goes into the big training mix. Like I want to emphasize there's a single checkpoint here, right? Like we it was one training run. Um but we do try to adapt the grippers to the task. So for example, when we use these kind of like more tool use kind of examples, we do find that having a little bit more pointiness and kind of friction on the gripper helps. Um and then on the other hand, if you want to do a more precise task like this screwing task, uh if you used one of those bulky grippers, there would be no hope. Uh but if you use like this more precise pinchy gripper, uh you can actually get it to do these very complicated kind of fine motion tasks. Um so we do adapt the grippers, but they're all currently parallel jaw for cost and reliability. I hope that at some point somebody will make me a really nice hand. I would be very excited. Yeah. Um in terms of fine motion more horizon. Do you display any interest in fine motion manipulation? We do. For example, actually specifically that bottom left corner of the knife. Um like a much more fine cutting type of task like a Lego sculpture for example. Yeah. So we're very interested in like I think you know, we're interested in solving all the tasks, but I think there are many tasks that require very precise motion, specifically industrial contexts. Often like I mean, Michelin star chefs maybe also, but typically industrial contexts require very high precision like these screwing tasks. Uh and so we want to be able to solve those. And there's nothing fundamentally different about those tasks. What puts like what [snorts] makes them harder is that you need to have the hardware to actually support data collection for those tasks. Right? That goes down to the hardware of the robot, but also your teleop interface that needs to be precise enough and needs to make it intuitive enough for humans to make very fine motor adjustments. Um but if you put those pieces in place, I think there's nothing in principle that would require a different type of model. You just need to be able to collect data that has precise enough motions to be able to get that behavior. So one question from a philosophy point of Do you think coming like this with a teleoperator point down towards fine motion tasks is the way to solve the problem fine motion dexterity in robotics? Okay. Okay, let me try to answer this way. So I think we have work where we show that you can as long as you can get close enough with a human teleoperator, you can use other learning techniques like reinforcement learning to actually close that loop and get really high performance on very precise tasks. Um so actually the plot I showed earlier was from this work called RLT, um which is RL fine-tuning essentially. And what we did there is tasks like tying a zip tie for example, or this like screwing task, where when you only do teleoperation, you get a you know, medium performance policy that can occasionally close the zip tie, but oftentimes will kind of miss and not quite get it. Um but then with RL, as soon as you have a solution that's close enough, you can do a little bit of RL fine-tuning, like maybe a an hour worth of fine-tuning, and you can get a policy to go really high performance, like high reliability, um uh you know, very precise motion tasks like closing a zip tie. Um so you know, maybe to answer your question, I think teleoperation will get us very close. Like it will get us in the right region, but then we can use more outcome-driven ways of teaching the policies to actually get us to very high performance. And then what PIO 7 shows us is essentially once we have an existence proof for a policy that can do this task with high performance, we can use our learning algorithm to distill it into a single checkpoint. Right? Like once we're able to collect a bunch of examples of the policy doing this with really high performance, we can take that data, put it into our training mix, and then through conditioning kind of pull out this high performance at the end. Yeah. Question. So is it Um are there other metadata being applied in this besides just um only training? Um So we try to condition on diverse

Segment 14 (65:00 - 65:00)

types of metadata. Um I don't think I should go into too much detail here exactly what these metadata pieces are, but um they're generally, I would say, related to things like speed, quality, and so on. Um there is a bit of work that goes into exactly deciding what things you want to condition on and how to, for example, annotate those things. Um but it's generally in that direction, if that's enough information. — [snorts] — All right. Are there any more questions? Thanks a lot, guys. All right. —

Другие видео автора — Stanford Online

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник