# The Worlds First FULLY AUTONOMOUS Robotics System Is Here (Physical A.I)

## Метаданные

- **Канал:** TheAIGRID
- **YouTube:** https://www.youtube.com/watch?v=1N0vwUHPJKY
- **Дата:** 05.11.2024
- **Длительность:** 14:01
- **Просмотры:** 12,631
- **Источник:** https://ekstraktznaniy.ru/video/13820

## Описание

Prepare for AGI with me - https://www.skool.com/postagiprepardness 
🐤 Follow Me on Twitter https://twitter.com/TheAiGrid
🌐 Checkout My website - https://theaigrid.com/


Links From Todays Video:
https://www.physicalintelligence.company/blog/pi0

Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos.

Was there anything i missed?

(For Business Enquiries)  contact@theaigrid.com

#LLM #Largelanguagemodel #chatgpt
#AI
#ArtificialIntelligence
#MachineLearning
#DeepLearning
#NeuralNetworks
#Robotics
#DataScience

## Транскрипт

### Segment 1 (00:00 - 05:00) []

so there is a new company called physical intelligence and they are really innovating the game of Robotics in this video I'll dive into exactly what they're doing and what makes them so special on their blog post it starts by saying that we are living through an AI Revolution the past decade witnessed practically useful AI assistants and AI systems that can generate photorealistic images and videos and even video models that can predict the structure of proteins but in spite of all these advances human intellig dramatically outpaces AI when it comes to the physical world and to paraphrase morx Paradox winning a game of chess or discovering a new drug represent easy problems for AI to solve but folding a shirt or cleaning up a table requires a solving some of the most difficult engineering problems ever conceived to build AI systems that have the kind of physically situated versatility that humans possess we need a new approach and we need to make systems embodied so that they can acquire physical intelligence over the past 8 months we've developed a general purpose robot Foundation model that we call Pi Z and we believe that this is the first step towards our long-term goal of developing artificial physical intelligence so that users can simply ask robots to perform any tasks they want just like they can ask any llm and chatbot assistance like llms our model is trained on Broad and diverse data and can follow various text instructions and unlike llms it spans images text and actions and acquires physical intelligence by training on embodied experience from robots learning to directly output lowlevel motor commands via a novel architecture and it can control a variety of different robots and can either be prompted to carry out the desired task or ftuned to specialize it to challenging application scenarios so this is where they talk about the promise of generalist robot policies it says today's robots are narrow Specialists Industrial robots are programmed for repetitive motions in choreographed settings repeatedly making the same world in the same spot on an assembly line or dropping the same item into the same box even such simple behaviors require extensive manual engineering and more complex behaviors in messy real world environments such as homes are simply infeasible they state that if we could train a single generalist robot policy that can perform a wide range of different skills and control robots we could overcome this Challenge and such a model would only need only a little bit of data from each robot and each application just as a person learns a new skill quickly by drawing on a lifetime worth of experience such a generalist robot policy could be specialized to new tasks with only modest amounts of data and this would not be the first time that a generalist model beat a specialist at these specialist's own tasks language models have superseded more specialized language processing system systems because they can better solve these Downstream specialist tasks by drawing on their diverse and general purpose pre-training in the same way that llms provide a foundation model for language these generalist robot policies will provide a robot Foundation model for physical intelligence now to get there they're going to need to solve major technical challenges they say that their first step is pi Z which is a prototype model that combines large scale multitask and multi-root data collection with a new network architecture that enables the most capable and dexterous generalist robot policy to date and while they believe that this is only a small early step towards developing truly general purpose robot models they think it represents an exciting step that provides a glimpse of what is to come and I personally do believe that this company is doing something absolutely incredible taking a look at the majority of these Demos in real time I can see that the tasks that these robots are faced with are genuinely really hard how do you manage to train a robot on how to fold the shirt when there's a million different ways you're picking up the shirt it could crumble it could fall this way it could fall that way there's a million different scenarios that these robots do have to face and watching these robots perform these actions and complete them successfully time and time again shows me that physical intelligence their first generalist policy Pi Z is steps ahead on the other robot demos that I've seen especially some of the ones that have broken the internet what you need to understand is that this is something that right now is fully autonomous none of the demos that you're currently seeing are teleoperated it's all fully autonomous being done by their one single generalist policy here you can see their cross embodiment training mixture which is where Pi zero the brain uses internet scale Vision language pre-training open-source robot manipulation data sets and their own data sets consisting of dexterous tasks from eight distinct robotss and then the model can perform a variety of tasks via

### Segment 2 (05:00 - 10:00) [5:00]

either zero shot prompting or fine tuning entire data set contains diverse tasks with each task exhibiting a wide variety of motion Primitives many different objects and various scenes and the tasks in this data set exercise different dimensions of robot dexterity while covering the range of real tasks that these robots might be asked to perform from busing dishes to packaging items into envelopes to folding clothing to routing CA assembling boxes plugging in power plugs packaging food into go boxes and picking up and throwing out the trash and their goal in selecting these tasks is not to solve any particular application but to provide their model with a general understanding of physical interactions which is basically the initial foundation for physical intelligence basically stating that look we're not training it for anything specifically we're going to train it so that it understands a huge variety of different tasks and that way it becomes smarter overall inheriting internet scale semantic understanding says that Beyond training on many different robots Piero inherits semantic knowledge and visual understanding from internet scale pre-training by starting from a pre-trained vision language model and VMS are trained to model text and images on the web widely used VMS include GPT for vision and Gemini and we use a smaller 3 billion parameter vlm as a starting point and adapt it for real time dexterous robot control now VMS effectively transfer semantic Knowledge from the web but they are trained to Output only discrete language tokens and dexterous robot manipulation requires Pi Z to Output motor commands at a high frequency up to 50 times per second and to provide this level of exterity they state that they've developed a novel method to augment pre-trained vlms with continuous action outputs via flow matching a variant of diffusion models starting from a diverse robot data and a VM pre-trained on internet scale data we train our vision language action flow matching model which you can then post train on high quality robot data to solve a range of Downstream so next we have the post training for dextrous manipulation which is basically where they just want to train the model in a very specific way for certain tasks that are really difficult they basically stating that fine tuning the model with high quality data for a challenging tasks like folding laundry is quite similar with how posttraining processes are employed by the llm designers and pre-training teaches the model about the physical world while fine-tuning for robots per basically just makes it perform at a particular task really well now they did one task which is of course laundry and they fine-tuned Pi Z to fold laundry using either a mobile robot or a fixed pair of arms and the goal was to get it to get the clothing into a neat stack this task is exceptionally difficult for robots and of course some humans while a single t-shirt laid on the flat table can sometimes be folded by repeating a prescripted set of motions a pile of Tangled laundry can be crumpled in many different ways so it's not enough to just simply you know move the arms through the same kind of motion and to their knowledge okay no prior robot system has been demonstrated to perform this task at this level of complexity basically what they're saying is here is that like look this robot is currently stateof thee art robots can be folded in a variety of different ways and ensuring that you have repeatable success which is what they're currently demonstrating here is something that we simply haven't seen before and it's particularly hard to do now they also talk about table busing they said we also fine-tuned the model to bus a table this requires the robot to pick up the dishes and trash on the table putting any dishes Cutlery or cups into a busing bin and putting trash into the trash bin now this task requires the robot to handle a dizzying variety of items one of the exciting consequences of training Pi zero on a large and diverse data sets was the range of emerging strategies that the robot employed instead of Simply grasping each item in turn the model could stack multiple dishes to put them into the bin together or shake off trash from a plate into the garbage before placing the plate into the busing bin this is rather fascinating we often talk about how robots and models have these emerging capabilities but having this robot being able to stack multiple dishes to put them in the bin together or shake off trash from a plate to put it into the garbage before placing the plate into the busing bin is actually pretty cool it's quite similar to what humans would do in that scenario now of course this is where they have something that seems absolutely incredible which is where they got this robot to assemble a box so it says here the robot has to take a flattened cardboard box and build it f in these sides and then tucking in

### Segment 3 (10:00 - 14:00) [10:00]

the flaps and this is very difficult because each fold and Tuck might fail in unexpected ways the robot needs to watch its progress and adjust as it goes it also needs to brace the box with both arms even using the table so that the partially folded box doesn't come apart which is pretty incredible when you see exactly how this robot is able to do that there's a lot of subtle information that goes into to training these robots so seeing them do this firsthand is truly rather fascinating I know a few people that struggle to fold certain box now of course they evaluate this model so it says we compared the pi Z to other robot Foundation models that have been proposed in the academic literature on their tasks they've got open VA 7 billion parameter VA model that uses discretized actions and Oto a 93 million parameter model that uses diffusion outputs and these tasks are very difficult compared to those that are used in academic experiments for example the tasks used in the open VA evaluation typically consists of single stage behaviors which is you know putting an egg plot into a pot whereas our simplest busting tasks which is pretty difficult because it consists of multiple objects and sorting those into either a garbage bin or a busing bin and their more complex tasks might require multiple stages manipulation of deformable objects and the ability to deploy one of many possible strategies given the current configuration of the environment and these tasks are evaluated according to a scoring rubric that assigns a score of one for a fully successful completion with partial credit for partially correct execution for example busting half the objects leads to a score of 0. 5 now the average scores across five zero shot evaluation tasks are shown in the graph that you can see on screen comparing the full pi0 pre-trained model to a small pi0 small which is a 470 parameter model that does not use VM pre-training so open VA and Oto although can attain nonzero performance on the easiest of tasks of Pi Z is by far the best performing model across all of the tasks and the small version of Pi Zer attains the second best performance but there is more than a two times Improvement in performance from using the full size architecture with VM pre- chaining now they also state where they go from here they state that the mission of physical intelligence is to develop Foundation models that can control any robot to perform any tasks and their experiments show and their experiments so far show that models can control a variety of robots and perform tasks that no prior robot learning system has done successfully such as folding laundry from a hamper or assembling a cardboard box but generalist robot policies are still in their infancy and we have a long way to go and the frontiers of robot Foundation model research include long Horizon reasoning and planning autonomous self-improvement robustness and safety and they expect that the coming year you'll see major advances along all of these directions but the initial results paint a promising picture for the future of robot Foundation models highly capable generalist policies that inherit semantic understanding from internet scale pre-training they also incorporate data from many tasks and robot platforms and enable unprecedented dexterity and physical capability and they also think that succeeding at this will require not only new technologies but a ton more data and of course a collective effort involving the entire robotics Community they've already got collaborations underway with a number of different companies and Robotics labs both to refine Hardware designs for teleoperation and autonomy and incorporate data from their Partners into their pre-trained models so that they can provide access to models adapted to their specific platforms overall I think physical intelligence is absolutely incredible they've managed to demonstrate that their models are able to successfully perform a variety of different tasks that we previously would have thought would be teleoperated but now we can see entire operations from start to finish completely autonomous in a zero short setting so it will be interesting to see how this company develops over time
