❤️ Check out Lambda here and sign up for their GPU Cloud: https://lambda.ai/papers
Guide for using DeepSeek on Lambda:
https://docs.lambdalabs.com/education/large-language-models/deepseek-r1-ollama/?utm_source=two-minute-papers&utm_campaign=relevant-videos&utm_medium=video
📝 The paper "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots" is available here:
https://github.com/NVIDIA/Isaac-GR00T
https://arxiv.org/abs/2503.14734
📝 My paper on simulations that look almost like reality is available for free here:
https://rdcu.be/cWPfD
Or this is the orig. Nature Physics link with clickable citations:
https://www.nature.com/articles/s41567-022-01788-5
🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible:
Benji Rabhan, B Shang, Christian Ahlin, Gordon Child, John Le, Juan Benet, Kyle Davis, Loyal Alchemist, Lukas Biewald, Michael Tedder, Owen Skarpness, Richard Sundvall, Steef, Taras Bobrovytsky, Thomas Krcmar, Tybie Fitzhugh, Ueli GallizziIf you wish to appear here or pick up other perks, click here: https://www.patreon.com/TwoMinutePapers
My research: https://cg.tuwien.ac.at/~zsolnai/
X/Twitter: https://twitter.com/twominutepapers
Thumbnail design: Felícia Zsolnai-Fehér - http://felicia.hu
#nvidia
Оглавление (2 сегментов)
Segment 1 (00:00 - 05:00)
This is a historic paper, GR00T-N1, which is going to set off a robotics revolution by giving us…yup. An open foundation model for humanoid robotics. All open, all free for all of us. Wow. And, everyone’s favorite, the super cute robots are coming hopefully soon. Okay, so what is going on here? Is this important? Well, when it comes to robotics, not so long ago, even a company like OpenAI said nope, I am outta here. Why did they do that? Note that you may see some eye candy here, but we are Fellow Scholars here, so rest assured that what you hear in this video is grounded in a proper research paper. Multiple, as you will see in a moment. And we are not shy to talk about limitations either. So when it comes to robotics, even OpenAI said I’m outta here — and they did that because it was expensive, and more importantly, they had a huge data problem. You see, training a chatbot today is very easy because we have the whole internet at our disposal. It’s just text. Textbooks, courses, everything. Just read and learn it. But for robots moving around in the real world? Not so much. But wait, I hear you asking, Károly, we have millions of videos on Youtube, and you know, the real world is already there. Just look at what humans do! Well…not so fast. This data also needs to be labeled, and we can do that by demonstrating a set of movements to a robot to learn from. Yes, but unfortunately, we would have to do this for every task out there, so we would need millions and millions of labeled demonstrations. Labeled. We need to know who is doing what exactly for each of these examples. Labeling all this data is tiring, there’s just not enough time. So here comes the crazy secret sauce. They use a system called Omniverse to create a video game version of the world, super accurate, you can have a whole factory in it, everything is digital, everything is labeled. But it does not always look realistic enough. So now, let’s plug it into Cosmos, their system that can take the video game footage as a baseline and create tons and tons new realistic videos, and hence, tons of training data. As much as you want. Infinite, practically…and most importantly all labeled. And don’t forget, this also ensures that all of these videos are grounded in reality where physics works how physics should really be working. And note that with human demonstrations, we are limited by our time in our lives. But with a video game world plus the video generator, we can simulate as quickly as our hardware can do it. And some of the Omniverse works can simulate more than 25 years worth of data in just one day in real life, which is kind of mind blowing. But, surprisingly, that is still not enough. That is just secret sauce number one. We need secret sauce number two. This is going to be insane. I mean, this is not a 36-page research paper for nothing. So, there are still tons and tons of videos out there on the internet, unlabeled. And then researchers say, well, if it is unlabeled, just teach the AI to do the labeling. Yes, really. I couldn’t believe it when I saw it. It looks at the videos, and extracts all the useful information out of them, for instance, did the camera move? Where did it move? But not just that! Also, which joint does what, what action is happening on the screen, and so on. With that, every frame is labeled with annotations like actions, joints, goals, and more. Bam. Get it now? This is video game data from reality. My mind is blown…again. And thus, reality can now be used as proper training data too as if it were a video game. That is insanity. So, in short, this is super smart and can learn from everything. Teleoperation data, simulation data, internet data, doesn’t matter! It can eat all of it and make sense of it too. Loving it. Now, surprisingly, even this is not good enough to make humanoid robotics work. This problem is super challenging, borderline impossible. But don’t despair, because here comes secret sauce number 3. You see that this is going to be super yummy. What is that? Well, for robots to understand the world around them, they are building on a previous paper called Eagle-2, this is a vision-language model. That is incredible, you see here the second law of papers at work. Everything is connected. Scientists are working all around
Segment 2 (05:00 - 09:00)
the globe to build on each other’s work to create a better world for us. So why is this super cool? Well, it helps a robot to think on two different levels. System 2, slow thinking with reasoning to understand the world and make a plan. And we also need the faster system 1 thinking that generates motor actions in real time. Real time is the key here. You see, just one of the two is not enough — just system 2 can make a plan, but that’s just a plan. It is also too slow - it cannot act in real time. But system 1 can move in real time, but does not know what the effect of its movements will be. And, would you look at that…the fast system 1 neural network is a diffusion model. That’s really weird. You see, diffusion models are often used to create images from a bunch of noise, so what does this have to do with motor actions? Well, get this, it really starts out from noise and denoises it until we get not a smooth image, but smooth motor action. It thinks about motion as others think about images. That is absolutely incredible stuff. So combine the two, and you get…what exactly? Well, hold on to your papers Fellow Scholars because when comparing it to a previous method, we go from a 46% success rate to 76%. That’s absolutely amazing—a result that, just a few years ago, would have taken us up to decade to achieve. And now, here it is! And it is so much better than anything else we’ve had before. A complete game changer. And this is why I think GR00T-N1 is going to kick off a robotics revolution. Useful robots that do helpful things for us are now finally within reach. Bravo. And for some reason, I barely see anyone talking about this. Crazy! But this is why Two Minute Papers exists. And now, you also know part of the reason how these robots can learn to be super cute as well. Subscribe and hit the bell icon if you appreciate this. Now, yes, we are Fellow Scholars here, so we also look at the limitations of these works, and that is one of them. This is not a turnkey solution yet that you can just deploy and have it fold your laundry at home. Perhaps two more papers down the line. But then, it might save your marriage by doing all the folding! Also, this is still about short tasks which are mostly about mingling with things on a table. But the good news is that the model is free and fully open for all of us, and people can fine tune it to be better on their own tasks. In fact, this is not just a promise, you Fellow Scholars are already using it for some smaller projects. Just listen to the joy. Amazing. And as you see, it also works for different embodiments, so you can train it on your particular robot. What an incredible paper, what a time to be alive! And I waited with this video a bit to see how you Fellow Scholars are using it, it is not the best for views, but it gets a better video for you. So now, let the experiments begin!