# NVIDIA’s New AI: Insanely Good!

## Метаданные

- **Канал:** Two Minute Papers
- **YouTube:** https://www.youtube.com/watch?v=rQJmDWB9Zwk
- **Дата:** 10.04.2025
- **Длительность:** 9:08
- **Просмотры:** 90,071
- **Источник:** https://ekstraktznaniy.ru/video/12465

## Описание

❤️ Check out Lambda here and sign up for their GPU Cloud: https://lambda.ai/papers

Guide for using DeepSeek on Lambda:
https://docs.lambdalabs.com/education/large-language-models/deepseek-r1-ollama/?utm_source=two-minute-papers&utm_campaign=relevant-videos&utm_medium=video

📝 The paper "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots" is available here:
https://github.com/NVIDIA/Isaac-GR00T
https://arxiv.org/abs/2503.14734

📝 My paper on simulations that look almost like reality is available for free here:
https://rdcu.be/cWPfD 

Or this is the orig. Nature Physics link with clickable citations:
https://www.nature.com/articles/s41567-022-01788-5

🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible:
Benji Rabhan, B Shang, Christian Ahlin, Gordon Child, John Le, Juan Benet, Kyle Davis, Loyal Alchemist, Lukas Biewald, Michael Tedder, Owen Skarpness, Richard Sundvall, Steef, Taras Bobrovytsky, Thomas Krcmar, Tybie Fitzhugh, Ueli G

## Транскрипт

### Segment 1 (00:00 - 05:00) []

This is a historic paper, GR00T-N1, which is  going to set off a robotics revolution by giving   us…yup. An open foundation model for humanoid  robotics. All open, all free for all of us. Wow. And, everyone’s favorite, the super cute  robots are coming hopefully soon. Okay,   so what is going on here? Is this important? Well,   when it comes to robotics, not so long ago, even  a company like OpenAI said nope, I am outta here. Why did they do that? Note that you may see some  eye candy here, but we are Fellow Scholars here,   so rest assured that what you hear in this video  is grounded in a proper research paper. Multiple,   as you will see in a moment. And we are  not shy to talk about limitations either. So when it comes to robotics, even OpenAI said  I’m outta here — and they did that because it   was expensive, and more importantly,  they had a huge data problem. You see,   training a chatbot today is very easy because  we have the whole internet at our disposal.    It’s just text. Textbooks, courses,  everything. Just read and learn it. But for robots moving around in the  real world? Not so much. But wait,   I hear you asking, Károly, we have  millions of videos on Youtube,   and you know, the real world is already there.   Just look at what humans do! Well…not so fast. This data also needs to be labeled, and we can  do that by demonstrating a set of movements to a   robot to learn from. Yes, but unfortunately, we  would have to do this for every task out there,   so we would need millions and millions  of labeled demonstrations. Labeled.    We need to know who is doing what exactly  for each of these examples. Labeling all   this data is tiring, there’s just not enough  time. So here comes the crazy secret sauce. They use a system called Omniverse to  create a video game version of the world,   super accurate, you can have a whole factory  in it, everything is digital, everything is   labeled. But it does not always look realistic  enough. So now, let’s plug it into Cosmos,   their system that can take the video game  footage as a baseline and create tons and   tons new realistic videos, and hence, tons of  training data. As much as you want. Infinite,   practically…and most importantly all labeled. And  don’t forget, this also ensures that all of these   videos are grounded in reality where physics  works how physics should really be working. And note that with human demonstrations,  we are limited by our time in our lives.    But with a video game world plus the video  generator, we can simulate as quickly as   our hardware can do it. And some of the  Omniverse works can simulate more than   25 years worth of data in just one day in  real life, which is kind of mind blowing. But, surprisingly, that is still not enough.   That is just secret sauce number one. We need   secret sauce number two. This is going to be  insane. I mean, this is not a 36-page research   paper for nothing. So, there are still tons  and tons of videos out there on the internet,   unlabeled. And then researchers say, well, if  it is unlabeled, just teach the AI to do the   labeling. Yes, really. I couldn’t believe  it when I saw it. It looks at the videos,   and extracts all the useful information out of  them, for instance, did the camera move? Where   did it move? But not just that! Also, which joint  does what, what action is happening on the screen,   and so on. With that, every frame is labeled  with annotations like actions, joints, goals,   and more. Bam. Get it now? This is video game data  from reality. My mind is blown…again. And thus,   reality can now be used as proper training data  too as if it were a video game. That is insanity. So, in short, this is super smart and can learn  from everything. Teleoperation data, simulation   data, internet data, doesn’t matter! It can eat  all of it and make sense of it too. Loving it. Now, surprisingly, even this is not good  enough to make humanoid robotics work.    This problem is super challenging,  borderline impossible. But don’t despair,   because here comes secret sauce number 3.   You see that this is going to be super yummy. What is that? Well, for robots to understand  the world around them, they are building   on a previous paper called Eagle-2, this is a  vision-language model. That is incredible, you see   here the second law of papers at work. Everything  is connected. Scientists are working all around

### Segment 2 (05:00 - 09:00) [5:00]

the globe to build on each other’s work to create  a better world for us. So why is this super cool? Well, it helps a robot to think on two different  levels. System 2, slow thinking with reasoning to   understand the world and make a plan. And we also  need the faster system 1 thinking that generates   motor actions in real time. Real time is the key  here. You see, just one of the two is not enough   — just system 2 can make a plan, but that’s just a  plan. It is also too slow - it cannot act in real   time. But system 1 can move in real time, but does  not know what the effect of its movements will be. And, would you look at that…the fast system  1 neural network is a diffusion model. That’s   really weird. You see, diffusion models are often  used to create images from a bunch of noise,   so what does this have to  do with motor actions? Well,   get this, it really starts out from noise and  denoises it until we get not a smooth image,   but smooth motor action. It thinks about  motion as others think about images.    That is absolutely incredible stuff. So combine the two, and you get…what exactly? Well, hold on to your papers Fellow Scholars  because when comparing it to a previous method,   we go from a 46% success rate to 76%.   That’s absolutely amazing—a result that,   just a few years ago, would have taken us up  to decade to achieve. And now, here it is! And it is so much better than anything else  we’ve had before. A complete game changer. And   this is why I think GR00T-N1 is going to kick  off a robotics revolution. Useful robots that   do helpful things for us are now finally  within reach. Bravo. And for some reason,   I barely see anyone talking about this. Crazy!   But this is why Two Minute Papers exists. And now, you also know part of the  reason how these robots can learn   to be super cute as well. Subscribe and  hit the bell icon if you appreciate this. Now, yes, we are Fellow Scholars here, so we  also look at the limitations of these works,   and that is one of them. This is not a  turnkey solution yet that you can just   deploy and have it fold your laundry at  home. Perhaps two more papers down the   line. But then, it might save your  marriage by doing all the folding! Also, this is still about short tasks  which are mostly about mingling with   things on a table. But the good news is that  the model is free and fully open for all of us,   and people can fine tune it to be better on their  own tasks. In fact, this is not just a promise,   you Fellow Scholars are already using it  for some smaller projects. Just listen to   the joy. Amazing. And as you see, it  also works for different embodiments,   so you can train it on your particular robot.   What an incredible paper, what a time to be alive! And I waited with this video a bit to  see how you Fellow Scholars are using it,   it is not the best for views, but  it gets a better video for you. So now, let the experiments begin!
