I Watched an AI Drive a Real Car Through San Francisco Using Arrow Keys

14:24

I Watched an AI Drive a Real Car Through San Francisco Using Arrow Keys

MattVidPro 25.02.2026 8 479 просмотров 410 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

A small lab just shipped something that might be the most important AI agent demo I've seen this year. Standard Intelligence built FDM-1 — a computer use model trained on 11 million hours of video data that can actually generalize to any software interface. It used Blender to model a gear. It drove a real car through San Francisco using arrow keys. With less than one hour of fine-tuning data. This is what true computer use looks like. Also in today's video: Mercury 2's 1,000 tokens/sec diffusion AI, handcrafted 343-parameter models, Nano Banana 2 impressions, and Nvidia's Sonic robot model. ▼ Link(s) From Today’s Video: SI's Computer Use FDM-1: https://x.com/si_pbc/status/2025978959947784290 Inception Labs Mercury 2: https://www.inceptionlabs.ai/blog/introducing-mercury-2 Codex Hand Crafts Model weights: https://x.com/N8Programs/status/2026318679525163189 Nano banana 2: https://x.com/chetaslua/status/2026355434152685582 Nvidia Robotics Model: https://x.com/DrJimFan/status/2026350142652383587 ► MattVidPro Discord: https://discord.gg/mattvidpro ► Follow Me on Twitter: https://twitter.com/MattVidPro ► Buy me a Coffee! https://buymeacoffee.com/mattvidpro ▼ Extra Links of Interest: General AI Playlist: https://www.youtube.com/playlist?list=PLrfI66qWYbW3acrBQ4qltDBsjxaoGSl3I AI I use to edit videos: https://www.descript.com/?lmref=nA4fDg Instagram: instagram.com/mattvidpro Tiktok: tiktok.com/@mattvidpro Gaming & Extras Channel: https://www.youtube.com/@MattVidProGaming Let's work together! - For brand & sponsorship inquiries: https://tally.so/r/3xdz4E - For all other business inquiries: mattvidpro@smoothmedia.co Thanks for watching Matt Video Productions! I make all sorts of videos here on Youtube! Technology, Tutorials, and Reviews! Enjoy Your stay here, and subscribe! All Suggestions, Thoughts And Comments Are Greatly Appreciated… Because I Actually Read Them. 0:00 — What's In Today's Video (Overview) 0:45 — Standard Intelligence FDM-1: Computer Use AI 4:15 — Why This Is True Generalization 5:10 — Mercury 2: Diffusion AI With Reasoning 5:45 — How Diffusion LLMs Actually Work 7:15 — Testing It Live 9:40 — Handcrafted 343-Parameter AI Models 10:20 — Nano Banana 2 Early Access Impressions 11:00 — Nvidia Sonic: AI Native to a Robot Body 13:00 — Music, VR, Text Prompts — It Does All of It 13:45 — Open Source + Isaac Lab Training Explained 14:20 — Outro

Оглавление (12 сегментов)

What's In Today's Video (Overview)

I just can't get this out of my head. Standard intelligence has built a new kind of computer use model and it's shattering barriers. This AI model is actually able to use a full CAD program like a human. It was trained on video data, not just screenshots. But they're not the only ones out there with new AI breakthroughs. Inception Labs just released Mercury 2. This is a diffusion large language model. To put it simply, this type of language model renders the entire block of text at once. It diffuses the text into existence where your typical auto reggressive language model goes token by token. Users on X are also asking AI language

Standard Intelligence FDM-1: Computer Use AI

models to handcraft weights for much smaller purpose-built AI models, and it's actually working, which is really cool. Google's Nano Banana 2 has been spotted for early testing. And Nvidia built a custom transformer model that controls a robot body. And it's a little uncanny to see as a human, but when you dig into this one, it's so cool. Now that you got an idea of today's scope, let's take it back to the beginning to standard intelligence computer models. Just as it sounds, computer use models are designed to operate desktop style UXs. A lot of major labs have their own versions of this type of model. OpenAI initially kicked this off with their agent mode, but Google's got one too as an experimental labs feature, and in-house purpose-built agentic companies like Manis have been building their own in-house agent model. Manis was promising enough for Meta to actually acquire them for a massive amount of money, 2 billion, I think. This new model, FDM-1, can construct a gear in Blender from scratch. find software bugs and even drive a real car through San Francisco using arrow keys. I'm sorry, what did you guys do? Whoa. Can we take a step back? Oh, yeah. And there it is in the corner. If you haven't seen it, as crazy as it is, there is a very good point to be made about it. So, with this model comes two main advances. The first, they've trained it on 11 million hours of computer action, and they've designed it from the ground up to understand long context video. Long context is an important distinguisher. They want to give this model its best possible shot at being able to carry out tasks for an extended period of time. It's currently something we struggle with regarding AI agents. You might remember GLM5 worked for 24 hours recently to build a whole Game Boy emulator. That is really pushing the maximum limit of how far an agent could possibly work. And that was with human supervision quite closely. This thing's got a 1 million token context window, and their video encoder can fit two hours of 30fps high-res video directly inside of it. That is impressive stuff. I could easily just record my desktop, show it exactly the kinds of tasks I need it to get done, and just by that video reference alone, essentially me recording a tutorial kind of like this. The idea is that it would be able to pick up what I'm putting down and then replicate it. This ultraefficient video encoder makes a world of a difference. I have a feeling the same exact type of thing could become very useful for AI video generation models in the future. How does it take all this data and figure out what to do? Well, they train an inverse dynamics model to predict frame by frame computer actions. That's all it really is. Predicting the next token. That token just happens to be a computer action. And the results, like I touted in the opening, it actually can navigate interfaces well enough to use a CAD app or Blender. It's so incredible to watch it make this gear in Blender. A lot of people might say, "Hey, there's, you know, claude extensions and stuff that you can get for Blender, traditional language models inside of it. " And yeah, those work great. I'm not saying those solutions shouldn't exist, but that is purpose-built for one application. This is generalization. It can just use Blender because it knows how to read and use icons like a person would based off of the data. The model has just built parameters that can handle all this stuff. And that's why the car demo is so significant. Like they said, true computer use is fully general. This

Why This Is True Generalization

thing uses arrow keys on a computer to steer a real car in San Francisco. That's with less than 1 hour of fine-tuning data. So they fine-tuned it to drive, but it produces high accuracy. Like that is to the untrained eye looking like Tesla FSD, full self-driving. I don't know how they rigged it up to the Toyota, but it feels very illegal. Of course, people in the replies asking the right questions like, "Can it play Doom? " I am wondering also if it can play video games. Can we fine-tune it on like Super Mario 64 or something like that? Would be very interesting to watch. So, the thing about this is that it's not available yet. These guys aren't a massive lab, but they say they're working hard to eventually make this available through an API. Personally, I cannot wait. It seems like it can be very fast and reactive. Right now, to me, this is feeling like the future of AI agents. All right, let's talk about Mercury 2. They're calling this the fastest reasoning LLM. This is

Mercury 2: Diffusion AI With Reasoning

probably true if you're talking about the things that people generally have access to, and it's kind of cheating. Also, diffusion still can't beat the auto reggressive in the top benchmarks and most difficult realworld tasks. Definitely for me personally, I've tried it. Is it promising, impressive, and fun to use if you're an AI nerd like me? Oh, yeah. This thing runs at over a thousand tokens a second just out of the gate because it's diffusion. And that is super cool. In that sense, yeah, it definitely could be a brighter future for language models that need to produce

How Diffusion LLMs Actually Work

a lot of tokens as fast as possible to get work done. This animation right here shows you the difference between the diffusion LM and the auto reggressive models that are more traditional. with Mercury diffusively certain tokens come in towards the end before even midway in the beginning because it's all fading in at once autogressive linear but what's very cool about this diffusion model and what people haven't done before is integrated realtime reasoning into it a modern LM needs test time compute long thinking chains it's costly and timeconuming mercury 2 is just operating at a different level a thousand tokens per second on Nvidia blackwell GPU use very cheap price. 25 cents per million input tokens, 75 cents per million output tokens. The quality they list as competitive. I would say it is pretty competitive, especially at that price point. The reasoning is tunable. It has an 128k context window, native tool use, and schema aligned JSON output. The benchmarks comparing this to traditional auto reggressive models really intrigue me. You'll see this thing fall behind Gemini 2. 5 Flash in GPQA Diamond, but Gemini 3 Flash is getting beat by Mercury 2 and Aim, although that is pretty close to saturation honestly. Yeah, in most of these it is losing a little bit compared to the Gemini series, but Claude 4. 5 Haiku or GPT5 Nano, it's definitely beating out. And

Testing It Live

yeah, this is a cheap model. You can actually try it for completely free. Link down below. Don't even need to make an account. Let's have it write a poem about its love for diffusion models. And you can see it all streams in. That is a diffusion effect going on. That's not actually the diffusion. Can turn it off up here on the top. In the quiet of a latent space, a whisper spreads a gentle trace from noise. A shape begins to bloom. Each step, a pulse, a quiet loom. Layers whisper slow than fast. A dance of pixels. Shadows cast. What once was random now refined, a masterpiece, the model mind. Wow. Wonderful. So you can see it's actually quite a traditional feeling AI model to interact with. That is unless you turn the reasoning off to instant. Then it feels quite a bit more flat. Let's ask for some text art of a lemon tree. I put the reasoning on high this time. You can see it's doing some thinking. It does actually delay the response a little bit. I assume it's running multiple diffusions thinking before it comes to a final answer. adding to the context window essentially. Right? Here's what Gemini 3. 1 did. Oh, it thought for 52 seconds and then nothing happened. Right. Okay. Oh, there it is. Okay. Well, that's more like a Christmas tree. That is definitely worse than Geminis's. This is pretty interesting. Here is GPT 5. 2 instant. All right. I think Mercury's got that beat. Okay. What is up with these models? Claude Sonet 4. 6 turned it into a spaceship. Uh, Mercury 2 might have actually won out of simplicity. I'd say this is definitely the most serious diffusion LM ever put out by an organization or a lab. I think coding is where this could really take off for the future. As you can see, it just built a reverse Tetris game like that. Instant vibe coding. OpenAI is also working on some models that are very, very fast like this. But with diffusion, it is just this native benefit. And you can see, yeah, the pieces are actually falling up instead of down. It's a very simple game, but for a model like this, pretty darn awesome. So, yeah, I've got really high hopes for this. I would really be intrigued to understand how exactly the reasoning works under the hood. And really, what I want to see is other big boy labs like Google or OpenAI try to do diffusion language models. Pretty cool stuff. All right, before we log off here, the final tidbits. I love

Handcrafted 343-Parameter AI Models

this. People are building extremely small AI models handcrafted, right? Like 350 parameters total. This particular one is a 10digit edition with transformers. Started at 491. Codex handcrafted weights and got better accuracy with 343 parameters. Codeex could handcraft these and get them all correct. A lot of testing and running experiments back and forth. That is how this was able to actually happen. But it's really insane that it actually worked. Codeex itself is just model weights. Much much more of them. Pretty crazy AI brainception stuff. And

Nano Banana 2 Early Access Impressions

Nano Banana 2 has hit early testing. I do have access to this although it isn't publicly released. Personally, in my own taste testing, it feels closer to Nano Banana Pro, which is like my main AI image model that I use for everything. But I still don't think it's as good. Although, I think you can under the hood tell it is more efficient. It generates faster, definitely better than the old nano banana flash, but this is looking like an upgrade that's going to roll out to everyone pretty soon. Finally, Nvidia built the Sonic AI model. This supports all kinds of different things, but essentially it's a transformer that is natively trained on the robot itself, what it can actually do, controlling all

Nvidia Sonic: AI Native to a Robot Body

the different actions, the arms, the legs. Essentially, the model is acting like a medium between us and the robot. By doing this, the model can generalize to VR whole body teleoperation, human video. You can literally point a webcam and live stream motions and it understands. You can do text prompts, walk sideways, dance like a monkey, kick your left foot, etc. Even musical audio, the robot can dance to a beat, adapt to a tempo and rhythm. Kind of scary. uh VLA foundation models. They plugged in Groot N1. 5 and achieved 95% success on mobile tasks. Super cool. So this video right here, you're watching it in whole body teleoperation mode. You can see this guy is showing off picking up a cabinet and pulling it out a little bit. The robot struggling a little bit with his low body weight and he's throwing a whole jar of good peanut butter away. He shouldn't be doing that. But there he goes closing it up. But a generalized robot controlling model, you know what actually is that? How did they pull it off? Dr. Jim Fan from Nvidia reveals that this model is only half the size of GPT1. So it's 14 million parameters and it's a transformer. Model is named Sonic. It's not that big and yet it's this capable right out of the box. It takes a remarkable amount of subconscious processing for us humans to squat, turn, crawl, sprint. Absolutely true. Our brains are little miracles. 30watt hypers specialized, fine-tuned by evolution neural networks for running the human system. The cherry on top of all this is that they released everything open- source and accessible. Nvidia's key takeaway though from training this model was that motion tracking is the one true scalable task for whole body control. Instead of hand engineering rewards for every new skill we come across, we use dense frame by frame supervision. human mocap data. The

Music, VR, Text Prompts — It Does All of It

data itself encodes the reward function. Configure your limbs in any human-like position while maintaining balance. That reward function is very much encoded in the way that we run our lives. So, if you transplant that to data, it sure comes through. They were able to accelerate the training of this model by taking the current data that they had and scaling it to an unprecedented scale. With NVIDIA Isaac Lab, you can simulate the robots running their tasks at a realistic scale, but 10,000 times faster. So, this gives the robots essentially many years of virtual experience and only hours of wall clock time. 3 days of training later, Neuralet transfers zero shot to real G1 robot with no fine-tuning. And they had on

Open Source + Isaac Lab Training Explained

100% success rate across 50 diverse real world motion sequences. Man, that's crazy. Yeah, Nvidia is working towards solving robotics here because they're like, "We we want you to buy the GPUs. We know if AI works with robotics, that's a huge driving factor for them. It makes total sense why they would invest in developing technology like this, and it seems to be paying off. Thanks so much for stopping in with me today. AI models are being pushed in some really unique ways. There's a lot more going on. If you guys want to stay the most uptodate, I recommend you follow me on X or join my Discord

Outro

server. Enjoy the rest of your day and I'll see you guys in the next video.

Другие видео автора — MattVidPro

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник