AI roundup: Thrilling Agent releases, Nano Banana 2 & Robotics!

18:43

AI roundup: Thrilling Agent releases, Nano Banana 2 & Robotics!

MattVidPro 27.02.2026 6 058 просмотров 240 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

AI NEWS ROUNDUP! Thanks to our sponsor - Take this as your sign to download Gauth: https://gauthmath.onelink.me/SUq5/u95t78hb Midterm Season will be no match! Google dropped Nano Banana 2 this week and I immediately tested it on the one thing I always said needed Pro — thumbnails with my actual face. That alone could have been the whole video, but AI had a packed week: NVIDIA built a robot that learned to fold laundry from 20,000 hours of humans doing it, a two-billion parameter open source video model dropped (and it actually works), Perplexity launched a computer-use agent routing 19 models, and a small lab published agent research that most companies would lock away. ▼ Link(s) From Today’s Video: EgoScale: https://research.nvidia.com/labs/gear/egoscale/ other nvidia robotics: https://x.com/DrJimFan/status/2026350142652383587 NB2 Launch: https://x.com/googledeepmind/status/2027051577899380991?s=61 SVG Model: https://x.com/QuiverAI/status/2026792057893708072 Jim Apple: https://x.com/apples_jimmy/status/2026801444922528107 Linum v2: https://www.linum.ai/field-notes/launch-linum-v2 https://github.com/Linum-AI/linum-v2 Perplexity Computer: https://x.com/perplexity_ai/status/2026695550771540489?s=46&t=AZs45ckJ7UUM_kJZcxnR_w Tzafon Agent Breakthrough: https://x.com/tzafon_company/status/2027101372597072351 Hermes Agent: https://x.com/NousResearch/status/2026758999488528639 MattVidPro Discord: https://discord.gg/mattvidpro Follow Me on Twitter: https://twitter.com/MattVidPro Buy me a Coffee! https://buymeacoffee.com/mattvidpro ▼ Extra Links of Interest: General AI Playlist: https://www.youtube.com/playlist?list=PLrfI66qWYbW3acrBQ4qltDBsjxaoGSl3I Instagram: instagram.com/mattvidpro Tiktok: tiktok.com/@mattvidpro Gaming & Extras Channel: https://www.youtube.com/@MattVidProGaming Let's work together! - For brand & sponsorship inquiries: https://tally.so/r/3xdz4E - For all other business inquiries: mattvidpro@smoothmedia.co Thanks for watching MattVideoProductions! I make all sorts of videos here on Youtube! Technology, Tutorials, and Reviews! Enjoy Your stay here. All Suggestions, Thoughts And Comments Are Greatly Appreciated

Оглавление (4 сегментов)

Segment 1 (00:00 - 05:00)

Welcome back everyone to our AI voyage. What you're looking at right now is a pair of robotic hands and arms fully assembling a toy car dexterously. And this is an AI model controlling these robotic hands. And it's learned all of this from watching a ton of egocentric video. Essentially firsterson view video doing tasks on a table. It's watched so much of this footage that it's able to generalize. It's not just assembling the toy car with a Phillips head. It's also picking up test tubes and moving liquid from one test tube to another. It's grabbing the next card on the deck and carefully placing it down. Not to mention something almost none of us actually want to do. Folding laundry, roll it into a little tube, or from one shot watching another robot fold a shirt. It's able to go ahead and pick up that task. Moving soft objects with a tong. I find it so incredible that this model can simply learn all of this by watching 20,000 hours of humans performing similar tasks. The data of the recorded footage is predicting wrist position and hand joint actions in the human videos. In the mid-training phase, the human and the robot become fully aligned. This mid-training data acts as a translator for all of those human hours of real work. If you look at the model architecture, it reveals something very cool. How do you get it to transfer liquid in a test tube or iron a t-shirt? A simple text prompt or just show it visually because it has both a text encoder and a visual encoder. Remember, it's all trained on firsterson view. So, it relies on its own firsterson view to carry out the tasks. The bonus benefit with this is that it can watch other robots do a task and then oneshot copy it. Man, AI models are just so darn cool, right? The best architectural setups for multimodal models give you the coolest inherited capabilities. I love what Nvidia is doing here. They're really laying down the groundwork and making all of this open and accessible for AI companies that want to specialize in robotics. Man, to never have to do dishes again. H that is a dream. Very likely be released completely open. GitHub is coming soon since this is brand new. In my last video, we talked about a similar model from Nvidia. This one can control and operate the whole body of a robot. So, walking around and throwing stuff away, opening drawers, etc. What you're watching it do right now is actually teleaoperation. The AI model is fully controlling the robot in taking the input, some controllers behind the camera. It can also follow text prompts to do like a little dance or something and, you know, walk around, maneuver a space. But more complex operations are out of the scope for this particular one. That project is already out open source and the model that controls that robot is the size of GPT1. But as much as watching those videos makes me feel like I'm straight up living in the future, there are a plethora of other things to talk about and especially things that you can use today like Nano Banana 2 just released by Google. But before that, I've got a quick word from today's sponsor. For me, it was not that long ago I was in university and my brain felt like it had 37 different tabs open whenever it was midterms. And one of those browser tabs is playing panic music. So, here's what you can actually do to lock in. Take actual direct lecture material and instantly convert it into clean notes, flashcards, quizzes, made nice and easy, so you aren't trying to reread the same slide 12 times like it's going to osmosis into your skull. This part of the video is sponsored by Goth, and their study converter feature is built for midterm season. Drop a page in from your notes and Goth converts it into AI notes that are inviting and readable. Flashcards so you can drill the key definitions and formulas in a quiz to make sure you can replicate the knowledge. You can basically convert all your notes into a study guide. No need to spend an hour formatting when you should be learning outright. And here's the part that'll save your hide. Personally, when I'm learning and I miss something, I don't just want an answer. I want to know why. For this, you can tap AI live tutor and it walks you through step by step. It's like a real tutor, but they can explain the same question five different ways and they'll never get sick of it. No matter the subject, take this as your sign. Download Goth for free today. Try the study converter on your latest notes and I'll leave the link down in the description below. Start your academic comeback season today. Huge thanks to Goth for sponsoring today's video. Now, back to your regularly scheduled content. And welcome back. So yeah, DeepMind launched NanoBanana 2 yesterday. Easiest way you're going to be able to access this is through the Gemini app or web interface. Simply select the banana icon create image and it should be using the latest Nano Banana 2. This model is pretty incredible because it is very cheap and almost just as good as Nano Banana Pro and even in some cases a little bit better. Since the latter half of last year, Google has been working some real magic with image gen. And a part of the Nano Banana series was the fact that it could actually sit there and reason

Segment 2 (05:00 - 10:00)

about the image as it creates it or even before it creates it. I can only imagine this model is smaller than Nano Banana Pro, yet they've enhanced the reasoning. They've made it more efficient so that it scores almost just as good as Nano Banana Pro. With this also comes an increase in the ability to maintain the likeness of people, faces, and objects. This is something that we saw with Nano Banana Pro, and it was a huge leap. But Nano Banana Pro is expensive. In fact, up until now, I would only use Nano Banana Pro to do thumbnails with my face because it's the only one that felt like it could really replicate it. But this absolutely gets there with ease. The pricing here is about 6 cents per 1K image. that moves up to about 10 cents per 2K image and 15 cents for a 4K image on average. Of course, with thinking with larger prompts, that might boost a little bit, but not by very much. This is like five times cheaper than the old Nano Banana Pro. And again, like I said, it's like 80 to 100% as good. But I'll be frank with you guys. I think sometimes this model does suffer a little bit of small boy syndrome. Efficient architecture, small brain, big think. You'll get an image like this maybe sometimes. Look at this divine glove with all of these really cool gold markings and glowing ethereal blue ancient rune text. Count him 1 2 3 4 5 for the hand. That's good. What's going on with his pointer finger? It's like this became his pinky finger or something strange like that. So, it's stuff like this where it's like this whole image is almost perfect, but one weird thing it decided to malfform. I don't know what happened. I can only imagine this is a reasoning error. Here's a cool prompt I tried asking for a junkyard from the future. The realworld accuracy here is pretty astonishing. You can go through and look through these vehicles and actually say, "Hey, I know what type of car that is. That's a Ford F-150 Lightning or that's a Tesla Model 3, but it's, you know, rusty," which actually wouldn't happen in real life. I think they're all made of aluminum. Maybe this one isn't, but I specifically prompted for the car in the center. Real world knowledge on this is huge. And like the previous Nano Banana models, and apparently it's a little bit upgraded for this one as well, it can go do research and get real world knowledge, pushing the model to its absolute max. This is me asking for a 20 plus panel comic page. Invader Zim accidentally teleports the president to the moon and he's got to like fill in and take his place. So, it's supposed to be like a funny comic. It is pretty coherent up until like a certain point and then it kind of around this part all starts to devolve. So, it gets pretty far and then it kind of has some issues. I would stick to like 8 to 16 panels if you're going to do comics, but it can absolutely do a perfectly coherent comic. Just not this big and this detailed. You can kind of see where it was going though, especially with the images. If you want to produce a ton of images with this thing in one go, you know, it's going to be fine. And it's just when it starts to do that with the text, it's really cool to see how far they've pushed this. Here you can see I generated a quote unquote perfect variant with no mistakes. You could actually follow this and read through it like a normal comic and it would all make perfect sense. It captures Invader Zims likeness very well. And honestly, even the humor is really good. Putting the orange paint on his face, telling me he looks tangy. I like how the staff barely even notice. Great stuff. Next up, let's talk about specialized AI models. Using a direct image generation AI isn't the only way to get the job done. Now, check this out by Quiver AI. It's a first of its kind SVG AI model. So, it's only doing SVGs. That's what it's designed to do. But, we have never seen an AI model do SVGs this good. A lot of AI companies get wrapped up in trying to generalize. Sometimes doing specific AI models for certain tasks can lead to really impressive and cool results. There is absolutely a very real use to have a model like this that just generates SVGs. If you don't know what an SVG is, it's a file that is code that turns out to essentially produce an image. I mean, that's what a JPEG is or PNG, but there are no pixels. It's vector graphics. Check it out. Jimmy Apple's uploaded a video of it working in real time. Really crazy to watch it. essentially draw all of the lines out. It's very interesting how it does happen temporally like we slowly see it get created. But you know this makes sense because it is under the hood a text model. The thing is though you could watch a nano banana do something akin to this but it wouldn't be like drawing lines. It certainly wouldn't be diffusion though either. SVGs scale infinitely. They can often times be a better option, especially if you would like to animate them and make them move around. The reason we want specialized models like that SVG one is so that we can completely 100% saturate a certain type of task because ultimately those tasks are just always going to be useful. Take OpenAI's whisper models for

Segment 3 (10:00 - 15:00)

example, designed specifically to take audio in and then convert it to text, nothing more. Anyways, what we're looking at right now is not that type of model. This is a generalized model. And you might be watching this footage and you say, "Okay, so what? " This is a typical AI video model. You know, we see the bear swimming. There's a horse running around in a field. We're cooking up some stew. So what? So while these generations do look incredibly average and honestly even a little bit crappy, this is actually incredible because this model Linum V2 is only 2 billion parameters. Billion is a big number, but two billion in the AI world is like it's a little baby P size. Apache 2. 0 licensed fully open source. There's two sets of weights, a 720p and a 360p. This is a very small lab that put this out, but I respect it because it's completely open source. This is free. We're going to be able to hack away, fine-tune, and utilize what they've built. They gave it away. And look at this. They're showing the failed generations up front. You don't see a lot of big labs actually even do this these days. They used to do it way more often, but now the competition is so fierce, they feel like they can't. Here you can see, you know, cutting into a lemon. It was really not that long ago that this would have been cutting edge AI video. We would have been like, "Oh my god, it looks like a real lemon. " But now we're like, "Oh, that's like crappy AI video. " And you're like, "Wo, the model's 2 billion parameters. " A question mark. You can see misunderstanding of how weights might go. This is the type of stuff you'll come across with smaller models. Yeah, you can see this is really, really pushing the limits. Fun stuff that doesn't know what to do with the background. two billion parameters. To give you guys an idea of a size comparison, weigh 2. 5 is an estimated 40 billion parameters. VO3. 1 is undisclosed, but probably higher. Right now, the model weights are open and available to download, but running it at full precision still uses over 20 gigs of VRAM. However, the community is absolutely going to get to work on distilling this to run at much, much less. This has great potential to smoosh down and run in smaller GPUs. Perplexity AI has also launched Perplexity Computer. Their claim here is to unify every current AI capability into one system. Research, design, code, deploy, manage any project, end to end. I love that idea, but it seems pretty difficult to pull off. Not going to lie. You know, something that's really intriguing about this is that perplexity came from the position of starting on the research side of things where, you know, it seems like Google's anti-gravity IDE or Claude Code, you know, there's a focus on making programs over completing research. Okay, this is pretty cool, though. It's going to go download the latest podcast, but we're actually going to ask it to find a specific section and extract the clip from it and then make it vertical for Tik Tok and add captions. That's much more advanced. Making a nice spreadsheet, doing research, that's one thing. But this that's uh that's on another level. So it would download the video first, trim the clip, and then give you the result. They kind of just show you it all at once. You don't see it actually working in real time. Okay, if that's one shot, I'm impressed. I am thinking about buying Walter White's house and turning it into a rental. Build me a financial model. Oh man. Okay, that's pretty funny. So yes, of course, kicks off some research agents. Perplexi with that focus on research. It's a real benefit. Have it monitor competitive bids. Perplexity computer massively multimodal. Agents have to be so they can navigate and use a computer interface. Just like in my last video with the FDM1 agent, that thing generalized to drive a car with arrow keys. It orchestrates models to run agents in parallel. This is something that I personally do in my own workflows, leveraging Opus to match each task to the model best suited for it. Interesting. It sounds kind of expensive. In total, computer can route work across 19 different models. You know, they're trying something that's a little bit more out there. Google, OpenAI, oh, they're not using anyone else's models. Their models are going to be more limited. And that's why those computer use agents, it's like why bother using them? They're stuck in chat GPT or Gemini. And you know, that's something I'm thinking about this too is how am I going to connect this maybe to my local system, let's say, where I've got local files that I need for it. I think they still fully want you to use this thing through their perplexity website. Although, I'm wondering about their Comet browser if that's going to have any integration with it. From a single task to hundreds of active projects. I think it's a little bit out of the scope of today's video to mess around and test out the Perplexity computer use agents, but it's definitely something I would be open to. And they are not the only ones making strides in agentic areas. Like I said in my last video, I showed something off that was totally insane. And I've got another one to talk about today. Teafon, this is an AI company I've never heard of, trained a model on colored squares. The general idea is here is a colored square on a screen size and you have to go click it. Also, those aren't squares. Those are circles. The tweet says squares, brother. These are circles. So, what's the actual point of all of this? Well, their research is to explore why current

Segment 4 (15:00 - 18:00)

AI agents eventually lose the plot and can't click on the right areas of the screen anymore. Well, at an architectural level, the model's sense of where things are on a screen decays exponentially through its layers. By the time it needs to output coordinates, if it's been too much time, the positional signal has completely faded. So, their hypothesis to bolster this, they build synthetic simplified environments. White square colored shape plain background. The training is 100 environments that require 3 to 15 click interactions to succeed. There is nothing resembling a real app or UI design. And apparently they're claiming the model that they produced with this method generalized to real benchmarks better than models trained on actual UI screenshots, which is totally crazy to think about because you think like training on the real task at hand would give you the best result, but no, they are sort of hacking the system here and it's a very intelligent way to do research into AI models. You really would not expect this. But I think what's going on is that these larger models trained on a bunch of unsympathetic pre-recorded work task data, human errors, slight misclicks, whatever it might be inside of that data, there are additional weights that are actually holding the model back. At least that's what I'm picking up from a macro view. They took it to the next level, though, because they moved to multi-turn, training the model to interact with an app across multiple steps, learning from entire trajectories as a whole. And they notated an emergent effect. The model stopped repeating failed actions. It started reasoning about what went wrong with its actions and trying new and different approaches. How awesome is that? That behavior is not designed in the reinforcement learning made it the optimal strategy. And this small lab claims it all comes down to a simple math problem the whole field is ignorant of. Improving perstep accuracy doesn't solve all of your problems. A 32step task is still going to fail 15% of the time. And at over a 100 steps, you're failing more times than you actually succeed. Meaning the actual reliability is not 95%. What you need isn't just improved accuracy. You need models that can recover from failure. And that's what they discovered with reinforcement learning. The numbers here certainly don't lie. They went from 23% to 37%. That's a pretty big boost. And they're kind of giving all this info away for free. When you think about it, this is some secret sauce that a company would lock away normally. And if you can believe it, we're not done talking about agents. Noose Research has launched Hermes Agent, the first open source agent that grows with you. This thing can spawn off multiple sub agents and actually carry over sessions and conversation from one app to another. Like if you were in Telegram and then switched to WhatsApp, it would be able to go find the previous session inside of the new one. And I really hope that retrieval is automatic. Sub agents, tool calling, file system and terminal control, agent managed skills, browser use, scheduled tasks. This is all the type of stuff we want to see in an agent. Reminds you a little bit of open claw, right? Well, just like openclaw, Hermes agent is open source, so developers can extend, modify. They claim it sits between a clawed code style CLI and an open claw style messaging platform agent, but I'm wondering if it would get stuck on certain tasks that maybe the perplexity computer agent would be able to complete. Before I log off here, I want to say I can't cover everything in these videos. Like there are some things that happened this week with Anthropic, for example. I just didn't get to them. I'm sure you've heard of other AI YouTubers talk about them, but if you want to hear active opinions from me, or if you just want to see what I'm seeing in the AI space, I recommend you join my Discord server as well as follow my exac account. I'm always reposting stuff. At any rate, thanks so much for watching. I'll see you guys in the next video. Have a great weekend and goodbye.

Другие видео автора — MattVidPro

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник