NVIDIA's NEW All-in-One: Nemotron 3 Nano Omni for Multimodal Agents
13:58

NVIDIA's NEW All-in-One: Nemotron 3 Nano Omni for Multimodal Agents

Sam Witteveen 29.04.2026 5 981 просмотров 195 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
In this video, we look at the latest Nemotron model from Nvidia, Nemotron 3 Nano Omni, which is a multi-modal model which is built to be small, fast, and fully multi-modal for agents supporting text, images, videos and audio. Blog: https://developer.nvidia.com/blog/nvidia-nemotron-3-nano-omni-powers-multimodal-agent-reasoning-in-a-single-efficient-open-model/ HF Blog: https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence HF Model: https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 Twitter: https://x.com/Sam_Witteveen 🕵️ Interested in building LLM Agents? Fill out the form below Building LLM Agents Form: https://drp.li/dIMes 👨‍💻Github: https://github.com/samwit/llm-tutorials ⏱️Time Stamps: 00:00 Intro 00:12 NVIDIA models released in the past 00:59 Nemotron 3 Nano Omni 02:26 PinchBench 03:31 Nemotron 3 Nano Omni Paper 04:16 Nemotron 3 Nano Paper 05:28 Nemotron 3 Nano Omni Hugging Face 05:50 OpenRouter and NVIDIA Cloud 06:25 Demo

Оглавление (9 сегментов)

Intro

Okay, so today Nvidia dropped a model which really is a gamecher and the model that I'm talking about is the Neotron 3 Nano Omni model. And the key thing with

NVIDIA models released in the past

this model is this is literally taking some of the best models that they've released over the last 6 months and combining them into one model. So, we've got a base model which was the original Neotron 3 Nano that was pre-trained on 25 trillion tokens before it even got to postraining. We've also got in there their latest vision encoder and the vision adapter for not only handling images, but also for being able to handle video at the same time. And then on top of that, we've got the Parakeet audio encoder that they used for a lot of their really good ASR and voicetoext streaming models. And what they've actually done here is they've taken all of those and put them together in one model. And that's what this Neotron 3

Nemotron 3 Nano Omni

Nano Omni, and it is a mouthful there, but it's got quite a lot actually in there. And this is built for long context multimodal intelligence for documents, audio, and visual agents. So let's roll back a little bit and see how we actually got here. So to understand why this matters, you sort of need to know where it actually came from. Nvidia's been building out the Neotron 3 family for a while now. There's been the Nano, which was the 30B with 3B active text model. There was a super that's 120B model with a million context window which is really kind of aimed at things like software cyber security and then there's the ultra which is still on the way but they talked about it at GTC this year. Now what they've done with this Omni model is they've taken that nano backbone which was the member transformer mixture of experts model. They've dropped in this C radio vision encoder in here. And they've also dropped in the Parakeet audio encoder in there. And remember, Parakeet is what's been powering their open ASR systems. And I actually made a video about that last year. And as they've been rolling out this Neotron series of models, they've been talking about getting these models to actually power things like Open Core and other agentic systems out

PinchBench

there. If you remember back the actual super model I think it was the top open rating model for pinchbench the benchmark that actually measured doing things with open claw and we saw GTC them then roll out a bunch of different things around Nemo claw and around the concept of building agentic apps in general and this particular model is built for giving agents these multimodal abilities in here so we've got real world document analysis is multiple image reasoning. So that's where we can get looking at different images and sort of talking about them together. Automatic speech recognition, long audio and video understanding. And then with their post training, they've made it so that even things like agentic computer use are being supported in here. And remember, while these components come from different places, this is not a suite of models that we're looking at here. It's one model that does the text, the images, the video, the audio, all in one model. Now, while it's true that

Nemotron 3 Nano Omni Paper

we've had proprietary models that can do these multimodal kind of tasks for a while, having this in open models is really recent and in open models where we've actually got a paper so we actually know how this whole thing was put together and the training recipes with the different components in here. Well, this is pretty much it. There is no paper with this kind of detail for any of the other open models. And I think this is the key thing that Nvidia understands is that a lot of organizations want to use open models, but what they need actually goes beyond just open weights. They need to know what went into the model. how the model's going to respond. And this is where Nvidia's done a really

Nemotron 3 Nano Paper

good job. If we go back to the Neotron 3 Nano, you can actually see in the tech report that they released, they've got a full breakdown of what kind of data actually went into the training mix. They tell us what languages has been trained on. They tell us how many tokens were actually used for pre-training and then they had a full breakdown of not only how many examples for supervised fine-tuning, but what data went into that supervised fine-tuning. And the supervised fine-tuning recipes are really where a lot of this magic is happening nowadays. So if we jump forward to the Neotron 3 Omni report, we can actually see in here that they've got what they did for the vision supervised fine-tuning doing the audio encoder fine-tuning joint omni SFT with both the vision and audio and then going into the RL training for text and reasoning etc. And we can see here looking at this breakdown, we just don't get papers that make all of this open like this. And on top of this, a lot of the data sets are actually up on hugging face. Now, why is this important? This is important because if you're going to

Nemotron 3 Nano Omni Hugging Face

use the model in some kind of use, whether that's aentic or just for some kind of LLM applications, especially if you want to do some kind of fine-tuning, whether that's perhaps getting the model to be better at particular kinds of OCR and stuff like that, you want to read these recipes to be able to take advantage of that so that you can get the best out of the model. So, the model

OpenRouter and NVIDIA Cloud

itself is both up on the Nvidia cloud where you can actually try it out, but currently it's actually free on Open Router. So, what I thought is I would do a demo of this and then Nvidia has actually been kind enough to sponsor me with a DGX Spark and I'll actually show you how I've been setting up the DGX Spark to basically just be an LLM server that's running in my office that allows me to basically ping it anytime I want from my main computer without having to use any of the resources of my main computer. So, let's jump in and have a look at some of the demos here. Okay, so

Demo

I've put together a collab of where you can use either the Nvidia version or the open router version. The open router version is free, but I don't think it's fully supporting the audio and the video in here. So, just come in here and you can basically pick which version you want. You will need an API key, of course. And then you can see we've got some basic sort of settings here of where we're going to enable thinking or not enable thinking. So if we enable thinking in this case, you will get the thinking in sort of green here and then the standard output of what's going on. We can also actually set a reasoning budget. So if we want to just determine the number of tokens that we're going to set, that's something that we can do here. And then of course you can actually have thinking off totally. So that you just in this case we just get a standard answer out. You'll notice that that's a lot quicker. Now if we do give it something to actually sort of reason over and we give it a good budget, you will find that the model actually thinks for quite a lot. So this is a classic sort of coin flip thing here where it's basically getting it to evaluate a bunch of different things. And you can see that it will actually do a lot of thinking as it goes through. And then finally it'll come through to the end here where it basically puts this together. Now the same thing you can do obviously for no thinking and it will still put a long answer in there but you see sure enough we're getting to the same sort of mapping out. And you will realize that for certain questions you're just not going to get as good quality answers out when you've got the reasoning either turned off or the budget too low. All right. If we want to take an image, just making sure we've got that image loaded. We can actually enable thinking and do reasoning over the actual image. So you can see here it's basically reasoning over what it's got in the image tokens and then it gives us the answer there. The exact same prompt with no reasoning on it will just get us to this answer. Now you can play around with the system prompt. I'll show you that with the local version in here. Another thing that you can actually do is do sort of tool calls based on an image. So here we're setting up this tool of capture observation tool. We're going to pass in a prompt that we want to basically tell it call the capture observation tool exactly once with this modality. If we see this goes off and sure enough it's got the image there. It's then able to use that tool and we can see that got run. And if we do that with the thinking on, we can actually sort of see what's going on before it comes back with the structured output from the tool there. So looking at the same thing running locally. So this is running locally on the DGX Spark. You can see that I'm basically got it on set up so it's on my local network. It's running the model using VLM in there and that can handle things like images, like text in here. Now, the UI is just a simple gradio app in here. There's nothing sort of fancy about that, but it means that we can give it some nice little sort of settings of where we can turn the reasoning on or off. We can show the reasoning traces if we want to. We can set the reasoning budget. So, if I set a bigger reasoning budget, I set something like this. Let's say I want to do a typical sort of system prompt. Okay, so we've got our system prompt there. If I come in here and I say, tell me about the best places to live in San Francisco. We've got pirate mode on with thinking. So you can see here we're getting the reasoning out. You can see the reasoning is basically looking at our system prompt. And if you come and look at here, you can see that the answer we got out has got the reasoning where it actually talks about replying like a pirate, right? some uses pirate language in there. In this case, I can just hide the reasoning if I want to. And see, sure enough, I can see the actual response out there. Now, if we take something like an audio file, I've basically put up an audio file here of just a very short little script. And just to show you that we can see, you know, here, this is the actual audio. — For me, this podcast is an extension of the loving community of my YouTube subscribers. Okay. And you can see that in the reasoning it actually transcribed the audio there. So this is exactly what we had there. And then it uses that to basically give us our response out in this case. So the same is true for images. videos in here. So the cool thing with this is you see that we're getting a pretty quick response back from the model. It's not sluggish at all. And we're not using any of the resources that are actually on my main computer because this is basically just pinging over a local LAN network. I'm doing the inference on the DGX Spark and basically just having it send the tokens back to my computer here. that's also running with VLLM. So, I don't need to worry about any issues of things like Olama not supporting audio or not supporting the file formats and stuff like that. Because I'm using the VLM on the DJX Spark, I'm actually getting a much better response out here, which in this particular case, I'm using through a Gradio interface, but I could also just be pinging the raw VLM directly with an agent or a particular app. So just to wrap up, this model is very good at anything where you want a general workhorse that can take in a whole bunch of different multimodal content, have the model process that and give it back to you. It can be really useful if you're doing something with agents where you're getting it to scrape pages, take screenshots of pages, do that kind of thing, or process videos that you've downloaded and stuff like that. Now, if you've got a specific task where you just want to transcribe a whole bunch of files or something like that, you probably would still be better to go for the parakeet model by itself without this because you just want the transcripts in that. Here we can actually get the transcripts, we can get that text and we can actually reason over it to sort of extract different pieces of information out of it. So overall, the Neotron 3 Nano Omni is definitely a step forward for local models and for being able to have a model that you can then use with agents to do multimodal tasks. So check it out on the API versions. And if you do need something that's fully local, this model is small enough for you to be able to run it. All the versions that I've been showing you here are the full 16bit versions. Of course, Nvidia has also made available an FP8 version and an FP4 version as well as a GGUF version in here. So, it's great to see that Nvidia is using obviously a lot of the compute that they have to make these general models that then they can make available for people to use out of the box like this or to basically fine-tune their own versions, etc. Anyway, as always, if you've got questions or comments, please put them in the comments below. And I will talk to you in the next video.

Другие видео автора — Sam Witteveen

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник