I've put together a collab of where you can use either the Nvidia version or the open router version. The open router version is free, but I don't think it's fully supporting the audio and the video in here. So, just come in here and you can basically pick which version you want. You will need an API key, of course. And then you can see we've got some basic sort of settings here of where we're going to enable thinking or not enable thinking. So if we enable thinking in this case, you will get the thinking in sort of green here and then the standard output of what's going on. We can also actually set a reasoning budget. So if we want to just determine the number of tokens that we're going to set, that's something that we can do here. And then of course you can actually have thinking off totally. So that you just in this case we just get a standard answer out. You'll notice that that's a lot quicker. Now if we do give it something to actually sort of reason over and we give it a good budget, you will find that the model actually thinks for quite a lot. So this is a classic sort of coin flip thing here where it's basically getting it to evaluate a bunch of different things. And you can see that it will actually do a lot of thinking as it goes through. And then finally it'll come through to the end here where it basically puts this together. Now the same thing you can do obviously for no thinking and it will still put a long answer in there but you see sure enough we're getting to the same sort of mapping out. And you will realize that for certain questions you're just not going to get as good quality answers out when you've got the reasoning either turned off or the budget too low. All right. If we want to take an image, just making sure we've got that image loaded. We can actually enable thinking and do reasoning over the actual image. So you can see here it's basically reasoning over what it's got in the image tokens and then it gives us the answer there. The exact same prompt with no reasoning on it will just get us to this answer. Now you can play around with the system prompt. I'll show you that with the local version in here. Another thing that you can actually do is do sort of tool calls based on an image. So here we're setting up this tool of capture observation tool. We're going to pass in a prompt that we want to basically tell it call the capture observation tool exactly once with this modality. If we see this goes off and sure enough it's got the image there. It's then able to use that tool and we can see that got run. And if we do that with the thinking on, we can actually sort of see what's going on before it comes back with the structured output from the tool there. So looking at the same thing running locally. So this is running locally on the DGX Spark. You can see that I'm basically got it on set up so it's on my local network. It's running the model using VLM in there and that can handle things like images, like text in here. Now, the UI is just a simple gradio app in here. There's nothing sort of fancy about that, but it means that we can give it some nice little sort of settings of where we can turn the reasoning on or off. We can show the reasoning traces if we want to. We can set the reasoning budget. So, if I set a bigger reasoning budget, I set something like this. Let's say I want to do a typical sort of system prompt. Okay, so we've got our system prompt there. If I come in here and I say, tell me about the best places to live in San Francisco. We've got pirate mode on with thinking. So you can see here we're getting the reasoning out. You can see the reasoning is basically looking at our system prompt. And if you come and look at here, you can see that the answer we got out has got the reasoning where it actually talks about replying like a pirate, right? some uses pirate language in there. In this case, I can just hide the reasoning if I want to. And see, sure enough, I can see the actual response out there. Now, if we take something like an audio file, I've basically put up an audio file here of just a very short little script. And just to show you that we can see, you know, here, this is the actual audio. — For me, this podcast is an extension of the loving community of my YouTube subscribers. Okay. And you can see that in the reasoning it actually transcribed the audio there. So this is exactly what we had there. And then it uses that to basically give us our response out in this case. So the same is true for images. videos in here. So the cool thing with this is you see that we're getting a pretty quick response back from the model. It's not sluggish at all. And we're not using any of the resources that are actually on my main computer because this is basically just pinging over a local LAN network. I'm doing the inference on the DGX Spark and basically just having it send the tokens back to my computer here. that's also running with VLLM. So, I don't need to worry about any issues of things like Olama not supporting audio or not supporting the file formats and stuff like that. Because I'm using the VLM on the DJX Spark, I'm actually getting a much better response out here, which in this particular case, I'm using through a Gradio interface, but I could also just be pinging the raw VLM directly with an agent or a particular app. So just to wrap up, this model is very good at anything where you want a general workhorse that can take in a whole bunch of different multimodal content, have the model process that and give it back to you. It can be really useful if you're doing something with agents where you're getting it to scrape pages, take screenshots of pages, do that kind of thing, or process videos that you've downloaded and stuff like that. Now, if you've got a specific task where you just want to transcribe a whole bunch of files or something like that, you probably would still be better to go for the parakeet model by itself without this because you just want the transcripts in that. Here we can actually get the transcripts, we can get that text and we can actually reason over it to sort of extract different pieces of information out of it. So overall, the Neotron 3 Nano Omni is definitely a step forward for local models and for being able to have a model that you can then use with agents to do multimodal tasks. So check it out on the API versions. And if you do need something that's fully local, this model is small enough for you to be able to run it. All the versions that I've been showing you here are the full 16bit versions. Of course, Nvidia has also made available an FP8 version and an FP4 version as well as a GGUF version in here. So, it's great to see that Nvidia is using obviously a lot of the compute that they have to make these general models that then they can make available for people to use out of the box like this or to basically fine-tune their own versions, etc. Anyway, as always, if you've got questions or comments, please put them in the comments below. And I will talk to you in the next video.