# Nvidia's NEW Nemotron 3 Nano - Reasoning LLM for the Edge! 

## Метаданные

- **Канал:** 1littlecoder
- **YouTube:** https://www.youtube.com/watch?v=LpBc2X4BaKE

## Содержание

### [0:00](https://www.youtube.com/watch?v=LpBc2X4BaKE) Segment 1 (00:00 - 05:00)

This Nvidia Neimotron 3 nano 4 billion parameter model is a really good model for ondevice use cases. This is a model that hardly people are talking about and this model is heavily optimized for ondevice usage such that you can run this on web GPU. That means you can load this model in a quantiz state within your browser and have a chat without even having to have internet connection. We're going to learn about this particular model and then I'm going to show you a demo of how this works in web GPU. First of all, this is a model that is just 4 billion parameters. Even though this is a 4 billion parameter model, Nvidia has managed to release a BF16, B float 16 and also GGUF checkpoint. So it has three different checkpoints that they have released. And this model is a hybrid model. In this case, hybrid stands for a hybrid between two different architecture. A Mamba transformer architecture. Mamba stands for the SSM type and the transformer architecture is what we have seen with the age-old all the LLM. So the idea of this model is that it is designed for efficiency and accuracy. So if you want to use this model for a set of use cases, the model has done particularly good on these benchmarks. First of all instruction following evaluated by ifbbench and if eval. And then the model has got good set of intelligence. It's also a reasoning model. So you can use this model to have internal chain of thought. The model requires probably the lowest VRAM footprint for this model size and the model has got also the lowest ETF time to first token in its size. These benchmarks are measured on RTX 4070 if you're running it on that using quantized version of Q4 with Llama CPP. If you want to see this benchmark the model performance I guess this is my personal opinion from my testing it feels like this model has been slightly benchmaxed which seems to be the theme of uh releases these days. So you can see Neimotron 3 nano4 billion here and Quinn 3. 54 billion. I've heavily used Quen 3. 5 model and I believe that Quen 3. 5 is a much better model than what we are seeing here on benchmark. So Neimon 3 is not a bad model but I wouldn't say any day that this is a better model than Quen 3. 5 because I've found pretty good success with Quen 3. 5. So here you can see this model has done better on um ifbench in fact like by 10 percentage points and if evolve it has done 1 percentage point better than quen 3. 5 and then other benchmarks are available where this model has still done better than quen 3. 5. One good thing about this particular model is Nvidia has gone ahead and then released the training recipe as well which you do not get to see a lot these days. So Nvidia has managed to share the post- training data set also the pre-training data set. So any data set that was used to train this model is available publicly for us to go ahead and then use it. And they've also released the recipe of how they managed to train this model. So you can see first of all this is a distilled model. They've taken the 9 billion parameter model and then they've distilled this model. So it's a compressed version of 9 billion parameter model. So you can see here the 9 billion parameter model has been taken the short context they have like compressed it distillation checkpoint. 1 then they've made it long context from 8k they have gone to 49k. So this is the second distillation checkpoint. From this they've done supervised finetuning and you can see the post fine tuning 80% with reasoning on 20% with reasoning off a little bit of safety focus finetuning and then they have also done reinforcement learning. So you can see here RLVR reinforcement learning with verifiable rewards used and then there is a second reinforcement learning with verifiable rewards checkpoint and then finally you have got the one that we are using the for us which has gone through the single multi-turn instruction following structured output fine-tuning multi-turn conversational tool calling. So it has gone through all these things and that is how we have got this. I think this is probably the most interesting part more than the model. I mean the model is good for a lot of use cases but more than the model I believe the models recipe that Nvidia has shared here is the most important thing and the most interesting thing here. There are further more details how the model has been trained with something called mutron elastic uh which has got an end toend trained router. So how the router makes the decision. So you can go ahead and then see all those things like how the pruning happens, how the distillation happens, what kind of stages and all those things purely in terms of the model performance itself. I think you can use this model for a variety of use cases including basic chat. So if you want to have a customer chatbot and you want to give limited context and then you want the model to have a chat then I would say go ahead and then use this. I wouldn't prefer using this model for primarily agentic use cases or coding use cases. The way I'm going to show you this model is I'm going to show you how to use this model within web GPU. So this is a link uh it's on hosted on hugging face. You can go here, click the link and then start chatting with that. First time it is going to download about 2GB of model but next time when you use it, it is going to use it from cache. So do it only if you have got 2 GB of storage. So this is running entirely within this browser instance and it is

### [5:00](https://www.youtube.com/watch?v=LpBc2X4BaKE&t=300s) Segment 2 (05:00 - 10:00)

not it doesn't have to be connected to the internet. So as you can see here I just gave the plot of the movie tenet and then asked it to summarize and outright if you watch the screen you would notice a couple of things. One the model has thought for 47 seconds and like I said this has completely happened on my browser without any internet connection and that is the coolest thing here and it generated about 47 tokens per second. I'm going to show you a demo as well. The problem here though I would say is it has done the summarization pretty well. I would say I gave a longer text like here I gave the entire plot and then asked it to summarize which it did a good job but it started hallucinating in the middle. For example, while it was going through the thinking process, it decided to call the protagonist as Tom Cruz which is completely incorrect because Tom Cruz did not act in this movie. Instead of tenant, it started calling it tenset. I don't know it's like uh loyal to its founder Jensen. I don't know what is the reason. So there are a couple of these nuances but I also can understand that this is a model that has been running within web GPU heavily quantized. So I can understand why this is happening. But for a general chat I think the model is doing a decent enough job. Like for example I can go ahead and then say what is um the uh largest uh animal in the world. I can switch off reasoning and then send it. The largest animal in the world is blue whale. So for things that you would necessarily use basic Wikipedia for things that uh you know you if for example let's say you want to design a kids toy and then you want to put this model in it probably this is an easier candidate you can easily put this model and within browser also you can see how fast it is like I'm going to again go ahead and then ask one more question like with reasoning of can you multiply um 40 I don't think this will do well but let's see 24x2 you can see here 44 tokens and it okay it gave me the right answer 48. So as you can see here it is generating at a good speed. The TTFT time to first token is really good and I believe this model would be fine tunable. So I think that is again another place where this model has heavy uh advantage than any other model that is available. Once again uh I might be biased. I'm really sorry but I wouldn't say this is a bigger advantage over a Chinese model per se like Quen 3. 5. But if you do not want to use Chinese model, if you want to use only US specific model, especially a model that is heavily optimized for Nvidia um GPUs, Nvidia machines, I think then this is an easier choice like you can go here just play with the model just get the vibe of the model. So I'm going to once again do this but this time without thinking on to see if it will reduce hallucination which I believe a lot of these models hallucinate a lot when you have the thinking process on because they go into this deep thinking. uh summarize this plot in just three lines. Send it. We have switched off the thinking process and uh yeah the model is starting to think and you can see here 42 tokens per second on a web GPU model is really good and yeah just like I guessed it has not hallucinated when the thinking process is off. So this is something that you would notice with a lot of different models. When the reasoning is on, the model does a pretty terrible job because unnecessarily it has to reason. When you switch off the reasoning, the model's TTF would be faster because it is not reasoning. While also in many cases like this, in this case, for a classical NLP task, the model would do a pretty good job. So let me show you one more uh demo of you know how good the model is for classical machine learning task. I'm going to pick one here and say uh visually dazzling and uh I'm going to come back here say you are a um movie sorry sentiment classifier just respond in JSON with positive negative neutral and score text. So ideally this is a positive review. You can see that it gave uh a positive score with the 85. I'm going to do the same thing here but this time I'm going to say something very bad visually dazzling puzzle for uh unlock um the did not uh serve up and uh another from a disappointment. So I'm going to add couple of keywords here that are negative and you can see here it added negative and even though we have some positive sentence it gave me negative. So for classical NLP tasks like this where you want to do lot of batching you can definitely use this model. The model has got good capability even at this quantized level the model is doing a good job. But if you think about you know LLM in current 2026 era and then you want to replace your cutting edge model with this that is not this model for like I said the fine-tuning part is the most interesting part and the fact that you can run this on WebGu is another very interesting part I think overall this is a great model like I said there are like three

### [10:00](https://www.youtube.com/watch?v=LpBc2X4BaKE&t=600s) Segment 3 (10:00 - 10:00)

different checkpoints you can use the P float point floatingoint 16 if you're running it on like any Nvidia machine or a full precision FP8 and then if you're running it on CPU or any quantized environment you can use the GG phone Nvidia is pushing to you pushing you to use this on uh the Jetson hardware honestly like I don't have the hardware like Nvidia has never given me one for me to test it out but otherwise if you have a G DGX spark if you have got a Jetson device then you can use it I think this is a great release especially given the fact that this is a hybrid architecture and they've shared the entire recipe I'm really looking forward to test this model further let me know in the comment section what you feel about this model see you another video happy prompting

---
*Источник: https://ekstraktznaniy.ru/video/44662*