# I Benchmarked 6 LLMs on Jetson Thor — Here’s What Surprised Me

## Метаданные

- **Канал:** JetsonHacks
- **YouTube:** https://www.youtube.com/watch?v=LV2k40nNpCA

## Содержание

### [0:00](https://www.youtube.com/watch?v=LV2k40nNpCA) Segment 1 (00:00 - 05:00)

These are Jets Thor benchmarks straight out of the box. No optimization, just the first run. I've just set up Jetson Thor and installed llama. cpp. I ran benchmarks on six interesting LLMs, my own hands-on tests, and I'll serve two of them so we can interact through a web browser. That should give us a feel for how they perform for reals. This is our baseline for future development. But one huge model does something you won't expect. We start now. People obsess over benchmarks. They hope that it gives them a handle on how fast a machine is when compared to others. There are enough numbers and bright shiny things to make you think something good is happening. Here we're looking at Quen 3 coder. First, the model is loaded into memory. This model takes up 17 GB, so it takes a moment to load. Then the benchmark calculates the prompt processing time, returning tokens per second. Finally, the benchmark calculates the token generation speed also in tokens per second. This tests the speed of three major subsystems. The drive controller, the memory controller, and the CPU/GGPU complex. Quinn 3 coder looks pretty good here. Prompt processing is 880 and token generation is 49 tokens per second. These early results are pretty promising. When you send a prompt, the CPU first loads the model weights from disk if they aren't already in memory, then turns your words into tokens and looks up embeddings. It manages memory and hands everything to the GPU. The GPU does the heavy lifting. Huge matrix multiplies across the whole context window. This stage is GPU bound and pushes memory bandwidth while the CPU mainly coordinates. After the prompt is processed, we move into token generation. Now the CPU handles the control and sends each step to the GPU. Things like sampling and when to stop. The GPU computes one new token at a time, reusing a compact memory of what it's already seen. That's the key value cache. This reuse keeps the loop fast and drives your tokens per second. Watching the benchmarks run is about as exciting as watching paint dry. I'll speed it up so it only take a few seconds for you. For me, it was very many dreadful minutes. We'll do OpenAI's GPOSS 20B next. Both of these are mixture of experts models. I'll talk about that in a moment. Now, let's look at a couple of dense models. First up is Nvidia Llama 3. 3 Neatron 49B. Finally, for this group, we'll look at straightup Gwen 332B. You can see the token generation rates are much lower than the first mixture of experts models. You can see that the mixture of experts models just crush the dense models. But why is that? Dense models are the straightforward kind. Every parameter is active for every token. That gives them predictable scaling. But it also means resource use grows directly with model size. More compute, more memory and slower throughput. Mixture of experts models, sometimes called sparse models, work differently. Think of them like a team of specialists only two or three step in for each decision. For an embedded system, that efficiency means you can tap into large model quality while keeping the per token workload small enough for good performance. For running large language models on embedded systems, this efficiency makes a lot of sense. Mixture of experts gives you big model quality without crushing the hardware. But for smaller models, especially vision language or multimodal ones, dense networks can be simpler to deploy and just as performant. Now, let's take a look at a larger mixture of experts model GPOSS 120B. In addition to whether a model is dense or sparse, another big factor for performance and accuracy is quantization, the precision used for the numbers inside the model. Most models today are trained in 16bit floating point known as FP16. But by converting them to lower precisions like int 8, FP4, or mixed FP4, you can cut memory use and speed up token generation while still keeping accuracy in a range that works for most tasks. I find these results pretty interesting. I was expecting the speed gap to be much larger as we move down to FP4. I'm not entirely sure why the difference isn't bigger, but I suspect that FP16 and int 8 are using the GPU's tensor cores while the lower precisions are not. Lest you think that this level of performance must come with bad results, recent testing shows it's only about a point behind Claude 4. 1 at 58, one of the strongest commercial models out there. All the horrible stuff is behind us. Let's start up a server and see how these models perform. We'll take a look at Quen 3 coder 30B first. Startup takes around 11 seconds. I have not shortened this sequence. Let's switch to the browser. Please write a fast for your transform in Python. Just so you know, this clip is at normal

### [5:00](https://www.youtube.com/watch?v=LV2k40nNpCA&t=300s) Segment 2 (05:00 - 10:00)

speed. You're seeing the model's real performance. No tricks. Let me scoot this out of the way. I'll let these generations run for a while so that you can get a feel for how fast they are in practice. I'll put a chapter marker for each prompt and the next model so you can skip ahead. I need the same incuda. Oh, good. Here are the results on the server. The first prompt was the same as the benchmark speed, but it looks like the second prompt was about 10% lower. Worth investigating. GPT OSS 120B. We are going to start it up with the large context 128K tokens. It loads about 60 GB into RAM. So it takes a minute 50 seconds to load and start up. Let's have it explain mixture of experts to us. We're back to normal speed on the video. I tried to read through the answers to see if they were correct, but it put me to sleep. At first glance, it feels correct. Hey Datoo. Hey. Let's ask it to tell us about agents. Heat. don't want to

### [10:00](https://www.youtube.com/watch?v=LV2k40nNpCA&t=600s) Segment 3 (10:00 - 12:00)

What do you want? All right. Aha. Hey, hey. Again, the first prompt generates at the benchmark rate, but the second lags behind. Here are the takeaways. We've got our baseline benchmarks on Jets and Thor. We've seen how dense and mixture of experts models behave. And we've explored how different quantization levels affect speed and accuracy. These demos were in real time. And the results show that even massive models can run interactively with quality not far from state-of-the-art. This is just the starting point. As the software improves, we should see even better performance ahead. Thanks for watching.

---
*Источник: https://ekstraktznaniy.ru/video/39295*