5 Signals I use to estimate DGX Spark VS Jetson Thor LLM Speed

5 Signals I use to estimate DGX Spark VS Jetson Thor LLM Speed

JetsonHacks

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (2 сегментов)

Segment 1 (00:00 - 05:00)

Hello, it's Jim from jetsonhacks. com. Folks have asked me to compare the Nvidia Jetson Thor with the DGX Spark. I am going to share with you how I go about comparing two different systems. There may be surprising results, things you didn't think about, or things I may have missed. Part of this game is to leave your guesses in the comments below. You can say where you think I might be wrong, agree about the results, or come up with your own estimate. This is like real work. We start now. When starting this type of comparison, there are several things to keep in mind. The first is the easiest. It's the usual who, what, when, where, and why. In this case, we want to understand the goals of the analysis, and under what circumstances are we doing the comparison. This particular comparison is straightforward. We have two machines, both with 128 GB of unified memory, ARM CPUs, and a Blackwell GPU. We want to make sure we're comparing the same characteristics. By setting a baseline, we are then able to factor in architectural differences which determine overall performance. We look specifically for enablers. These are features which cause the devices being compared to behave differently than the others. We want to know what are the distinguishing characteristics of each device. What makes them different? Finally, we know everything has a cost in engineering. What are the trade-offs the device made for each feature? and how does it use those to its advantage? When I look at chips, I want to know the number of transistors and how much power it draws. Here we have an obvious difference. The Jets and Thor uses a 4 nanometer TSMC process. The GB10 a 3 nanometer process. This allows the GB10 to have 25 to 30% more transistors in the same die area. That also means that the GB10 will have better performance per watt. Another thing I noticed right off the bat is how much power the chips consume. The Jetson Thor SOC is rated for 120 watts. The Nvidia GB 10S SoC is 140 W. Remember that the newer manufacturing process allows you to run the same amount of compute in a smaller power envelope. Or if you use more power, you will get more compute in the same space times the efficiency gain of the new process. Once I know how many transistors there are, I want to know how they were spent. What architectural blocks did they implement? Assume that a good portion of the chip is cache memory for the CPU and the GPU on Thor. They spent a lot of the transistor budget on a safety first power conservation CPU architecture. Automotive grade guaranteed performance is high priority. The GPU has about 25% more CUDA cores than its predecessor. There's also a 50% increase in the number of tensor cores. Taking a look at the GB10, we can see that they spend a good portion of their budget on high performance CPU cores. The cores provide better performance, but can be rather bursty at times. That's one of the first design trade-offs we see in our comparison. The THOR provides calculations within a known time. The GB10 provides them as fast as possible with no guarantees. The GPU has more than twice the compute components of Thor. This should provide quite an uplift in performance. Now that we have the high-level overview, we can start estimating the performance differences on the CPU. The GB10 change in process results in about a 1. 3 times increase in clock speeds. There are 10 cores, 10 performance, and 10 efficiency. We'll be very conservative here and say that this results in a 1. 7 times performance uplift over Thor. Before we go on, let's set expectations. I'm treating this as if you were by some miracle able to stay awake in computer architecture class. Also, if this was one of my professional reports, I would have added many more acronyms, Greek and math symbols. Most people don't know what they mean, but are happy to pay money to look like they do. Next, we will calculate the clock speed of the GPU. One of the major assumptions is that the GB10 produces 32 TFOPS FP32. Nvidia published this number and it sounds correct given the configuration. From here we can figure out the GPU clock speed. The answer is 2. 6 GHz. Here are the Thor numbers. We can expect the CUDA GPU performance to increase by a factor of four. Let's take a look at the tensor cores. Remember that these are special fixed function blocks for multiply accumulate functions on matrices. Because of that, we expect faster clocks to have less of an uplift here than in the CUDA cores. The GB10 has at least twice as many tensor cores. Multiply that by the clock increase and you get about 3. 2 times the performance. When we talk about memory bandwidth, one thing we need to keep in mind is how the memory controller delivers quality of service. Quality of service is the policy that prioritizes and schedules access to shared resources. In the memory controller, it decides which client gets DRAM next, how long they can burst, and with what priority. Note that QoS is a mixed letter case. You can charge more for those in your reports.

Segment 2 (05:00 - 08:00)

Let me show you an example of how quality of service works by comparing Thor and Orin. We will use stream to determine the CPU bandwidth. This is an industry standard test. It's been around since the 1990s. This is a typical example. For testing, you generally run many tests and throw out the highest and lowest before averaging the remaining tests together. There are four different tests. copy simply moves memory and the other three perform arithmetic operations between the fetch and return. Notice that these numbers are well below the quoted maximum memory bandwidth of 273 GB per second. Next, I'll run some tests on the CPU, GPU, and run both of them together to measure contention. Here we see that the GPU can grab a little more bandwidth. The CPU will always get some bandwidth just to keep the system running. The percentage use is typical. There's overhead that needs to be taken into account. Now, when we run both at the same time, the usage on each falls in half as expected. The CPU appears to be slightly favored. Dynamic voltage and frequency scaling dynamically adjusts a chip's clock speed and voltage to match workload and thermals. In embedded systems, it's crucial for staying within tight power and thermal budgets, extending battery life, and avoiding throttling. For the benchmarks, I've turned off this feature and set the clocks to their maximum frequencies. Now, let's run stream on Jetson AGXRin. Here we see something that's a little puzzling. The Orurin is rated at 204 GB per second of memory bandwidth, but we aren't even seeing half of that. When we run the GPU and contention benchmark, we can see what happens. The GPU appears to run at maximum rate, while the CPU is limited to around 35% of bandwidth. This corresponds to the quality of service constraints that the Orurin uses. These can be changed by the developer somewhat. That's why tuning your applications is so important. Thor's controller has documented quality of service that favors predictable latency and efficiency under contention. With coherent cache and a desktop power budget, the GB10 tends to keep more bandwidth available to the GPU when the CPU is busy. Also, it looks like the GB10 memory will be about 10% faster in DGX Spark. One major advantage that the GB10 has over Thor is coherent cache. The CPU can work directly with the GPU L2 cache. Now the Thor has a fully coherent cache. However, the GB10 can directly access the L2 GPU cache from the CPU acting as a L4 cache. This is quite an advantage when passing small amounts of data and control signals. Adding it all up, CUDA is going to be about four times faster. Tensor cores are going to be a little over three times faster. You figure the CPU is 1. 7 times faster, memory bandwidth 10%, and a little bonus multiplier for the coherent cache we'll call 10%. The final numbers, please. The short answer is that of course it depends. It depends on the AI workload, but my guess is that Spark is two or three times faster in general. TC is tensor cores. The more you can keep the compute fed, the better the performance gains over Thor. If a lot of your overhead is in CPU toGPU communication, GB10 has a fast pathway. If you are serving a LLM to multiple users, that's a big win on Spark. It's not likely that you would turn DVFS off on Thor. So, the startup to get Thor to full throttle is rather costly. Anyway, that's my thoughts. Let me know in the comments what you think. Thanks for watching.

Другие видео автора — JetsonHacks

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник