# Luce Megakernel — 25x Faster Than PyTorch on a Single GPU - Test Locally

## Метаданные

- **Канал:** Fahd Mirza
- **YouTube:** https://www.youtube.com/watch?v=e6jY4goVIu0
- **Дата:** 15.05.2026
- **Длительность:** 9:52
- **Просмотры:** 3,357

## Описание

Luce Megakernel hits 340 tok/s on a single GPU — 25x faster than PyTorch, matching Apple M5 Max efficiency with a 2020 NVIDIA card.

🔥 Get 50% Discount on any A6000 or A5000 GPU rental, use following link and coupon:

https://bit.ly/fahd-mirza
Coupon code: FahdMirza

🔥 Buy Me a Coffee to support the channel: https://ko-fi.com/fahdmirza

#megakernel #lucebox 

PLEASE FOLLOW ME: 
▶ LinkedIn:  https://www.linkedin.com/in/fahdmirza/
▶ YouTube: https://www.youtube.com/@fahdmirza
▶ Blog: https://www.fahdmirza.com

RESOURCES:

▶ https://github.com/Luce-Org/lucebox-hub/tree/main/megakernel

All rights reserved © Fahd Mirza

## Содержание

### [0:00](https://www.youtube.com/watch?v=e6jY4goVIu0) Segment 1 (00:00 - 05:00)

There is a belief in the local AI community that Nvidia GPUs are fast but power hungry and Apple silicon is slower but efficient. You pick your trade-off and live with it. This belief has shaped how people buy hardware, how they run models locally, and what they expect from a consumer GPU. Today, we are going to challenge it with a piece of work from the same team that built Loose the Flash, which we covered extensively on this channel. This is Fahad Mirza and I welcome you. You can see that not only this Loose box, but also Dflash and Pflash and various other models we already have covered from the same team. That is why I'm quite excited to share this Loose mega kernel with you. So, what they have done here is they took 1 3. 5. 8 billion model and instead of running it through llama. cpp or vLLM or any standard framework, they rewrote the entire inference engine as a single CUDA kernel. All 24 layers of the model, one dispatch. No CPU involvement between layers. The results are quite surprising and I will show you and share with you everything in as simple words as possible. We will also do some of the hands-on where I will install this and then we will check out some of the benchmarking. But before that, it makes sense to understand what exactly we are dealing with here. So, before we get into the numbers, let me explain what this actually means. On the left is how every inference framework works today. The CPU dispatches a kernel to the GPU. The GPU runs one layer, control returns to the CPU. The CPU dispatches the next kernel. The GPU runs the next layer. Control returns again. For a 24 layer model, that is roughly 100 kernel launches per token. Every boundary between layers costs you CPU round trip time, weight re-fetching from memory, and thread synchronization. The GPU is idle between every single layer doing nothing useful. On the right hand side is a mega kernel. CPU dispatches once. The GPU runs all 24 layers internally using cooperative grid synchronization to pass data between layers without ever returning to the CPU. Data stays in registers and shared memory as it flows through the entire network. No idle time, no redundant fetches, no wasted microseconds between layers. So, I will be talking more around its architecture and how exactly this is working in terms of speed, but for now let's get it installed on my Ubuntu system where I have this GPU card in video RTX A6000 with 48 GB of VRAM. If you're looking to rent a GPU on very good price, you can find the link to Vast Computing video's description with a discount coupon code of 50% for range of GPUs. Let's go back and let's get clone the repo of this loose mega kernel and I will drop the link to it in video's description. Now, let's install all the prerequisite and these are simply PyTorch and Transformer. This is going to take few minutes. While that runs, let's talk about this model and what exactly is happening here. So, this is the headline number in front of you. Mega kernel on an RTX 3090 at 413 tokens per second. llama. cpp on the same 3090 gets Apple M5 Max gets 229. PyTorch Hugging Face gets 108 and that is 1. 8 times versus M5 max in the top right corner is the number that matters most for this story as you can see here. Look at this. Here is the problem the mega kernel is solving. With llama. cpp on RTX 3090, you get 267 tokens per second at 350 watts, which works out to 0. 76 tokens per joule. The M5 max gets 229 tokens per second at around 130 watts, which is 1. Nvidia is faster, but 2. 3 times worse on efficiency. This is the assumption everyone accepts. Nvidia is a power-hungry choice. Apple is the efficient choice. The mega kernel asks whether that is actually true or whether it is just a software problem. Now, here is why this model specifically was interesting to target. When 3. 58 billion is not a standard transformer, it uses a hybrid architecture, 18 Delta net layers, which are linear attention with learned recurrence, and six full standard attention layers. So, in a

### [5:00](https://www.youtube.com/watch?v=e6jY4goVIu0&t=300s) Segment 2 (05:00 - 09:00)

3:2:1 ratio, it is working. You can see that in the layer diagram. Most layers are Delta net. Every fourth is full attention. This hybrid pattern is where LLMs are heading. When 3 next uses it, Kimmy linear uses it, but no framework had built a fused kernel for this pattern. MLX has no Delta net kernels. Llama. cpp supports it, but generically. This is the first mega kernel built specifically for hybrid Delta net and attention models. Now, if you look at here, the same thing I was talking about the Delta net, and look at this diagram. Now, this these are the sort of I would say full numbers. Prefill at 37,800 token per second for the mega kernel, whereas for llama. cpp it is just 11,000. That is 3. 4 times faster on prefill. Decode is the same story, 1. 55 times faster on decode. And this one is my favorite comparison side by side. This is one is challenging the conventional wisdom. RTX 3090 with llama. cpp, 267 token per second, 350 watts, right? And cost around The GPU cost around $700 used. M5 MX229, and again, just around $2,500 minimum for that one. Whereas this mega kernel is not only cheap, it is also consuming lot of lot less power. So, the efficiency gap between Nvidia and Apple is not inherent to the silicon. It is an artifact of running generic software on capable hardware. So, I hope that I was able to differentiate all the differences. Let's go back to our terminal, and meanwhile, if you're looking for AI updates like these, please follow me on X and consider becoming a member to support the channel. And the installation is successful. The mega kernel code extension is built and ready. Let's now run the benchmark script which they have provided in the repo. So, this is official benchmark which tests two things: prefill speed, which is how fast the model reads and processes the input prompt, and decode speed, which is how fast it generates token. So, let's see how it goes. It should run with some warm-up passes. And it is also downloading the model. Okay, so I just need to install this accelerate. Despite of the fact that I have installed all the prerequisite, but anything I think this was missing. Let's quickly do that, and we will rerun it. So, let's walk through what has just happened. The benchmark, here you can see ran two things side by side, the Mega kernel and plain PyTorch Hugging Face. So, this is a PyTorch Hugging Face and the top one is Mega kernel. On the same model, same hardware, same prompt of 520 token, and this is a prompt which it uh used for its benchmarking. Now, prefill first, the Mega kernel processed 520 tokens in 24 milliseconds at 21,282 tokens per second. Whereas PyTorch took 428 milliseconds at 1,215 token per second. That is 17. 5 times faster prefill from the Mega kernel. This is where the single kernel launch makes the biggest difference. No CPU round trips between 24 layers means the prompt gets processed dramatically fast. That is the main thing here. Right? And you can see that in the final result. Same story for the decode. The Mega kernel generated tokens at 340 per second, as you can see here, and PyTorch managed 14 per second. That is 25 times faster generation. So, on a 0. 8 billion parameter model, PyTorch Hugging Face is genuinely that inefficient because it was never optimized for this hybrid DeltaNet architecture. And then we have these completions at the bottom. They show both produced identical output, same text correction correct generation, no quality difference. So, I think this one uh is has got real promise. And I will drop the link to its repo in video's description. Check it out. Let me know what do you think. A single CUDA kernel doing what 100 separate kernels launches used to do. Same hardware, same model, same output, just software written the way the GPU actually want to be used. Things are moving pretty fast. I hope that you are enjoying the channel. If you do, please consider becoming a member. Thank you for all the support.

---
*Источник: https://ekstraktznaniy.ru/video/50838*