I Tested the First Diffusion Reasoning LLM… It’s Insanely Fast

6:20

I Tested the First Diffusion Reasoning LLM… It’s Insanely Fast

Skill Leap AI 25.02.2026 8 180 просмотров 210 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

You can try Mercury 2 here: M2 Playground: https://chat.inceptionlabs.ai/ M2 API: http://platform.inceptionlabs.ai/ Inception gave me early access and I made this video in partnership with them. I walk through how I test a new reasoning LLM called Mercury 2 and show why it’s so fast. I ask it to build a full game of checkers, and it writes working code almost instantly. Then I push it to create a full game of chess, and it generates hundreds of lines of code in seconds. I also show how it handles follow-up prompts and rewrites the code just as fast. Mercury 2 is different from other large language models like ChatGPT and Claude Haiku. Most LLMs generate tokens one at a time. Mercury 2 uses a diffusion model, which creates and refines tokens in parallel. That’s why this diffusion LLM can reach around 1,000 tokens per second. I compare Mercury 2 to Claude Haiku in a speed test, and Mercury 2 finishes much faster while still keeping strong reasoning. This makes it a great fit for AI agents, coding, voice apps, search tools, and customer service apps where you need both speed and reasoning. If you’re building with an API and need a fast reasoning LLM, Mercury 2 is worth testing in the playground or API.

Оглавление (2 сегментов)

Segment 1 (00:00 - 05:00)

I'm going to show you a brand new reasoning large language model called Mercury 2. And this works completely differently than any other LLM I've ever tried. Let me just show you the speed here with a simple prompt to create a game of checkers for me. And if you look at it, it's completely different, right? And look at all this code he wrote almost instantly here. And it's a working game of checkers I could play on my browser right now. Now, the reason why this is so fast is because it uses a diffusion model to handle multiple tokens at the same time. So, if you're not familiar with how large language models work, when you talk to chat GPT, for example, it generates tokens. Tokens, maybe two, three tokens could generate a word, for example. Every time it does that like a typewriter. So, it has to generate one token, then the next token, and so on. And that kind of creates that bottleneck effect. The way these models work, these diffusion models, which I first started using diffusion models when image generation models came about, it creates the tokens in parallel. It's basically generating and then refining over time. And this graphic kind of shows you how it works. Now, these basically first generate noise and then they refine that and then give you the final output after the refinement is done. So again, works like an editor rather than a typewriter. And Mercury 2 is the first reasoning LLM that is diffusionbased and they compared it in the benchmarks with some other LLMs that are designed for speed. So Claude has one called Haiku that is optimized for speed and you could see this is clearly five times faster in the tokens it generates. So this generates 1,000 tokens per second. Same thing comparing to the Open AI's model that is designed for speed. Let me show you some more examples here on the playground. So this one, let's create a more complex game of chess. This playground also has this option for changing the reasoning effort, which some reasoning models allow you to change this while you're chatting with them. So in this case, I'm going to change it to high here just to show you at the highest level what the speed looks like. And again, you could see how it's generating the code. Let me show you here. It is working totally differently than normal LLMs, right? And let me see how many lines of code he just wrote. almost 600 lines of code to create this game of chess over here. And just like any chatbot here, you could do any type of follow-up prompt here and it will make adjustments. Let me just show you here in real time the speed of the follow-up prompt. And again, it rewrote the entire code here and it made my adjustment here. And again, yeah, it's working just fine. I could add AI to this. Again, I could do any type of vibe coding I usually do to create little apps like this, which I've made ton of vibe coding videos lately on the channel. Okay, let me show you a quick speed test here. I'm going to compare Haiku 4. 5 with their thinking with the extended thinking here. And I'm going to compare it to Mercury 2, set it to high, which is going to be obviously the slowest, but it's going to do the most effort for its reasoning. So, it's going to give us the best result here. And I'll just turn on this diffusion effect so it doesn't affect any of the speed here in this chatbot. And I'll send this out first and then we'll send this out second. So half a second of difference here. But I'll let them both go to work now. And I won't edit this part just so you see. I mean this one was almost instant here. Let me go ahead and see if reset works. If I create a black hole. Okay, that yeah this is working as intended. Pretty simple user interface, but it definitely followed my prompt. And Haiku is finally done, right? And and yeah, I guess this kind of works the same way. But you could see it took far longer. This one was almost instant here. And let me see how much code it wrote. And they both roughly created about 250 lines of code here to give us that result. So if you're looking for speed, obviously I think that shows you a clear example of the difference. Now when it comes to these other models that they have like Ops and Sonnet, those type of flagship models are actually not comparable to this model right now. Mercury 2 is designed to compare to the models designed for speed at a lower level like Haiku for example that I tested this with. Now, a place where I think this is going to come in really handy is in the API if you're using AI to build any type of app or any type of AI powered app because Mercury 2 when you're building with it again is going to be five times faster. So, a lot of apps like customer service apps for example or voice apps or anything that needs almost instant response you need speed but speed alone is not good enough if the reasoning is not there. So Mercury 2 is the first reasoning diffusion LLM which makes it very interesting for an API use and the pricing is actually really affordable. 25 cents per million input tokens and 75

Segment 2 (05:00 - 06:00)

cents per million output tokens. Now a few places where this is obviously going to shine any type of agent build a lot of times you need speed for that but you also need reasoning. Diffusion LLMs actually have been around, but a reasoning one hasn't been. Mercury 2 is the first reasoning plus diffusion LLM. So then you get the speed, but you also get reasoning. They go hand inand really well. It's going to also come in really handy in any type of search voice. It's going to come in handy obviously for coding that I showed you. You always want speed, but you also need reasoning because if you just get code with speed, sometimes obviously they're not going to be good enough if the reasoning is not set to a proper level to give you great outputs. Now, a couple places you could try Mercury 2 right now. So, in the link below, I have two links. One will take you to this playground here and you could just use Mercury 2 and you could change your reasoning model. You could give a web access here and try all kinds of different prompts just to see the speed and the reasoning level. And then they also have the API and I'll link the API below as well. So if you're building any type of apps, Mercury 2 is something you probably want to test out, especially if you're doing any type of agentic work, search work, any type of customer service or voice where speed is going to be one of the key factors, but you also need reasoning to make sure the answers are really good. Thanks so much for watching this one. I'll see you on the next

Другие видео автора — Skill Leap AI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник