New Qwen3 Coder vs Kimi K2: When Benchmarks Lie

15:35

New Qwen3 Coder vs Kimi K2: When Benchmarks Lie

Ray Amjad 23.07.2025 20 098 просмотров 459 лайков обн. 18.02.2026

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Join AI Startup School & learn to vibe code and get paying customers for your apps ⤵️ https://www.skool.com/ai-startup-school 📲 Stay up to date on AI with my app Tensor AI - on iOS: https://apps.apple.com/us/app/ai-news-tensor-ai/id6746403746 - on Android: https://play.google.com/store/apps/details?id=app.tensorai.tensorai CONNECT WITH ME 📸 Instagram: https://www.instagram.com/theramjad/ 👨‍💻 LinkedIn: https://www.linkedin.com/in/rayamjad/ 🌍 My website/blog: https://www.rayamjad.com/ ————— Links Mentioned: - Qwen-3-Coder Announcement: https://x.com/Alibaba_Qwen/status/1947766835023335516 - GitHub for Qwen3-Coder: https://github.com/QwenLM/qwen-code - OpenRouter for Qwen3-Coder: https://openrouter.ai/qwen/qwen3-coder Timestamps: - 00:00 - Intro - 00:47 - Setup - 03:20 - What We’ll Be Doing - 04:06 - Task #1 - 10:51 - Task #2 Brief - 11:30 - Token Usage & Cost - 12:23 - Task #2 Execution - 14:25 - Conclusion

Оглавление (8 сегментов)

Intro

Okay, so about four hours ago, Alibaba came out with a brand new model, and this is a Qwen3-Coder model, and we'll be comparing it against Kimi K2, which came out last week, to see which performs better on a real-world production codebase. And you can see on the benchmarks, it already seems to outperform Kimi K2, which is pretty surprising, because they're already a weaker part. And it already outperforms Claude Sonnet 4 as well, in many cases, and we'll be seeing if it actually lives up to the test when it comes to a real-world production codebase. Because I find that with a lot of these benchmarks, usually they do well in the benchmarks, but when you try it on a production codebase, then each model kind of has its own feel to it. And so models, even though they do well in the benchmarks, they don't do well in the codebases, and vice versa. But basically, you can use it in a similar way you can with Claude Code and also Kimi CLI as well. If you go to GitHub, which is linked down below as well,

Setup

you can download a Qwen3-Coder CLI—like a Kimi CLI equivalent. So this is actually forked, or it's adapted from Gemi CLI over here, but it's optimized for Qwen3-Coder. And you just run: npm install -g qwen3-coder-cli Then you would run qwen anywhere on your computer. And you actually have to set your parameters before you run the qwen command. So I'm going to quickly show you how you do that. So I'll be doing it in the terminal over here. And I'm actually going to close this where I have Kimi K2 running. And basically, after installing it, you want to run this command over here. But you want to replace these API keys and base URLs accordingly. Now—this is important—even though it says OPENAI_API_KEY and OPENAI_BASE_URL, it is not actually using an OpenAI model. You want to use OpenRouter over here. The model isn't being hosted on OpenRouter directly, but it forwards the request to the relevant backend, which is usually Alibaba’s servers. And you can see right now the Qwen3-Coder model is over here. Most of the requests are being forwarded to China, which is where Alibaba is hosting the model. And it has some good uptime. And where the model is being hosted in the US, it seems to be struggling right now, kind of. But basically, we want to set the model name to this over here. So we copy this over and then go back to our terminal, paste this over here. We need to set the base URL, and the base URL will be the OpenRouter base URL. And I believe that is near the bottom over here. This will also be in the description down below, so you can use that. And if I go back to my terminal… And now for the OpenAI API key—again, it’s not actually an OpenAI key—you want to use the OpenRouter key. So if you go to OpenRouter, go to Keys over here, and then make a new key. I'm going to call it YouTube for this YouTube video. And don't use my key because it'll be done by the time or expired by the time you're watching this. Then copy the key over here. And then go back to your terminal. And then press paste. So now you should have your environment variables set like this: export OPENAI_API_KEY=< your-OpenRouter-key> export OPENAI_BASE_URL=https://openrouter. ai/api/v1 export OPENAI_MODEL=qwen/qwen3-coder Then run qwen. We can then make sure it's actually running Qwen over here. So it's pointing to the right model. So we can say, "Who are you? " and we'll wait a second over here. And you can see it says, "I'm Qwen, blah blah blah. " And this is the correct model. And if you really want to make sure it's the correct model, then you can ask like a politically sensitive question, and it will be denied because it is a Chinese model. So now I have Qwen running on the left-hand side over here in a Qwen-specific folder. And

What We’ll Be Doing

I have Kimi K2 running on the right-hand side over here in a Kimi K2 folder. And this is the exact same codebase. And the codebase I'm using is actually a production codebase from an application that I previously made called Tensor AI. And Tensor AI basically helps you stay up to date with the latest AI news. If you feel like there's so much AI news that it's kind of hard to know what is relevant for you, this application only notifies you about the AI news that is relevant to you. And you can listen to a summary of the news from the last 24 hours. You can download it using a link in the description down below. But basically, if you're also interested in making an application just like this and making money from like mobile apps and over web apps, then I have an AI Startup School where I teach you everything from making mobile apps to being able to monetize them and also sell them as well. A bunch of people have already joined and gotten pretty good results from what I teach in the community.

Task #1

Basically, getting back to the application that I have running on my phone over here, you can see that Qwen3-Coder is also here as well. So it's like very up to date. And I want to change this “five-minute updated hourly” to the actual length of the episode. So currently, “five-minute” is just static, but the actual length is variable—like, it's four minutes right now. And I want the application to show how many minutes and seconds each of the hourly summaries are. So currently, where it says “five minutes,” that's just static, like hard-coded into the codebase because I know it'd be roughly five minutes. So this requires a few things. It requires like a new database migration. It requires a way of actually calculating the length of the audio and so forth. And we'll see which model does a better job in this case. So I'm going to use SuperWhisper to basically describe the change that I want. So I'll press start recording over here. Hey, so basically right now, I have an audio summary. I have like an audio thing on the homepage where it's hard-coded five minutes, which is like static. When the aggregator runs that like makes this audio summary, it should actually calculate the length of the audio like summary. And that should be using something like Mux's API, or it should like actually calculate the length by downloading it from Cloudflare where the audio is stored or something like that. Basically, it should calculate the length and then store that in the database as well. There should be a new database column for the length of the audio in seconds. And then it should update the front-end UI as well where it shows in minutes and seconds how long each audio is. And it should also show the last updated time as well. And that's on the homepage of the application. Ask me any clarifying questions if needs be. So basically I have this over here. I'm going to copy this over and paste it into the left-hand side right-hand side. And then press enter on both and we will see how they perform. So you can see that Qwen3-Coder already is faster and it asks me the clarifying questions. And Kimi K2 just did not ask me any clarifying questions, which is pretty interesting. But I really wish that Qwen3-Coder over here looked at my codebase before asking me these questions, which it seems to have not done. So I'll just answer these questions quickly. So I'll answer the questions over here and we'll see how it gets along. So the interesting thing is that Qwen model is doing the Google search over here and it's asking me permission to fetch webpages. So I'll allow that. And I should have actually given the Kimi model a chance to do some like searching as well to make sure it's implementing things correctly. But it does seem to use a Mux API. I think this is correct, but we will give it a chance to use Google to correct itself. Okay, so it seems like Kimi K2 is done. So I'll give it a chance to search online just to be fair, because I did that to Qwen. And I'll say, search online to check your implementation of the audio duration calculation and press enter. And then see how that performs. And you can see it says it found an issue with its own implementation because it's using the wrong endpoint. So maybe in a rules file, I should have like use Google search or search whenever you're unsure rather than having to explicitly say all the time. I think that's just generally good practice. So it seems that despite Kimi K2 starting off slowly, it actually managed to get the job done faster than Qwen did, which is quite interesting. But Qwen is actually proposing like changes of like doing the Supabase migration up and so forth, which Kimi K2 did not do. So I think it has like a bit more agentic behavior in a way. So it will allow the migration to happen. But actually the migration files between the two models are slightly different, which I think may cause issues. But I already like Qwen so far because it's suggesting all these terminal commands as well and making sure the migrations are done properly. Although one thing I do like about Kimi K2 is that instead of actually editing the existing migration, it made a brand new migration to rename the column. I find that with some models, for some reason they just decide to edit an existing migration, which can be a problem if the migration has already been applied. So despite me not setting any rules to do so, it did pretty well in this case. Anyways, it seems like Qwen is now going around in circles with some Supabase commands. So I think it's done actually coding for now. We will see how the code looks different in both of these cases. So I'm going to open up the Qwen version in Kasa and then see what changes are staged ready for us. And then in Kasa again, I'm going to open up the Kimi K2 version. So I have the Kimi K2 version open and the Qwen version and we'll see how the changes compare. So it added a migration file, which looks good. In the aggregator step, it used the Mux API to send the file to Mux and then it immediately tries to get the duration of the asset from Mux, which is possible with the API. But I wish it decided to add a wait over here of about 10 or 30 seconds because it can take some time for Mux to be able to process a file on their end and then for the duration of the audio to be available. As for the homepage, it added a thing to refetch the audio summary and it gets the creator and duration and ultimately seems to do a pretty good job over here. This type error is because I need to update the Supabase types. But yeah, let's see how Kimi K2 compared over here. Firstly, it seems to have edited more files. So it edited the database. ts file, but these files are auto-generated. So it doesn't matter. It then made this file over here and then renamed it to be the same, which we did earlier. As for the aggregator step, what did I do over here? It made a brand new step which gets the audio duration just after the audio has been uploaded to Mux. So it made a brand new step instead of combining into the same step and it retries 10 times getting it from Mux and then it actually has a fallback where it calculates it based on the size and the file type instead, which I find quite interesting over here. So I think I actually preferred this solution because it's more reliable. It actually checks if the file is ready yet and if not, it waits a few more, like three more seconds before checking if the file is ready and then gets the duration audio. As for this listener story here, it made another change that Qwen did not make and I think this change is quite helpful even though it's not necessary. And as for this section over here, it made the exact same change and it made a separate function for formatting the audio nicely and it formatted the last updated at nicely too. I wish I used an external library like Luxon or something to do the formatting, but it should be fine totally. But ultimately, I think over here, I actually prefer Kimi K2’s solution because it decides to wait until the audio duration is ready before actually getting the audio duration. Whereas in Qwen’s case, even though it searched the internet, it just immediately tries to get the audio duration in which case it may not be ready yet and then it will just throw an error and then not continue, which can be problematic. So I actually think Kimi K2 wins on this round compared to Qwen. So anyway, I'm going to commit most of both of these changes to different branches actually and then I'm going to move on to the next task.

Task #2 Brief

Now the next task is going to be a bit more complicated than before. So basically right now in the mobile application, if I go to one of the articles which has an image at the bottom, then you can see the images over here and it has the alt text over here, but it just says alt text because currently the application when it's like collecting and aggregating the AI news, it's not able to actually look and understand images. So I'm going to implement an AI model called MoonDream 2 and it's a small vision language model and you can basically upload any image or use any image and then give it a prompt which is like “describe this image” or like “how many chairs are in this image” or something and then it will give you an answer.

Token Usage & Cost

So I'm going to restart both sessions over here but you can see that Qwen for the previous task, it used 830,000 input tokens and about 4,500 output tokens whereas Kimi K2 on the right-hand side over here used 5. 2 million input tokens and also 22,000 output tokens. So it did use more tokens overall and this actual total cost is not the cost of the Kimi K2. This is a cost if it was Claude Sonnet instead, but Kimi K2 is cheaper so it's much cheaper than this. So the total cost for that previous task was $1. 56 for Qwen3-Coder and for Kimi K2, because I used a YouTube API key that I made earlier in the video, it was $0. 45. So Kimi K2 was three times cheaper than Qwen3-Coder but it actually did a better job. So Kimi K2 looks pretty promising here despite Qwen3-Coder performing better on the benchmarks. But anyway, we'll see for the overtasks. I'm going to rerun

Task #2 Execution

Claude again over here and this is running Kimi K2 which you can see over here and I'm going to rerun Qwen over here and then I'm going to give it the task which is a MoonDream task. So copy the links in MoonDream model and then paste it over here over here and we'll see which one performs a better job again. And it seems like Kimi K2 is done whilst Qwen is still going around in circles and it keeps hitting a token limit thing that it should really know how much tokens it has. I think what it's doing is it searches the entire codebase for some pretty generic text. It merges it all together and then it just hits a token limit. So far for Qwen, it's already burnt through $8. 50 or $8. 58 whereas Kimi K2 after completing that task over here, it's already stayed, I don't think it's updated properly yet but I think it just burned through an extra like $0. 20, $0. 30 if I remember from the balance over here. The part down here hasn't updated properly. But yeah, like so far, it's been less than $1 for Kimi K2 but for Qwen, it's been almost $10 over here. So I think Kimi K2 clearly wins on this one. As for the solution itself, then it added a test endpoint over here which I think is quite helpful. And this is like a separate test command—I don't know if it's hooked up to ingest though, maybe it should be a separate like API endpoint. But over here, implemented a extract images from markdown. It generated the text. It did use a correct model and a user replica client that I already had on the projects locally. It added a nice feature flag where I could enable and disable all text generation over here which I think is very, very neat. And yeah, I would actually have to run it myself to see how it performs and give any errors back to Kimi K2. But ultimately, the solution looks actually pretty promising. And I will have this update on the real application done shortly.

Conclusion

So yeah, I ultimately think that Kimi K2 is still better than Qwen3-Coder. I'm sure they'll make improvements to Qwen3-Coder over here. But right now, it's too expensive for like the kind of performance that it gives. And also like, I don't like some of the solutions that it gives. It doesn't take into consideration the fact that the asset may not be ready when it comes to calculating the audio duration, for example. And it's not aware of its own context limit or length when it comes to using OpenRouter. Maybe there's a different endpoint that is more suited towards Qwen such as the Alibaba Cloud official endpoint. But in my experience of signing up to Alibaba Cloud, it's quite a hassle to go through the whole process. So unless you make it much easier to use Qwen and they fix some of these issues, then I don't think I'd be using Qwen again. I'll stick to Kimi K2 and then also using Claude Sonnet. I think my current workflow has now become, since Kimi K2 gives similar like decisions or it makes similar decisions that Claude Sonnet does, when it can't solve a problem, then I give the solution to—I tell Claude Sonnet to continue via Claude Code. Otherwise, I stick to using Kimi K2 for any simpler tasks. Anyways, if you have watched this far into the video, then do like and subscribe because I will be posting more content like this.

Другие видео автора — Ray Amjad

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник