New Qwen3 Coder vs Kimi K2: When Benchmarks Lie
15:35

New Qwen3 Coder vs Kimi K2: When Benchmarks Lie

Ray Amjad 23.07.2025 20 098 просмотров 459 лайков обн. 18.02.2026
Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Join AI Startup School & learn to vibe code and get paying customers for your apps ⤵️ https://www.skool.com/ai-startup-school 📲 Stay up to date on AI with my app Tensor AI - on iOS: https://apps.apple.com/us/app/ai-news-tensor-ai/id6746403746 - on Android: https://play.google.com/store/apps/details?id=app.tensorai.tensorai CONNECT WITH ME 📸 Instagram: https://www.instagram.com/theramjad/ 👨‍💻 LinkedIn: https://www.linkedin.com/in/rayamjad/ 🌍 My website/blog: https://www.rayamjad.com/ ————— Links Mentioned: - Qwen-3-Coder Announcement: https://x.com/Alibaba_Qwen/status/1947766835023335516 - GitHub for Qwen3-Coder: https://github.com/QwenLM/qwen-code - OpenRouter for Qwen3-Coder: https://openrouter.ai/qwen/qwen3-coder Timestamps: - 00:00 - Intro - 00:47 - Setup - 03:20 - What We’ll Be Doing - 04:06 - Task #1 - 10:51 - Task #2 Brief - 11:30 - Token Usage & Cost - 12:23 - Task #2 Execution - 14:25 - Conclusion

Оглавление (8 сегментов)

  1. 0:00 Intro 190 сл.
  2. 0:47 Setup 501 сл.
  3. 3:20 What We’ll Be Doing 191 сл.
  4. 4:06 Task #1 1407 сл.
  5. 10:51 Task #2 Brief 141 сл.
  6. 11:30 Token Usage & Cost 166 сл.
  7. 12:23 Task #2 Execution 363 сл.
  8. 14:25 Conclusion 245 сл.
0:00

Intro

Okay, so about four hours ago, Alibaba came out  with a brand new model, and this is a Qwen3-Coder   model, and we'll be comparing it against Kimi K2,  which came out last week, to see which performs   better on a real-world production codebase.   And you can see on the benchmarks, it already   seems to outperform Kimi K2, which is pretty  surprising, because they're already a weaker   part. And it already outperforms Claude Sonnet  4 as well, in many cases, and we'll be seeing if   it actually lives up to the test when it comes to  a real-world production codebase. Because I find   that with a lot of these benchmarks, usually  they do well in the benchmarks, but when you   try it on a production codebase, then each model  kind of has its own feel to it. And so models,   even though they do well in the benchmarks, they  don't do well in the codebases, and vice versa. But basically, you can use it in  a similar way you can with Claude   Code and also Kimi CLI as well. If you go to  GitHub, which is linked down below as well,
0:47

Setup

you can download a Qwen3-Coder CLI—like a Kimi  CLI equivalent. So this is actually forked,   or it's adapted from Gemi CLI over here, but  it's optimized for Qwen3-Coder. And you just run: npm install -g qwen3-coder-cli Then you would run qwen anywhere on your  computer. And you actually have to set your   parameters before you run the qwen command. So  I'm going to quickly show you how you do that.    So I'll be doing it in the terminal over  here. And I'm actually going to close this   where I have Kimi K2 running. And basically,  after installing it, you want to run this   command over here. But you want to replace  these API keys and base URLs accordingly. Now—this is important—even though it  says OPENAI_API_KEY and OPENAI_BASE_URL,   it is not actually using an OpenAI  model. You want to use OpenRouter   over here. The model isn't being hosted  on OpenRouter directly, but it forwards   the request to the relevant backend,  which is usually Alibaba’s servers. And you can see right now the Qwen3-Coder model  is over here. Most of the requests are being   forwarded to China, which is where Alibaba is  hosting the model. And it has some good uptime.    And where the model is being hosted in the US,  it seems to be struggling right now, kind of. But basically, we want to set the model  name to this over here. So we copy this over   and then go back to our terminal, paste this  over here. We need to set the base URL, and   the base URL will be the OpenRouter base URL. And  I believe that is near the bottom over here. This   will also be in the description down below, so  you can use that. And if I go back to my terminal… And now for the OpenAI API key—again,  it’s not actually an OpenAI key—you   want to use the OpenRouter key. So if you  go to OpenRouter, go to Keys over here,   and then make a new key. I'm going to  call it YouTube for this YouTube video.    And don't use my key because it'll be done by  the time or expired by the time you're watching   this. Then copy the key over here. And then go  back to your terminal. And then press paste. So now you should have your  environment variables set like this: export OPENAI_API_KEY=< your-OpenRouter-key> export   OPENAI_BASE_URL=https://openrouter. ai/api/v1 export OPENAI_MODEL=qwen/qwen3-coder Then run qwen. We can then make sure it's  actually running Qwen over here. So it's   pointing to the right model. So we can say,  "Who are you? " and we'll wait a second over   here. And you can see it says, "I'm Qwen, blah  blah blah. " And this is the correct model. And if you really want to make sure it's  the correct model, then you can ask like a   politically sensitive question, and it will  be denied because it is a Chinese model. So now I have Qwen running on the left-hand  side over here in a Qwen-specific folder. And
3:20

What We’ll Be Doing

I have Kimi K2 running on the right-hand  side over here in a Kimi K2 folder. And   this is the exact same codebase. And  the codebase I'm using is actually a   production codebase from an application  that I previously made called Tensor AI. And Tensor AI basically helps you stay up to date  with the latest AI news. If you feel like there's   so much AI news that it's kind of hard to know  what is relevant for you, this application only   notifies you about the AI news that is relevant  to you. And you can listen to a summary of the   news from the last 24 hours. You can download  it using a link in the description down below. But basically, if you're also  interested in making an application   just like this and making money from  like mobile apps and over web apps,   then I have an AI Startup School where I  teach you everything from making mobile   apps to being able to monetize them and  also sell them as well. A bunch of people   have already joined and gotten pretty good  results from what I teach in the community.
4:06

Task #1

Basically, getting back to the application  that I have running on my phone over here,   you can see that Qwen3-Coder is also here as well.   So it's like very up to date. And I want to change   this “five-minute updated hourly” to the actual  length of the episode. So currently, “five-minute”   is just static, but the actual length is  variable—like, it's four minutes right now. And   I want the application to show how many minutes  and seconds each of the hourly summaries are. So currently, where it says “five minutes,” that's  just static, like hard-coded into the codebase   because I know it'd be roughly five minutes.   So this requires a few things. It requires   like a new database migration. It requires a  way of actually calculating the length of the   audio and so forth. And we'll see which  model does a better job in this case. So I'm going to use SuperWhisper  to basically describe the change   that I want. So I'll press  start recording over here. Hey, so basically right now, I have  an audio summary. I have like an audio   thing on the homepage where it's hard-coded  five minutes, which is like static. When the   aggregator runs that like makes this audio  summary, it should actually calculate the   length of the audio like summary. And that  should be using something like Mux's API,   or it should like actually calculate the  length by downloading it from Cloudflare   where the audio is stored or something like that.   Basically, it should calculate the length and then   store that in the database as well. There should  be a new database column for the length of the   audio in seconds. And then it should update  the front-end UI as well where it shows in   minutes and seconds how long each audio is. And  it should also show the last updated time as well.    And that's on the homepage of the application.   Ask me any clarifying questions if needs be. So basically I have this over  here. I'm going to copy this   over and paste it into the left-hand  side right-hand   side. And then press enter on both  and we will see how they perform. So you can see that Qwen3-Coder already  is faster and it asks me the clarifying   questions. And Kimi K2 just did not  ask me any clarifying questions,   which is pretty interesting. But I really wish  that Qwen3-Coder over here looked at my codebase   before asking me these questions, which it seems  to have not done. So I'll just answer these   questions quickly. So I'll answer the questions  over here and we'll see how it gets along. So the interesting thing is that Qwen model  is doing the Google search over here and it's   asking me permission to fetch webpages.   So I'll allow that. And I should have   actually given the Kimi model a chance to  do some like searching as well to make sure   it's implementing things correctly. But it does  seem to use a Mux API. I think this is correct,   but we will give it a chance to  use Google to correct itself. Okay, so it seems like Kimi K2 is done. So I'll  give it a chance to search online just to be fair,   because I did that to Qwen. And I'll say, search  online to check your implementation of the audio   duration calculation and press enter. And then  see how that performs. And you can see it says it   found an issue with its own implementation  because it's using the wrong endpoint. So maybe in a rules file, I should have like use  Google search or search whenever you're unsure   rather than having to explicitly say all the  time. I think that's just generally good practice. So it seems that despite Kimi K2 starting  off slowly, it actually managed to get the   job done faster than Qwen did, which is  quite interesting. But Qwen is actually   proposing like changes of like doing  the Supabase migration up and so forth,   which Kimi K2 did not do. So I think it  has like a bit more agentic behavior in   a way. So it will allow the migration  to happen. But actually the migration   files between the two models are slightly  different, which I think may cause issues. But I already like Qwen so far because it's  suggesting all these terminal commands as   well and making sure the migrations are done  properly. Although one thing I do like about   Kimi K2 is that instead of actually editing  the existing migration, it made a brand new   migration to rename the column. I find that  with some models, for some reason they just   decide to edit an existing migration, which can  be a problem if the migration has already been   applied. So despite me not setting any rules  to do so, it did pretty well in this case. Anyways, it seems like Qwen is now going around  in circles with some Supabase commands. So I think   it's done actually coding for now. We will  see how the code looks different in both of   these cases. So I'm going to open up the Qwen  version in Kasa and then see what changes are   staged ready for us. And then in Kasa again,  I'm going to open up the Kimi K2 version. So   I have the Kimi K2 version open and the Qwen  version and we'll see how the changes compare. So it added a migration file, which looks good.   In the aggregator step, it used the Mux API   to send the file to Mux and then it immediately  tries to get the duration of the asset from Mux,   which is possible with the API. But I  wish it decided to add a wait over here   of about 10 or 30 seconds because it  can take some time for Mux to be able   to process a file on their end and then for  the duration of the audio to be available. As for the homepage, it added a thing to  refetch the audio summary and it gets the   creator and duration and ultimately seems to do  a pretty good job over here. This type error is   because I need to update the Supabase types. But  yeah, let's see how Kimi K2 compared over here. Firstly, it seems to have edited more  files. So it edited the database. ts file,   but these files are auto-generated.   So it doesn't matter. It then made   this file over here and then renamed it  to be the same, which we did earlier. As for the aggregator step, what did I do  over here? It made a brand new step which   gets the audio duration just after the audio  has been uploaded to Mux. So it made a brand   new step instead of combining into the same  step and it retries 10 times getting it from   Mux and then it actually has a fallback  where it calculates it based on the size   and the file type instead, which I  find quite interesting over here. So I think I actually preferred this  solution because it's more reliable.    It actually checks if the file is ready  yet and if not, it waits a few more,   like three more seconds before checking if the  file is ready and then gets the duration audio. As for this listener story here, it made another  change that Qwen did not make and I think this   change is quite helpful even though it's not  necessary. And as for this section over here,   it made the exact same change and it made a  separate function for formatting the audio   nicely and it formatted the last updated  at nicely too. I wish I used an external   library like Luxon or something to do the  formatting, but it should be fine totally. But ultimately, I think over here, I actually  prefer Kimi K2’s solution because it decides   to wait until the audio duration is ready before  actually getting the audio duration. Whereas in   Qwen’s case, even though it searched the internet,  it just immediately tries to get the audio   duration in which case it may not be ready yet  and then it will just throw an error and then not   continue, which can be problematic. So I actually  think Kimi K2 wins on this round compared to Qwen. So anyway, I'm going to commit  most of both of these changes   to different branches actually and then  I'm going to move on to the next task.
10:51

Task #2 Brief

Now the next task is going to be a bit more  complicated than before. So basically right   now in the mobile application, if I go to one of  the articles which has an image at the bottom,   then you can see the images over here and it  has the alt text over here, but it just says   alt text because currently the application when  it's like collecting and aggregating the AI news,   it's not able to actually look and understand  images. So I'm going to implement an AI model   called MoonDream 2 and it's a small vision  language model and you can basically upload   any image or use any image and then give it  a prompt which is like “describe this image”   or like “how many chairs are in this image” or  something and then it will give you an answer.
11:30

Token Usage & Cost

So I'm going to restart both sessions over here  but you can see that Qwen for the previous task,   it used 830,000 input tokens and about 4,500  output tokens whereas Kimi K2 on the right-hand   side over here used 5. 2 million input tokens  and also 22,000 output tokens. So it did use   more tokens overall and this actual total  cost is not the cost of the Kimi K2. This   is a cost if it was Claude Sonnet instead,  but Kimi K2 is cheaper so it's much cheaper   than this. So the total cost for that previous  task was $1. 56 for Qwen3-Coder and for Kimi K2,   because I used a YouTube API key  that I made earlier in the video,   it was $0. 45. So Kimi K2 was three times cheaper  than Qwen3-Coder but it actually did a better job. So Kimi K2 looks pretty promising here  despite Qwen3-Coder performing better on   the benchmarks. But anyway, we'll see  for the overtasks. I'm going to rerun
12:23

Task #2 Execution

Claude again over here and this is running  Kimi K2 which you can see over here and I'm   going to rerun Qwen over here and then  I'm going to give it the task which is   a MoonDream task. So copy the links in  MoonDream model and then paste it over   here over here and we'll  see which one performs a better job again. And it seems like Kimi K2 is done whilst  Qwen is still going around in circles and   it keeps hitting a token limit thing  that it should really know how much   tokens it has. I think what it's doing  is it searches the entire codebase for   some pretty generic text. It merges it all  together and then it just hits a token limit. So far for Qwen, it's already burnt through  $8. 50 or $8. 58 whereas Kimi K2 after completing   that task over here, it's already stayed, I don't  think it's updated properly yet but I think it   just burned through an extra like $0. 20, $0. 30  if I remember from the balance over here. The   part down here hasn't updated properly. But yeah,  like so far, it's been less than $1 for Kimi K2   but for Qwen, it's been almost $10 over here.   So I think Kimi K2 clearly wins on this one. As for the solution itself, then it added a test  endpoint over here which I think is quite helpful.    And this is like a separate test command—I don't  know if it's hooked up to ingest though, maybe it   should be a separate like API endpoint. But over  here, implemented a extract images from markdown.    It generated the text. It did use a correct model  and a user replica client that I already had on   the projects locally. It added a nice feature  flag where I could enable and disable all text   generation over here which I think is very, very  neat. And yeah, I would actually have to run it   myself to see how it performs and give any errors  back to Kimi K2. But ultimately, the solution   looks actually pretty promising. And I will have  this update on the real application done shortly.
14:25

Conclusion

So yeah, I ultimately think that Kimi K2 is  still better than Qwen3-Coder. I'm sure they'll   make improvements to Qwen3-Coder over here. But  right now, it's too expensive for like the kind   of performance that it gives. And also like, I  don't like some of the solutions that it gives.    It doesn't take into consideration the fact  that the asset may not be ready when it comes to   calculating the audio duration, for example. And  it's not aware of its own context limit or length   when it comes to using OpenRouter. Maybe there's  a different endpoint that is more suited towards   Qwen such as the Alibaba Cloud official endpoint.   But in my experience of signing up to Alibaba   Cloud, it's quite a hassle to go through the whole  process. So unless you make it much easier to use   Qwen and they fix some of these issues, then I  don't think I'd be using Qwen again. I'll stick   to Kimi K2 and then also using Claude Sonnet.   I think my current workflow has now become,   since Kimi K2 gives similar like decisions or it  makes similar decisions that Claude Sonnet does,   when it can't solve a problem, then I give the  solution to—I tell Claude Sonnet to continue   via Claude Code. Otherwise, I stick to  using Kimi K2 for any simpler tasks. Anyways, if you have watched  this far into the video,   then do like and subscribe because I  will be posting more content like this.

Ещё от Ray Amjad

Ctrl+V

Экстракт Знаний в Telegram

Транскрипты, идеи, методички — всё самое полезное из лучших YouTube-каналов.

Подписаться