Claude Just 5x'd Its Context Window
11:31

Claude Just 5x'd Its Context Window

Ray Amjad 14.08.2025 2 158 просмотров 53 лайков обн. 18.02.2026
Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Join AI Startup School & learn to vibe code and get paying customers for your apps ⤵️ https://www.skool.com/ai-startup-school —— MY APPS —— 📲 Stay up to date on AI with my app Tensor AI - on iOS: https://apps.apple.com/us/app/ai-news-tensor-ai/id6746403746 - on Android: https://play.google.com/store/apps/details?id=app.tensorai.tensorai —— MY CLASSES —— 🚀 Claude Code Masterclass: https://www.masterclaudecode.com/ - Use coupon code 9LQCQ9UE for 20% off ————— CONNECT WITH ME 📸 Instagram: https://www.instagram.com/theramjad/ 👨‍💻 LinkedIn: https://www.linkedin.com/in/rayamjad/ 🌍 My website/blog: https://www.rayamjad.com/ ————— Links: - https://x.com/claudeai/status/1955299573620261343 - https://abanteai.github.io/LoCoDiff-bench/ - https://icip-cas.github.io/LiveMCPBench/ - https://research.trychroma.com/context-rot - https://docs.anthropic.com/en/docs/build-with-claude/context-windows#1m-token-context-window - https://cline.bot/blog/two-ways-to-advantage-of-claude-sonnet-4s-1m-context-window-in-cline - https://x.com/omarsar0/status/1955408417616695671 - https://every.to/vibe-check/vibe-check-claude-sonnet-4-now-has-a-1-million-token-context-window Timestamps: 00:00 - Intro 00:08 - Pricing 00:38 - LoCoDiff 02:29 - LiveMCPBench 03:08 - Context Rot 05:10 - Don't Fill Up the Context Window 06:50 - Using It 08:12 - Industry Reactions 11:02 - Conclusion

Оглавление (9 сегментов)

  1. 0:00 Intro 41 сл.
  2. 0:08 Pricing 92 сл.
  3. 0:38 LoCoDiff 433 сл.
  4. 2:29 LiveMCPBench 149 сл.
  5. 3:08 Context Rot 499 сл.
  6. 5:10 Don't Fill Up the Context Window 402 сл.
  7. 6:50 Using It 327 сл.
  8. 8:12 Industry Reactions 691 сл.
  9. 11:02 Conclusion 111 сл.
0:00

Intro

As of yesterday, Anthropic now  supports up to 1 million tokens   of context on their Claude Sonnet 4  model, and I'll be talking about what   this means for you and how you can make  the most of it. Basically, the pricing
0:08

Pricing

has also gone up, so if you're using between 200, 000 tokens and 1 million tokens, you now expect to   pay $6 per million input tokens, which is double  than before, and $22. 50 per million output tokens,   which is 50% more expensive than before under 200, 000 tokens, which was the previous limit. And this   puts Claude Sonnet 4 on par with Gemini 1. 5 Pro,  which also has a 1 million token context window,   and is quite a bit cheaper than Claude Code  or Claude Sonnet 4 over here. Basically,
0:38

LoCoDiff

despite that, Claude Sonnet 4 does do better on  some benchmarks, so there's a benchmark over here   called LoCoDiff, Natural Long Context Benchmark,  and basically the way this benchmark works over   here is they get some really long coding files,  and then they do the git log output for the file.    So you can see a file over here, shopping list,  it shows the items that were added to the files,   the lines, the lines that were removed, the  lines that were added, and what the model is   meant to do, or what a different LLM is meant  to do, is it's meant to take this entire input,   and then come up with the final state of the  file after all these changes were made. And   this happens for about 200 files, and it's like a  binary answer, either it got it perfectly right,   or it got it wrong. And across 200 files,  when you increase the context over here,   you can see that Sonnet 4 seems to do really  well, the accuracy or success rate is still pretty   constant for long context tasks, whereas Gemini  1. 5 Pro, if I remove these other ones over here,   whereas Gemini 1. 5 Pro does pretty badly, because  you can see it drops to about 20% over here,   compared to Sonnet 4's 66% over here. But they  haven't yet done the benchmark up to 1 million   tokens of context, so it'll be quite interesting  to see how much Sonnet 4 holds up until then,   and if they do publish that, then I will link it  down below in the comments section. But basically,   you can see that for very long context, Claude  Opus 4. 1 does pretty well over here, and Sonnet   4 and Sonnet 4 Thinking actually does better than  Opus 4. 1 for longer context tasks, and this is   also true when you're using different programming  languages as well over here, you can see some of   the results, and this is linked down below. And if  I compare against 4. 1, Opus 4. 1 over here, remove   these areas, then basically you can see that now,  whilst Opus 4. 1 does start out pretty strong for   very long context tasks, it quickly falls off,  whereas Sonnet 4 seems to hold up pretty well.    And whilst the benchmark has still not yet been  updated for the 1 million token context window,   it does seem promising that it will still do  better than Gemini 1. 5 Pro for the similar context   window size. And there's another benchmark over  here called LiveMCPBench, and basically what they
2:29

LiveMCPBench

did is they took 10 leading models, and they gave  each model access to 70 MCP servers with 527 tools   overall. And basically, they gave it different  tasks, and they saw how like well it completed   each task, and they came up with a success  rate. And Claude Sonnet 4 did the best overall,   and it did it better than Claude Opus 4 over  here, for a better price as well. So you can   see this graph over here, it does better than  every other model. I'm hoping they update this   as well for Claude Sonnet 4's like 1 million token  context window, and they throw even more MCP tools   at it. But the general idea is that Sonnet 4 is  really good when it comes to tool calling as well,   and knowing which tools to use, and when to  use them. And even though the context window is
3:08

Context Rot

bigger, you still want to be aware of what you're  actually putting in the context window. There's a   good paper link down below called Context Rot  by Chroma over here. And basically what they   did is they took the needle in a haystack  experiment much further. So the needle in a   haystack is you basically fill the entire context  window with some random piece of information,   like it could be a story or something. And then  you include one piece of information, that a model   is meant to like get out of the context window.   So for example, I could include someone's name,   a place. In this case, it says the best piece of  writing advice I got from my college classmate   was to write every week. That's the needle. And  this is a haystack around it. And then you ask   the model questions such as what is the best piece  of writing advice I got from my college classmate.    And if it finds this needle successfully,  and like answers the question successfully,   then it scores well on the needle in a haystack.   And most new models these days do seem to be   getting perfect scores in the needle in a haystack  experiment. But they actually decided to take this   experiment further and include distractors, which  basically distract the model and can lead it down   the wrong path. So you can see over here, the  best piece of writing advice I got from my college   classmate was to write every week. And there's a  distractor saying, I think the best writing tip I   received from my college professor was to write  everyday. So you can see that it's semantically   similar to a needle in a way, but it can actually  lead the model down the wrong track. And what they   did is they also mixed up the distractors as  well. So they used more confusing distractors   over here. So they said, I thought the best piece  of writing advice I got from a college classmate   was to write each essay in four different styles,  but not anymore. So that kind of flips the needle   around and is used as a distractor. And then  they put it in different areas and they saw how   many distractors like impact their performance  and which distractors impact their performance   the most. So you can see for all the models over  here, as you increase the number of input tokens,   if you have no distractors, then the  performance is more or less the same.    But as you increase the number of distractors,  so you have one distractor, the performance   actually decreases as you increase the number  of tokens. And if you have four distractors,   then it like massively drops over here. And  you can also see that different distractors   have different hits on the performance. So the  last distractor that I said earlier is the most   confusing and most models do pretty bad on it  because it kind of flips the needle around. And
5:10

Don't Fill Up the Context Window

what this means for you is that even though you  have a 1 million token context window, you should   not be filling up the entire context window just  for the hell of it. You should still be careful   about what you're actually putting in there,  because otherwise a model performance can be worse   because you've accumulated these distractors in  the context window. So for example, like in this   codebase over here, it's a monorepo and there's an  expo application over here and there's a Next. js   application. And the Next. js application has  a landing page, which is a homepage over here.    And I remember previously, I loaded the entire  codebase into a context window and I said, hey,   can you edit the button on the home screen? And  what it did is it edits the button on the home   screen of the mobile application instead of the  like landing page of the Next. js application,   because there are two different home screens  going on and two different buttons. And that   is a distractor. The needle that I wanted was a  home screen of the Next. js application and the   distractor was actually the home screen of the  expo application. So for example, if you have   a codebase with multiple payment providers, for  example, you're using Paddle and you're also using   Stripe and you add a refund feature, for example,  and you load in one of the docs, then the model   can get pretty confused and its performance can  be worse in some ways because you've introduced   distractors. Things that are semantically related  to each other, but different in other ways. And   in another project that I was vibe coding recently  using Claude Code, I said to Claude Code, can you   move this modal into a separate folder and make  a separate file? And it copied the modal over,   but it didn't actually delete the modal from the  original file. So then I had two different modals.    And when I asked it to edit the modal to like  add new things, and I didn't realize that there   were two different modals still there. It was  randomly edited to the old one and it randomly   edited the new one. And I kept wondering like,  what the hell is going on until I looked at the   codebase and realized I didn't actually remove the  old modal. And to actually use a 1 million token
6:50

Using It

context window, right now, at least you can't use  it via the Claude Code subscription. You have to   use like Anthropic API. So  basically in Claude Code, you have to first log   out by doing slash log out. And then you have to  log in again by running Claude Code. And you have   to choose Anthropic console account over here,  API usage billing. And then you have to link it   to your account over here. And then you have to  write slash model, sonnet, and then in brackets,  1m over here. And if I say hi, then you can see  that it responded to me over here. And if I go to   my console for Anthropic, then you can see if  I group by context window, it used under 200,  000 tokens or 200, 000 context window right now. And if   you want to implement the 1 million token context  window in your application, then you just follow   these instructions. But it is worth noting you  have to use being usage tier 4 to be able to use   a 1 million token context window. And usage tier  4 means that you need to have purchased at least   $400 in credits from them. But if you still want  to try out for 1 million tokens of context without   spending $400 on a credit purchase, then you can  use OpenRouter instead because they are already   in usage tier 4 or higher, I think. And basically  they will route your request via their own API to   like Anthropic servers or API. And you still pay  the same pricing, but you still pay a tiny bit   more when you're buying credits on OpenRouter. Or  you can use something like Cline, so you get this   on the left hand side over here, and then link  it to your account and then top up your account   with credits. Or you can use something like Cursor  instead because I'm sure it's supported in Cursor
8:12

Industry Reactions

now. And Cline did release an article yesterday  about how to make the most of it. And firstly they   said stop being context-stingy, you don't have to  like be careful or strategic about what to include   in context. Just like pull in the MCP servers,  documentation, test files and so forth. Load   everything that's relevant. And relevant being the  keyword over here, you don't want to be loading   in distractors. And with the bigger context,  you can use Plan mode more effectively now,   so you can load in your entire project context,  discuss architecture decisions thoroughly, explore   edge cases, and refine your approach before  switching to act mode. And you can have longer   development cycles, so you can go through way  more iterations before you have to /compact the   conversation or summarize it. Someone on Twitter  did do some vibe checks on Claude 4 Sonnet with   1 million token context and compared it to Gemini  2. 5 Pro on this like paper analysis task. So you   can see that they loaded in a bunch of papers over  here and they said, please find interesting and   insightful connections in all these papers. And  then they go through them over here and compare   the responses. And they say that Gemini 2. 5  Pro is a beast. It provided a very detailed and   comprehensive response. But Sonnet 4 prefers to  output more concise responses, which is useful in   the context of AI agents. And apparently for this  guy, Sonnet 4 did highlight a lot of gems from   the papers provided. And I will link this down  below so you can look at the whole like video.    And Every also did a vibe check as well of the 1  million token context window. And they basically   did three different tests. They say the 1 million  token context window is basically the length of   all the Harry Potter books combined. And for the  first test, what they did is they hid two movie   scenes in 1 million tokens of context and asked  Claude to find those scenes and do a detailed   analysis of them in one shot. And you can see  over here, they compared against Gemini 2. 5 Flash,   which also has a 1 million token context window  and Pro as well. And Sonnet 4 was the fastest out   of the three. Gemini incorrectly identified the  title of the movie as another movie over here,   whereas Sonnet 4 never hallucinated it and it  just declined to assign a title. And basically   this is an analysis that I gave. And you can see  that Claude gave a much like brief analysis, which   it has a habit of doing. It's much more concise.   So if you do want high-quality, detailed analysis,   Gemini is a better bet, which we also saw in this  previous tweet over here. And then they tested the   ability to analyze code. So they put in the entire  like content management system that they have for   their website, which is 250, 000 tokens of Ruby on Rails   code. And they also put in 700, 000 tokens of padding code, which is just,   I guess, like random related code. And it seems  that Sonnet was faster by about three seconds,   but it did score lower on their own vibe check.   And then they got Claude to play AI Diplomacy,   which is their own variation of the strategy  game Diplomacy. And they say that Claude did   surprisingly well at this. With aggressive  prompts, Claude Sonnet 4 came in second   only behind o3. And it was also really fast,  completing games faster than Gemini 2. 5 Flash.    And you can see over here, it took two minutes.   And on aggressive, it took 1. 7 minutes instead.    And basically their own verdict is that Claude  Sonnet 4 makes a very good use of its longer   context window if you need a model that's fast and  reliably free of hallucinations for long context   tasks. And of course, we did mention earlier in  the video that it's more expensive than Gemini,   so you have to bear that in mind. Now, once it's  released as part of the Claude Code subscription,
11:02

Conclusion

which I hope is soon, then I will be  trying out with my own application,   Tensor AI, which is an AI News application,  to stay up to date over the latest AI news.    And basically right now, the code base is 363, 000 tokens, so this should fit quite comfortably   into a 1 million token context window. And it  should mean that I'm able to code for quite a long   time. And after doing some more testing myself, I  will have my own vibe check ready, and I will have   some more best practices on how to use it. So if  you do want to see that video, then do subscribe.

Ещё от Ray Amjad

Ctrl+V

Экстракт Знаний в Telegram

Транскрипты, идеи, методички — всё самое полезное из лучших YouTube-каналов.

Подписаться