Экстракт Знаний

Claude Just 5x'd Its Context Window

11:31

Claude Just 5x'd Its Context Window

Ray Amjad 14.08.2025 2 158 просмотров 53 лайков обн. 18.02.2026

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Join AI Startup School & learn to vibe code and get paying customers for your apps ⤵️ https://www.skool.com/ai-startup-school —— MY APPS —— 📲 Stay up to date on AI with my app Tensor AI - on iOS: https://apps.apple.com/us/app/ai-news-tensor-ai/id6746403746 - on Android: https://play.google.com/store/apps/details?id=app.tensorai.tensorai —— MY CLASSES —— 🚀 Claude Code Masterclass: https://www.masterclaudecode.com/ - Use coupon code 9LQCQ9UE for 20% off ————— CONNECT WITH ME 📸 Instagram: https://www.instagram.com/theramjad/ 👨‍💻 LinkedIn: https://www.linkedin.com/in/rayamjad/ 🌍 My website/blog: https://www.rayamjad.com/ ————— Links: - https://x.com/claudeai/status/1955299573620261343 - https://abanteai.github.io/LoCoDiff-bench/ - https://icip-cas.github.io/LiveMCPBench/ - https://research.trychroma.com/context-rot - https://docs.anthropic.com/en/docs/build-with-claude/context-windows#1m-token-context-window - https://cline.bot/blog/two-ways-to-advantage-of-claude-sonnet-4s-1m-context-window-in-cline - https://x.com/omarsar0/status/1955408417616695671 - https://every.to/vibe-check/vibe-check-claude-sonnet-4-now-has-a-1-million-token-context-window Timestamps: 00:00 - Intro 00:08 - Pricing 00:38 - LoCoDiff 02:29 - LiveMCPBench 03:08 - Context Rot 05:10 - Don't Fill Up the Context Window 06:50 - Using It 08:12 - Industry Reactions 11:02 - Conclusion

Оглавление (9 сегментов)

0:00 Intro 41 сл.
0:08 Pricing 92 сл.
0:38 LoCoDiff 433 сл.
2:29 LiveMCPBench 149 сл.
3:08 Context Rot 499 сл.
5:10 Don't Fill Up the Context Window 402 сл.
6:50 Using It 327 сл.
8:12 Industry Reactions 691 сл.
11:02 Conclusion 111 сл.

Intro

As of yesterday, Anthropic now supports up to 1 million tokens of context on their Claude Sonnet 4 model, and I'll be talking about what this means for you and how you can make the most of it. Basically, the pricing

Pricing

has also gone up, so if you're using between 200, 000 tokens and 1 million tokens, you now expect to pay $6 per million input tokens, which is double than before, and $22. 50 per million output tokens, which is 50% more expensive than before under 200, 000 tokens, which was the previous limit. And this puts Claude Sonnet 4 on par with Gemini 1. 5 Pro, which also has a 1 million token context window, and is quite a bit cheaper than Claude Code or Claude Sonnet 4 over here. Basically,

LoCoDiff

despite that, Claude Sonnet 4 does do better on some benchmarks, so there's a benchmark over here called LoCoDiff, Natural Long Context Benchmark, and basically the way this benchmark works over here is they get some really long coding files, and then they do the git log output for the file. So you can see a file over here, shopping list, it shows the items that were added to the files, the lines, the lines that were removed, the lines that were added, and what the model is meant to do, or what a different LLM is meant to do, is it's meant to take this entire input, and then come up with the final state of the file after all these changes were made. And this happens for about 200 files, and it's like a binary answer, either it got it perfectly right, or it got it wrong. And across 200 files, when you increase the context over here, you can see that Sonnet 4 seems to do really well, the accuracy or success rate is still pretty constant for long context tasks, whereas Gemini 1. 5 Pro, if I remove these other ones over here, whereas Gemini 1. 5 Pro does pretty badly, because you can see it drops to about 20% over here, compared to Sonnet 4's 66% over here. But they haven't yet done the benchmark up to 1 million tokens of context, so it'll be quite interesting to see how much Sonnet 4 holds up until then, and if they do publish that, then I will link it down below in the comments section. But basically, you can see that for very long context, Claude Opus 4. 1 does pretty well over here, and Sonnet 4 and Sonnet 4 Thinking actually does better than Opus 4. 1 for longer context tasks, and this is also true when you're using different programming languages as well over here, you can see some of the results, and this is linked down below. And if I compare against 4. 1, Opus 4. 1 over here, remove these areas, then basically you can see that now, whilst Opus 4. 1 does start out pretty strong for very long context tasks, it quickly falls off, whereas Sonnet 4 seems to hold up pretty well. And whilst the benchmark has still not yet been updated for the 1 million token context window, it does seem promising that it will still do better than Gemini 1. 5 Pro for the similar context window size. And there's another benchmark over here called LiveMCPBench, and basically what they

LiveMCPBench

did is they took 10 leading models, and they gave each model access to 70 MCP servers with 527 tools overall. And basically, they gave it different tasks, and they saw how like well it completed each task, and they came up with a success rate. And Claude Sonnet 4 did the best overall, and it did it better than Claude Opus 4 over here, for a better price as well. So you can see this graph over here, it does better than every other model. I'm hoping they update this as well for Claude Sonnet 4's like 1 million token context window, and they throw even more MCP tools at it. But the general idea is that Sonnet 4 is really good when it comes to tool calling as well, and knowing which tools to use, and when to use them. And even though the context window is

Context Rot

bigger, you still want to be aware of what you're actually putting in the context window. There's a good paper link down below called Context Rot by Chroma over here. And basically what they did is they took the needle in a haystack experiment much further. So the needle in a haystack is you basically fill the entire context window with some random piece of information, like it could be a story or something. And then you include one piece of information, that a model is meant to like get out of the context window. So for example, I could include someone's name, a place. In this case, it says the best piece of writing advice I got from my college classmate was to write every week. That's the needle. And this is a haystack around it. And then you ask the model questions such as what is the best piece of writing advice I got from my college classmate. And if it finds this needle successfully, and like answers the question successfully, then it scores well on the needle in a haystack. And most new models these days do seem to be getting perfect scores in the needle in a haystack experiment. But they actually decided to take this experiment further and include distractors, which basically distract the model and can lead it down the wrong path. So you can see over here, the best piece of writing advice I got from my college classmate was to write every week. And there's a distractor saying, I think the best writing tip I received from my college professor was to write everyday. So you can see that it's semantically similar to a needle in a way, but it can actually lead the model down the wrong track. And what they did is they also mixed up the distractors as well. So they used more confusing distractors over here. So they said, I thought the best piece of writing advice I got from a college classmate was to write each essay in four different styles, but not anymore. So that kind of flips the needle around and is used as a distractor. And then they put it in different areas and they saw how many distractors like impact their performance and which distractors impact their performance the most. So you can see for all the models over here, as you increase the number of input tokens, if you have no distractors, then the performance is more or less the same. But as you increase the number of distractors, so you have one distractor, the performance actually decreases as you increase the number of tokens. And if you have four distractors, then it like massively drops over here. And you can also see that different distractors have different hits on the performance. So the last distractor that I said earlier is the most confusing and most models do pretty bad on it because it kind of flips the needle around. And

Don't Fill Up the Context Window

what this means for you is that even though you have a 1 million token context window, you should not be filling up the entire context window just for the hell of it. You should still be careful about what you're actually putting in there, because otherwise a model performance can be worse because you've accumulated these distractors in the context window. So for example, like in this codebase over here, it's a monorepo and there's an expo application over here and there's a Next. js application. And the Next. js application has a landing page, which is a homepage over here. And I remember previously, I loaded the entire codebase into a context window and I said, hey, can you edit the button on the home screen? And what it did is it edits the button on the home screen of the mobile application instead of the like landing page of the Next. js application, because there are two different home screens going on and two different buttons. And that is a distractor. The needle that I wanted was a home screen of the Next. js application and the distractor was actually the home screen of the expo application. So for example, if you have a codebase with multiple payment providers, for example, you're using Paddle and you're also using Stripe and you add a refund feature, for example, and you load in one of the docs, then the model can get pretty confused and its performance can be worse in some ways because you've introduced distractors. Things that are semantically related to each other, but different in other ways. And in another project that I was vibe coding recently using Claude Code, I said to Claude Code, can you move this modal into a separate folder and make a separate file? And it copied the modal over, but it didn't actually delete the modal from the original file. So then I had two different modals. And when I asked it to edit the modal to like add new things, and I didn't realize that there were two different modals still there. It was randomly edited to the old one and it randomly edited the new one. And I kept wondering like, what the hell is going on until I looked at the codebase and realized I didn't actually remove the old modal. And to actually use a 1 million token

Using It

context window, right now, at least you can't use it via the Claude Code subscription. You have to use like Anthropic API. So basically in Claude Code, you have to first log out by doing slash log out. And then you have to log in again by running Claude Code. And you have to choose Anthropic console account over here, API usage billing. And then you have to link it to your account over here. And then you have to write slash model, sonnet, and then in brackets, 1m over here. And if I say hi, then you can see that it responded to me over here. And if I go to my console for Anthropic, then you can see if I group by context window, it used under 200, 000 tokens or 200, 000 context window right now. And if you want to implement the 1 million token context window in your application, then you just follow these instructions. But it is worth noting you have to use being usage tier 4 to be able to use a 1 million token context window. And usage tier 4 means that you need to have purchased at least $400 in credits from them. But if you still want to try out for 1 million tokens of context without spending $400 on a credit purchase, then you can use OpenRouter instead because they are already in usage tier 4 or higher, I think. And basically they will route your request via their own API to like Anthropic servers or API. And you still pay the same pricing, but you still pay a tiny bit more when you're buying credits on OpenRouter. Or you can use something like Cline, so you get this on the left hand side over here, and then link it to your account and then top up your account with credits. Or you can use something like Cursor instead because I'm sure it's supported in Cursor

Industry Reactions

now. And Cline did release an article yesterday about how to make the most of it. And firstly they said stop being context-stingy, you don't have to like be careful or strategic about what to include in context. Just like pull in the MCP servers, documentation, test files and so forth. Load everything that's relevant. And relevant being the keyword over here, you don't want to be loading in distractors. And with the bigger context, you can use Plan mode more effectively now, so you can load in your entire project context, discuss architecture decisions thoroughly, explore edge cases, and refine your approach before switching to act mode. And you can have longer development cycles, so you can go through way more iterations before you have to /compact the conversation or summarize it. Someone on Twitter did do some vibe checks on Claude 4 Sonnet with 1 million token context and compared it to Gemini 2. 5 Pro on this like paper analysis task. So you can see that they loaded in a bunch of papers over here and they said, please find interesting and insightful connections in all these papers. And then they go through them over here and compare the responses. And they say that Gemini 2. 5 Pro is a beast. It provided a very detailed and comprehensive response. But Sonnet 4 prefers to output more concise responses, which is useful in the context of AI agents. And apparently for this guy, Sonnet 4 did highlight a lot of gems from the papers provided. And I will link this down below so you can look at the whole like video. And Every also did a vibe check as well of the 1 million token context window. And they basically did three different tests. They say the 1 million token context window is basically the length of all the Harry Potter books combined. And for the first test, what they did is they hid two movie scenes in 1 million tokens of context and asked Claude to find those scenes and do a detailed analysis of them in one shot. And you can see over here, they compared against Gemini 2. 5 Flash, which also has a 1 million token context window and Pro as well. And Sonnet 4 was the fastest out of the three. Gemini incorrectly identified the title of the movie as another movie over here, whereas Sonnet 4 never hallucinated it and it just declined to assign a title. And basically this is an analysis that I gave. And you can see that Claude gave a much like brief analysis, which it has a habit of doing. It's much more concise. So if you do want high-quality, detailed analysis, Gemini is a better bet, which we also saw in this previous tweet over here. And then they tested the ability to analyze code. So they put in the entire like content management system that they have for their website, which is 250, 000 tokens of Ruby on Rails code. And they also put in 700, 000 tokens of padding code, which is just, I guess, like random related code. And it seems that Sonnet was faster by about three seconds, but it did score lower on their own vibe check. And then they got Claude to play AI Diplomacy, which is their own variation of the strategy game Diplomacy. And they say that Claude did surprisingly well at this. With aggressive prompts, Claude Sonnet 4 came in second only behind o3. And it was also really fast, completing games faster than Gemini 2. 5 Flash. And you can see over here, it took two minutes. And on aggressive, it took 1. 7 minutes instead. And basically their own verdict is that Claude Sonnet 4 makes a very good use of its longer context window if you need a model that's fast and reliably free of hallucinations for long context tasks. And of course, we did mention earlier in the video that it's more expensive than Gemini, so you have to bear that in mind. Now, once it's released as part of the Claude Code subscription,

Conclusion

which I hope is soon, then I will be trying out with my own application, Tensor AI, which is an AI News application, to stay up to date over the latest AI news. And basically right now, the code base is 363, 000 tokens, so this should fit quite comfortably into a 1 million token context window. And it should mean that I'm able to code for quite a long time. And after doing some more testing myself, I will have my own vibe check ready, and I will have some more best practices on how to use it. So if you do want to see that video, then do subscribe.

Ещё от Ray Amjad

How to Make Anki Flashcards 10x Faster with AI (for free!)

Ray Amjad | 23.04.2025 | 6 сегм. | 267 982

Staying Organised for A-levels // Cambridge Student

Ray Amjad | 05.09.2021 | 11 сегм. | 117 382

How I Got an A* in Physics A-level (Cambridge Student)

Ray Amjad | 28.09.2021 | 14 сегм. | 68 131

Mistakes I Made in Sixth Form

Ray Amjad | 17.09.2021 | 10 сегм. | 68 002

Brutally Honest Replit Review (Not Sponsored)

Ray Amjad | 12.09.2025 | 16 сегм. | 49 405

How I Got an A* in Further Maths A-level (Cambridge Student)

Ray Amjad | 02.10.2021 | 14 сегм. | 47 938

Ctrl+V

Экстракт Знаний в Telegram

Транскрипты, идеи, методички — всё самое полезное из лучших YouTube-каналов.

Подписаться