Meta's Llama 4 is a beast (includes 10 million token context)

9:43

Meta's Llama 4 is a beast (includes 10 million token context)

Skill Leap AI 25.04.2025 6 228 просмотров 146 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

You can learn more about the Llama 4 release here: https://ai.meta.com/blog/llama-4-multimodal-intelligence/ Try it yourself here: https://meta.ai/ Meta recently released its most powerful open-source AI models yet: Llama 4 Scout, Llama 4 Maverick, and Llama 4 Behemoth (currently in preview). Llama 4 Scout is a 17B parameter model with a 10 million token context window, outperforming models like Gemma 3 and Gemini 2.0 Flash-Lite—all while running on a single NVIDIA H100 GPU. Llama 4 Maverick, also with 17B active parameters but using 128 experts, beats GPT-4o and Gemini 2.0 Flash across key benchmarks. It delivers state-of-the-art performance in reasoning, coding, and vision, while being more efficient than larger competing models. Both Scout and Maverick were trained using Meta’s 288B parameter teacher model, Llama 4 Behemoth, which already outperforms GPT-4.5 and Claude Sonnet 3.7 on several STEM benchmarks. These models use a “mixture of experts” architecture—only the necessary parts of the model are activated for each task, making them faster and more efficient. You can try Scout and Maverick directly on Meta’s site or download them from Hugging Face if you're a developer.

Оглавление (2 сегментов)

Segment 1 (00:00 - 05:00)

Meta recently released Llama 4, which is their new open-source large language model. It's a family of three, and it's available right now, and I partnered with Meta for this video to give you the breakdown of all the different versions of Llama 4 and what they have to offer. And later in the video, I'll show you a couple of different websites where you could try it for free for yourself. And I'll show you how to download it, too, if you're a developer and you want to play around with Llama 4 for your applications. Okay, so Llama 4 and I've covered all the different Llama models every time they release a new model. And this version of a Llama 4 comes in three different sizes. So you have Llama for Behemoth, Llama for Maverick, and Llama for Scout. And I kind of like to think of this, this is the large one, this is the mediumsiz one, and this is the smallest one they make. Now, all these models are multimodal, which is great, but the biggest thing that stood out to me is the context window. It completely changes how we're going to think and use context windows. Okay, right here, I'll just show you Scout. It has an industry-leading 10 million token context window which is about five million words. Now, just to put that in perspective, chat GPT with GPT40 still has a context window of 128,000. Basically, that's your input. That's all the context that it could remember in a conversation. 128,000 versus 10 million. Even Gemini, which is the best model right now for context windows before this came out, is at 2 million. Now, let me walk you through each model. I want to just emphasize the key points here and I want to just simplify it a bit because this is a little bit technical, but I usually make videos that are a little bit more simple. So, I'll walk you through just the key points in a simpler way. Now, the first one in the batch is called Llama for Scout. This is the smallest model they have and it's a general purpose model, but it does also have multimodal capabilities, so it could understand text as well as images. And it has 17 billion active parameters, 16 experts, and 109 billion total parameters. So I'm going to explain exactly what this means right now. Now this number 109 billion parameters, what parameters are, they're basically like a setting when the model is going through training, right? So the more parameters they have, typically the more capable the model is going to be. And you'll see that we're starting with the smallest one at 109 billion and it will go up much higher with the biggest model they have. Now this part 17 billion active parameters this makes the model a whole lot more efficient. So instead of using 109 billion total parameters all the time it's going to use 17 billion at any given time. And the way it does that is right here 16 experts. So let me explain what this is. So these are basically mixture of expert models. So the way they work is instead of the full model working all the time, mixture of experts uses only parts of the models that is needed for that very specific task. Now this graph explains this really well. So this mixture of experts every time a request comes in. So you ask it to do something, it will activate two, three or four different experts depending on what's most relevant instead of having all of it available. Now this process keeps the high performance but it makes the whole thing faster and more efficient. Now that 17 billion active parameters can actually run entirely on one single Nvidia H100 GPU, right? So it's a lot more resource friendly for its size and performance. It's still a big model, right? So it's going to require a pretty beefy GPU, but only a single one. Now this 10 million token context window that I mentioned, this is by far the largest in the industry, right? For open source, for closed source, for any large language model, this beats everything else that's available right now. And that's going to completely change how we use large language models. This is one of the big limitations of large language model. This is going to handle a huge amount of text as input and output. And I've read a lot about what Meta is trying to do. They're clearly trying to push towards an infinite context window. And Scout that could handle 10 million token context window is well on its way there. is basically trying to solve one of the biggest problems when it comes to using large language models. And when I first started using chat GPT back in 2022, just to put this in context, the context window was 8,000 tokens. This is 10 million tokens now, just a couple years later. Now, the process I'm using right now is kind of a workaround, but I would take large documents. This could be data documents, text documents, right? I trim them down or I chunk the data into different files, organize it and try to give different large language models the size that it could handle, right? That it could handle in the context of that conversation so it remembers what we're talking about within that chat, right? But with a 10 million token context window and moving towards infinite context window, that's never going to be a problem pretty soon. Even with 10 million context window, I don't think I have any type of documentation that is going to hit that limit. But I'm sure some people have much more bigger data sets they're trying to work with or analyze. So that is soon going to pretty much be solved with the way Meta is

Segment 2 (05:00 - 09:00)

moving. And the benchmark for Llama for Scout, they compared it to the older Llama 3. 3 model 3. 1 the biggest 3. 1 model was 405 billion parameters and other open source models and also Gemini 2. 1 flashlight. And again this is multimodal. So some of these didn't have multimodal capabilities. This multimodal wins in pretty much every benchmark here that they tested. And obviously the context window, all these were about 128k, which was pretty standard. Now we're at 10 million on this side. Okay, the next one on the list, this is the medium model, Llama for Maverick. This has 128 experts and 400 billion total parameters and it still manages to use the 17 billion active parameters to make it efficient to run that single GPU that I mentioned. So that is very useful. Okay, Llama for Maverick. There's one thing that's really interesting about it. Even though it has 17 billion active parameters, that's still relatively small if you compare it to GPT 40 for example or Gemini 2. 0 Flash. But if you look at this benchmark, they compare it to those other models, Deepseek V3. 1 also, which is not multimodal. But you could see across the board, it's beating every single other model with Maverick. And here when it comes to cost, it's very efficient, right? At 19 cents, starting at 19 cents for cost per 1 million input and output token, right? So on par with Gemini 2. 0 Flash and a bit cheaper here than DeepSeek, which is another open source model. GPT40 has a lot to go here to catch up with this kind of pricing. Now, when it comes to coding and reasoning, it's also performing on par with this DeepSeek model, but this DeepSeek model is using twice as many active parameters. So at half the active parameters and again using less parameters is more efficient because you could run on less of a hardware setup. Okay, let's get to the biggest one here. This is called Llama for Behemoth at 2 trillion parameters, right? 288 billion active parameters, 16 experts and Behemoth right now is outperforming Gemini 2. 0 know pro and it's beating cloth sonnet 3. 7 here in this benchmark which is a stem related benchmark. Now here's the wild part. This one is still in preview. That means it's still in training. So it's having this kind of result in the benchmark while it's still in training. So this is really impressive here. Now, the fact that this family of open source models could compete or beat almost every single closed source model from the top AI companies right now is a pretty big deal, right? When a developer or a company's considering building an app using a large language model, well, they have a lot more control with open source model like this. Now, if you're using other models like GPT to build an app or claw to build an app, those have API access. That's how you pay those companies to use them. But that's going to be limited, right? Open source is flexible. It's customizable. You could self-host it. You could fine-tune it, right? A lot more you could do with an open source model than with a model that requires that API access. Now, let me share a couple links with you. So, you could try this for yourself. You could try it on the web or if you're a developer, you could download it. So, request access to Llama models. You just fill this out here and then choose which model if you have the hardware to run these models here. And as I mentioned, this is the smallest one, medium-sized one, and the behemoth is still in preview, so you won't be able to download that one, but these two are available right now. And you could also get it from the hugging face website, which is linked in their blog post here. And if you want to try it for yourself, I have a couple of resources for you. So, I've been using it here. So, this is the Meta AI website, meta. ai. If you ask it, what model is he using? This is using Llama 4 right now. And every time I try to open source models, I really like this website, too. So this is called Grock with a Q and it's at chat. grock. com. And if you look right here, you could actually choose different models, different open source models. So if you search for llama, you'll see you could use scout here and you could also use Maverick here. Okay. So if you choose one of these, it's going to use this model to answer any type of a prompt you send out. But the coolest part about Grock is even though Meta AI is really fast, this might be a bit faster. I mean, it just answers you almost instantly here when you type in a text prompt. So, if you want to test it out, this is a good place. Obviously, Meta AI is going to be a good place. They also rolled it out to all kinds of app that Meta owns. So, it's on WhatsApp, Messenger, Instagram, and this is the web version here on Meta. ai. I hope you found this useful. Thanks for watching. I will see you on the next

Другие видео автора — Skill Leap AI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник