# Meta Finally Revealed The Truth About LLAMA 4

## Метаданные

- **Канал:** TheAIGRID
- **YouTube:** https://www.youtube.com/watch?v=YgJ338Phb9M
- **Дата:** 08.04.2025
- **Длительность:** 15:54
- **Просмотры:** 14,591

## Описание

Join my AI Academy - https://www.skool.com/postagiprepardness 
🐤 Follow Me on Twitter https://twitter.com/TheAiGrid
🌐 Checkout My website - https://theaigrid.com/


Links From Todays Video:
https://ai.meta.com/blog/llama-4-multimodal-intelligence/
https://www.theinformation.com/articles/llama-4s-rocky-debut?rc=0g0zvw

Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos.

Was there anything i missed?

(For Business Enquiries)  contact@theaigrid.com

Music Used

LEMMiNO - Cipher
https://www.youtube.com/watch?v=b0q5PR1xpA0
CC BY-SA 4.0
LEMMiNO - Encounters
https://www.youtube.com/watch?v=xdwWCl_5x2s

#LLM #Largelanguagemodel #chatgpt
#AI
#ArtificialIntelligence
#MachineLearning
#DeepLearning
#NeuralNetworks
#Robotics
#DataScience

## Содержание

### [0:00](https://www.youtube.com/watch?v=YgJ338Phb9M) Segment 1 (00:00 - 05:00)

So, the AI industry is never without drama, and Llama 4 being released has been a rocky one to say the least. Llama 4 has been released, and apparently the benchmarks aren't living up to the hype. So, in this video, I'm going to dive into the recent news that may reveal some of the actual truths going on behind the scenes at Meta. So, one of the first things that was interesting to me was that Llama 4 was released without a technical paper. Now, of course, you may think that this is not that big of a deal considering the fact that many companies now are opting to privately own their models in the sense that when they make a breakthrough, they decide to, you know, just keep that innovation private because of course that is technically a competitive advantage. However, in this case, it may be indicative of a further issue because we don't have access to the model's internal wirings. if we don't really know how they built the model, how it was trained, and any sort of techniques they used. And some people are arguing that this is even further proof that Meta somewhat tampered with the benchmarks to possibly overfit them to get better results. Of course, I'll dive into all the key details, but it really is intriguing this release because there are two sides of the herd here. On one side you have people saying that Meta completely faked the benchmarks and on the other side you have people including myself stating that this model is actually pretty decent. Now one of the things I want to actually show you was something that went semi viral on Twitter. This was a post I saw on Reddit from another website and it was really interesting because this was around the time that Deepseek V3 was released. It spoke about how Meta's Gen AI organization was in panic mode and that it started with Deep Seek V3, which rendered the Llama 4 already behind in benchmarks and adding salt into injury, was the unknown Chinese company with a 5. 5 million training dollar budget. It says engineers are moving frantically to dissect Deep Seek and copy anything and everything we can from it. And I'm not even exaggerating. This is a website where people in tech they can post anonymously you know about their jobs, their experiences and the industry. And you can see here it says that management is worried about justifying the massive cost of the gen AI or how would they face leadership when every single leader of Genai or is making more than what it cost to train Deepseek V3 entirely and we have dozens of such leaders. Essentially what they're saying here is that, you know, if Deep Seek's training run cost $5. 5 million, think about the fact that these Gen AI, you know, leaders, they are essentially getting paid millions and millions of dollars because the talent is so fierce in AI that, you know, companies are really competitive on who they can retain. And one way to retain your employees is by giving them higher compensation. And so now they're wondering, maybe I'm paying these employees too much. If I'm paying someone multi-million dollars a year and some unknown Chinese company can perform the same of the entire organization, maybe we need to change things. And you can see here that it says Deep Seek R1 made things even scarier. I can't reveal the info, but it will be soon public anyways. And it should have been an engineering focused small organization, but since a bunch of people wanted to join the impact grab and artificially inflate hiring in the organization, everyone loses. It seemed like right what happened was the fact that a lot of people wanted to be on the pioneering team considering Gen AI is getting so much attention not only in the media but in many different industries. So this post at the time it was quickly disregarded by many individuals because of course we do know that Deepseek V3 was good. Of course, you can, you know, post anonymously. But now that the recent news has come about, it's leading a lot of people to believe that this statement now has even more credibility to it because not only was it earlier, but it actually predicts the fact that Llama 4 is still behind indie benchmarks. So, I'm going to show you guys some evidence/proof that there was a little bit of differences going on. So Ethan Molik, the AI professor, actually spoke about how there were differences in the Llama for model that won in the large language model arena, which is basically the benchmark area, and how that version was different than the one released to the public. And he actually has been comparing the answers from the arena to the released model, and they aren't close at all. and he says the data is worth a look as it shows how LM arena results can be manipulated to be more pleasing to humans. Essentially what he's saying here is that we've got a situation on our hands. There was one model that was being tested that humans were using that was the benchmark results being released publicly for Llama 4 and then they released a separate model arguably potentially a less capable model. So when we actually take a look, we can see that the released maverick when you ask the question make a riddle where the answer is 3. 145. The open router responds in a

### [5:00](https://www.youtube.com/watch?v=YgJ338Phb9M&t=300s) Segment 2 (05:00 - 10:00)

very small way with a question that is pretty basic. Not a question but an answer. And then we can see that the llama for Maverick experimental version is pretty different to the one that we can see here. I mean, this one seems to be very comprehensive in its reasoning and of course its response. I'm not sure what this llama for Maverick experimental version is, but one of the key problems that we do have in the AI space is that companies are pretty terrible at naming things. You've got 04, 03 mini, 03 mini high, 03 high, like 01, 01 mini, GPT40, GPT40 mini, we've got 04 coming. I mean, it's all very, very confusing. And sometimes you've got, you know, even with AI new releases, you'll often have, you know, several different names for the same model. So, I don't want to say that this professor was confused, but we can clearly see that the Llama for Maverick 03 26 experimental, if they did use that model to benchmark and produce the results, it's clear that isn't the same model as the one being released on websites like Open Router. So, clearly there would be a difference. And I would suspect that potentially this model might even be the behemoth model considering the fact that Llama spoke about the behemoth model to still be in training. Maybe it's just like a distilled version or something like that. But clearly if there are these differences in release, it's of course not a good look. But I would just hope that this is a small mistake and this Llama Maverick experimental version. It's just completely you know completely different version to the one that is released in the sense that you know they are just testing out different versions to see which ones are the best. So you can see right here that someone says that this is the clearest evidence that no one should take these ranks seriously. And basically LMA said that you know to ensure full transparency they're releasing the head-to-head battles for public review. This includes the prompts, the model responses, and user preference. So on Elea, I'm pretty sure you're familiar with how it works. Someone puts one question in, two models respond blindly. So you don't know which models respond just yet. You just see the outputs. And then after the outputs are revealed, you decide to pick which one you like best. And then, you know, let's say you pick the left one, it then reveals to you which model you manage to side with. And interestingly, we can see here that the llama for Maverick experimental isn't apparently accidentally better. Apparently, it just results in more text and that's why the user is voting for Llama 4. So, I mean, this is something that is super interesting because we have a situation on our hands where in some instances users are voting for models that are incorrect based on the data that they see. So whilst Elmarino of course is good in some instances, I guess in some benchmark areas it probably isn't the best. Now like I said in the title, Meta have responded to this data floating around the internet. They've said that we're glad, you know, we're start to getting Llama 4 into all of your hands. We're already hearing lots of great results that people are getting with these models. And that said, we're also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready, we'll expect it just going to take, you know, several days for the public implementations to get dialed in. And we'll keep working through our bug fixes and on boarding partners. And we've also heard claims that we trained on test sets and that's simply not true and we would never do that. Our understanding is that the variable quality people are seeing is due to needing to stabilize implementations and we believe that the Llama 4 models are a significant advancement and we're looking forward to working with the community to unlock their value. So overall this is a super interesting statement because you know they clearly acknowledge the fact that there are reports of mixed quality across these different services and it really is super interesting to see the different responses that people are getting. Anecdotally, I will talk about the fact that I've personally used Llama 4. And while it isn't that crazy, I actually published a video on the top Llama 4 use cases. And I got to be honest with you guys, when I was using the model using an open router on my AI grid academy where I teach people how to use AI super effectively, it actually performed really, really well on a bunch of different problems and questions that I have compared to other models. So, I think like I said before, it really depends on where you are using the model and how effective you are being with the model. Of course, there's definitely some differences depending on the kinds of things you're using it for. But I do want to say that personally, when I used this on my second channel, like I said, when I was talking about all of the amazing ways that you can use the model, I found it to be rather effective. Now, of course, you're going to have to test out the model yourself. You can use something like open router or po. com. But I do think it's super interesting that you know there is all of this drama surrounding the model due to the benchmarks. Now you can see right here that you know they clearly state that you know we simply would never train on the test sets and that's true and we would never do that. But there's something you know quite interesting

### [10:00](https://www.youtube.com/watch?v=YgJ338Phb9M&t=600s) Segment 3 (10:00 - 15:00)

that I did find in another benchmark as well. And I'm not discrediting meta here. I do think that the models are actually pretty decent. Now there was also something you know posted on Reddit not too long ago. You can see it talks about how there are serious issues in the llama for training and that certain people have resigned. There basically was a post in Chinese that says despite repeated training efforts, the internal models performance still fall short of open-source state-of-the-art benchmarks, lagging significantly behind. And the company leadership suggested blending test sets from various benchmarks during the post-training process, aiming to meet the targets across various metrics and produce a presentable result. Failure to achieve this goal by the end of April deadline would lead to dire consequences. And of course, following yesterday's release of Llama 4, many users on X and Reddit have already reported extremely poor real world test results. Now, like I said before, I was just using it to, you know, automate a few things on social media and automate things for certain businesses, which is pretty qualitative, not really quantitative in the results that you're going to get. So perhaps that's why my results are a lot better. I basically just speak about the fact that meta is trained on Instagram, WhatsApp, and Facebook. And those kinds of data sources allow the model to understand the nuances between the platforms in a way that other models simply can't. And like I said before, if you're using this model in other areas like coding or building apps, you know, it's going to be a lot different since those individuals that are training the models are really going to have to optimize to ensure that you get the best performance out of that model. Now, of course, the statement was true. the VP of AI research at Meta did resign. But there's also this anecdote to note that whilst this person was the VP of fair, it's actually an entirely separate organization within Meta from Generative AI, which is the organization that works on Llama. The VP of Genai is Ahmed. And of course, I already showed you guys a tweet from him. Now, if we take a look at other benchmarks, cuz this is what I wanted to do. Artificial Analysis just posted this and they said that they've now replicated Meta's claimed values for the MMLU Pro and the GPQA Diamond, pushing our intelligence index scores for Scout and Maverick even higher. They basically noted that in their first post 48 hours ago, they noted discrepancies between their measured results and Meta's claim scores for multi-choice evaluation data sets. And after further experiments and after close review, they decided that in accordance with their published principle against unfairly peanizing models where they get the content of questions correct but format answers differently, they will allow Llama 4's answer style of the best answer is a as a legitimate answer for multi-choice evals. This leads to a jump in the score for both scout and maverick, the largest for scout in two of the seven evaluations that make up the artificial analysis index. So, Scout score has moved from 36 to 43 and Maverick's 49 to 50. Now, if you're wondering where that places these models, it actually places them just above Gemini 2. 0 Flash and just above the GPT40 March update. So, I don't think that this model is that bad. Like I've said before, I do think when it does come to the benchmarks, maybe Llama 4's way of answering questions could be misconstrued. And of course with benchmarks, you have to be pretty precise in how you record those answers. Now, one benchmark that I really wanted to look at because I know that this company actually has a private data set. It's hold out, you know, there's no way to really see it. And this is the seal LLM leaderboards. This was developed by Scale AI safety evaluations and alignment lab. And they are expert-driven rankings of LMS designed to provide accurate and reliable performance comparisons. And these evaluate frontier LLMs across multiple domains including coding, instruction following, math and multilinguality. Like I said before, these data sets, okay, and the CLLM leaderboards are different because these evaluations are cured by private data sets that cannot be exploited or incorporated into model training data ensuring unbiased results. So what we have here when we look at humanity's last exam, I was looking at this benchmark and thinking, okay, this looks pretty decent. However, when I did scroll down and I noticed Llama for Maverick score, it did say that there was a potential contamination warning. It says this model was evaluated after the public release of humanity's last exam, allowing model builder access to the prompts and solutions. So, of course, there is a bit of caveats there, but nonetheless, this website is what I'm going to be using when I try to see really where the models do lie. And interestingly enough, when I look at every single benchmark, they do talk about the fact that there is the potential for contamination warning, like even on the Enigma evaluation, even on the multi- challenge evaluation. So, overall, I'm not really sure. I mean, is it crazy that this happened? I mean, I do think Deep Seek definitely did shock the West in terms of its performance.

### [15:00](https://www.youtube.com/watch?v=YgJ338Phb9M&t=900s) Segment 4 (15:00 - 15:00)

And seeing this firsthand is pretty crazy to me. Like I said before, I've used Llama 4. I spoke about it in a different video on a second channel where I'm using AI to educate and I still think that the model is pretty good. But I will say that, you know, overall, I do hope there's more clarity surrounding the model because, of course, honesty is the best policy. Of course, Meta's incentives are aligned to have the best benchmarks so that they can get people to actually use their products and services and attract the best talent, but time will tell. And considering that this model is honestly just an open-source non-reasoning model, I don't think that the results aren't that crazy. I do think that we have, you know, become almost accustomed to rapid AI changes. And when there isn't a complete shift in the tides of how models perform, a lot of times we are pretty confused. Let me know in the comment section below how you're finding, you know, Llama 4 and other models because I'd love to know your thoughts.

---
*Источник: https://ekstraktznaniy.ru/video/13083*