Checkout my newsletter : - https://aigrid.beehiiv.com/subscribe
🐤 Follow Me on Twitter https://twitter.com/TheAiGrid
🌐 Learn AI With Me : https://www.skool.com/postagiprepardness/about
Links From Todays Video:
https://x.com/Enscion25/status/1999700260491251981
Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos.
Was there anything i missed?
(For Business Enquiries) contact@theaigrid.com
Music Used
LEMMiNO - Cipher
https://www.youtube.com/watch?v=b0q5PR1xpA0
CC BY-SA 4.0
LEMMiNO - Encounters
https://www.youtube.com/watch?v=xdwWCl_5x2s
#LLM #Largelanguagemodel #chatgpt
#AI
#ArtificialIntelligence
#MachineLearning
#DeepLearning
#NeuralNetworks
#Robotics
#DataScience
Оглавление (3 сегментов)
Segment 1 (00:00 - 05:00)
So apparently everyone hates the new chat GPT and we need to talk about it. So GPT 5. 2 was released and when it was released, I'm quoting Greg Brockman here, it was the most advanced frontier model for professional work and longunning agents. So this is essentially the model that OpenAI released that was pretty much the comeback to Google's Gemini 3. And if you weren't familiar with how the models got released, Gemini 3 was released a couple, you know, weeks or days earlier. And a lot of people were surprised so much that many people cancelled their Chat GBT subscriptions and hopped on over to Google because Google basically have the full AI stack. Now, now the thing is right when this model was released, it was supposedly the most advanced model. When you look to the benchmarks, and I do apologize for the resolution. It's not your phone or laptop. It's just the image I have. But the point is that you can see that GPT 5. 2 thinking was supposed to be a step above every other model. So I asked the question what exactly happened? Now you might be using chat GPT on a day-to-day basis and you may not notice any sort of user complaints or anything weird with the model. But a large select of power users, well should I say the vocal minority have realized that 5. 2 is not the model you think it to be. And we're going to get into exactly why that is. And I think I know further suspicion later on in the video will show you guys where OpenAI are moving forward. So, of course, I wanted to see if other people had the same conclusion, and there were so many tweets about how bad 5. 2 is. Now, I do have to state they are exaggerating quite a bit because saying that holy crap, GPT 5. 2 is possibly the worst model they've ever released. I have no idea what they've done. There's no way this was the alpha model my cohort tested, nor is it even close to 5. 1. This is the biggest piece of nonsense I've ever seen. Worse than 5. 1 and it's failed every single Eval. It's so over for OpenAI. Let's be honest. That's like a bit of exaggeration. They've released, you know, tons of good models. I don't think it's the worst model they've released ever, but it might not be the best model they've released. Now, like I said, this wasn't the only person. It wasn't just power users. I saw a broad consensus of vocal minority power users stating that 5. 2 simply wasn't the best model. It's not that it wasn't the best model was that it was a real sort of disgust for the model in a in a weird way. And you you'll see from the tweets, GPD 5. 2 instant is a broken model. It ignores custom instructions uh memory and poisons context before thinking the model trigger safeguards every second message. Thinking mode is utterly different animal and it's least as good. Some people are saying GBT 5. 2 sucks so bad. GBT 5 was so listiked that 5. 1 felt like an upgrade and 5. 2 two is so bad it actually makes five looks good. Now, trust me when I tell you guys there were a lot more tweets. This isn't just cherrypicking tweets just to make the video. There were a ton of different tweets and a lot of them had a different likes. I just wanted to get a different variety of different tweets to show you guys all of the different reasons that people actually thought the model looked back. So, you might be thinking, okay, well, the model was released. It was released in, you know, I guess you could say backlash 2 Gemini 3. So, how on earth did OpenAI quote unquote mess this up and what actually happened? Well, one of the things I wanted to show you guys as well was the simple bench because there is one thing that is really strange about this model that I have to be honest with you guys and that is the benchmark inconsistency. GBT 5. 2 Pro is extraordinarily inconsistent across a range of different benchmarks. What I mean by this is that GBT 5. 2 2 succeeds and excels on certain benchmarks but completely fails the tests on other benchmarks. For example, simple bench is supposed to be a benchmark that, you know, smart models can get, models with good real world understanding. And I'll actually come back to this in a minute. But the point I'm trying to say is that simple bench was a benchmark that's designed to sort of trick the model with trick questions and see if the model is actually paying attention to the entirety of the question and not just, you know, memorizing some stuff. And it got a nine in terms of placement on the Simple Bench benchmark. And to be honest with you guys, that isn't good because it's underneath Gemini 3 Flash, underneath the other Claude Opus models, and underneath GPT 5 Pro. So I mean it's also underneath Gemini 2. 5 Pro preview that's released all the way back here in June. So you might be thinking how on earth is a model that's released you know months later after Google, months later after Grock and months later after Claude failing at some well one of you know the tests that basically test the true reasoning of the model. When I say true reasoning I'm not talking about maths coding and all of those you know quantitative benchmarks. I'm just talking about understanding the world in way that humans do more so than you do realize. And now a lot of people are saying, "Remember I said there were just tons of tweets. " Trust me, guys. This one person said, "This is why we don't like 5. 2. It talks about a role playing like a human and it just sounds just awful. " Now remember what I said. Okay, 5. 2 is an incredibly smart model.
Segment 2 (05:00 - 10:00)
However, on some benchmarks, it completely seems to collapse. And we're going to dive into exactly what that reason is. But one last example that I want to show you before we dive into, you know, the entire reason that 5. 2 is so weird and the way I think opening eyes is taking things is how it performed on the EQ bench. So a lot of people were saying that 5. 2 was just completely terrible in terms of the you know emotional bench in terms of how good it is at like you know talking to humans and you know simulating personalities and that kind of stuff. However, when I did check the EQ Bench 3 GB 5. 2 is number three in terms of the ability which is completely surprising. So I honestly don't know what's going on with the benchmarks but I think there is something that is going on. One of the things that has been discussed quite a lot recently, and I think this makes the most sense just for a number of reasons is potentially overfitting. So, this person tweeted that the O in opening eye has to be for overfitting. As someone who genuinely loved 5. 1, this feels like a huge step backwards. Answers feel rushed. The nuance, the depth, and creativity aren't there. Now, if you aren't familiar with what overfitting is, I'm going to explain it to you as simply as possible. And I'm going to bring up this tweet and trust me guys, this is a lot of text, but I'm going to summarize it to you super quickly. And this tweet got around, I think, 2,000 retweets. And this was basically quoting Ilia Sutzkov's recent podcast where basically he talks about the fact that when an AI studies and it practices the examples so hard, it basically memorizes them instead of learning the general pattern. And it makes it look super smart on those benchmarks that it saw on during training, but it basically just performs super bad on the new unseen data. So — inadvertently is that people take inspiration from the evals. You say, "Hey, I would love our model to do really well when we release it. I want the evos to look great. What would be RL training that would help on this task, right? I think that is something that happens and I think it could explain a lot of what's going on. If you combine this with generalization of the models actually being inadequate, that has the potential to explain a lot of what we are seeing. This disconnect between eval performance and actual real world performance, which is something that we don't today exactly even understand what we mean by that inadvertently. Think about it using this example. Okay? And if you want to pause to read this, you can. But imagine a student who memorizes the exact answers from last year's exam instead of understanding the actual subject matter. If the teacher changes those questions even a small bit, suddenly you know the students going to do poorly on the test because they never learn the underlying ideas and concepts. So in I in AI this is basically the same thing. So the overfit model is basically just matching the training data too much including random noise or quirks that basically never repeat in real life. And because it learns those, you know, when it comes to tests and those kind of things, that's where it performs really well. But in the real world, it just fails. And this is what this tweet is getting. And I think it summarizes the biggest problem. And I think in the future, this is going to change quite a lot because um, you know, it talks about how research teams have entire divisions that have nothing to do but create new reinforcement learning training environments designed to boost those benchmark scores. They treat the SW bench, the MMLU like standardized tests. Then the model just studies hard on those tests and then it fails to fix a simple bug production without introducing you two new ones. And Satska said the analogy is perfect. The student A grinds 10,000 hours of competitive programming, memorizing every algorithm, every edge case, and it becomes the number one ranked competitive programmer in the world. But student B practices 100 hours, but it has intuition, taste, and the ability to learn new things quickly. Who is going to have the better career? Student B. because current AI models are all student A. And it talks about, you know, studies have shown that data contamination inflates the model score by 20 to 80% on popular benchmarks. This is true. We did, you know, not we, but like I I've seen papers that do kind of show this. And data contamination, I honestly don't know how they're going to solve that problem because um certain benchmarks are just out there and these models basically train on everything. And you have to understand like when I there is a quote and I can't remember who it's from but it says show me the incentive and I will show you the outcome and you have to understand that for every single AI company there is the incentive for their new model to perform better than the you know next best model on the benchmarks and if they don't do that then the episteem becomes oh well you know this this model company is just going out of business they're terrible they didn't release something that was better than the last ones there's a war yada yada And I do think that is a bad um way for a lot of people to think because necessarily, you know, a model that doesn't perform better but performs better at specialized tasks, like a model that actually has real understanding of the world would probably be better. And this is the best quote I saw from this
Segment 3 (10:00 - 15:00)
tweet. That's why I just had to include in this video. It says, "This explains the economic puzzle that I pointed to. Models can score 100% on the AME 2025. They hit 70% on a GDP val beating human professionals which is a benchmark that is you know economically valuable work yet businesses still struggle to extract value from AI models. The benchmark performance is genius and the profit and loss statements says otherwise. That sample efficiency gap tells you everything. A human teenager learns to drive any car after 10 hours and an AI model might need millions of examples and still fail on slight variations. Humans learn the concept once and apply it everywhere. And models need to learn the exact pattern thousands of time and still choke when the formatting changes slightly. And this is absolutely true. Models seem to be just these super geniuses. You know, if you go back to like this, I think this is the men way. You can see GPT 5. 2 thinking is just super smart. But in the real world, these models just don't perform that well on things that are just random. Like if you have it a task for whatever your your strange work things, it just doesn't understand like it just really struggles sometimes. And I'm not saying that these models are completely use useless, you know, economically valuable tasks. I'm just saying if you had a person that was actually as smart as the benchmarks claimed, they would be able to do a lot more than those AI systems. So it's not really a onetoone mapping of okay, the models got 140 IQ. It's equivalent to me having an 140 IQ person inside of my computer. Now, of course, you know, once again, I want to show you guys where a lot of people are saying that GBT 5. 2 thinking scored lower than 5. 1 on long context reasoning despite opening eyes blog post. Other people saying that, you know, 5. 2 is a lot worse. And remember guys, this is one of the biggest things. Remember how I said, if you show me the incentive, I'll show you the outcome. This is the thing. OpenAI declared code red as Google threatened AI lead. I think OpenAI did feel a lot pressured to actually release the model early because Google were pretty much taking the limelight and a lot of people were starting to get bearish on OpenAI. I'm not actually bearish for OpenAI in the long run. I think they have one of the craziest distribution networks, but I do think that other AI players are entering the game and they do have to, you know, continue to prove themselves so that they don't end up at BlackBerry. One thing that a lot of people don't realize is that like, you know, companies come and go. There are shifts that happen. You know, remember Nokia and Blackberry from the early 2000s. Those companies seemed like they had it in the bag. But, you know, Apple came out with the iPhone. They waited. They were patient. And, you know, look at where Apple are now. And look at where BlackBerry are now. You know, two different scenarios. And at Apple, you know, have shown that, you know, companies that get complacent, they can simply be overtaken. And OpenAI doesn't want to end up like Yahoo. So, of course, now they're pushing, pushing. And the thing is here, okay, is that when they declared this code red, I think it was a sense that, okay, we need to show the world Openi still has it. And the reason I think this is most likely the case is because I think it was yesterday, in fact, I'm pretty sure it was yesterday, that the information, which is a very reputable source when it comes to, you know, AI information, they came out with a paper, you know, an article stating that look, OpenAI basically told us that GBT 5. 2 2 isn't actually based on the fullblown model that they were supposed to release. They actually decided to release an early checkpoint of the model occur according to a person with knowledge of the model. So there's still some improvement there. So it seems like on the one hand this model was basically geared to performing on economically valuable tasks because that's what they said it was designed for. And not just that, it seems like the model's EQ wasn't good because it just wasn't designed to talk to people. It was designed just to get work done. And number three was like this was, you know, a response model because they were just scared of, you know, what Google is doing. And we did actually get some confirmation that, you know, this is number one, it's an early checkpoint of the model. So it's clear they probably did rush this out if it's an early checkpoint. And number two, we did get confirmation that 5. 2 is going to be upgraded in the next part of 2026. So in Q1 to 2026, we can expect some interesting upgrades. So this was actually confirmed in this interview. So you might want to take a listen. When's GP GPT6 coming? Um, I expect I don't know when we'll call a model GPT 6. Uh, but I would expect new models that are significant gains from 5. 2 in the first quarter of next year. — What does significant gains mean? — I don't have like an EVAL score in mind for you yet, but uh more enterprise side of things or — definitely both. the there will be a lot of improvements to the model for consumers. Uh the main thing consumers want right now is not more IQ. Enterprises still do want more IQ. So uh we'll improve the model in different ways for the kind of for different uses. But uh I our goal is a model that everybody likes much