OpenAI’s New GPT-5.4 Pro Is Now The Smartest AI In The World.

23:49

OpenAI’s New GPT-5.4 Pro Is Now The Smartest AI In The World.

TheAIGRID 06.03.2026 17 867 просмотров 474 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

🎓 Learn AI In 10 Minutes A Day - https://www.skool.com/theaigridacademy Get your Free AGI Preparedness Guide - https://theaigrid.kit.com/agi 🐤 Follow Me on Twitter https://twitter.com/TheAiGrid 🌐 Learn AI Business For Free AI https://www.youtube.com/@TheAIGRIDAcademy Links From Todays Video: https://openai.com/index/gpt-5-4-thinking-system-card/ Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos. Was there anything i missed? (For Business Enquiries) contact@theaigrid.com Music Used LEMMiNO - Cipher https://www.youtube.com/watch?v=b0q5PR1xpA0 CC BY-SA 4.0 LEMMiNO - Encounters https://www.youtube.com/watch?v=xdwWCl_5x2s #LLM #Largelanguagemodel #chatgpt #AI #ArtificialIntelligence #MachineLearning #DeepLearning #NeuralNetworks #Robotics #DataScience

Оглавление (5 сегментов)

Segment 1 (00:00 - 05:00)

So GPT 5. 4 is here and this is genuinely a surprising model. So let's get into it. Now I'm not going to bore you guys with all of the benchmarks benchmark stuff because oftent times it is pretty boring. But what I will say is that surprisingly GPT 5. 4 thinking beats Lord Opus 4. 6 in many different areas and even Gemini 3. 1 in areas where Gemini 3. 1 is supposed to be state-of-the-art. Now, what do I mean by that? Well, I mean, taking a look across the board here, what I've seen is that the benchmarks that you're currently looking at, they aren't these standard boring ones that we consistently look at, like the GPQA, even though the GPQA is here, those benchmarks basically saturated, the benchmarks I'm talking about are the ones for the next evolution of AI, such as Frontier Math, OSWorld, and Browse Comp. So, one of the ones I want to show you guys is Browse Comp. Now, this might not look like anything crazy, but here's the thing, okay? This is a benchmark that is designed to test if an AI model can pull in realtime data from across the web and aentically do that. Now, you probably would think that Google's Gemini 3. 1 Pro model, considering that yes, it is powered by Google, the search engine, that it would be state-of-the-art. But nope, that isn't the case. You can see that GPT 5. 4 4 Pro actually edges it out getting around 89. 3%. Now I know once again if we do actually look at this across the board it isn't that big of a deal in terms of just raw capabilities but when we actually look at the fact that certain companies are supposed to have their own areas where they do dominate it does become maybe a little bit more concerning. And when I say concerning, I actually mean that for Google and other companies because OpenAI wasn't really supposed to make that advancement on this area just yet. So, I'm not sure what they were able to do to be able to actually get ahead of the other labs. Now, it isn't without saying that whilst yes, it might be better than other models, the problem with GPT 5. 4 4 Pro, which is currently the best model in the world, I guess you could say, is that this model is pretty darn expensive. I mean, if you look at how expensive this model is, I mean, just look at that. $30 per million tokens input price versus $180 output. A million tokens. That is pretty incredible. I mean, GPT 5. 4, if you're using the pro version, is pretty expensive, which of course most people would want to use. But of course, if we actually compare this to the leading AI models, you know, if we're just looking at Claude Opus 4. 6 here, it does look a little bit more reasonable if you're looking at just the GPT 5. 4 without reasoning. Now, of course, people are going to have their different reasons for picking different models. But I do think that price is going to be a key thing here. And the reason I find this so funny is that if you've been in this space for a while, you will remember many individuals talking about this entire situation being intelligence too cheap to meter. So I mean whilst yes maybe in some cases the prices of AI is going down, I would argue that the price of AI and reasoning is actually going up. And in some cases GBT 5. 4 at that current input level actually beats Claude Opus 4. 6 six, which means that in many cases it might be more cost effective to use the standard version of GPT 5. 4. So the pricing issue is once again I think going to be a real issue coming forward in the future. Whichever model is going to be the one that is just the best in terms of price to performance. I think they're probably going to take a lot of market share because you guys know what it's like to run out of credits using Claude and that is something that happens pretty consistently for most people. So, I do wonder how they're going to manage that in the future. It will be interesting to say the least. Now, I do want to show you guys the most interesting benchmarks, and I really do want to get into this because these aren't just, okay, this model is 10% better than this. These benchmarks were ones that were kind of breaking novel ground. So, GBT 5. 4 Pro did something really cool. It set a new record on Frontier Math. Now, if you aren't sure why that's impressive, Frontier Math is the problem. Well, I should say set of problems that are designed by professional mathematicians that require research level thinking, not textbook stuff, genuinely novel hard problems. And when it launched, top models were scoring around 2%. Now, essentially, you can see on the tiers, you can see GPT 5. 4 Pro is completely dominating. GPT 5. 4 with extremely high reasoning is completely dominating. And I think one thing most people are going to miss from this quickly that I do want to add is that just because the other models do seem to be close behind. I think it's a little bit different because OpenAI do have something innovative with math. I'm not sure what they're doing in terms of

Segment 2 (05:00 - 10:00)

how they're training their models, but they have consistently had the lead on Frontier Math questions for quite some time. And I'm consistently hearing mathematicians saying that they're actually using versions of chat GBT to solve equations in math. So, this is going to be really interesting. And I think in the future, who knows, maybe we're going to get some actual crazy breakthroughs a lot sooner than you think. And you can see that on the Frontier Math Tier 4, which is the hardest problems, once again, a GP 5. 4 Pro is actually dominating on this benchmark much better than Opus 4. 6. Although I don't see Gemini 3. 1 Pro there. So of course we will have to wait for that. And remember guys, this Frontier Math benchmark thing, this was essentially a benchmark that was designed to resist mathematics for pretty much years. It was a privately held out set of questions. So there was no real way to game this. So the fact that they're making these advancements so consistently is rather surprising if you ask me. So we take a look at this. I mean this is genuinely truly surprising. But do you want to know the craziest thing about all of this with GBT 5. 4 Pro? And that's why I said I didn't want to make this too much of a benchmark video, but this is pretty crazy. So, when they were testing the benchmark, you can see this guy said, "It finally happened. My personal move 37 or more, I am deeply impressed. The solution is very nice, clean, and feels almost human. " While testing new models in the last week, I felt this is coming. And so if we're actually talking about why this benchmark matters rather than just standard numbers, this mathematician Bartos had actually cured a problem personally. And this wasn't a benchmark problem. It was a problem that nobody had solved for 20 years, which is pretty crazy, which means that you couldn't have scraped this from the internet. Now, he's actually referring to move 37. Now, that's a direct call back to Alph Go's famous move 37 in 2016. And that was a move that essentially shocked every Go player on the planet because no human would ever play it. Yet it turned out to be genius. And that moment was widely considered the first time AI did something genuinely superhuman in a complex domain. And the researcher here is essentially saying that GPT 5. 4 just did his personal version of that. So this guy, you know, this mathematician had this problem for 20 years, two decades of work and GPD 5. 4 solved it with a solution that he describes as very nice and it feels almost human. Now what's interesting is that they ran GPD 5. 4 10 times on the hardest tier of problems and got 38%. And that was one of the runs where it was able to solve a problem that no model had ever solved before. And that's not just an incremental improvement, that's a qualitative threshold being crossed. So, I mean, maybe this isn't the singularity, but I think we're starting to see AI cross that individual frontier. And now he's working on, you know, a completely new level. And I wonder how long that will last. So that's why I said that I didn't want to talk too much about benchmarks, but it's important to understand that it is a different kind of benchmarks that we're now looking at when the models are this good. I mean, if you want to take a look quickly at just how much improvements there have been since the GPT5 to GPT 5. 2 to GPT 5. 3 codeex, then all the way up to GPT 5. 4, four. You can see that the improvements are still continuing and they're rapidly approaching human capabilities in many of the ones that are extremely difficult to do. So, one of those is of course the ARC AI benchmark. Now, I guess you could argue that this has already been past the human baseline because the human baseline is around like 85% and GBD5. 4 4 Pro High is up there at around I think 83 to 84% and it's costing around $30 to $50 per task. Now, that is pretty expensive when you think about it. But what is crazy is that this model is actually pretty much right next to Gemini 3. 1 Deepthink or Gemini 3 Deepthink. And the reason that is so surprising is because Gemini Deepthink models were just really large. And of course, Google have been working on maths for quite a lot of time. So, it's pretty interesting to see OpenAI and Google pretty much going head-to-head on the ARC AI2 benchmark. Remember guys, the ARC AGI2 benchmark is specifically designed to test genuine reasoning and fluid intelligence, not mesmerized knowledge. It's one of the hardest and most respected benchmarks. And the fact that it only hits 83% while costing $30 per task is pretty crazy. Now, I will say that it does look like some other models are probably a little bit more cost effective, but then again, at the end of the day, intelligence is intelligence, and the high score is the high score. Now, when I was looking at

Segment 3 (10:00 - 15:00)

benchmarks, this was one that I came across that was pretty crazy. Okay? And you might think, what does this benchmark even matter? I'm about to show you. So, you know, I saw this quote that said, GPT 5. 4 is the best model we've ever tried. It's now on the top of the leaderboard on our Apex agents benchmark which measures the model performance for professional services work. It excels at creating long horizon deliverables such as slide decks, financial models, legal analysis while delivering top performance. Now, why does this matter? Well, the reason that I personally believe that these are the benchmarks that matter is because the other benchmarks have been pretty much saturated. You can't really tell the difference. But what matters here is that GPD 5. 4 four just hit 52% on the hardest real world professional benchmarks we've seen. And that's the first time that they've crossed 50% on tasks that were designed by investment bankers, consultants, and lawyers to represent their actual daily work. So this was built by Merkco and this was launched in January 2026. This was 480 tasks across 33 simulated work worlds. And you had real professionals, bankers, consultants, and lawyers. And they spent 5 to 10 days building realistic scenarios with actual files, tools, and software. And models have to complete tasks the same way a human junior employee would. No asking for help, pretty much single shot. And those include financial modeling, slide decks, legal memos, market analysis. And the crazy context is just how fast this moved. When this benchmark actually launched in January 2026, the best score was 24%. And now GPT 5. 4 basically doubled that in roughly 6 to 8 weeks. And this is the thing that most people don't realize. That's the benchmark progress curve that everyone keeps, you know, warning people about. Okay? And that's a crazy benchmark on realworld tasks. So this isn't some math benchmark. This is real qualitative differences in the kind of works that people, you know, actually do in the office and it's pretty crazy. Okay, so now I think of course you could argue that 52% is still failing nearly half of all professional tasks and these tasks do have no internet access, no clarification allowed in real work is a lot messier and models could probably do better with those tools. And the benchmark is open source, so labs are probably going to optimize for it. But remember guys, this was specifically designed to measure how close AI is to replacing junior white collar workers, investment banking analysts, junior consultants, parallegals, and these are the roles being targeted that are going from 24% to 52% in 6 weeks means that this is accelerating really fast. Remember guys, this is something to think about if we're talking about job disruption. Now, of course, once again, the benchmarks that matter. And I remember looking at this before and I was like, okay, this is the section that I will pay attention to the next time OpenAI release a model. And this is the GDP val. And it's basically OpenAI's own internal benchmark. And it's worth noting that they made it themselves. It essentially tests the AI's knowledge against real knowledge workers across 44 occupations in the nine industries that contribute most to US GDP. things like writing sales presentations, building spreadsheets, scheduling, legal work, financial analysis. And what the chart shows is that the dotted line is a human professional doing the same task. Every model on that chart is beating it. GPT 5. 4 is matching or beating a human 83% of the time, and GPT 5. 4 Pro is at 82%. The most important stat is that these models complete tasks roughly 100 times faster and 100 times cheaper than a human expert. And the honest caveat is that it's one shot. The model gets one attempt at a well- definfined task and real jobs involved back and forth iteration and context building over time. So 83% is of course very impressive and it doesn't mean that you know it's going to be replacing professionals tomorrow but I mean it should give you guys the understanding that you know model capabilities have moved from okay we're just going to saturate you know high school benchmarks to move into actual work that people are doing. And that's why it's okay. It's pretty exciting that the models can do this because of course you could use it to pretty much do whatever you want. But at the same time, it is a little bit concerning that they're increasing the capabilities that quickly. I mean, if we look at what OpenAI have said here, they said this is an AI model optimized for finance workflows. GPT 5. 4 thinking is the most advanced model ideal for financial reasoning, Excel-based modeling, and we're working with industry practitioners to improve GPT 5. 4 four on realworld finance workflows that often take analysts days or hours to complete. And you can see right there the graph is just going up in terms of improvement and that's actually surprisingly better than Opus 4. 6. I mean, if you want to look at some side

Segment 4 (15:00 - 20:00)

by sides, I know that this isn't probably the highest quality if you're on mobile, but essentially what you're seeing here is a lot more detail and a lot more, I guess you could say, accuracy with as to how the model is going to break down the question and how it's going to deliver that data. Not only that, but if you have it in a word document, and remember guys, you can actually now install these extensions. So, if you just go on the app store, instead of the co-pilot button, you will have a chatbt button if you do install that. I currently do have the clawed one, but of course you can, you know, install which one you and of course it is there in the presentation. I did actually use this before and it wasn't that good, but it seems that the GPT 5. 4 model has once again improved, which is why I'm saying these are the kind of, you know, tasks that a lot of individuals are doing. And of course, this is just showing you how the creative suite is going to be affected. Now, if you actually want to take a look at the creative writing, I think this one is really important because I think a model that can't creatively write or actually engage with humans on a meaningful basis is going to be pretty annoying to speak to, which is what we had with the prior models. Hence the reason for it being so low on the creative writing. But right now, you can see the GPT 5. 4 high is actually ranked second in terms of the creative writing. So previously, I mean, Samman literally got on stage and said, "Look, we actually did mess up GPT 5. 2. I'm pretty sure we messed that one up. " And they basically said, "Look, we were just trying to focus on math and coding so much so that we kind of just messed up how the model actually responds to you. " It was kind of like, you know, one giant calculator rather than a model that you could actually speak to. And so, right now, it seems that they've fixed that. They've ironed out those creases and they've made this a lot more rounded. Now, of course, this is, you know, human votes. So, I mean, it only has 390 votes. So, as more votes come in, the model could actually be number one. So, I definitely would test this in the creative writing area because it probably does fare a little bit better than you do think. I tested it myself and it is pretty decent. I will say it is a lot better than it was before and it was certainly frustrating a few weeks ago. So, of course, it really does depend on how you use the model. Now, of course, there is coding. So there was this theme park simulation game made with GPT 5. 4 from a single lightly specified prompt and they actually used a playright interactive browser for play testing and image generation. So this was pretty cool and the reason I like this so much is because they're essentially just able to use a framework where they were able to just you know get the model to test itself. So it was really cool because I think you know what most people aren't realizing about this is that now we have the full thing. we have like this full loop where the models are able to not only take in images, but they're also able to take in those images, then use those to code something, code it, then test it, then play it back. And of course, that's probably going to consume a wide amount of tokens. But this kind of thing is just surprising because when you have all of these benchmarks improving 15%, you know, every one or two months, I think the jumps that you're going to see in a year or two are just going to dwarf anything you could have imagined looking forward. And so there was another one here which was a tactical RPG created with GPT 5. 4 over multiple terms using the interactive browser for play testing and the image generation for the game's visual style. Another thing that we can also add is that this model is essentially the first generalpurpose model with native computer use capabilities. And apparently this marks the first major step towards developers and agents alike. And it's currently the best model available for developers building agents that complete real tasks across websites and software systems. And this model works really well in terms of, you know, visually seeing things and using a computer. So you can see right here that this is, you know, where the model is essentially reading information and putting it into an invoice intake machine. And the crazy thing about this is that this is actually real time. So this part of the video is not sped up, which is rather surprising to me because, you know, currently when you do have an AI do tasks, it usually is relatively slow. But you can see here that the model's performance and flexibility look really, really well. And on the OS world benchmark, which essentially measures the model's ability to navigate a desktop environment through screenshots and keyboard and mouse actions, GPT 5. 4 achieves state-of-the-art at 75% exceeding GPT 5. 240. And so, whilst this may not seem crazy, I think this is going to be one of those models that a lot of people probably start to use again. I mean, from what I've heard, a lot of people are finding that this model is good in terms of the raw capabilities, what it's able to do with agents, the coding capabilities also. But of course, now I'm going to talk about some really strange things that I did find from reading the research paper. Now, there were only three things that I found interesting, but we're going to get into them now. So, one of

Segment 5 (20:00 - 23:00)

them was super interesting because this might be one of the most interesting slides because OPQA is essentially something that is a real internal problem that took, you know, an actual team at OpenAI over a day to solve things like unexpected performance regressions, training bugs, anomalous metrics, genuinely hard novel engineering problems, not really like textbook stuff. And the chart actually shows here that GBT 5. 4 for thinking scores 4% which is actually worse than every other model on the chart. So GPT 5. 2 codeex scored 8% and GPT 5. 4 thinking was 4% which was dead last. Now OpenAI actually put this in their own technical report. They didn't hide it. And what it shows is that on genuinely novel hard problems that require real reasoning, the model in this case scores barely above zero and is currently regressing. So this is kind of weird. I'm not sure why this is the case. Maybe it's just because, you know, it's a little bit of a weird one. Maybe it's just noise. But I did find this interesting. But the next thing here, I think you should all pay attention to this because I think this is going to really change for the landscape in 2027, 2028, 2029. And I genuinely mean this because they said now that they are treating this model as high and essentially they're doing that for the cyber security threat. So PBT 5. 4 for thinking is the first general purpose model that required them to implement specific safety mitigations for high capability in cyber security. And so when they put this model in, you know, against professional level capture the flag challenges, it achieved 88% success rate. In a simulated network environment, the model successfully executed complex multi-step attacks such as exploiting vulnerable Azure web app to steal credentials and modifying controls to move laterally through the network. Now I think you have to understand here that when they you know classify a model as high in their preparedness framework that means that they believe it could potentially automate end-to-end cyber attacks against hardened targets. Now the problem here is that I think that this changes the future fundamentally because if this model can do what they say it can do think about what that means for GPT6. Every single cyber security benchmark has gone up with every model generation. GPT 5. 2 was 47%, GPT 5. 3 80%, GPT 5. 4 was at 73%. The trend is up. The problem is that if you have GPT 6, GPT7, that means it's going to hit the critical level. Okay? And that critical level means that it could potentially cause catastrophic large scale damage autonomously. meaning that would they actually release those models under OpenAI's current framework critical cyber security capability means that a model could meaningfully enable attacks on critical national infrastructure power grids water systems and if GBT6 is the first model where they're going to have to make that call that's going to be super interesting because they're probably going to have to get ID verification right now anyone with an API key can access GBT 5. 4 for. And when we think about how the rest of society works, it doesn't really make sense. The logic is kind of straightforward. You use an ID to get a gun. driving license to open a bank account. A model that can autonomously execute sophisticated cyber attacks at scale is a minimum in that risk. Now, I'm not advocating this. I'm not saying it shouldn't happen, it shouldn't happen. I'm just saying that think about the future because it's probably going to happen. We've already seen what the government has done with Anthropic and their powerful models.

Другие видео автора — TheAIGRID

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник