# OpenAI Basically Dropped Agi?... (o3 and o4 mini)

## Метаданные

- **Канал:** TheAIGRID
- **YouTube:** https://www.youtube.com/watch?v=cropdnyqov0
- **Дата:** 17.04.2025
- **Длительность:** 23:42
- **Просмотры:** 34,085

## Описание

Join my AI Academy - https://www.skool.com/postagiprepardness 
🐤 Follow Me on Twitter https://twitter.com/TheAiGrid
🌐 Checkout My website - https://theaigrid.com/


Links From Todays Video:
https://openai.com/index/introducing-o3-and-o4-mini/

Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos.

Was there anything i missed?

(For Business Enquiries)  contact@theaigrid.com

Music Used

LEMMiNO - Cipher
https://www.youtube.com/watch?v=b0q5PR1xpA0
CC BY-SA 4.0
LEMMiNO - Encounters
https://www.youtube.com/watch?v=xdwWCl_5x2s

#LLM #Largelanguagemodel #chatgpt
#AI
#ArtificialIntelligence
#MachineLearning
#DeepLearning
#NeuralNetworks
#Robotics
#DataScience

## Содержание

### [0:00](https://www.youtube.com/watch?v=cropdnyqov0) Segment 1 (00:00 - 05:00)

Sam Alman quotes a tweet that basically says this is at or near genius level. We have someone else who was working on the model training at OpenAI saying they were tempted to call this model AGI. Another person called Tyler Cohen was essentially saying that I think it is AGI honestly. So did OpenAI just drop the AGI model with 03 and 04 mini. Let's dive into absolutely everything. So, OpenAI just released two state-of-the-art models, 03 and 04 Mini. Now, let's firstly get into 03 because it's absolutely crazy. So, 03 is the most powerful reasoning model that pushes across the frontier in coding, math, science, visual perception, and a range of different benchmarks. This one is essentially really good at coding and it set a new benchmark on code forces and sbench. This is basically the benchmarks that test your coding ability in real world scenarios. Now they also introduced another model called O4 mini and O4 mini is a smaller model that is optimized for fast costefficient reasoning and it achieves remarkable performance for its size and cost particularly in math coding and visual tasks and it is the best performing benchmark model on the math benchmark for 2025 the Amy math benchmark and the 2024. So overall, OpenAI released two Frontier models that are exceptional when it comes to reasoning across some of the hardest subjects. Now, it wasn't the only thing they did. They did a few minor updates that most people won't know about until you really start to use the model. And one of those updates that I think is absolutely incredible is something called thinking with images. So thinking with images is crazy because for the first time these models can integrate images directly into their chain of thought. They don't just see an image, they actually think with that image as their reasoning. And this is absolutely incredible when it comes to reasoning about certain problems. I mean text level reasoning is of course good, but reasoning about a problem that you can visually see takes things to the next level. So you can upload a picture about something and chat GPT is going to zoom in. It's going to analyze that image and it's going to pick certain things out of that image and reason across the internet with that data. It's absolutely incredible when it comes to figuring out exactly what is inside of an image. And not just that, but reasoning with the context of the wider web and wider sources, which is gamechanging in terms of actually getting an accurate representation of what you do see. So you can upload a picture of a whiteboard, a textbook, a diagram or a handdrawn sketch, and the model can interpret it even if the image is blurry, reversed or low quality. And with tool use, one of the craziest things that you can do is you can manipulate images on the fly. rotating, zooming, or transforming them as part of the reasoning process. And I do think this is quite underrated because in a few days, you're probably going to see a bunch of tweets or maybe a few social media posts just talking about how crazy chat GPT is when it comes to being able to reason with the images that are there. Now, I think this is going to completely change the game because this is moving towards a system that is actually more agentic and actually feels like not AGI specifically. We'll get into more of that later, but more of like someone who's actually thinking about the problem, looking at the images and then of course reasoning around them. So, let's take a look of course at some examples. One of the examples that OpenAI had on their website was of course the fact that you know you can put an image in. So they put an image in here of what seems to be some kind of schedule. So they basically show us the previous version with OpenAI's 01 model. We can see that previously when you are reasoning with an image, it would just take that image at face value and then it would reason for a short amount of time about what it sees. So for example, if it sees something blurry, it's not legible. It's not going to zoom in, change its reasoning process. it's just going to essentially digest that image as one. However, when you have thinking with images, once you decide to use 03's capabilities, you can see that you're basically able to reason with that image on a completely different level. Like I said before, this is completely gamechanging because it now allows you to, you know, use the model to pick out certain things from the image and realize what is important and what isn't important. And the crazy thing about that is it can take certain parts of the image and look around and see if there are any things on the web that it can reference and that it can site which leads to overall a lot more accurate representations of what is in your image and overall a much better response. You can see right here it's

### [5:00](https://www.youtube.com/watch?v=cropdnyqov0&t=300s) Segment 2 (05:00 - 10:00)

able to zoom in. It's able to crop certain things and it's able to really get down in details when it comes to analyzing these images. And the reason that you know people are saying that this is hints of AGI or may actually be some kind of at least web AGI is because having a system where you ask it a question and then it's going on the image, internet, it's zooming in, looking at that image, rotating the image in some cases, zooming in, debluring it, doing all these kinds of things to come to a final solution. A lot of people would have thought that would be AGI if you asked them maybe six to eight months ago. You can see right here, one example that actually went pretty viral was Dan Shipper stating that 03 can repeatedly zoom and crop images in order to read small handwritten text. And this wasn't the only demonstration of just how powerful this new reasoning tool can be when you give it a prompt that needs to access the capabilities of such intelligence. For example, take a look at this prompt where someone had a handdrawn diagram on a sticky note oriented upside down in a mess of spilled toys. And they just simply put the prompt which was to solve. It thought for 1 minute and 50 seconds and then managed to actually solve the issue right there. Now, I don't know about you guys, but trying to have an AI system that can not only rotate an image, get it right, zoom in, get the data from the image, and then use its mathematical capabilities to be able to solve said problem. I think that is a remarkable feat that we really shouldn't take for granted. Now, like I said before, vision is absolutely incredible. Now, I would say though that it is probably a little bit too incredible. And you might be thinking, what do you mean by that? Well, as you know, people test out these models quite a fair bit. And one thing that people have tested is the location features of this model. Now, I do think that by the time this is released, they may actually change this feature. As we know, model deployments do vary from time to time due to user feedback. And one of the things I don't think they will keep this in the model for quite that time is, of course, the fact that many people are calling this location AGI. And what they mean by that is that you put any picture into chat GBT, maybe just take a picture outside the window or just a picture of a really blank location and it's able to accurately get your location and find out exactly where you are. So, this is clearly something that, you know, like I said before, when it's combining being able to think with images with advanced reasoning capabilities and tool use, it really does change the game for what is popular. I've seen several examples on Twitter of where people are essentially not doxing themselves, but putting their location out there and proving that Chat GBT with very limited information can accurately pinpoint where you are on the globe. I do think you probably should try this yourself, but I've tested it with a few images and it was eerily accurate on where the picture was taken and what location it was at. I've seen some users actually just take pictures of their restaurant food and it's able to locate that specific restaurant and that area which is you know really incredible. So it's pretty crazy to say the least when we think about how advanced this model is in terms of what it's able to do when it has access to a suite of tools. And that's one of the things I want to really drive home in this video. 03 is not just a model that is text reasoning. It's essentially an agent built in with a bunch of tools that it can use to actively go out and do certain things. Now, if we're going to talk about vision, it makes sense to mention the multimodal benchmarks. Now, I'm not going to bore you guys with statistics, but we actually do need to take a look at what is being tested here. Often times, we see MMLU, MMU, all of this stuff. MMU is basically testing if the AI can solve problems that college students would face when the problems include images. Think of questions like what's happening in this biology diagram or what does this physics illustration shows. It checks if the AI can understand pictures and use that understanding to solve actual college level problems. On the math vista, this tests if the AI can solve math problems that are presented visually. Imagine geometry problems with shapes, graphs that need interpretation, or visual puzzles that require mathematical thinking. It's checking if the AI can see math problems and figure them out, not just process textbased questions. Now, for the last one, the scientific reasoning one. This one basically checks if the AI can make sense of graphs, charts, and figures found in scientific papers. Can it really understand what a complex chart is showing? Can it draw conclusions from experimental data visualizations? This is particularly challenging because scientific figures often present dense, specialized information that requires domain knowledge to interpret correctly. In all three cases, the benchmarks here

### [10:00](https://www.youtube.com/watch?v=cropdnyqov0&t=600s) Segment 3 (10:00 - 15:00)

are testing the AI's ability to combine visual understanding and reasoning, which is where you draw correct conclusions or solve problems based on visual information. And for all of these, we can see that 03 has a significant leap in capability against 01. So, there is an actual huge jump here when it comes to that vision capability. So, if you do have any vision related tasks that relate to your day-to-day, I would definitely test out this model because like I said already, it doesn't seem like it's just an image classifier. It's something that tends to reason with the image and uses its tools to be able to really get down to the details of exactly what you're seeing. Now, I do want to say thinking with images isn't perfect. As always, there are limitations to every AI system. And most recently with the 03 model, we're currently seeing that there is one in these vision systems. Now, this example right here, this demonstration reminds me of a paper that I saw around 9 months ago when I was testing this on a private benchmark for a individual use case of mine. And I realized that I kept running into this issue that I didn't think anyone else had. But I'm going to show you guys a paper really quickly. So firstly, I'm going to show you guys this example. Basically, this is a picture of a kid's drawing and it has three names or four names or five names and five different characters and it has lines that are basically going towards every single character. Now, in this example, we can see here that this model didn't actually manage to get this right. We can see that it actually links these colors to the wrong characters. For example, Bob here is actually light green, but it actually states that Bob is pink or margarita. Now maybe it is prompting. Maybe we can elicit further capabilities outside of the model if we decided to prompt it a little bit further. But one of the things like I said before that I saw on you know previous benchmarks was this paper right here. This was called LLMs are blind. Well vision language models are blind. And it basically had seven different tasks as examples where it would ask questions that really for us they are really easy to see. Take for example this. Task one was counting line intersections. They had 1,800 images of 2D line plots drawn on a white canvas. Each line plot consisting of two line segments. And basically they just wanted to see whether or not these AI systems could figure out lines were actually having any intersections. Basically to see if the lines were touching or not. And then when we look at the examples here, we can see that in many a time there were instances where the models really did just get a lot of this wrong. Now I do have to say that this was 9 months ago and AI has changed dramatically. So there probably would be a large boost on this. But the point I'm trying to make is that AIs do have inherent drawbacks with as to how they do process images. And of course in the future this will get solved. And whilst we can easily look at a line and you know follow it around and you know draw it with our eyes to see which character it follows for AI this is something that is just a little bit on the harder side. No doubt in the future there's probably going to be changes to the system that will probably iron out this feature. Now here's the big question. A lot of people are saying that this is potentially AGI. And honestly, for the first time, I really don't blame them. You can see here that John Hullman, a model trainer at Opening Eye, that when 03 finished training and we got to try it out, I felt for the first time that I was tempted to call a model AGI. Still not perfect, but this model will beat me, you, and 99% of humans on 99% of intelligent assessments. and one can start to see the light at the end of the tunnel. Now, I agree with this statement a lot because a lot of times people have said AGI is this, AGI is that, but if we're talking about a basic definition of AGI, a system that can beat, you know, the average person on a range of different intelligence tests, I think the only thing that's really stopping this kind of system from actually becoming an AGI is its ability to use tools with a very low hallucination rate. Essentially, most people right now trust AI to write an email or conduct a decent report. But you're not going to really trust an AI to book your doctor's appointment or to use your credit card because on the 1% or 3% chance it messes up, those consequences are just too severe at the moment. But I can see why people are actually starting to call this model AGI. Because if I can upload documents, give it images, it's going to zoom in, go around the web, it's going to think with its brain for so much time. I mean, it's definitely a really smart system that a lot of people haven't really factored in yet in terms of like understanding that what we have here is an absolutely incredible piece of technology. Another piece of information that was, you know, circulating the Twitter sphere at the moment as well was that 03 and AGI is April 16th AGI day. And he says, Tyler Cohen, I think that is AGI. Seriously, try asking it lots of questions and then ask yourself, just how much smarter was I expecting AGI to be? As I've argued in the past, AGO AGI, however you define it, is not much of a social event per

### [15:00](https://www.youtube.com/watch?v=cropdnyqov0&t=900s) Segment 4 (15:00 - 20:00)

se, it still will take us a long time to use it properly and I do not expect securities prices to move significantly. Benchmarks, maybe AGI is like yada yada. I know when I see it, and I've seen it essentially right here, he's stating that this is basically AGI. And honestly, in some regards, I kind of do agree. Now, there was one benchmark that had people really on the edge of their seats for what's to come in the future, and that is the math competition benchmark. So, the Amy 2024 and 2025 math competition benchmarks are some really difficult benchmarks when it comes to math problems. Now, so when we look at these benchmarks, what we can see here is that 03 and 04 essentially have saturated the mathematics benchmark. We can see it achieves 99. 5 which is only. 5 away from 100. Now I don't know about you guys but that's almost a perfect score. Now like I said this is the tweet that you know had been going the rounds on Twitter. You can see here that David Shapiro a popular figure in the Twitter sphere or should I say AI sphere basically said that AI has solved math. OpenI did it with 04. Not it is close to solving math. Not it is competitive at math. it is solved. This is far bigger than anyone realizes. Let me explain why. Now, before everyone jumps ahead of themselves, Noam Brown, the person who actually works on these reasoning models at OpenAI, he says, "We did not solve math. For example, our model still not great at writing mathematical proofs. 03 and 04 mini are nowhere near close to getting International Mathematical Olympiad gold medals. And so basically whilst yes these models have done really well at the benchmarks there's still a long way to go in terms of actually solving mathematics because that is a completely different game if you're actually solving mathematics because even right now as we speak there are many different problems in mathematics that are currently unsolved and solving them would lead to fundamental changes or you know understandings of different problems. Now if we do solve math because one of the things that people talk about in AI especially with models like 03 and the reasoning models is that you know there are severe implications for when we do is the fact that math basically underpins many other subjects and he goes on to talk about biochemistry, robotics, spaceflight, cryptography, nuclear physics, the blockchain and that basically once you manage to solve math completely you're going to be able to you know impact many other areas which is quite true. So, I do think that, you know, if AI ever quote unquote solves math, it's going to be a really impressive day. But I don't even know if that statement even makes sense to completely solve math. I guess we just have a complete understanding of what goes on in our world. Now, of course, you might be wondering, all right, these models are super smart, but where are they placed? And how do they stack up against other models? Well, surprisingly, 04 Mini High manages to just edge out over Gemini 2. 5 Pro on the artificial analysis index. And this is the one that incorporates seven evaluations such as the MMLU Pro, GPQA Diamond, Humanity's Last Exam, and other benchmarks that are used across these AI models. And we can see here in humanity's last exam, which is a 30,000 question benchmark, including mathematics, humanities, and natural sciences, which is on basically a private data set, 03 also manages to take the cake here with a score of 19. 20, just above Gemini 2. 5 Pro. So once again, the model seems to just edge out Gemini 2. 5 Pro. However, I will say it is with a slight caveat in the fact that 03 is quite expensive when it comes to cost effectiveness for intelligence compared to Gemini 2. 5 Pro. Of course, none of these models are free. But if we're going to look at just pure cost effectiveness, Gemini 2. 5 Pro does win. Now, I will say though in opening eyes defense, it does actually show that the cost performance for 03 versus the previous model is actually much cheaper than before. And so, when we actually look at what the model is doing here, we can see that relative to the amount of intelligence you're getting, 03 compared to the previous models isn't as expensive for the raw amount of intelligence that you're getting on various benchmarks and various tools. So 03 is probably going to be the super agent everyone really wanted. Now another section I of course have to talk about briefly is the coding. Now coding is one that is you know quite vast in the different benchmarks that you can have. You can have real world benchmarks such as the s or you can have benchmarks like this like live bench which are a little bit different. So with this we can see the 03 high and 04 mini high have surpassed Google Gemini 2. 5 Pro experimental not by much but by enough so that they do take the number one spot currently which is rather fascinating as well. And then we do have a company here called Charlie Labs AI where essentially they are trying to create autonomous software engineers and their evaluation gives Charlie their autonomous agent real GitHub bug reports from optimizing database queries and updating CSS to enforcing security policies. Then they ask an LLM to judge their PRs against

### [20:00](https://www.youtube.com/watch?v=cropdnyqov0&t=1200s) Segment 5 (20:00 - 23:00)

human solutions. And basically 03 is setting some serious benchmarks compared to Sonic. Now once again if we take a look at coding we can see that one of the most realistic benchmarks are the SWE Lancer and the S SW bench verified SWE Lancer. And so this benchmark the reason I like it a lot is because this benchmark you can actually quantify just how much money the AI system is able to earn. Basically they have tasks that are on Upwork and they basically simulate if an AI system was able to do those tasks how much of the available capital of those jobs it would able to satisfy. And we can see here once again there's a clear jump in terms of the amount of money that these systems would be able to earn. I'm not going to say that they're able to earn a full salary. Not that $65,000 isn't a lot of money, but there was on some benchmarks overall like a $1 million prize pool. So I think this is, you know, still an interesting benchmark in terms of being able to see just how much money these models could theoretically earn if they were, you know, purely agentic themselves, which is pretty crazy when you think about it in terms of real world use case. And once again we can see that 03 mini and 01 there's a significant jump on the S swb bench verified software engineering. So once again you're seeing significant jumps across all areas like I stated before. Now there is also the safety area. I did see a ton of, you know, posts and stuff about safety because of course when you have a model that is like 03 and you have 04 mini, people are wondering about the safety capabilities and OpenAI recently updated their safety area. And they talk about the fact that they completely rebuilt their safety training data, adding new refusal prompts in such as biological threats, malware generation, and jailbreaks. And the reason I actually wanted to include this here is because number one, I'm actually going to make a separate video on AI safety because this 03 model is just a completely different beast when it comes to safety in terms of really pushing the boundaries of what is even acceptable in today's standards. But I think it's pretty funny that every single time a model is released, this guy called Ply on Twitter literally manages to jailbreak the model without fail. And you can see right here, he uh manages to make 04 Minih High be able to compile methods and proof of concept strategies that could cause significant disruption to MacOSS systems, which is pretty crazy. So, um I think this is something that, you know, is rather fascinating because I don't know how every single time he's able to, you know, get past the guardrails. It's pretty crazy that he's able to do this, but it just goes to show that currently these LLM systems aren't completely And there was also something really fascinating that I don't think most people saw from the model card. So, I saw this tweet on Twitter and it actually showcases something from the full paper that is rather fascinating. It talks about the fact that 03 seems to hallucinate twice as more than 01 according to the system card. So hallucinations could scale inversely with increased reasoning unlike for increased model size because outcomebased optimization incentivizes confident guessing. So this is pretty crazy because if these models the smarter they get tend to hallucinate more due to how they're trained with their reasoning, this could be quite the problem when it comes to figuring out how these models achieve outstanding results or even realizing whether or not they are telling the truth. Like I said before, there's this entire safety thing about O3 and about how this highly capable model tends to trick, lie, or deceive individuals and I guess in some cases hallucinates a lot more than you'd think. So with that being said, let me know what you guys think about 03. I personally do think that this is teetering on the edge of at least a web AGI or a computer AGI. Definitely the kind of super agent that is going to be absolutely incredible. And hopefully you guys enjoyed today's

---
*Источник: https://ekstraktznaniy.ru/video/13021*