o3 breaks (some) records, but AI becomes pay-to-win

14:34

o3 breaks (some) records, but AI becomes pay-to-win

AI Explained 25.04.2025 60 832 просмотров 2 549 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

A green card, o3 vs Gemini 2.5, 6 Benchmarks and a whole bunch of my thoughts on what on earth is happening in AI, from here to 2030. Plus, how AI is becoming pay-to-win, and why. Crazy times, 14 mins probably wasn’t enough. https://app.grayswan.ai/ai-explained AI Insiders ($9!): https://www.patreon.com/AIExplained Chapters: 00:00 - Introduction 00:33 - FictionLiveBench 01:37 - PHYBench 02:14 - SimpleBench 02:54 - Virology Capabilities Test 03:13 - Mathematics Performance 04:29 - Vision Benchmarks 05:43 - V* and how o3 works 06:44 - Revenue and costs for you 08:54 - Expensive RL and trade-offs 09:40 - How to spend the OOMs 13:27 - Gray Swan Arena Green Card: https://techcrunch.com/2025/04/25/an-openai-researcher-who-worked-on-gpt-4-5-had-their-green-card-denied/ PHYBench: https://arxiv.org/pdf/2504.16074Virologytest: https://www.virologytest.ai/ How o3 Vision Works: https://arxiv.org/pdf/2312.14135 https://x.com/sainingxie/status/1912570624523829573 Visual puzzles: https://neulab.github.io/VisualPuzzles/ Fiction Bench: https://x.com/ficlive/status/1912863028141244850 https://geobench.org/ https://simple-bench.com/ AIME 2025: https://openai.com/index/introducing-o3-and-o4-mini/ USAMO: https://x.com/mbalunovic/status/1914398518896193747 NaturalBench: https://linzhiqiu.github.io/papers/naturalbench/ Where’s Waldo: https://uk.pinterest.com/pin/492792384225896298/ IMO and AlphaProof:https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/ Crazy Revenue: https://www.theinformation.com/articles/openai-forecasts-revenue-topping-125-billion-2029-agents-new-products-gain?rc=sy0ihq Number of Users: https://www.theinformation.com/briefings/googles-gemini-user-numbers-revealed-court?rc=sy0ihq Subscriptions pay to win: https://www.forbes.com/sites/paulmonckton/2025/04/23/google-leak-reveals-new-gemini-ai-subscription-levels/ GPU Trade-offs: https://x.com/sama/status/1915098951067554030 RL Scale-up Amodei: https://www.darioamodei.com/post/on-deepseek-and-export-controls Log-linear Returns: https://x.com/bobmcgrewai/status/1895228291981943265 2030 Scaling: https://epoch.ai/blog/can-ai-scaling-continue-through-2030 Model Size: https://x.com/slow_developer/status/1874554473256997201 Adam on AGI: https://x.com/TheRealAdamG/status/1913998366632968381 Papers on Patreon: https://arxiv.org/pdf/2502.01839 https://arxiv.org/pdf/2504.13837 Chollet Quote: https://x.com/fchollet/status/1912934762580447447 OpenSim: https://opensim.stanford.edu/ Non-hype Newsletter: https://signaltonoise.beehiiv.com/ Podcast: https://aiexplainedopodcast.buzzsprout.com/

Оглавление (12 сегментов)

Introduction

This video is about rapid progress in AI. Progress that might soon be a little less US- ccentric with news of a veteran OpenAI researcher being denied a green card. But it's been just a singledigit number of days since the release of 03, the latest model from OpenAI. And it has broken some records and in turn raised yet more questions. So, in no particular order and drawing on a halfozen papers, here are four updates on the state of play at the bleeding edge of AI. Now

FictionLiveBench

just before we get to how much money these models will make for companies like OpenAI and Google and how much money they will cost you, which model is actually the best at the moment? Well, that's actually really hard to say because it depends heavily on your use case and the benchmark that you look at. At the moment, the two clear contenders for me would be 03 and Gemini 2. 5 Pro. And I covered how they were neck andneck in some of the most famous benchmarks in the video I released on the night of 03's release. But since then, we've arguably got some more interesting benchmark results. Take the piecing together of puzzles within long works of fiction up to say around 100,000 words. I honestly expected Gemini 2. 5 Pro to keep its lead in that it could piece together those puzzles even at various lengths through to the longest texts. After all, long context is Gemini's specialtity. But no, 03 takes the lead at almost every length of text. If you know that there's a clue in chapter 3 that pertains to chapter 16, then 03 is the model for you. Who cares about that?

PHYBench

Some of you will say, "What about physics and spatial reasoning? " Well, here is a brand new benchmark from less than 72 hours ago, and we can compare those top two contenders. We have Gemini 2. 5 Pro in the lead, followed by 03 High. And bear in mind that Gemini 2. 5 Pro is four times cheaper than 03. Notice though, for reference that human expert accuracy on this benchmark still far exceeds the best model. Imagine you had to learn about all sorts of realistic physical interactions predominantly through reading text, not experiencing the world. You would probably have the same problems. And

SimpleBench

honestly, this explains much of the discrepancy between the top two models and the human baseline on my own benchmark, Simple Bench. Those two models are starting to see through all the tricks on my benchmark, but they're still failing quite badly on spatial reasoning. This isn't a question from Simple Bench or the physics benchmark, but it illustrates the point that if, for example, you put your right palm on your left shoulder and then loop your left arm through the gap between your right arm and your chest, well, you're probably following, but models have no idea what's going on. It's not in their training data, and they can't really visualize what's happening. I will come back to this example later, though, because soon with tools, I could see them getting this question right. Speaking of getting questions right, we

Virology Capabilities Test

learned that 03 beats out Gemini 2. 5 Pro on a test of troubleshooting complex vyology lab protocols. 03, you will be glad to know, gets a 94th percentile score. This is, of course, a textbased exam and isn't the same as actually conducting those protocols in the lab. You might notice I'm balancing things

Mathematics Performance

out because now for a benchmark in which Gemini 2. 5 Pro exceeds the performance of 03 competition mathematics. Now, you may have heard on the grapevine the 03 and 04 mini actually got state-of-the-art scores on AIM 2025. That is a high school maths competition. Without tools, both models got around 90%, but with tools, they got over 99%. What you may not know is that AIM is just one of the tests used to qualify for USMO. That is a significantly harder proof-based maths test. Notice all of these are high school tests though, which is very different from professional mathematics. Anyway, on the USMO, you can see here that we have 03 on high settings getting around 22% right compared to 24% for Gemini 2. 5. Again, four times cheaper for Gemini. What's perhaps more interesting is that the USMO is only a qualifier for the hardest high school math competition. That's the International Math Olympiad. and Google has a system, Alpha Proof, that got a silver medal in that competition. Now, I've done other videos on Alpha Proof, but I would predict that in this year's competition in July, I suspect Google might get gold. Back to some more downto-earth domains, though.

Vision Benchmarks

What about simple visual challenges like this one? Given an image, can the model answer, "Is the squirrel climbing up the fence or is the squirrel climbing down the fence with these two images? " Are these two dogs significantly different in size? As another question, this benchmark is called natural bench. And you probably guessed, cuz I'm alternating in performance, 03 actually scores better than Gemini 2. 5. Both of course still well behind human performance. Despite that first impression, it's actually Gemini 2. 5 Pro that scores better at geoging, being given a random street view and knowing which country and location within that country you're looking at. In fact, the difference is quite stark with 2. 5 Pro way exceeding 03 high. Now I think of it, probably not too surprising given Google's ownership of Google Maps, Google Earth, and of course YouTube and Whimo. Last benchmark, I promise. But how about Visual Puzzles? Which kite has the longest string here? The answer is C. And overall on the visual puzzles benchmark, we have Gemini 2. 5 Pro even underperforming 01, let alone 03. Both of course still well behind the average human let alone an expert human. Now allow me if you will 30 more seconds

V* and how o3 works

before we get to the question of money because OpenAI basically gave away the VAR method they use to improve so much in vision. You may have noticed how 03 seems to zoom in to answer a question. But what's the executive summary of VAR? Essentially the model gets overwhelmed by a highresolution image. So what the method does is it uses a multimodal LM to guess at what part of the image is going to be most relevant to the question. That part of the image is then cropped added to the visual working memory the context of the model along with the original image and submitted with the question. You can see that in action when I gave 03 this where's Wally or Americans say where's Waldo image. The language model speculates that Waldo tends to show up in places like a top vantage point or a walkway. So, it decides to crop that area. Now, I will say in keeping with the other benchmarks we saw, it wasn't actually able to find Waldo, and I was, although it took me about 3 minutes, I'll be honest. Okay, those are the state-of-the-art models in

Revenue and costs for you

AI. But where is this all heading? Well, to $174 billion of revenue for OpenAI in 2030, according to themselves. In a moment, I'll touch on what that means for you in terms of price, but actually on that prediction seems pretty reasonable to me. Even though in 2024 they made just $4 billion, I could see that growing extremely rapidly. I would note though that even with the biggest figures being far less than 1% of the value of white collar labor globally, someone would have to be spectacularly wrong. Either as I suspect, we won't get a country of geniuses in a data center in 2026 2027 or these figures are spectacular underestimates. Here then are some of my very summarized thoughts about why I think AI is becoming maybe has already become pay to win or another way of putting that why me or you might have to pay more and more to stay at the cutting edge of AI. We got news just the other day that Google is planning their own premium plus and premium pro tiers probably on the order of $100 $200 a month just like open AI and very recently anthropic as well. Now think about it. If AGI or super intelligence was quote one simple trick away, one algorithmic tweak or a quick little scale up of RL, well then these companies incentives would be to get that AGI out as soon as possible to everyone safety permitting. Capture market share as they all tend to want to do, gain monopolies and then further down the road charge for access to that AGI. If on the other hand performance can be bought through sheer scaling up of compute then someone is going to have to pay for that compute namely you. Yes, we've had some quick gains going from 01 to 03 and even 04 mini but as the CEO of anthropic said that post-training or reasoning through reinforcement learning is soon going to be at the cost of billions and billions of dollars. And nor is post-training magic. It can't actually create reasoning paths not found in the original base model. That's according to a very new paper out of Shinua University. If you're interested in my deep dive on that paper and the previous one you just saw, I've just put up a 20-minute video on my Patreon. Thank you as ever to everyone who supports the channel via Patreon. Now

Expensive RL and trade-offs

as the former chief research officer at OpenAI said, that doesn't mean that there isn't lots of lowhanging fruit in reasoning or post-training. But he nevertheless predicts that soon reasoning will quote catch up to pre-training in the sense of providing log linear returns. as in you have to put 10 times the investment to get one increment more of progress. Also bear in mind that Samman recently called OpenAI a product company as much as a model company. It's a little bit like they're kind of taking their eye off the AGI ball and focusing more on dollar returned per compute spend. These companies only have so many GPUs and TPUs to go around. Every time researchers attempted toward a bigger base model or more post-training, Samwin has to judge that against rate limits for new users. new feature launches and

How to spend the OOMs

latency. I know this research from Epoch AI was mainly focused on scaling up training runs or pre-training the base models, but very broadly speaking, it predicted by 2030 having say 100,000 times the effective compute as was used in 2022 for the training of GPT4. But even if hypothetically by 2030 we had five orders of magnitude more compute than we have say today, think of all the competing demands on that compute OpenAI would have if they're to achieve $174 billion of revenue. Their models by parameter count might be a,000 times bigger on average by then as compared to now. Most free users until very recently were using around an 8 billion parameter model, GPT40 Mini. But even if free users are now getting used to models the size of GBC 4, GPC 4. 5 is around 20 trillion parameters. Some say 12 trillion, but either way, roughly two orders of magnitude more than GP40. Of course, by then power users like me won't be using GPT 4. 5, but probably GPT 5 or 6, 10 or 100 times bigger. Then there's the user base. And even though OpenAI are serving 600 million monthly active users, by 5 years from now, there might be 6 billion smartphone users. Google with Gemini recently quadrupled its user base in just a few months up to 350 million monthly active users. But that could easily 2x, 3x, 4x. That takes compute and this is all before we get to models thinking for longer. Then there's latency. Deep research is amazing, but it takes an average of say 5 to 10 minutes. You can imagine spending an order of magnitude more compute to bring that down to say 5 seconds. Also, don't forget there's usage per user in this 2027 or 2030 scenario of AGI. Everyone is of course going to be using these chat bots way more than they are now. That's another 10x and that's all before we get to things like text to image, text to video with Sora. All of which is a long way of saying that I could imagine 12 orders of magnitude of effective compute being utilized by companies like OpenAI. That includes things like not just more chips, but more efficient chips and better algorithms. Five orders of magnitude by 2030 wouldn't be nearly enough. If you notice, none of that actually precludes there being a proto AGI in the coming few years, albeit a very expensive one. Here's what a senior staff member at OpenAI said just a few days ago. OpenAI, he said, has defined AGI as a highly autonomous system that can outperform humans at most economically valuable work. We definitely aren't there yet. Far from it. You might have deduced the same with some of the benchmarks earlier in this video. But he goes on, "The AGI vibes are very real to me, especially the way that 03 dynamically uses tools as part of its chain of thought. " Again, he says that does not mean we've achieved AGI now. In fact, it's a hill that he would die on that we have in fact not. He ends though, and I agree with this, that things will go slow until they go fast. Really fast. Things feel fast today, but I think we're actually still accelerating and we will actually start to go even faster. If you're willing to spend the money, Francois, a famous AI researcher said, going from cents per query up to tens of thousands of dollars per query, you can go from zero fluid intelligence to near human level fluid intelligence. After all, we're getting things like anthropics model context protocol where models now have a shared language to call tools of all types. And we know that tool calling was part of the reinforcement learning training of 03. So, how long is it before 03, which arguably fails on anatomy questions like this, can call on open- source software like OpenSIM and run a simulation, enter the relevant parameters and run the code like they do with code interpreter, watching the resultant simulation. Soon, almost any software could be sucked into the orbit of these models training regimes. Now, I will grant you that

Gray Swan Arena

presents all sorts of security problems that will have to be solved first. Which is why I'm going to introduce you to the sponsors of this video, Grace Swan. And you may be able to see out of the corner of your eye a $60,000 competition that's in progress wherein you, you don't even have to be a professional researcher, can try to use image inputs to jailbreak leading vision enabled AI models. I think it's pretty insane that you can be paid to exploit these vulnerabilities and yet at the same time be boosting AI safety and security. These are incredibly legit competitions with public leaderboards monitored by OpenAI, Anthropic, and Google Deep Mind. So, wouldn't it be pretty epic if the winners of this competition turned out to have used my unique link, which you can find in the description? I will completely take full credit for your win and bask in the resultant glory. Of course, feel free to weigh in the comments what you think about the news story that's currently going viral online. No doubt it's crazy times we live in, but thank you guys so much for watching to the end. I will never not be grateful for your viewership. So have an absolutely wonderful

Другие видео автора — AI Explained

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник