GPT-5.2 Just Hit 75% on ARC-AGI! How Is This Possible?
8:01

GPT-5.2 Just Hit 75% on ARC-AGI! How Is This Possible?

Universe of AI 24.12.2025 2 820 просмотров 59 лайков обн. 18.02.2026
Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
GPT-5.2 just reached 75% accuracy on ARC-AGI, one of the hardest reasoning benchmarks in AI. In this video, I break down: • What ARC-AGI actually measures • How this result was achieved without retraining • Why the system matters more than the model • What this means for AI reasoning going forward This isn’t hype — it’s a structural shift worth understanding. Sources: https://x.com/poetiq_ai/status/2003546910427361402/photo/1 https://poetiq.ai/posts/arcagi_announcement/ For hands-on demos, tools, workflows, and dev-focused content, check out World of AI, our channel dedicated to building with these models: ‪‪ ⁨‪‪‪‪‪‪‪@intheworldofai 🔗 My Links: 📩 Sponsor a Video or Feature Your Product: intheuniverseofaiz@gmail.com 🔥 Become a Patron (Private Discord): /worldofai 🧠 Follow me on Twitter: /intheworldofai 🌐 Website: https://www.worldzofai.com 🚨 Subscribe To The FREE AI Newsletter For Regular AI Updates: https://intheworldofai.com/ GPT-5.2, ARC-AGI, ARC-AGI-2, AI reasoning, artificial general intelligence, AGI benchmarks, OpenAI GPT-5.2, AI systems, AI agents, AI research, machine reasoning, recursive self improvement, AI benchmarks explained, Poetiq AI, AI system architecture, AI progress, reasoning models, AI evaluation, #GPT52 #ARCAGI #AIReasoning #ArtificialIntelligence

Оглавление (2 сегментов)

  1. 0:00 Segment 1 (00:00 - 05:00) 764 сл.
  2. 5:00 Segment 2 (05:00 - 08:00) 491 сл.
0:00

Segment 1 (00:00 - 05:00)

Ooetic recently posted this update. They ran a system using GPT 5. 2 extreme high on ARC AGI2 using the same harness as before with no training or model specific optimization. The result a 75% accuracy on the public ARC AGI2 evaluation at under $8 per problem. They also note that this beats the previous state-of-the-art by roughly 15 percentage points. That is a very strong claim. So in this video I want to do something simple. I want to explain exactly what this means, how we got here, and why this result matters. So let's get into it. This result comes from Poetic, a small team of six researchers and engineers, most with backgrounds at Google DeepMind. They're not building a new foundation model. They're not competing on parameter account. Instead, they're building what they describe as meta system, an intelligence layer that sits on top of existing models and determines how reasoning is carried out. Their system decides how to structure a solution, which model to use, when to generate code, how to revise an approach, and when to stop. That last point, deciding when to stop, turns out to be the central to everything you're about to see. To understand why this result is interesting, we need to understand what ARC AGI actually measures. Arc AGI is designed to test general reasoning rather than memory or training data recall. These problems require abstraction, hypothesis testing, and the ability to realize that your first idea is wrong and adapt to it. ARK AGI2 increases the difficulty further by adding more complex transformations and fewer shortcuts. For reference, the average human test taker scores around 60%. Most large models struggle to reach that level reliably, especially when cost is constrained. Let's start with ARG AGI1. On the horizontal axis, you can see cost per task, and on the vertical axis, you see accuracy. Each dot represents a model configuration across GPT, Gemini, Claude, and Grock, often with different reasoning budgets. Normally, you expect a trade-off where higher accuracy requires higher cost. What poetic shows is actually different. Their system redraws the Pareto frontier, meaning that at every cost level, their accuracy is actually higher than anything else reported. ARK AGI1 shows us the pattern, but ARC AGI2 is where that pattern really gets stress tested. ARK AGI 2 is designed to be harder, more compositional, and much less forgiving of shortcuts. If a system is relying on surface level tricks, this is where it usually breaks. The axis here are the same as before with cost per task on the horizontal axis and accuracy on the vertical axis. The dash horizontal line represents average human performance at around 60%. Many strong models approach this line but struggle to cross it consistently without very high cost. What poetic shows us is different. Their systems cross the human baseline and continue improving as computation increases. What's especially important is that the gap widens rather than shrinks as tasks get harder. If these gains were coming from benchmark specific optimization, we would expect performance to collapse here. Instead, the system appears to handle increased complexity more effectively. This is the point where it becomes clear that something structural is happening. Arc HI2 removes many of the shortcuts that models rely on, which is why most systems see diminishing returns. Poetic system does not flatten out in that same way. Instead, it continues to trade additional computation for more meaningful reasoning gains. That behavior is much closer to how human problem solving scales than how typical model inference scales. These per model comparisons help isolate what's actually changing. The underlying models are the same. There's no retraining, no fine-tuning, and no model specific optimization involved. The only difference is that poetics reasoning system is applied on top. Accuracy increases while cost decreases, which runs counter to what we normally expect. This pattern holds across different model families and reasoning budgets. That consistency strongly suggests that the system is discovering general reasoning strategies rather than exploiting quirks of individual models. At this point, it's worth slowing down and explaining what Poetic is actually doing under the hood because this is where most of the misunderstanding happens. Poetic describes their approach very simply. It's LLM all the way down. They use large language models to build, improve, and power the reasoning system itself. They're not embedding intelligence into a single prompt and they're not training a new reasoning model. Instead, they treat the model as a component inside a larger reasoning process. One of their key ideas is that
5:00

Segment 2 (05:00 - 08:00)

the prompt is not the intelligence. The prompt is just an interface. In a typical setup, we ask a model a single question and hope the answer is correct. Poetic system does something very different. It engages in an iterative problem solving loop. The system asks the model to generate a candid solution, sometimes in natural language and sometimes as code. It then evaluates the output, analyzes what worked and what didn't, and uses the model again to refine the approach. This loop repeats across multiple steps, allowing the system to incrementally build toward a correct solution rather than betting everything on a single response. The second key idea is self- auditing. Poetic system does not blindly keep generating tokens. It actively monitors its own progress and decides when it has enough information to stop. That self-monitoring step is critical because it avoids wasted computation while still improving accuracy because the intelligence lives in the system rather than the model. And that is why the same framework can be used across different models. This is why Poeta can plug in new models quickly and see immediate gains. they have. With that in mind, we can now return to the update that triggered this discussion. Poetic reported that they ran GBT 5. 2 Extreme High on ARC AGI2 using the exact same Poetic harness as before. They explicitly state that there was no training and no model specific optimization performed for GPT 5. 2. With that unchanged system, they observed results as high as 75% accuracy on the public ARC AGI2 evaluation at under $8 per problem. That represents roughly a 15% point improvement over the previous state-of-the-art model. The key detail here is not the number itself. It's the fact that the system did not change. This is the most important insight in the entire result. The intelligence gained did not come from retraining, reinforcements learning or benchmark specific tuning. It came from plugging a stronger model into an existing reasoning framework and watching the system scale immediately. that suggests reasoning improvements can compound as models improve without rebuilding intelligence from scratch every single time. This does not mean AGI is here, but it does demonstrate a credible path where intelligence improves systematically rather than statistically. If this pattern holds, future breakthroughs may look less like singular model releases and more like quiet shifts in how reasoning is organized. If you enjoyed this video, this is what we do here. fast, clear updates on the biggest moves in AI. If you want to stay ahead of everything happening in this space, make sure you're subscribed. And if you want the hands-on side, demos, tools, workflows, and everything developers can actually build, will check out the World of AI. We also run a simple no noise newsletter that gives you the most important AI tools and updates in just a couple of minutes. Subscribe here. Follow World of AI. Join the newsletter.

Ещё от Universe of AI

Ctrl+V

Экстракт Знаний в Telegram

Транскрипты, идеи, методички — всё самое полезное из лучших YouTube-каналов.

Подписаться