This Test Was Built to Block AI — GPT-5 Finally Passed It
11:31

This Test Was Built to Block AI — GPT-5 Finally Passed It

TheAIGRID 01.01.2026 20 595 просмотров 597 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Checkout my newsletter : - https://aigrid.beehiiv.com/subscribe 🐤 Follow Me on Twitter https://twitter.com/TheAiGrid 🌐 Learn AI With Me : https://www.skool.com/postagiprepardness/about Links From Todays Video: Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos. Was there anything i missed? (For Business Enquiries) contact@theaigrid.com Music Used LEMMiNO - Cipher https://www.youtube.com/watch?v=b0q5PR1xpA0 CC BY-SA 4.0 LEMMiNO - Encounters https://www.youtube.com/watch?v=xdwWCl_5x2s #LLM #Largelanguagemodel #chatgpt #AI #ArtificialIntelligence #MachineLearning #DeepLearning #NeuralNetworks #Robotics #DataScience

Оглавление (3 сегментов)

Segment 1 (00:00 - 05:00)

So GP5 just reached human level on one of the most difficult benchmarks, but it's not how you think. We need to talk about it. So what you're looking at is a graph of a chart called the ARC AGI 2. Now this is a benchmark that is designed to test abstract reasoning, which is why it's called the ARC AGI benchmark. Now, by any means, when you hear the name ARC AGI, you think, okay, this is clearly some AGI kind of benchmark, and it is. It's quite different from the standard kind of test that you know LLMs usually are faced with. These ones have more curated, more complex tasks and explicitly different metrics such as the efficiency per task which means they have to figure out how much it actually cost per task to get that specific result. So I'm not going to go into all the details of the specific benchmark, but this benchmark is positioned as a fluid intelligence test which target generalization, pattern discovery, and compositional reasoning rather than memorized knowledge or data set familiarity. Now what we can see is that the average human test taker scores at about 60%. And for the longest time, even on the ARC AGI2 benchmark, most would have thought that surpassing this threshold on current large language models would have probably meant that we would have had to have waited until the next set of generation of models. Now, clearly this isn't the case as currently there is a version of GPT5 that currently surpasses humans on the test this benchmark. And the reason why I'm actually talking about this is because of how this was achieved. It was achieved in a way that most people simply haven't yet realized. And I think this video is going to show you guys why AI acceleration is actually faster than we think and why most people aren't going to realize just how quickly the gains in some areas of AI development are going to come. Now, what this video is actually about is what I would call unhobbling. Now we can see that poetic which is the company behind this version of GPT5 managed to achieve I think around a 76 75% compared to the average human test taker. Now most people may not remember this paper that came around I think it was 2024 in mid 2024 an ex openly researcher Leopold Ashen Brenner made this paper which was called situational awareness the decade ahead. Now, essentially what Ashen Brener wanted to do with this paper was give everyone cleareyed awareness of the strategic situation and how fast AI progress is actually going to be. And in this paper, what he actually talks about is something called unhobling. And so we can actually see on this chart all of the different things that are going to increase an AI model's intelligence. And of course, we can see highlighted in the red there's something called algorithmic progress unhobling. Now, let's actually take a look at what this unhobling is. So you guys can actually decode the understanding. So let's take a look. He says, "Firstly, the hardest to quantify but no less important category of improvements is what I'll call unhobling. Imagine if when asked to solve a hard math problem, you had to instantly answer with the very first thing that came to mind. It seems obvious that you would have a hard time except for the simplest problems. But until recently, that's how we had LLMs solve math problems. Instead, most of us work through the problem step by step on a scratch pad and are able to solve much more difficult problems that way. Chain of thought prompting unlocked that for LLMs. Despite excellent raw capabilities, they were much worse than they could be because they were hobbled in a previous way. And it took that small algorithmic tweak to unlock greater capabilities. Essentially saying that look, we went from chat bots to chat bots that can think and we unlocked a huge range of capabilities. And we can see that on the progress when we decompose those drivers of progress unhobling is going to be a large percentage of what actually drives AI progress. And so now coming back to the poetic new ARC AI benchmark where it took what an average human test taker would take the standard model which got at around 60%. And then it pushed that all the way up to 75%. You must understand that this is incredible because this is a clear demonstration of what we would call unhobling. Now, the reason I'm making a video on this is because this is just one method of unhobling. I will talk in a moment about poetics meta system, what they actually did, but I think it's super interesting because you can see here that he says unhobling is what enabled these models to become useful. And I'd argue much of what is holding back many commercial applications today is the need for further unhobling of this sort. Indeed, models today are still incredibly hobbled. For example, they don't have long-term memory. They can't use a computer. They mostly still don't think before they speak, and they mostly only engage in short back and forth dialogues. And he's saying the possibilities here are enormous. And we're picking at lowhanging fruit. Completely wrong to just imagine GPT6 with continued unhobling progress. The

Segment 2 (05:00 - 10:00)

improvements will be step changes compared to GPT6 plus RLHF. By 2027, rather than just a chatbot, you're going to have something that looks more like an agent and a co-orker. And take a look at what they were able to do to Grock 4 fast. take that reasoning level from around 56 57% all the way up to around 72% by using their meta system. And I think this is crazy. So unhobling is just simply a way of scaffolding an LLM whether it be tools, frameworks, agentic systems, organizations and prompts. So essentially that you can get more out of the raw base model. And this is once again what they were able to do with Gemini 3. So you can see right here that Gemini 3 Pro scored around I believe just under 30% but then they had Gemini 3A. They did some tweaks, managed to get, you know, 38%. Then did some more tweaks, managed to get around 44%, did even up to a human test taker. And once again, even with more unhobling, they managed to get above that level. So what exactly are they doing? So basically, this essentially shows why Poetic beats other AIs without using a bigger, more expensive models. Remember, all they're doing is making a system around the LLM that gets to the better answer. So on the left side we have is how you know the normal AI would ask the question, I mean answer the question and you just ask one big model and then you make one big guess and then you pay the full cost even if it's wrong. And that does work sometimes, but remember that's expensive. It's unreliable and it's wasteful for hard reasoning tasks. It's essentially one shot, one answer with no safety net. And then the middle is where you have the manager AI. This is poetic. So this is the key idea. Poetic adds a manager layer on top of the models. And this manager decides which model to use, decides how to break the problem into steps, and decides when to write code. And then it checks its own progress. And it stops early when the solution is good enough. And the intelligence isn't just in the model. It's how the entire system thinks. And so when you have a system like Poetics that's able to self-check and stop early, you're able to avoid wasting computation and you're able to turn messy reasoning into a controlled process. And I think most people are going to miss this because most people just see the benchmarks and take them at face value. But I think it's interesting that other companies are going to be able to take those base models and then use those base models to get even more out of the models for whatever purpose they may have. So you can see right here that someone said how long did this take on average per task for the average test there were an average of 5 minutes per human and they said that we don't explicitly gather those stats right now but that they recall seeing that the easiest problems start to finish after perhaps 8 or 10 minutes or the hardest problems we have to terminate before 12 hours to stay under the limit. So there's definitely time for improvement. So there's still a bit of room for improvement on certain tasks because these AIs don't you know reason as efficiently. But it's still very interesting that ARC AGI 2 has been essentially beaten to the point where it's at human level because I remember when this thing first came out people were like okay AI systems they are going to you know resist this benchmark for quite some time. Now you can see right here Francis sole said if you're wondering whether saturating the arc AGI 1 or 2 means we have AGI now I refer to you what I said when we launched our I'm going to refer to you. So now this is Francis Cholay. This is essentially the guy who created the benchmark with his team and he says that if you're wondering whether saturating Arc AGI 1 or 2 means that we have AGI now I'm going to refer to you what I said when we launched Arc AGI 2 last year which is also the same thing I said when we announced Arc AGI 2 was coming. Arc AI was a minimal test of fluid intelligence and to pass that test you have to show non-zero fluid intelligence and this required AI to move past the classic deep learning/LM paradigm of pre-training and scaling and static models of inference towards test time adaptation. Now essentially what he means by here what non-zero fluid intelligence means is that instead of just pattern matching from training data you actually need to prove that you can reason through something you've literally never seen before. The old paradigm failed because you know when you pre-train on massive data sets you scale the model bigger and you deploy it just as a static systems it often just retrieves the patterns that it memorized and that actually got you zero on ARC AGI1. So that's why ARC AGI2 existed. Now he says that ARC AGI 2 is basically the same but there are also tasks that probe the deeper levels of reasoning complexity particularly with regard to concept composition. still these tasks that are solvable in minutes by regular people with no external tools. Apparently, they just hire test takers off the street. So, it does not represent the upper bound of what a human fluid intelligence can do. For example, solving a millennium problem. And he's essentially saying that a millennium prize problem are those mathematical challenges that are so difficult only a handful of humans in history could potentially crack them, which are years of deep reasoning. Now, essentially he's saying here that even if models are achieving high scores on these, it's still not representing the

Segment 3 (10:00 - 11:00)

upper bound of what human intelligence could really do. Now, of course, he does state that they're going to be working on ARK AI3, which is launching in March 2026, which is going to be probing interactive reasoning. So, they're going to be evaluating how systems explore unknown environments, model them, set their own goals, and plan and execute towards these autonomously without instructions. So, I'm guessing that this is going to be, you know, some game-like environments where they must explore unknown spaces to gather information, build mental models, and set their own goals and plan multi-step strategies toward those goals. Now, of course, current AI systems, even the ones that beak arc AGI 2, rely on being given the problem structure. And if you remove that scaffolding, unfortunately, they collapse. So, of course, the static problems are here are these three examples. Now, solve this fourth one. But the new tests will be you're in a new world. figure out the rules, decide what to do, and then go ahead and do that. That is going to be a completely new test of intelligence where how efficiently can you learn in a completely novel environment compared to a human playing the same game for the first time. That's true agency, the missing piece that separates pattern matching from true intelligence. And I think it's going to be super interesting to see what Arc AGI 4 and Arc AI 5 are going to be like because he's also working on that. So whilst yes we have reached human level it's I guess you could say surprising we managed to get here this early. Also very interesting that the paper said that we would have unhobling and we are. those unhobling gains are really true.

Другие видео автора — TheAIGRID

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник