Measuring Agents With Interactive Evaluations

21:05

Measuring Agents With Interactive Evaluations

OpenAI 08.10.2025 4 508 просмотров 73 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Agents explore, plan, and reliably execute across diverse, long-horizon tasks—challenges that static benchmarks can't measure. Hear from Greg Kamradt, President of the ARC Prize Foundation, on how evaluating agentic performance requires interactive evaluations.

Оглавление (5 сегментов)

Segment 1 (00:00 - 05:00)

Hi, my name is Greg Camrad, president of Arc Prize Foundation, and today we are going to learn how we measure Frontier AI. In the next 20 minutes, I'm going to step you through why interactive benchmarks are the key to doing this. We're going to take it take a look at a new Frontier AI benchmark. And then finally, we're going to wrap up with understanding why interactive benchmarks don't just measure intelligence for us, but they also tell us how efficient your intelligence is. Now, there's no doubt that AI has made incredible progress recently. But the question that I'm asking myself is not is AI making progress. It's what is AI making progress towards? Because if you measure your AI in terms of a narrow domain, maybe a verticalized benchmark, well then you're going to be making progress in that vertical domain. However, if you're intention is to measure models that generalize, then you need your specific benchmark to measure generalization and target it. To do this, the very first starting point is you need to define what intelligence is. So in 2019, Francois Chalet came out with a paper on the measure of intelligence and he did just this. He defined intelligence as skill acquisition efficiency. Now that's kind of a ver a verbose way to look at it. But if we were to put it another way, it's what is your ability to learn new things? We already know that AI can learn any one new thing. It can learn how to play chess. self-drive a car. It can learn how to play Go. But getting those same systems to learn something else, well, that still remains out of reach. Now, using this definition or this opinionated definition of uh intelligence, Francois started the Ark Prize Foundation with Mike Cane in 2024. We are a nonprofit with the Northstar to act as or to act as a Northstar towards open progress of AGI. We define AGI as a machine's ability to learn as efficiently as humans. Now, as an organization, we build benchmarks that test the intelligence of machines by specifically measuring their ability to generalize. Last year, we're invited by OpenAI to join them on their live stream to co-announce the results on their 03 preview model on our first benchmark, Arc AGI1. Now, it is in my view as we look forward as human-like intelligence starts to arrive in machines, it's going to show up as interactive agents that will be learning and adapting on the fly. And the reason for this is because intelligence is inherently interactive. The world isn't just giving you one-shot problems. And intelligence is going to unfold step by step through perception, feedback, and really ultimately action. So, if intelligence is interactive, well, then we need a new way to evaluate this behavior. And we're already starting to see early signs of this. Here's a view of GPT5 playing Pokémon on a Twitch stream. It may look like a toy example, but there's actually quite a lot going on underneath the hood here because Pokemon requires that you make long-term plans, that you explore your environments, and that you tackle short-term meta goals on your way to solving your long horizon goals. What this is showing you is that in order to measure interactive intelligence, you need interactive benchmarks. static benchmarks, ones where you just ask a question and then get an answer back. Those aren't going to cut it because with interactive benchmarks, you get a whole lot of new capability. You can test an agent's ability to explore a new environment. do the perceive, plan, and act loop. What's really interesting is you can also test an agent's ability to memorize because there's going to be way more information in your environment than you can hold on to. So, you need to also pick what to memorize. You can look at understanding goal acquisition and meta goal acquisition. And then what's very interesting is you can also test an agent's alignment and cooperation capabilities as well. These are things that you're not going to be able to get from a static benchmark. Combining all these different ideas, the Arc Prize Foundation, we're coming out with ARC AGI 3, and this is going to be a series of 150 open-sourced video game environments. Each one of these are novel, and we're creating these ourselves. We actually have built a mini game studio just to make this happen. And if you would have asked me early in my career if I ever thought I'd be running a game studio, the answer is no. Um, the whole goal with ARKGI 3 is we

Segment 2 (05:00 - 10:00)

want to test the test taker's ability and that's either human or AI. We want to test the test taker's ability to adapt to novel situations. We want to see can they figure out what is the goal of their environment and can they figure out how to get to that goal and really what are the goals in the first place. So each game in ArcGI 3 is going to be built on entirely new game mechanics. So these are not games that you've seen out there in the public but each one will be very different from each other. So for an illustrative purpose each game will be diff as different from each other as connect 4 is from solitaire is from Pac-Man. We won't literally make those games, but that's how different they will be from each other. Now, each game is also designed extremely intentionally. Back to Francois's definition of intelligence. We want to test whether or not the human or the agent can adapt to a new novel situation. You should not be able to overfit to any one game type. So, it'd be pretty lame if we just made one game type and then procgen a bunch of different levels. You would just be repeating the skill that you've already learned once. Now the games will be split across a public test set and a private evaluation set. On the public side, AI and researchers will get used to the game format and how really how the interface of the games. But all performance metrics when we actually evaluate how new models are doing, those will be based off of the private evaluation. Now these are games that neither the developer or the AI has ever seen beforehand. And what we can assert if there's success on the private test set is that they've actually generalized to unseen examples rather than just repeating what they've already seen on their public on the public data. However, these gifts, they don't go as far as I would like. So, I would love to look at a live game demo about what ArcGI 3 looks like. Here we have our first game. We call this VC33. And you'll notice that when while it's just static like this, it doesn't make a ton of sense. That is actually on purpose because what we want the user to do is as you start to click around, you want to see how your actions impact the environment because I should um I should mention here that these games don't have instructions either. There's no natural language instructions that we will give. The whole point about how you complete the game is you figure out what to do. So, as I click around here, nothing's happening. Now, if I click this blue one right here, oh, interesting. the left side goes up. All right, let me click the red. All right, the red side goes up. Let me click the red once more. And I want you to pay attention to where this yellow bar is in relation to the smaller one here. Oh, interesting. Okay, so I just did one level of an Arc AGI3 game. Let me try this again. Now I have the hypothesis that if I click the red button, the right side goes up and the blue button, the left side goes up. Yep, that hypothesis looks confirmed. Now, once I got it to the same level as beforehand, I beat the game or beat the level at C. Yep, that also looks good. So, as you can see here, even if you don't have instructions, and yes, I admit I played this beforehand. Even if you don't have instructions, you can start to click around and see what to do. Let me get the green bar up all the way to the top here. Oh, I ran out of uh energy or material over there. I need to get some more. So, you see how we just actually threw a puzzle in your way of completing level three here, and you needed to take an alternative route. You needed to move some material over. That's called a new mechanic that we introduce here. And it turns out that humans are very good at looking at this type of environment and understanding that that's what you need to do here. Let me close out just for the satisfaction and get to that one. We have another level here. Now, let's take a look at a second game. Beautiful. Now, the second game is actually completely different. Um, we have more blocks over here. Let's go through this. This game, by the way, is called LP85. Now, let me click the green over here. And you can start to see that it rotates around. Let me keep on clicking green, green. Oh, dang. Game over. I must have not done something right there. And as I go and replay this game, I'm going to have to try to do something new. Now, the term that I love for um when I observe humans play these games is actually curiositydriven exploration. So, us as humans, we're curious about something. We're going to explore and then see how it um see how it affects us. Now I'm gonna do a small spoiler alert. I clicked the red. I went to the other side. Notice how this yellow box is about to get within the yellow barrier. Click. Yep. Okay. So now the hypothesis here is that you must get the yellow box within the yellow barriers. Let me get this top one. Oh no, there's two of them. I wonder how we're going to solve this one. I won't actually go through it, but as I keep on continuing here, you can take a guess on how you'd actually do this one. Um, so I want you to notice how this one is a completely different game mechanic from the first one. We actually have other agentic games, too, where you're controlling a small little character

Segment 3 (10:00 - 15:00)

but these are going all going to be very different from each other. The last thing I want to call out while the demo is here is notice how every time I clicked on the back end, we're actually recording the number of actions that it took for you to solve that particular game. And this is very important. We're going to come back onto this in just a second here. One of our main design philosophies for as we design these games is we want to target problems. In fact, let me use stronger language. We love problems that are easy for humans but hard for AI. And the reason why we do this and why we love this is because humans are our only proof point of general intelligence. And as we're out here trying to make models that are more generally intelligent, it makes sense to us that this would be an anchor point and this is where we'd start. And so if we can find problems that humans can do, which again are our only proof point in general intelligence, but current AI cannot do, then that tells us that there's a gap and something that's missing there. And we will only include a game in ARC AGI 3 if we can find that a panel of humans can solve it. If they cannot solve it, then it will not be included in there. So every game is very doable about doable by humans. And these aren't PhD experts here. These are members of the general public that we recruit. Um, however, we can't just uh say that they're easy for humans and not have any first party data to back that up. So, what we do is we go and actually test members of the general public of this and we pull them in a conference room. We rent computers and we put the puzzles in front of them and we have a long candidate list of puzzles and if it ends up being too hard, we throw it out. But the important part here is that the humans have never seen these games beforehand. We call this a first run. So, how well does a human do on the first run of playing one of these games? Because we're eventually going to be holding AI to the same criteria on a first run. We want to see how it does. Um, as we test these humans, it would be easy just to see how many levels do they do and how many games do they do. But we go one step further and we're actually counting how many actions does it take each human to complete each game. Because if we can do that, we now have a benchmark of our only proof point of general intelligence. We can see how quickly that proof point can get through each one of these ARC AGI 3 games. And that is a new standard that we can then hold our AI to. So with this new metric that we get, what's really awesome is we get a new way to evaluate performance. And this isn't just accuracy in the traditional sense of how many did you get right. What we call this is action efficiency. So yes, you completed a level, but it's not just did you complete the goal. The question is how direct did you complete that goal or that game? Or very simply put, how many turns did it take for you to actually complete the game? The quicker you're able to do that, what we can then assert is that the better you're able to learn from your environment. Going back to the Pokemon example very briefly, they've already uh sort of adopted this as a pseudo scoring technique. So this view shows the same thing. This is GPT5 playing Pokemon and the uh on the x-axis is going to be the number of actions that GPT5 needed. And then the y-axis is going to be the milestones within the game that GPT5 was able to hit. And for specifically for this one, I think it says 5,000 actions in order to get to Libert Liberty Road. Now, there's two lines on here. The first line with the green and the and the gray, you'll notice that it's a lower slope. It's a bit more horizontal. And what this tells us is those models, I'm not sure which one it was, but it was older. Those models, they needed more actions and more turns in order to complete that environment. And in this case, Pokemon. GPT5, however, was much more efficient at this because the slope is higher. It was able to do it quicker. What's fascinating about this view is this isn't just a fun way to measure progress. This is actually a view into how efficient the model is at converting information from the environment into the value that it's looking for. Or in another way, it's actually a proxy for how intelligent your or how efficient your intelligence is. We can do the same thing for the human data that we collect during testing. And this is all the actions that I was talking about. I want to take a look at one human example here. So at the very beginning at the bottom left, they start at level one and they spend 10 actions to get to level two. And I say spend 10 actions because some of these will be used for exploration, figuring out what to do. And some of these will be used for execution, which is actually executing the strategy that you've thought up. Then they spend another five to get to level three. And then we can just fill in for the rest of the view here on how they did on this particular game. What's great is we're going to um do the work to test a whole

Segment 4 (15:00 - 20:00)

lot of people. It won't just be one person for one game. We'll get a lot of different data points for this here. And the last the last view that we have with all this data filled out is a quantitative view as to how our only proof point of general intelligence is able to complete this game that it has never seen beforehand. And that is quite an interesting data point. We are going to return back to this So design philosophy, easy for humans, hard for AI. That was on the easy for humans part. Let's also look at the hard for AI side. Here we have a view about how GPT5 is playing one of our other games. We call this LS20. This is actually an agent-based game. And spoiler alert, the goal is to get the blue block down at the bottom. You want to get to the black part up at the top. And we see what GPT5 here is doing is it's just going up and down, and it's not making progress towards the goal. And really, it's not really exploring much towards that progress either. What we can see is that it's taking or it's spending a lot of actions, but it's not making a ton of progress. We cut the animation off here at 50 frames because, well, it's expensive and it takes a long time and we didn't need that long of a GIF for this one. Um, but when we look at how humans do on this game specifically, it only takes them 20 actions to complete this level here. So, even with GPT5 being at 50 actions and not making any progress and humans only needing 20, that tells us that there is a clear gap. Now, I know what you may be saying. Oh, well, does the concept of action efficiency, does that only apply to games? And no, not really. Because I saw this relevant tweet from Will Brown the other day and he showed that the model thought for 43 seconds, made no changes. Okay, thought for 30 seconds, made no changes and then we'll quips with a wow thanks. So this tells us here even in our agentic workflows as we are coding up a storm, the model can make a lot of moves but it won't make the progress that we want towards. And so now we can see how turns apply to not just video games but then also all other applications. And one of my main CTAs for y'all today is as you're building out and understanding how your agent uh programs are doing, take turn efficiency and action efficiency into account as well. You know, it's funny running a benchmark, we used to only have the concept of performance divided by cost as our measure of efficiency, which is a very useful tool for us to understand how much money does it cost in order to get the performance that we want. But with interactive benchmarks, we now have a new measure of efficiency with us, which is the action efficiency. With all of that setup that we've done here, I want to finish or round it off with how we're going to be scoring AI models on ARC AGI 3. So, of course, we're going to measure how many levels does the next generation model do and how many games do. But with action efficiency, we get a new way to measure this uh performance. We will draw out what does the average human need to do and then we will draw out well where is frontier AI right now and we can plot frontier on this same graph and this near horizontal performance shows us that yes AI is making a ton of moves but it's far less efficient at turning those actions into performance. So even if it completed a level or a game, if it happens in a brute force manner, that doesn't tell me very much because it got to use a lot of information from its environment. This gap, what we call it, the delta between these two is we call this the human AI gap right now. And going back to the start of the presentation, I said that our definition of AGI was a system that could ma um match the learning efficiency of humans. Well, since we observe this gap, our claim is that this is proof that no, we do not yet have proof of AGI yet. Now, getting um getting towards the end here, I want to wrap up with a question that I get all the time, which is well, say something beats ARC AGI 3, what does that mean? What can you claim? And the way I like to answer that is to um enumerate the assertions that we can do and the claims we're able to make. So number one, we can assert that yes, this AI has seen has uh navigated novel unseen environments. It has learned the rules of the environment and it was able to orient itself towards a goal and then it was able to execute a plan to that goal. But the last one is the one that blows my mind every time. I get a little bit of goosebumps when this eventually ends up being true, which is it will have done those first three things, but it by either matching or surpassing human level action efficiency. And again, that is our only proof point of general intelligence. And so the next question naturally from all this is do we claim that this is AGI?

Segment 5 (20:00 - 21:00)

AGI? Well, as with previous versions of ARGI, the answer is no. We do not claim that this is AGI. But on the flip side, we do claim that this is the most authoritative evidence that a model has demonstrated of generalization that we've seen to date. My only CTA for you today is to basically go play the games. We have six preview games that are out right now. You can go to three. artprize. org and you can check them out. We're actually, our target is going to be to come out with 175 games by Janu. Okay, let me 175 games by Q1 of next year. That wasn't a spoiler. January is actually a little too soon. Um, but then also, if you're feeling agentic, we have an API that you can actually go play these games with your agents yourself. So, you can go and spin these up and go and do it for it. Thank you for your time today.

Другие видео автора — OpenAI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник