context that solutions and their scores for these tasks are kept in an evolutionary database. But remember, Gemini models have been confirmed to have up to a 10 million token context window. Those models aren't released yet. The public ones only go up to 2 million tokens. But clearly, that evolutionary database could one day get incredibly large, giving a veritable library of Alexandria for any future model to draw upon. For those watching a while, it might remind you of my coverage of Voyager, which was an agent for Minecraft, which had an ever growing skill library of executable code. So, first obvious feature improvement, a much bigger evolutionary database. Second, as we hinted at, Alpha Evolve is model agnostic. So, as hardware is improved, training time is reduced, and knowledge is distilled to help make a better Gemini 3. that Gemini 3 will make a much better LLM within Alpha Evolve. And that brings us to the ablations. This was a really cool part of the paper because it showed that every part of the coding agent we have so far described was actually crucial. For example, if you only used a small base LLM, Gemini Flash, not Gemini Pro, performance capped out at a lower point. If you didn't have that context window and couldn't do a full file evolution, remember that massive context window? Well, if you couldn't do that, again, you can see that performance caps out at a much lower point. If you're listening to this, by the way, all of the ablations show lower performance if you don't employ the full method. Even dropping the meta prompting, where you evolved which prompts to use impedance. And for those over on my Patreon, you may remember from the beginning of AI Insiders, I did an interview with Tim Rockashel, a key figure at Google DM. He gave us what turned out to be an early preview of this prompt evolution approach with his paper prompt reader. But what prompt reader does is that if you evaluate fitness of the prompts based on some kind of specific held out validation set for a domain then what prompt reader will do over time it will evolve more and more domain specific prompts right that's what we saw in the paper and there's actually one more paper that I think will give you a pretty great analogy of what is happening here with alpha evolve and that's Dr. Eureka from Nvidia. For this, imagine trying to handcraft instructions to a robotic hand to teach it how to flip a pen. Super boring and would take ages and isn't particularly effective. But now imagine you can give the language model feedback about how each iteration is doing, which reward functions perform well, which don't. That's like the evaluation metrics that humans provide for Alpha Evolve. With that feedback, Dr. Eureka and Alpha Revolve can iterate on their suggestions. Both approaches you obviously now know produce state-of-the-art results and hopefully that gives you an intuition, at least it did me, for why humans couldn't always have reached these kind of levels. How Alpha Evolve points to novel solutions that it's not like humans would get if they tried eventually. Humans often get stuck in local optima according to their inherent biases. Also, they don't have time to iterate on tens of thousands of potential solutions. Here's Guanggha Wang who worked both on the original Eureka and Voyager papers. It has a very much prior knowledge and therefore it can just propose different kind of mutations and variations of the reward function based on the environment context. I think it's just generate those reward functions based on its prior knowledge and not as a human like for human like you need to manually uh tune the reward functions and it's very easy for human to get stuck to a local optima but for GD4 it can generate tons of reward functions at the same time and then based on the performance of each reward function it can continuously to improve it in Eureka it's more like a evolutionary search third room for future improvement and this is a big one That code snippet that Alpha Evolve can improve on doesn't have to be the final function that generates the direct solution. It can be a search algorithm later used to find an optimal final function. So, Alpha Evolve can essentially continue to improve how we search for optimal programs. Fourth future improvement, and this is subtle and might be missed by many, but the authors foresee something quite important for me. They say, "However, with these improvements, we envision that the value of setting up more environments, problems with robust evaluation functions will become more widely recognized, which in turn will result in more high value practical discoveries going forward. You guys will get probably already are getting bored of me talking about benchmarks are all we need. But honestly, this paper screams of the need for robust evaluation functions and the incentives are now much more clear to create them, knowing that you will have a system on hand to optimize against them. Okay, but I did promise you guys some quirks. So