AI Improves at Self-improving

17:42

AI Improves at Self-improving

AI Explained 19.05.2025 83 622 просмотров 3 554 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

AlphaEvolve is not the first system to exhibit self-improvement, but it may be the most impressive yet. AI is literally improving the hardware, architectures, data and training methods of AI itself. A deep dive into the paper, drawing on two previous interviews and 5 other papers. Plus a snippet on OpenAI’s new Codex system. Gray Swan: http://app.grayswan.ai/ai-explained AI Insiders ($9!): https://www.patreon.com/AIExplained Chapters: 00:00 - Introduction 00:27 - AlphaEvolve 05:23 - Limitation 06:10 - Achievements 08:21 - Future Improvements 13:30 - Quirks 16:34 - Final Thoughts AlphaEvolve release: https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/ Paper: https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/AlphaEvolve.pdf Terence Tao Quote: https://mathstodon.xyz/@tao/114508029896631083 Nature Article: https://www.nature.com/articles/s41586-022-05172-4 MIT Article: https://www.technologyreview.com/2025/05/14/1116438/google-deepminds-new-ai-uses-large-language-models-to-crack-real-world-problems/ AI Co-Scientist: https://arxiv.org/pdf/2502.18864 OpenAI Codex: https://openai.com/index/introducing-codex/ 70% of Pull Requests: https://x.com/slow_developer/status/1920920456393028027 Amodei Essay: https://www.darioamodei.com/essay/machines-of-loving-grace OpenAI Jason Wei Tweet: https://x.com/_jasonwei/status/1923091260354531612 PromptBreeder: https://arxiv.org/pdf/2309.16797 DrEureka: https://arxiv.org/pdf/2406.01967 FT DeepMind: https://www.ft.com/content/4e497a91-670a-4f69-be4a-18e247daba3e Non-hype Newsletter: https://signaltonoise.beehiiv.com/ Podcast: https://aiexplainedopodcast.buzzsprout.com/

Оглавление (7 сегментов)

Introduction

AI that can help improve AI is actually almost everywhere if you know where to look. Not least coding tools like the new codeex from open AAI which didn't just help me find a bug that claude within cursor missed but is helping AI researchers too. The coding agents might be doing the easier bits, but it's freeing up AI researchers time to well work on AI improvement. But rarely is the process of AI self-improvement so

AlphaEvolve

direct as it is in the alpha evolve agent from Google Deepmind. It can generate better prompts for itself so that it can evolve better code for useful tasks. Tasks which lead to efficiencies in its own next version. This was published less than a 100 hours ago, but don't worry, it isn't Skynet. The real world does not yet allow for the speed of iteration that Alpha Evolve involves. But I would say that this agent is the final proof for anyone left doubting it that LM are not a dead end and have barely even begun to make their mark. I'm going to draw on plenty of analogies and multiple interviews to give you guys at least a gut sense of what is going on with this recursive Ronin. this agent that has already led to realworld efficiencies in the Google data center fleet and mathematical breakthroughs decades in the making. First though, please just skip to the chase. What on earth is this thing? Basically, the human comes along and has to provide the problem to solve, some code that they may have tried, and critically some evaluation metrics. Those details are kind of crucial if you don't want to get an overhyped sense of what Alpha Evolve can do. Anyway, the human provides all of that and the more metrics they can give, the better the performance. Then essentially, the human can just vibe as Gemini 2, not Gemini 2. 5, the far more impressive successor, but Gemini 2, iterates on that code. The system uses the Flash version of Gemini, the smaller and quicker one, for plentiful ideas. but the pro version, Gemini 2 Pro, for solid suggestions. Notice the prompt sampler, wherein the system draws on previous prompts that humans have tried that worked before and programs via the program database that were great in other situations. All with a goal of improving the code that the human submitted against the evaluation metrics. That's why Alpha Evolve is called a coding agent. At its heart, it's improving or evolving upon the code that the human submits against those evaluation metrics. Then, while the human questions their career choices, Alpha Evolve eventually comes back with some code improvements or diffs that produce programs that are 75% of the time state-of-the-art at one of the dozens of given tasks. Not impressed? Well, 20% of the time, these constructions are better than state-of-the-art. If you are the highest IQ dude on the planet, Terrence Tao, you describe this as extremizing functions f ofx with x ranging over a highdimensional parameter space omega that can outperform more traditional optimization algorithms when the parameter space is very highdimensional and the function f and its extremisizers have nonobvious structural features. Simple when you think about it, extremisizing functions. Rest assured, they are now moving on to more challenging problems. Hopefully though, I can demystify things a bit more than this. So, let's return to the paper. In this key diagram, you can see that DeepMind went allin on the evolve part of Alpha Evolve because the system not only stores and samples from the best prompts as judged by metric success, but even the best LLMs for the task. Yes, Gemini 2. 5 Pro would just be a plug and play away from further improvements. Skipping to the end, because why not, the natural next step will be to consider distilling the alpha evolve augmented performance of the base LMS into the next generation of the base models. This can have intrinsic value and also likely uplift the next version of Alpha Evolve. Now, you guys might agree, but those two sentences alone deserve a full video because first, who is Google fooling when they say they're going to consider doing this. I think it is quite possible they already did do this for Gemini 2. 5. Alpha Evolve just got published, but has been tested internally in Google for around a year. And second, Alpha Revolve is therefore a pretty definitive case study against the idea of a permanent data war because this system is built to spin up improved programs which can then be distilled into the next generation of base models which then get better at coming up with improved programs. Or the TLDDR is that iterated code that proves to be good is then great data for training the next base model which can then be plugged into the next version of Alpha Evolve. Yes, by the way, I know that this is just one of several recursive loops in the paper improving the base LLM through distillation. And this is all before we get to alpha evolve's intended use in applied sciences like drug discovery. But very quickly on that point, I want to touch on why alpha evolve isn't quite confirmation of an imminent fast takeoff. Because as the paper makes clear throughout, the main limitation of

Limitation

Alpha Evolve is that it handles problems for which it is possible to devise and submit by the way an automated evaluator. While this is true of many problems in the mathematical and computational sciences, there are domains such as the natural sciences where only some experiments can be simulated or automated. Yes, therefore it can help scientists evaluate new scientific experiments and they are working on making it a better literal co-scientist. But there is a reason that even the famously bullish anthropic CEO Dario Ammedday who expects a century of science progress in the next decade said intelligence will be initially heavily bottlenecked by the other factors of production. Test tubes in other words can only test tube so fast. But back to

Achievements

what Alpha Evolve actually already achieved. Most famously, it found a rank 48 tensor decomposition for 4x4 complex matrix multiplication, which is actually even for the authors a super unexpected improvement on the 50-year-old record for algorithms suitable for recursive application. As simply put as possible, tensor decomposition here means discovering a more fundamental recipe with fewer core steps, 48 instead of 49, to perform matrix multiplication. This specific type of recipe, a tensor decomposition, is cool because it allows the method to be used repeatedly or recursively to dramatically speed up calculations for very large matrices, multiplications that are needed for all sorts of computing and AI operations. If you're not too into maths, let's see what else I can impress you with. Well, Google helped improve the Borg. Yes, you know, the actual Borg, its data center optimization. Not sure what Borg you were thinking of, but this improvement helped Google recover 0. 7% of its worldwide compute resources. That will soon amount to billions of dollars. But remember, LLMs are a dead end. But being serious though, this is clearly delay. Humans and LLMs providing ideas and problems. LLM proposing iterations, hard-coded verifiers and systems providing automated checks. And by the way, we're not even done. Alpha Evolve helped refine the next generation of Google's chips, its Ironwood TPUs. And if you remember Deep Seek hand optimizing a kernel to eek out efficiency, if not, see my recent documentary which debuted on Patreon. But anyway, Alpha Revolve did it automatically when given that as a problem, leading to a 1% reduction in Gemini's training time. Obviously, that is yet another recursive loop. A better or more efficient Gemini, leading to a better future Alpha Evolve. But, okay, now we are suitably sold on its achievements. Let me give you four ways that Google admits it will soon get better, plus two funny quirks and two relevant interview clips. First future improvement involves some background

Future Improvements

context that solutions and their scores for these tasks are kept in an evolutionary database. But remember, Gemini models have been confirmed to have up to a 10 million token context window. Those models aren't released yet. The public ones only go up to 2 million tokens. But clearly, that evolutionary database could one day get incredibly large, giving a veritable library of Alexandria for any future model to draw upon. For those watching a while, it might remind you of my coverage of Voyager, which was an agent for Minecraft, which had an ever growing skill library of executable code. So, first obvious feature improvement, a much bigger evolutionary database. Second, as we hinted at, Alpha Evolve is model agnostic. So, as hardware is improved, training time is reduced, and knowledge is distilled to help make a better Gemini 3. that Gemini 3 will make a much better LLM within Alpha Evolve. And that brings us to the ablations. This was a really cool part of the paper because it showed that every part of the coding agent we have so far described was actually crucial. For example, if you only used a small base LLM, Gemini Flash, not Gemini Pro, performance capped out at a lower point. If you didn't have that context window and couldn't do a full file evolution, remember that massive context window? Well, if you couldn't do that, again, you can see that performance caps out at a much lower point. If you're listening to this, by the way, all of the ablations show lower performance if you don't employ the full method. Even dropping the meta prompting, where you evolved which prompts to use impedance. And for those over on my Patreon, you may remember from the beginning of AI Insiders, I did an interview with Tim Rockashel, a key figure at Google DM. He gave us what turned out to be an early preview of this prompt evolution approach with his paper prompt reader. But what prompt reader does is that if you evaluate fitness of the prompts based on some kind of specific held out validation set for a domain then what prompt reader will do over time it will evolve more and more domain specific prompts right that's what we saw in the paper and there's actually one more paper that I think will give you a pretty great analogy of what is happening here with alpha evolve and that's Dr. Eureka from Nvidia. For this, imagine trying to handcraft instructions to a robotic hand to teach it how to flip a pen. Super boring and would take ages and isn't particularly effective. But now imagine you can give the language model feedback about how each iteration is doing, which reward functions perform well, which don't. That's like the evaluation metrics that humans provide for Alpha Evolve. With that feedback, Dr. Eureka and Alpha Revolve can iterate on their suggestions. Both approaches you obviously now know produce state-of-the-art results and hopefully that gives you an intuition, at least it did me, for why humans couldn't always have reached these kind of levels. How Alpha Evolve points to novel solutions that it's not like humans would get if they tried eventually. Humans often get stuck in local optima according to their inherent biases. Also, they don't have time to iterate on tens of thousands of potential solutions. Here's Guanggha Wang who worked both on the original Eureka and Voyager papers. It has a very much prior knowledge and therefore it can just propose different kind of mutations and variations of the reward function based on the environment context. I think it's just generate those reward functions based on its prior knowledge and not as a human like for human like you need to manually uh tune the reward functions and it's very easy for human to get stuck to a local optima but for GD4 it can generate tons of reward functions at the same time and then based on the performance of each reward function it can continuously to improve it in Eureka it's more like a evolutionary search third room for future improvement and this is a big one That code snippet that Alpha Evolve can improve on doesn't have to be the final function that generates the direct solution. It can be a search algorithm later used to find an optimal final function. So, Alpha Evolve can essentially continue to improve how we search for optimal programs. Fourth future improvement, and this is subtle and might be missed by many, but the authors foresee something quite important for me. They say, "However, with these improvements, we envision that the value of setting up more environments, problems with robust evaluation functions will become more widely recognized, which in turn will result in more high value practical discoveries going forward. You guys will get probably already are getting bored of me talking about benchmarks are all we need. But honestly, this paper screams of the need for robust evaluation functions and the incentives are now much more clear to create them, knowing that you will have a system on hand to optimize against them. Okay, but I did promise you guys some quirks. So

Quirks

I thought you guys might find it cute that we still rely on prompts like these for alpha evolve. This is 2025 and we're telling our bleeding edge systems to act as an expert software developer. Your task is to iteratively improve the provided codebase. Later they say, suggest a new idea to improve the code that is inspired by your expert knowledge of optimization and machine learning. It really makes me at least wonder if the final prompt before the real singularity will be I work at Google. Improve yourself or I'll be fired. But a couple more serious points before we end. One thing that Alpha Evolve could not create yet is Alpha Evolve. I mean, of course, Alpha Evolve could improve parts of Alpha Evolve, as I've discussed, but it couldn't create it from scratch yet. Don't agree? Well, as Demis puts it, we have systems that are superhuman at the game Go, but yet could not invent Go, he says. That's Deisaris, the head of Google Deep Mind. So humans are still in the driver's seat at least for now. Next is that this direction of iteration and search is yet one more way we can spend our exploding compute allocations. And even OpenAI admit that this is all a somewhat different direction from the Oer that has produced such astonishing benchmark results. Jason Wei, a senior figure at OpenAI said, "Alpha Evolve is deeply disturbing for reinforcement learning dieards like yours truly. Maybe mid-train plus good search is all you need for AI for scientific innovation. " And he added, "What an alpha move to keep it secret for a year. Congrats big G. " We, in other words, have models approaching level four innovators without neural or a mandarin chain of thought in sight. As the authors themselves write on page 14, Alpha Evolve was chosen over a deep reinforcement learning approach because its code solution not only leads to better performance, but also offers clear advantages in interpretability, debugability, predictability, and ease of deployment. Essential qualities for a mission critical system. Not saying we always understand the solutions that Alpha Evolve helps generate, but it does help for these factors. And speaking by the way of dangerous reasoning chains, that was an incredible segue to the sponsors of today's video, Grey Sworn AI. They are hosting a competition in which you can help improve the safety and security of language models by essentially jailbreaking them. This is a brand new competition, link in the description, and the prize pool is $20,000. Actually, I think it might have been either my last video or the one before where the pinned comment is one of you guys first hearing about Grace One and its arena in one of my videos, entering the competition and doing really well. It would be truly amazing if you guys won it and the time is ripe because we are entering this first full wave starting May 17th. Thank you so much to Grace One for sponsoring this video and good luck to everyone who enters. Anyway, last couple of things

Final Thoughts

from me on Alpha Evolve. And one thing that I was predicting on this channel in 2023, way before it was fashionable, was that there is a significant chance that Google runs away with the AI lead. It has been working on AGI and self-improvement for years more than the other labs and has way more resources. I'm not talking about running away in terms of a user base or even profits, but in the raw intelligence of its models. Codeex from OpenAI, which I've been using over the last 48 hours, is great because you can run it on mobile and debug multiple things at once. But in just 18 months, Google has gone from the laughably bad bard versus the mighty GPT4 to now being at least on par with Gemini 2. 5. Essentially, as the flywheels start to fly, to quote Demos, I really do wonder where Gemini and DeepMind will be 18 months from now. Well, unionized potentially in the UK and credit to DeepMind for their ethical stand on the use of their AI in warfare. But in the lead, I think that is almost inevitable. Let me know what you think and have a wonderful

Другие видео автора — AI Explained

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник