# I Used Karpathy’s Autoresearch to Train an LLM!

## Метаданные

- **Канал:** Thu Vu
- **YouTube:** https://www.youtube.com/watch?v=XXR0zZ0_16M
- **Дата:** 24.04.2026
- **Длительность:** 15:38
- **Просмотры:** 39,156

## Описание

💻 Get started with Mistral Vibe 👉 https://mistr.al/vibe-thuvu-yt

🔗 Git repo for this tutorial 👉  https://github.com/thu-vu92/autoresearch_folktales

📩 Get FREE weekly AI & data insights 👉 https://thu-vu.ck.page/49c5ee08f6
🌟 Master Python and Build Awesome AI Projects 👉 https://python-course-earlybird.framer.website/

🔑 TIMESTAMPS
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
0:00 - Intro
1:11 - AI coding agent Mistral Vibe (sponsor)
2:03 - How autoresearch works
6:55 - Use cases
7:52 - Walkthrough training an LLM
12:30 - Training results
14:25 - Conclusions

#autoresearch #ai #ThuVu

## Содержание

### [0:00](https://www.youtube.com/watch?v=XXR0zZ0_16M) Intro

Over 41% of global code is now written by AI and most developers are relying on these AI coding agents daily. But that's only half the story. Recently, Andre Karpathy open-sourced a project called Auto Research and it does something that feels a little bit like science fiction. It lets an AI coding agent improve a program by itself in a loop autonomously. This project is an attempt to redefine how humans work with AI. We've gone past the vibe coding stage where human prompts and AI writes code and human reviews to agentic engineering where human orchestrates agents in real time, so human is the director. And Auto Research takes the next step where human doesn't even orchestrate. They just describe what good research should look like in a markdown file and walk away. So the human role here is more like a research advisor. So in today's video, we're going to experiment with this Auto Research tool. I'll show you how Auto Research works. Then I'll walk you through setting it up with a real project step-by-step. We're going to train a tiny language model on folklore and mythology data sets. For this project, I'm going to use Mistral Vibe, which is a command-line coding assistant. Big thanks to Mistral AI for

### [1:11](https://www.youtube.com/watch?v=XXR0zZ0_16M&t=71s) AI coding agent Mistral Vibe (sponsor)

sponsoring this video. Vibe is included in Le Chat Pro and team plans if you're already using Mistral. It's fully open source, so you don't need any subscription to start using it. With natural language prompts, multi-file editing, project-aware context across your whole code base, and very importantly, it has built-in agent modes including auto approve where it can execute file edits and commands without stopping to ask for your permission, which is what we need for this Auto Research project. You can use a command-line interface, but if you want to have a little bit better developer experience, you can also plug it into an IDE like Zed AI, VS Code, or JetBrains. By default, Vibe is powered by Dev Tool 2 model, which is one of the most advanced coding models available via paid subscription. But you can connect Vibe with any open-source models, too. If this sounds interesting, link in the

### [2:03](https://www.youtube.com/watch?v=XXR0zZ0_16M&t=123s) How autoresearch works

description. Okay, let's first dive into how Auto Research actually works. So here's a little backstory. Karpathy had a training script for GPT-style language model about 630 lines of Python code that he'd been manually optimizing for months. So tweaking hyper parameters or trying different architectures, adjusting learning rates, and that kind of usual machine learning research grind. And at some point, he thought, "Why am I doing this myself? Why don't I just let an AI coding agent do this loop for me? " So he built Auto Research and open-sourced it. It picked up tens of thousands GitHub stars within days. And once you see the design, you'll understand why. Auto Research design comes down to a contract between three files: prepare. py, train. py, and program. md. The prepare. py file, it's used for data prep, so downloading training data. And it also defines the validation metric for this particular project, which is val BPB, validation bits per byte. And note that this whole data preparation pipeline and evaluation metric is particular to this original project by Andre Karpathy, which is about training an LLM. But for your own research, it might look different. And I'll explain a little bit more about other use cases for Auto Research in a bit. Secondly, we have the train. py file, which is the agent's sandbox and it contains the 600-ish lines containing the training of a GPT model. And again, for other use cases, this might look different. The important thing to know is that this is the only file that the AI agent can edit. It can modify the details of this training code including trying out different architectures, different hyper parameters, optimizer, batch size, etc. and etc. This file is basically edited and iterated on by the agent. And finally, we have program. md. This file contains the baseline instructions for the AI agent. So you point your agent here and let it go. This file is written, edited, and iterated on by the human. Here you can tell the agent what research directions to pursue, what to avoid, and how to approach different experiments. It sets the general guideline for what the agent should do. In the original version, Karpathy basically described a little bit the setup, experimentation. So in each experiment, what the agent can do and what it cannot do. For example, modify the prepare. py file or install new packages or add dependencies or modify the evaluation metric. It specified the goals for the research, which is to get the lowest val BPB. It also set some other constraints. For example, the time budget. The training time for each experiment should always be 5 minutes. So every experiment gets the same time, so they are directly comparable. The agent can't cheat just by training for longer and only the idea matters. It also tells the agent to follow some key principles such as all else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it. What the output format should look like for each experiment and how to log the results and the whole experiment loop. And at the end, it even says, "Never stop. Once the experiment loop has begun after the initial setup, do not pause or ask the human if you should continue. Do not ask, 'Should I keep going or is it a good stopping point? ' The human might be asleep or gone from computer and expect you to continue working indefinitely until you manually stopped. The loop runs until the human interrupts you, period. " This diagram visualizes the experiment cycle at the core of Auto Research. First, the agent reads the program. md file to understand the current research priorities and constraints. And then it examines the current train. py file, which is the baseline version. And then proposes a hypothesis. For example, it can propose an architecture change for the model or using a different optimizer or any other training modification. And then it commits the change to a Git branch. And then it runs the training for exactly 5 minutes. And automatically, you can adjust this 5-minute time budget as well, whatever fits better to your own use case. After it finishes training, it evaluates the result using the given scoring metric. So in this case, it's val BPB. If the metric improved, then the commit stays. And if not, it discards this change and reverts to the previous version with Git reset. This is called a ratchet loop and because like a mechanical ratchet, the code base can only move forward. Each successful experiment adds a commit and each failure gets reverted. So improvements accumulate one at a time and you can never slide backward. You can actually

### [6:55](https://www.youtube.com/watch?v=XXR0zZ0_16M&t=415s) Use cases

apply this pattern to any other domain or any other use case where you can define an automatic scoring function or metric. For example, think of optimizing website designs for loading speed, optimizing trading strategy where the agent tweaks your buy/sell rules and backtests them against years of market data and scores each one by a certain metric. And for marketing domain, you can think of optimizing emails, optimizing landing page copies, etc. Any use cases can work given they meet these three conditions. Firstly, we want to have a clear metric which AI agent will optimize on. Ideally, it should be one number and it can be measured automatically. The second condition is we have one file to edit, one file only. And the third condition is a time-boxed loop, which means the agent can run and finish an experiment in a limited time frame. So that's a formula. Okay, now

### [7:52](https://www.youtube.com/watch?v=XXR0zZ0_16M&t=472s) Walkthrough training an LLM

let me walk you step-by-step how we can run Auto Research for our own project. And the idea is that we're going to train an LLM on an open-source data set about folklore and mythologies, which can be directly downloaded from Hugging Face. Karpathy himself run this Auto Research project on some really specialized GPUs. But since I want to run this project on my own M1 MacBook Pro, thankfully, someone in the community has created a macOS or Apple Silicon version of this Auto Research project. And for the autonomous agent, we're going to use Mistral Vibe. All right, let me pull up my terminal. And first, we need to install the uv Python package manager that Auto Research actually uses. And just to quickly verify if uv is indeed installed, I'll run uv space {dash} version. And here's the version number. That's good. Let's move on. Next, let's install Mistral Vibe with this command. And for quick check, you can open Vibe in the project folder with the Vibe command. On first launch, it will ask for your Mistral API key. So feel free to head over to the console. mistral. ai and generate a free API and paste it in. All right, this works, so our agent is ready. And that's the entire setup. Let's quit it for now. Next, let me CD to this research folder here in my computer. And then we are going to clone the Auto Research macOS Git repo into this Auto Research Folk Tales folder. Once that's done, we have basically downloaded all the files from this GitHub repo into our project folder, Auto Research Folk Tales. And now let's go into this folder. And then let's install all the dependencies for Auto Research with uv space sync command. This command pulls in PyTorch and other libraries we need. All right, in the next step for the setup, this is where our project diverges a little bit from the standard auto research. By default, it trains an LLM on a data set called CliMix 400B shuffle. In our project, I'm going to use a different data set, so we need to adjust the prepare. py file a little bit to actually download and pre-process this data file instead. Now, instead of doing it myself, let's have Vibe do it. Okay, I want to use auto research to train an LLM on this data set. Please review the code base, download the data set, and configure what you must in the prepare. py file. Make sure the data pipeline works. So, we can see that the agent just goes ahead and figure out how to do that. I basically just accept all the changes for the sake of the experiment, and I'm not going to meddle too much with the coding agent because we're really trying to play the role of research advisor rather than an engineer here. All right, the data pipelines is now fully configured and ready for training an LLM on folklore and mythology tales. So, that's great. Let's now manually run this prepare. py file because we only need to run this file once. Let's do UV run prepare. py, and good to know that with Vibe, you can run bash commands inside Mr. Vibe by adding this exclamation mark at the beginning. Now, as you can see, the data pipeline works, so now let's manually run a single training experiment to make sure that the training actually works. After finish running this file, the agent says, "Your setup is working, and you can go to autonomous research mode now. " So, that's exactly what we're going to do. Let's now enable the dangerous mode, which is the auto approve mode on Mr. Vibe. We can simply hit shift tab to toggle between different modes. So, here we are in the default mode, and shift tab, we go to the plan mode, and accept edits, and auto approve, which is the dangerous mode. Honestly, in no normal circumstances, you should enable the auto approve mode, but here in this project, in this auto research experiment, we're going to use this mode because that's the whole point. And now I just steal this prompt over here from the GitHub repo. Hi, have a look at the program. md file, and let's kick off a new experiment. Let's do the setup first. So, the agent goes through the setup in the program. md file, and then we just let it go and go ahead with the experiments, training the model, and trying different things.

### [12:30](https://www.youtube.com/watch?v=XXR0zZ0_16M&t=750s) Training results

So, good morning. I'm super curious what the agent has done. I noticed that I eventually run out of tokens, as you can see, but if we look at the results, you can see that our agent has been able to run 11 experiments in total. Here are all the commits, and here are the evaluation BBB, and together with whether the experiment was successful, meaning that it was able to reduce the this metric, or it failed, and that means that the agent just discard those commits. So, here you can see the description here, what happened. The first one is just the baseline model, and the subsequent experiments were basically the tweaks or the modifications that the coding agents made in an attempt to improve the metric. If you don't know much about deep learning, don't worry. The idea is that the agent just tries out different combinations of parameters and configurations of the models, it trained the model with all these modifications, and then evaluate the final metric. Visualizing this result, you can see that the scoring metric nicely improves across experiments. And so, here is a story sample from the baseline model. Once upon a time in a land far away, and now it pour all the kitchen. So, it's a lot of broken grammar, and it doesn't make any sense the story. And here is another story sample from the last iteration from the model, the final model. So, after we have done the auto research self-improvement loop, once upon a time in a land far away, and then the old man as a wind sank over as it would be told her to go with him again. The story still doesn't make much sense, but I think it's a little bit better than this story sample from the baseline model. The sentences are a little bit more complete than this version, so I think this visualizes the improvement in the training process that we've seen

### [14:25](https://www.youtube.com/watch?v=XXR0zZ0_16M&t=865s) Conclusions

here. So, congratulations, you've just done your first auto research experiment building an autonomous improvement loop with an AI agent that runs experiments, evaluates results, and makes a system better without a human touching anything. You can potentially come up with use cases for this in many different domains such as marketing, finance, education, engineering, anything that has a clear evaluation metric that you can optimize on, and potentially experiment in a reasonable time frame. But here's the catch. The agent can handle execution, but the judgment behind the research agenda, that's still the role of humans. There's a line from the DataCamp guide on auto research that I think really nails it. Writing a good program. md requires having done the research yourself. You need to know which directions are worth trying, what better means for your problem, and when incremental gains have run the course. And honestly, that might be the most valuable skills for the next decade. I shared the full guide for this walkthrough in the description. You can follow along and build this yourself. Mr. Vibe link is there, too, if you want to try them out. Thanks for watching, and I'm going to go read some actual fairy tales now.

---
*Источник: https://ekstraktznaniy.ru/video/49819*