# Next Stage of AI Scientist: NanoResearch (Skills, Mem, RL)

## Метаданные

- **Канал:** Discover AI
- **YouTube:** https://www.youtube.com/watch?v=pJMBwb3nL3k
- **Дата:** 14.05.2026
- **Длительность:** 28:19
- **Просмотры:** 1,837
- **Источник:** https://ekstraktznaniy.ru/video/50833

## Описание

This video shifts the perspective from Research Automation to Collaborative Agent-Human Co-evolution. An AI scientist shouldn't be a generic paper-generating factory; it must be a dynamic system whose parameter space continuously warps to internalize the unique research procedures or workflows and resource constraints of the specific lab it is deployed in.

All rights with authors:
NanoResearch: Co-Evolving Skills, Memory, and Policy for
Personalized Research Automation
Jinhang Xu†1, Qiyuan Zhu†1,2, Yujun Wu†1,3, Zirui Wang†1,4, Dongxu Zhang†1,5, Jianxin
Tang, Marcia Tian6, Yiling Duan1, Siyuan Li4, Jingxuan Wei1, Sirui Han∗2, Yike Guo∗2,
Odin Zhang∗7, Conghui He∗1, Cheng Tan∗1
from
1 Shanghai Artificial Intelligence Laboratory, 
2 The Hong Kong University of Science and Technology, 
3 Peking University, 
4 Zhejiang University, 
5 Xi’an Jiaotong University, 
6 East China University of Science and Technology, 
7 The Chinese University of Hong Kong
arXiv:2605.10813


#airesearch 
#aiscie

## Транскрипт

### Segment 1 (00:00 - 05:00) []

Hello community. So great that you are back. Yeah, today we talk about AI scientist and we go absolutely crazy today. So it is about AI and autonomous scientific discovery. AI will discover new things, so what we call also as an AI scientist. But you know that the existing AI scientists are trapped in a Markovian one-size-fits-all paradigm, no? They treat every new research project as an isolated starting at T equals zero event and produce some uniformly some generic paper. But we going to change this today because today we going to say wait a minute. In the last minute we talked about proactive AI by Google. And today we're going to have a look how we can implement this, how we can make this an individual human researcher AI companion that really understands here a highly specific methodological constraint that you as a researcher want. Maybe you prefer some mathematical formalism, way of doing your experiments. So today we go to step further and we build here individual human AI scientist that works closely with the researcher. And this is it, nano research co-evolving. And now hold on to your socks, skills, memory, and the policy for personalized research automation. And this is here by my goodness, Shanghai Artificial Intelligence Laboratory, Hong Kong University of Science and Technology, Peking University, Zhejiang University, Shanghai Jiao Tong University, East China University of Science and Technology, and the Chinese University of Hong Kong. Beautiful, May 11th. So they do finally here the triple. Now we go for the skill MD file over them, complete memory update, and we do not stop there, we don't stop here in here the harness of the AI, we go really down to the LLM and we optimize here the policy how the LLM is learning. We do have some reinforcement learning now of the LLM itself. So, all the elements we know are now modified. A lot of questions will be about, "Hey, where is the code? " Well, yes, of course, here is the code. Enjoy it. MIT license. Now, you know, current automated researcher like scientist version two or whatever you like behave like an ergodic system with a high temperature. So, every time you give it a prompt to exploit the entire vast face space of possible methodologies, ultimately producing here in general an average microscopic output that looks exactly the same for everyone. There is no individualization. The AI is looking for one generic solution, but it is ignoring if you are researcher that likes to experiment. is really going just for the theoretical proof. Or maybe you're a researcher that says, "Hey, I want to do all ablation studies. I want to play around. I want to have here trigger that activate some components and deactivate some others. " No? So, current, if you want, AI researchers here are memoryless Markovian. Now, let's change this. Nano research now introduces a path dependency. It learns what the LLM has already learned, what is already here either a positive or a negative experience here as any AI scientist, and there will be some symmetry breaking. Yes, of course, the moment we have an autonomous learning, a self-learning LLM, we will go and we will break the symmetry. So, if you are an experimental physicist who demands some rigorous ablation, or if you are a theorist who prefer wild mathematical architecture, now it will stop producing here average paper, and it will really adapt to your personal style. So, this means we couple here our multi-agent system plus harnesses plus everything to the human style, to the specific human users like a localized heat bath if you want to medical physics. So, you know, now the companies want to sell us here individualized LLM. They their particular memory, their particular skills, and of course we should save everything in the cloud platform. So, therefore I will try to do everything on my local machine. Is this possible? Now, the beauty here is we have a three-level co-evolution here in this AI scientist. For my knowledge, the first one that they tried it accumulates here the hard constraint into the skill bank and into the memory modules plural and crucially uses here your natural language feedback. So, this is really your individual machine now to physically alter here the own parameter the LLM tensor weights here via reinforcement learning. So, we do have now the human in the loop and this is you and the machine will learn on your feedback. So, I think I work daily with those machines. So, if I use it now for I don't know, some months, three, four, five months, maybe a year, the machine should really know exactly how I like jobs to be done and how I prefer to

### Segment 2 (05:00 - 10:00) [5:00]

think. So, let's do the learning. Skill bank is easy. Now, we have a skill empty file. Memory can be easy too. But how the hell if you have no in ICL, no in-context learning, if you really go here for a reinforcement learning and tensor weight learning methodology, now it gets interesting. So, here we have a screenshot here from the original study. Here on the first line you see everything is the same and they say, "And now comes our another research now. " And now we have here the skill, the policy, and the memory that are now optimized. If you are rigorous experimenter, exploratory researcher, or a pragmatic submitter, you can now have your personal style and this AI system will learn with you. And you might say, "How is this possible? What the mathematics? How is the optimization engine? How can we code this, no? At first mathematics and proceed the mathematics, then we know how to code and what to code. So, let's start. So, the first is easy to scale on the memory distillation, no? You see distillation, so yeah, we do have here an evolution. We do have here, if you want here, a teacher student learning happening, no? Because at the end of each single research trajectory, let's call it tau, which will include all the action that the AI scientist performed, the critique, the self-critique of the system, and the outcome that was then given here the feedback by the human operator, we have here an agent, our orchestrator, no? The AI boss, who does not just wipe now the context window clean, but now we say, "Hey, this was important information here that happened here in this learning cycle. " So, this orchestrator agent will now distill the macroscopic invariant rules, let's call it the skills, and project specific facts into the permanent storage, into our memory. So, you see, it's simple, no? We have an update mechanism for the skills, as you see here, and for the memory. Given your complexity, your domain knowledge, it will build on your preferences. Now, before each task now, if we go here in the next round, then this is an iterative AI scientist, you have now the orchestrator, it retrieves now the top case skills and the top memories that are relevant now to the next current context via a heuristic scoring function, this is a very simple function, and now we combine the keyword matching, the tag alignment, the recency, maybe even, and the weights adapted here to the target. So, we have now the next step, not the same knowledge, basic knowledge, or only the parametric knowledge of the LLM, but now we go with skill retrieval and memory retrieval. And if you have seen my last video, you know exactly what I mean with skill retrieval. We talked in detail about it, yeah? And then we have the adaptive planning phase. This is now interesting. While here as our skills and our memory capture brought procedural knowledge and project facts, we further internalize here a fine-grained user-specific preferences here. And at the end of each stage, I'm not going to show you the stages in a moment. The user provides now immediate natural language feedback. So, I just go, I type here just my comment or speak here to the AI in my natural language, which we encode then directly into the orchestrator planner model by data. So, this is here the classical policy. The orchestrator is now simple our LLM that we used here to fine-tune, to train now on this comment. So, where the risk being compressed or retrieved, this was the old case when we just looked at an update where you had a concatenation to your skill MD files or to your memory MD file. No, now we bring it so that it is not happening into the LLM itself. And here the boss, if you want, AI system here of our multi-agent system, the orchestrator agent. And he now learns exactly the mistakes, the positive steps, the negative steps, how we prefer science or whatever job you want to have to be learned at, yeah? So, after you see our free-form language feedback, this is my human user feedback, rather than any scalar rewards or some preference pairs. So, how do you do this, yeah? And they adopt now the self-distillation policy optimization. This is the first time I've seen this in this massive way. So, an SDPO which convert your single feedback instance into a dense token level learning signal without any reward model. We might say, "Yippee! Yeah, finally! Now, we just can go straight forward here to the learning. " And you might say, "Oh, wow. Straightforward is interesting, but don't you worry. " Yes, of course, they give you the SDPO gradient here. Nabla theta from SDPO, a logic level policy

### Segment 3 (10:00 - 15:00) [10:00]

gradient and you'll say, "What the hell is this? " So, either you have a look at my video where I explain you AI mathematic or you just go here and this is from February 2026 to this particular paper here from ETH Zurich, MIT and University of Zurich, Switzerland. Aligning the language model from user interaction. And here exactly they show us your new method for learning directly from user interaction through a self-distillation. And in this paper they go step-by-step and they explain exactly how they do it, how they build the methodology. What is possible via a self-distillation? Think about this. A self-distillation is such a genius idea, no? They go here with the classical formula and they show you here our nabla then exactly and you say, "Now I understand how this happen. " You have here that we have the hindsight policy pi theta access the teacher and is treated here as a fixed target during each update for which we define a detached hindsight model pi theta bar. But this is here defined in a particular way. You have an annex in this paper that goes into the mathematical proof of this formula. So, if you want to learn more about this, this is the paper I would recommend to you. But let's come back to our topic here in this video. So, there, as you see following six, so they just use here this formula from the other observation and they use it now here simply for the self-distillation. And here, of course, for the advantage function, this is the our advantage function. And here we have our feedback function F, of course. So, you see, we just stand on the shoulders of giants and we continue to build on the knowledge here of the geniuses before us. The beauty, again, no reward model. The users provide here natural language feedback. You say you tell the machine, "Hey, I don't think this is the way to go. I would prefer another mathematical method. I would prefer you go here. " You compare these two methods or I prefer you would to switch completely to something else. You just give it a feedback F at the end of each stage, which then the agent here, our observer agent or boss agent, internalizes it into its planner policy. Thereby, with this formula, we know how to code this formula, turning the explicit feedback into persistent preferences. So, we really let the LLM learn this. Not in-context learning. Not write it into a skin MD file. Not do it like a memory somewhere and save it in the harness of the LLM. No. Really change the tensor weights finally of your transformer layers. Beautiful. So, again, let's take a step back and let's have a look at this beautiful screenshot here from the artist. So, we have a user. And this girl is now programming here, an AI scientist. She has certain preferences. She has a certain budget, you have compute budget. She has a target venue styles. She has certain topics that she would like to address here in this scientific endeavor. And the idea is simple, now. At first, we retrieve all the AI, it retrieves the all the skills and memory maybe needed for this particular job. So, if you have seen my last video, rack for skills retrieval augmented execution of mathematical operator that are now the skill operators here and this is called skill ray for retrieval augmented execution. This is exactly what you could substitute your step one for. The I finds via rack the best skills available, let's say on the internet, on a skill database, whatever you have. Then we have the strategic element that AI now starts to come up with a plan. Ideation, experiment, writing, review, beautiful. Then we have the coordinate stage. So we dispatch here at the stage agents. Whatever is now decided to do and the orchestrator is here if you want the mastermind. And after this happened, you see you come back and you have here again the skills here. You have maybe there are some new methodology you found here. You could have here a code verification. So you have now new skill. So you distill now this new skill into your system and you add a new skill here to the skill database that you have. Or the extraction here after memory show a new path forward, show a new complexity, a new solution, then write it down into the memory and you have a new memory MD file or whatever you prefer. But now you see what we have. We have constantly here an update and a retrieve from our three elements that learn continuously. The memory with the past hypothesis, notice the failure, look at the results, lock down the constraint.

### Segment 4 (15:00 - 20:00) [15:00]

Second, all the skill MD files or whatever you have. For literature search, for debugging certain patterns, for writing some templates here. They are evaluated, compared, and the best one is selected for tool use strategy, APIs, whatever you have. And then and this is here if you want the real interesting thing here, the update here of the policy of our core LLM of your orchestrator agent. Here is the real intelligence, learning happening. The planner behavior has an adaptation, the schedule preference has an update, and the user feedback this is this feedback is now really integrated in the future behavior of our orchestrator. So, this two elements here, the beautiful user and the beautiful orchestrator become now a beautiful couple that is now working together here for this EI scientist. And you can be quite sure as a user that the orchestrator agent is behaving in a way you like it, that this is what you prefer to do. An individualization. Congratulation. Oh, wait a minute. Here we have now the stages. Officially, there are three stages, never mind, you can insert some, you can play around with this. You have normally use you start with a stage one, the ideation, then you do the coding, the verification, the real-world experiments, then you have the writing, the summarization, the understanding, the whatever you want now to publish, and then you have maybe a paper. But, let's have a closer look. So, the stage one is simple now, idea generation and planning. This is what we know, there's nothing specific except now system now queries academic databases here. It uses quantitative evidence extraction, beautiful existing papers to prevent LLM from hallucinating some other lines, and now it starts to generate a pool of hypothesis based on existing papers. It goes the next step, now, and it uses the automated novelty verifier to filter out ideas that already existed, have already been published. And after we have a pool of hypothesis how to proceed, we have a planning phase. So, the surviving hypothesis is translated now into a strict JSON formatted experience blueprint how to use the data set, what architectures we should build, ablation studies to verify, how to code this, how to whatever. Second, validation optimization. Now, we are in the lab if you want. You have the coding phase guided by the blueprint. The system clones not repos. The stage is here the data set, generates a self-contained code base, all the models, all the training loops, all the evaluation metrics. You as a human have already given here feedback that you like metric one, you do not like metric three, and it here strictly adheres to the user's preferred coding style. Or if you already have here, like say in cloud code an interaction here with 1 2 3 months, I think cloud code has exactly understood how you like to have your coding style implemented. And then you have the execution and the debugging phase. You have an autonomous debugger, another LLM that looks now here at the created here code and either, yeah, debugs it, verifies it, extends it, whatever, until you can successfully execute this. And stage three is simply now it writes it down, it summarizes here, drafts here, if you want a manuscript, deliberately an architecture choice to prevent here the catastrophic forgetting. If you go here with or you do not want a context window overflow, you want to ensure that the introduction aligns perfectly with the conclusion. So, we have now a section-by-section writing phase here by our LLM. And then, of course, we have another standalone that acts now as an external reviewer. We have the review phase and you got it. The interesting thing is you have user profiles now. So, you have the same study here, whatever it is, never mind, UCI HAR. And then you decide a user profile who is doing now who is now this AI researcher. And maybe here the main characteristic of the first researcher is evidence first. The second researcher is maybe ablation focused for practical methods for clean ablation for directly implementation review friendly or maybe you say, "Hey, I want here my AI researcher purely data set driven. " So, you can steer this. You can define your own preferences or I would like to try this out. Go exactly with the complement of my behavior, you know, because I know how I do things, but I like the complement of mine who does exactly the thing opposite, experiment much aggressively, try some crazy ideas, and do not stay here with the confinement here of classical scientific behavior, you know. So, you see the blueprint and it's really different, you know. The AI scientist he really comes up with different solution. For evidence first

### Segment 5 (20:00 - 25:00) [20:00]

for example, we have a fixed multi-scale CNN, convolutional neural network, or for the second one a temporal feature gating, or a temporal routing. So, you see it really depends what you how you define the AI researcher the preference profile here themself. Then the code, yeah, either you they go with a fixed encoder or you go with a pluggable gate structure, some adaptive routing, whatever methodology the AI will now implement here in this scientific experiment really depend here on the user profile, you know. And of course, it will be written the paper will also reflect your particular profile. Okay, let's have a look at the results. How good is it? Let's compare it here. So, here we go. If we have here the ablation results, what is the most important? You see the nano research in the full implementation here for the alignment phase or for novelty or whatever profile or whatever parameter you choose. Okay, it is let's go with alignment, you know. 8. 1. So, without the skill bank, you would only reach 7. 9. Without memory, you would almost reach 8. 07. Without the planner miles, 7. 8, and without the preference alignment, 8. I mean, yeah, this is and you really cannot leave this out now, no? You can choose what to take because the beauty is really here the interaction above all. But of course, what we are interested in, if you compare to two other the AI scientist, and here you have in this line the AI scientist version two or the evil scientist, and here you have all the performance parameter, and here we go here with the average API, with the token, with the runtime, with the GPU hours, and interestingly with the costs. And now they show you here, okay, if we compare this now here on a token level, on an API call level, or a runtime level, the runtime is shorter. GPU hours is almost half. The cost, therefore, of course, is also massively reduced. Okay. So, this is what they show us here for the {quotation mark} performance sheet. They go here API calls, token, runtime, costs here, and GPU hours, no? Okay. I put it in this is it, no? So, let's reflect about this. So, because the agent is now crystallizing here the successful debugging path and procedural codes into some permanent reusable skills and memory that this AI scientist now builds up with the time, no? It spends drastically fewer search tree iterations guessing how to compile here the code base in later rounds. It will reuse here the skills that it found to be helpful. It will reuse here the memories that it found to be helpful in the past. So, it is behaving more or less like a student researcher that is understanding, I build upon my knowledge, successful paths, and it just goes on. Make sense. You see also that now this pre-print shifts here the spec- the perspective of an AI scientist, from a pure research automaton, a machine that does here some automation, to a collaborative agent human co-evolution. And you might say, yes, of course it needs now the human because otherwise the machine would not be able to self-develop itself, to self-learn, because we have chosen here exactly for the reinforcement learning now the self-DPO algorithm. So, of course. This really brings us back here to the video from yesterday. Yesterday I showed you here Skill-Way, now. Skill-Way addressed the immediate the static problem of how to assemble a perfect executable context out of a vast internet of tools, and it is a wreck compiler. And this video today I have chosen for a very particular reason because you can say that Skill-Way defines here the spatial solution at a time t equals zero. So, Skill-Way from yesterday I showed you how to provide the optimal topological projection to select the right macros and grafted right micro dependencies, the sub units of our skills. And now Nano-Research, in my interpretation, defines now the temporal solution over the time development. So, you see Skill-Way t equals zero, and for the time evolution Nano-Research. They go perfectly hand in hand. Read both paper in parallel if you really want to see it, and take care about this particular fact. And if you say, no, I see it the other way, please write a comment. Otherwise, Nano-Research addresses here this evolutionary problem of an AI scientist, now. You don't want to start from fresh every time you start up the machine again, eh? So

### Segment 6 (25:00 - 28:00) [25:00]

So, as the agent writes here research paper, it discovers novel procedural steps, successful debugging patches, it learns new skills by trial and error, why not? And then, yes, it distills these into the new discrete MD files and the new skills and the new memory files to append then to its permanent skill banks that it uses here at a later time step. But, the most important fact is now it also was integrated here via reinforcement learning into the parametric knowledge of the LLM itself. Simple example, nano research conduct a research experiment, runs 6 hours, gets a feedback from the physicist, and writes now a very specific PyTorch training loop. Let's say you need this, eh? Maybe it says, "Okay, great. This is all I needed for today. " So, you save this loop here as a new skill in your private skill bank. And the very next day the scientist comes back and says, "Okay, AI scientist, switch on. " And it understands immediately when the agent needs now to write a new paper or performs a new research, eh? The orchestrator agent starts again looking around asking for the best tools that is available in his environment. And now we understand with this new skill, let's say it's a skill MD file, available is now taken care of, eh? And skill that takes over passing here the nano research skill bank into a multi-scale graph, see yesterday's video, retrieving that specialized PyTorch loop that we built, grafting on the specific sub-unit constraints for the physics lab's GPU cluster, and compiling it perfectly into the agent context window. You really build upon what you did last yesterday, the day before yesterday, and so on. So, you see, this is here a if you want iterative self-learning process here for a multi-agent system that is performance here performing here research task, whatever you define this research here, and AI scientist. What I really love is this three-level evolution. Whenever the paper is finished or whatever the first day is finished, now at the end of any stage, and the human has provided feedback and critique and whatever, the system here really triggers its evolution in the following way. If you want to remember one sheet, this is it. Skills. This AI now abstracts the coding fixes it invented in the stage two process into some reusable skill empty files and saves it in the skill bank. Memory. For the memory, it analyzes what went wrong. It logs all the failed hypothesis from stage one into the project history so it does not repeat that ends anytime again. It learns from its mistake. And policy. It uses now, and here it comes that the human interface is so important, the human feedback is so important here, the human's free-form feedback to mathematically adjust its neural weight tensors via STPO ensure that in the next paper it's innate intuition better matches here the user's scientific taste, performance patterns, research patterns. Isn't this beautiful? Hope to see you in my next video.
