“Self-Improving AI Agents Are Almost Here…” – DeepSeek Insider

59:43

“Self-Improving AI Agents Are Almost Here…” – DeepSeek Insider

David Ondrej 04.03.2026 11 254 просмотров 357 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Wanna scale your AI business? Go here: https://www.scalesoftware.ai/start Wanna learn how to code with AI? Go here: https://www.skool.com/new-society Learn about the best AI Business models here - https://www.youtube.com/watch?v=Ta5g-OxjPO4 Follow me on Instagram - https://www.instagram.com/davidondrej1/ Follow me on Twitter - https://x.com/DavidOndrej1 Zihan: https://zihanwang314.github.io/ AgentZero: https://github.com/agent0ai/agent-zero Host your n8n agents on Hostinger: https://www.hostinger.com/david Subscribe if you're serious about AI. Podcast with Zihan Wang

Оглавление (12 сегментов)

Segment 1 (00:00 - 05:00)

This is John Wang, an AI researcher who's one of the authors of the famous Deepseek V2 paper. He also open sourced the training methods behind Deepseek R1 and has over,300 citations as a firstear PhD student. In this podcast, we talk about how the top AI labs compare the AI race between China and USA and what it's like to be an AI researcher. If you want to understand what the future of the world will look like, watch until the end. This is the David Andre podcast. Enjoy. All right, son. So tell me about your time at Deepseek. — So I joined Deepseek at 2024 at the very beginning. So uh at that time they were basically working on this uh MOE models and uh when I joined they were basically working on the DPS v2 which uses like uh MLA to reduce KV cache and make the model more efficient. Uh during my time there I was basically working on expert specialization. uh which means that we have so many experts in a very large sparse model but how can we make them really take their own roles. So at my time there I basically discovered a method that can help us make the models become more specialized and adapted to downstream tasks by um training the specialized expert specifically for one specific downstream task. So this not only help reduce the memory and the compute need when we want to adapt a model to the new domain uh but also reduce the other irrelevant experts from uh being too uh like overfit to a single downstream task. So making the model becoming more specialized while maintaining general uh capabilities. — Curious what you just said because it's a architectural change right in the model. — Yeah. Do you think this is where the biggest gains are or like do you think it's like cleaner data, more compute — under strict the same data all the algorithms will perform better than before than the baseline algorithms? Um I think both of them are pretty important. If we have good data, the model will improve. If we have better algorithms, the model will also improve. So yeah, I think both of them necessary. But basically at this point I would say it's fair to say that each of the main labs have scraped the entire internet and uh you know maybe like Google has more proprietary data from YouTube and Google search and stuff like that. So do you think like the biggest differences between the labs will be better algorithms, better architectures or what do you think will distinguish these labs? Do you think be something else? — Yeah. So um there are actually many points that can dis distinguish those labs. So some labs has more public data from online that they can use. Some labs has better infra that they can iterate their speed very fast. Uh some labs uh actually have more compute so they can train more models, train larger models. I think deepseek really has great infra at the time I was there. So I can feel that whenever I have an idea in the morning, I can implement it in the afternoon. Even at that time we don't have cloud code we don't have codeex we don't have this code agents so their infra is pretty much clean if you want to add something to it or you want to remove something to it just do it and you can do it pretty fast I think this is their pretty much advantage — so it's basically like a very open culture where like just elite people can do whatever they think they should do — um yeah it's kind of like a bottom up structure uh you have an idea and you propose it to your teammates and then you start working on it and uh like the boss won't interrupt too much uh and uh they will just uh make sure that you have uh enough resources and the idea can be pushed out and uh teams do collaborate a lot because there are someone working the same field someone working across fields so if an idea has to be pushed out for example I'm working on improving algorithms I also need those people from infrasite to improve the efficiency once the algorithm is implemented uh for example writing more kernels. So that's how team members collaborate if they really want to push something out. — What's like one of your most memorable stories from working there? Real quick, if you have an AI business and want to scale it, make sure to apply to my accelerator. If you qualify, we'll work with you for 6 months straight to help you scale so that you can either build a massive AI company or go towards an acquisition just like I did with Vectal, which I managed to sell for $1. 8 million just 14 months after founding. So if you want to build or scale your AI business, click the link below the video and apply now. — It might be a little bit cliche, but some time when I was there, I really enjoyed chatting with those Chinese drama with those people there because I feel they are really from very diverse backgrounds and some of them are actually from uh like um Chinese literature backgrounds. Yeah. So I couldn't imagine that I can just chat with all those technical people on this

Segment 2 (05:00 - 10:00)

Chinese literature, Chinese drama that much during launch. So it was really fun for me. — Interesting. Maybe I'm going to go through like all the main AI companies and like give me you know like few sentences on each like your thoughts, what do you think is their advantage? they're doing right? So let's start with Enthropic. I think their coding agent is pretty good like so I think their kind of um product choice is also pretty good for example like they have this kind of co-workers their product wise strategical choice is kind of uh what I love a lot yeah — okay open AI — I have been knowing open AI pretty early I knew open from my freshman age I knew gym like the open eye gym. So it was a pretty early impro that I think it just motivates a lot of research at the time of their age. So I just pretty appreciate that because it not only motivates other people's research but also motivates my homework. So after I got to Berkeley as a exchange student in my third year of my undergrad, I start to heavily look uh read their papers especially about the VPT paper like a video training where they start to pretrain these model AI models not only in language such as what bird and GP2 do but also start to use them for like understanding the world through video training. And I was pretty excited by that because I have always wanted to see a game agent that can really plays games throughout the world and this is what they did. So pretty excited about that. Later is the story that we all know chat and all this stuff. — Let's move to Google Demite. — We have a um serendipity with the deep minds alpha go. So um I was at my middle school at the time. So I didn't know anything about AI but I did I did like watch the alpha go live stream and uh feel pretty excited about that and after nine years we happened to release our agent infra that um like later like become like a region u which is um some work I think should be one of the first agent infra with larger reasoning models that happened like at this time like a region. So it was released exactly at 2025 uh January like 21st at the 27th. So exactly nine years. Yeah. So like we celebrated both Alpha Go's 10 years anniversary and RA is one year anniversary just like last month. So it's quite interesting. — Okay. Next uh AI lab deepseek. What do you think is their main advantage? Um I think what I thought about their advantage um like still exists and they will exist for a long time their super good inferability uh how people work there and um I would say that um deepseek talent is really an important factor for this company to build up. I think not only for deepseek but for a lot of other companies uh in Hyena. So uh specifically like Hayen is a district in Beijing and uh they have Tinuab University, Pan University and Ma University and a lot of other universities that have very good like those AI education. So I feel really good about my teammates and I also feel that like all these universities are putting much and much more into AI education. So especially I know that uh some of them are holding AI competition to high schoolers very serious uh so that when they directly get into this college they will be knowing how to train transformers how to train agents — that's crazy — do you think China does this better than USA — I don't know how USA do this to high schoolers actually because I didn't finish my high school here so but I know from my Chinese uh high schoolers uh students that Chinese are holding something like this. Yeah, I feel like this is you know in the western side of the world this these changes are super slow like to the educational system or you know like something is happening as like okay focus 20% time of high schoolers on this right like AI is changing the world I feel like that would just take forever and I don't think there's any western country where that's happening but you know in China it's like you can make these changes a lot faster would you agree — um — it's like almost allocation of the resources of the best talent like how they're allocated — I feel US is more interest driven. So if you like it, you can go for it. I think this is nothing bad because this will make those people more motivated to go to this area and they will still become very top researchers in this field or like a top contributors to this field. Uh but if you put more resource statistically you will get more talent.

Segment 3 (10:00 - 15:00)

So yeah. — Okay. What do you think about XAI? I know XAI because some of its members is from SGN and they do love this infra. So they kind of um push forward uh language model inference from single turn to multi-turn and they did this very fast. Um I think like it's one of the fastest inference infra that allows this multi-turn agentic training uh agentic roll out for training. So I think it it's quite a pretty much close to where we had at region but they made it pretty much faster. So I think we did this multi-turn IR training at um January last year and they did this at uh I think February last year but I think it's kind of like um two times or three times faster than our implementation. So I feel they are like a pretty good uh researchers uh from SGON and I know that SJ has a lot of people from XAI. So that's all I know about XAI. Yeah. — What about Moonshot? Moonshot AI creators of Kim. — I first know Moon is from Moonshot's paper like Moon is scalable. um in that paper they kind of uh um measured like why this optimizer is better than atom and use very large scale study to do that. So from this kind of uh thing slides I feel that they are this kind of company who reads a lot of literature and uh like kind of AI work and try to figure out which of them is good not and uh like boldly adapt them to their models. I kind of feel that this strategy is very rare for now because many people will only follow those who already succeed but um very rare people will follow this who had the potential to succeed but currently not. So I think they are the very first followers of moon and I believe that their team will continue to like find all these insightful papers in this field or even create their own very insightful papers and ideas and uh keeping adapted to their models. So that's very good like uh feature to learn from all over and adapt from all over. Yeah, — like I think you have unique perspective on the China versus US AR race, right? Like you been on both sides and you can compare it probably better than most people can who are like only from outside. Like personally I haven't really spent a significant time in either China or USA, right? So like I'm complete outsider but most people who talk about this are either on the Chinese side or on the USA side. I'm you have unique insight in both. So like who would you say is actually winning and what do you think are like the main differences in the AI race because ch like for example China has like a lot more energy right they're like adding a lot more to the grid and I think this is something that like no nobody in the west is doing if you look at some countries like Germany they're shutting down their nuclear reactors so they're literally sabotaging themselves so like you know just give a broad analysis I don't want you to get into trouble or anything or you know go deep into politics — I can speak more about the talents because um Actually I pretty much know what Chinese education system is like. So — well you start there. The Chinese educational system like what people don't understand about it. — Yeah sure. So let me do a brief introduction to that. So I think the biggest difference between Chinese education system and a lot of countries outside is they have a very standardized casting system that keeps uh nurturing while filtering those talents from their very early stage. So it's kind of like at six when you go to primary school uh you will be like um many people will tell you that uh learning from school is not enough and you probably need to learn outside the class for example you take classes — and you take even like a piano classes I don't know why the parent feel that taking piano classes will help the people succeed but the truth is that so — I also used to take piano classes when I was like in elementary. So, it doesn't be public. You know, — I think when you kind of um go to uh middle schools and uh there will not only be those uh classes, but also there will be some uh those Olympia challenges that takes you to like become the very top student beside your group. So, it's kind of like um yeah, I I think I took about one or two classes per week, but I know that some of the students just uh take our like a you spend their whole weekend on to this like math classes and physics classes, English classes to help them get better grades at these

Segment 4 (15:00 - 20:00)

competitions. Um — so basically kids spent like way more of their post school time preparing, studying, grinding. Yeah. Yeah. The key feeling is that this is very high pressure and um it keeps a very competitive system so that when people are inside you are pushed to go forward. Um for example we do not only have this like final exam at GA call but at each month our high school will also have this kind of uh monthly exams and at the last year of our high school study each day we will finish at least the two exams and each week we will have a complete simulation of GAO. — So almost like more gamified like more you know competitive system. Yeah. Yeah. And it feels pretty standard. So you can always have those talents that uh you would like to have. So it's kind of like you design a metric and under this very competitive and standardized system, you can always filter out someone that um like fits into your requirements. — Yeah. And uh as the base of the education system is pretty large like there's always someone who is very talented. — Yeah. I think my high school and my middle school is much better than everyone else everywhere else because um at least at the first and the second year. So our middle school and high school all have three years at least at the first two years we do not have that much pressure and we don't uh need to go to all these classes at weekends. So it's probably much better and uh I have joined a lots of clubs during my middle school and high school for example like a model UN and also like some kind of a chess classes. So I kind of feel that my interest and motivation is largely kept like during this uh middle school and high school training and I feel that this is why I u kind of have much motivation to uh pursue something when I go to college. Um because like some people when they are getting too much high pressure and uh they got the kind of driven by the outside system when the system uh disappears they will feel a lack of motivation. — Yeah. — And I kind of feel that I kept this motivation. So yeah — I think it's a similar transition where like people go from like a work environment right to 9 to5 and then they start their own business and suddenly they have like no task because you nobody is watching their schedule and they have too much freedom. Yeah. I think it's a similar thing. Okay. — That's interesting. So would you say like a system which like western educational versus Chinese which is better at like getting the best out of the best people? — I think both can have this kind of best outside best people. So I I do have a strong belief in that. So I can like um demonstrate a little bit. I kind of feel that if you want to be very top um you need a very strong motivation and you need to be capable to work well under high pressure. So um I feel that uh in China if you have the motivation and you are already in the system that keeps this high pressure to you then like you can keep driving through a research line or some kind of motivation pretty fast. And in US um you have motivation and you can allow yourself to be competitive under high pressure. I think you also become pretty best. So I'm not sure about the average or like um what specific American high schoolers is like because I've never chatted with any of them. uh but I feel that if there is um competition there is uh stress because we can just see from this admission rates for Harvard MIT and they are all very low. So if you come want to keep competitive and through your peers you must be working pretty hard. So I feel that this is pretty similar in US and like even other country that have their own educational systems. — And about the energy, do you have any insights there? Cuz like I think this is another massive difference where you know China is like deploying a lot more and every other country is like kind of falling behind in this or like do you think this will not be a bottleneck if we you know scaling AI. Do you think the bottleneck will be somewhere else in the next five 10 years? — Yeah. the I I think the key idea is that whoever creates an agent that can improve themsel at a comparable rate they will kind of have a very high advantage at this time and uh I don't think this will require that much energy from all the fields because I feel that if you have better data algorithms if you make your agents in this kind of self-improving loop then it probably can become better and they will know how to save the energy because when agent figures out

Segment 5 (20:00 - 25:00)

self-improving agent you are in a different stage. — So you think that is like a big question who gets there first — I think so. Yeah. — Okay. And if you had to say like how far we are what's your intuition through recursive self-improvement? So for myself, I work extensively on self-improving agents and I feel current agents have um very good abilities that are already I think past the line to become a self improvement like they are kind of common sense ability, — reasoning ability and also like um alignment with human like they know what should be do what should be done, what should not be done. — So basically like we are limiting them with like lack of tools, environments, permissions, protocols. Yeah, the the keyboard on like here is uh context lens. So first is context lens. Uh you probably see the case that someone like I think like recent news that someone like put uh kind of coding agents to their working environment and they deleted all the all their files. So the kind of story is that because the agents has limited context lens. So after working for several steps it would should clean their memory and uh when it cleans the memory they deleted the initial instruction that uh they cannot delete the like something from from their computer. So — um I think the keyboard neck here is memory. So if agents have a near infinite memory that they can really leverage. So first you need to make the model able to contain that much memory. The second is that you need agent to really utilize this memory because sometimes agent even if they have the memory they don't use that. — Yeah. Like most people just summarize at like 100k 200k even if you know the model allows one mil because it's not effective. — Yeah. We have a work at our lab especially in vision language model case. So the model so the paper is like why VMs fail at uh spatial intelligence. is like they when you let them um work on this kind of a spatial tasks you give them an image they didn't even refer to their image in their attention space so that's a lot of tokens but they don't see it so kind of um I feel when the context lens is even more it will be even harder to retrieve this kind of memory from a very large context so uh I feel a key is here um — what's your intuition there do you think it's like embedding some sort of vector database into the architecture. Do you think it's, you know, letting the model update its weights in real time? Do you think it's context window of 100 million tokens? — Yeah. Yeah. So, um I do have several works going on. So, I might not be going too detailed into that, but um I think there are just um several rules that you can uh go forward. Well, so the first is to rethink about memory from human side. Yeah, I've heard about an opinion uh I think maybe it's a verified scientific fact that human memory is not TV cache. to retrieve from a large codebase. Human memory is kind of a hallucination. — So you experience new things. — People don't realize this. People don't realize when they recall the memories they're adding details and like you know it's less and less reliable. — Yeah. Yeah. It's kind of hallucination. So you do this new experience and you retrieve from your update parameters but you don't know if it even exists. I think if we really do this we might have a kind of agent that has very long memory I think uh as long as maybe human memory uh and even more than that. Um but this will maybe make them make more mistakes because the human also hallucinate a lot and sometimes this will harm them. Um but maybe we can find a way to control this kind of hardination and to make it maybe happen at those unimportant details but not happening at those like important stuff. So maybe we can build an agent that keeps uh much more memory than that. And another kind of uh route I feel is that we offload much thing and only retrieve them when available. So this is um what you mentioned about like we have kind of a vector base and we have this cache and we retrieve from them when necessary. I think this is also much doable and but I kind of feel that both this way or the way that I mentioned earlier like about uh storing in parameters all pretty much need pretty well memory benchmarks because as far as I know current benchmarks focus a lot on actually this long context understanding but if you really want to find this realistic memory benchmark um it's very hard. So if any of your readers or audiences know this kind of benchmarks, please suggest to me in the

Segment 6 (25:00 - 30:00)

like uh in the comment box. — Yeah. Yeah. So um yeah, go ahead. Yeah. — Well, I was about to say that it's interesting that like as you said that I realized there's not that many popular benchmarks about this, right? Like most benchmarks like coding performance, you know, scientific, you know, GPQA, Eric, AGI, there's not many benchmarks that properly test it. you know there's needle in a haststack but that's like in most situations that's not how it works you know like it's you know it's not like what some random fact in fact it's kind of opposite of human memory it's like you go through a month of your life and then like what car did you see that Tuesday it's like I don't know what color was that car and like the model would nail that right so that's like kind of useless we need different benchmarks for memory right like we need something like — yeah yeah I mean it is pretty hard and um I I see what you mean. we need real new benchmarks like we really have a simulated user and it keeps chatting with the agent and when it wants to record something it just ask agent so what did I do and the agent can retrieve from all these context windows and um I have some uh friends doing on this field and they just feel like the most difficult part of this kind of simulated user for memory benchmark is not only like reproducibility because each time the user might generate something new and uh this will make your benchmark not that much stable but also is because they feel that all these language model simulated user are just too smart and they are not that much like a real users for example if I were to ask GBT for several kind of things I might be very vague at start like I just ask so what is that and uh GT replies to me and I was just recall okay uh what I exactly want to say is something and I keep publishing my query at the time. But current language model simulated users just very clearly state all things about their statement at very first stage and the agent to be tested is just feel very clear about that and they do not have any stress on how to keep memory keep understanding the like a vague user query and keep making the conversation helpful. So at this time we might still need real human data for this kind of memory valuation and I feel this is why only like those large companies are doing this right now. So you're bullish on companies like Meror Search AI who are like generating these you know labelers. you think it's very much — um I'm not sure whether labelers are more important or real users are more important because I kind of feel that in real users you don't not only have this um kind of crowdsource workers but you also have those people who are actually working on very diverse domains like you really have — yeah yeah working on math working on medical working on law like all these fields you do have this kind people so you can help your model improve in all these domains. — So basically it's almost a trap having these laborers like because you give them instruction then you don't get sparse data. You just it's too specialized and it doesn't like you need users that are power users and beginners you know you need people who are like just doing some task or doing like scientific research like is that fair to say it's like real users the closer to the use case that you want that's how you should be getting the data basically. — Yeah. Yeah. — Okay. And on the topic of recursive self self-improvement like would you say apart from memory like you said a pretty big statement you think like the current models are basically there that they are basically capable of this. So like what else is missing? — So uh if I were to say I feel that besides memory the other thing is whether they can really improve from memory. So um for example like you have all this um like Google search so they are kind of search system that they really have good memory. So when someone post on the internet um they can like retrieve whenever there's a query but you have very good memory but you don't know how to use Google search to create an agent that knows everything in the internet. So a kind of a thought on this is like um when we have an agent that have infinite memory can it really learn from these failures. So one of our labs work on this field is like when the agent fails at start we just prompted to try again and do this task again and see — a paper on that right — yeah yeah so uh we wrote a paper on that and I think there's also a very uh if I were to say it's very old but actually I think it's three years ago like a reflection so uh like uh published by Princeton that they also let the model to try again if they fail search and we kind feel that current RL even make this worse. So if a model can like uh finish

Segment 7 (30:00 - 35:00)

the task early uh it's okay but it fails at start and we prompt it to try again it will insist their current failures like it will think again and uh produce a very long chain of thought and after that it will say okay I still insist to my previous answer but not like generate something new that really target to this problem. um we find that current like those base models for example like kin uh 2. 5 dB instruct uh is kind of uh like already showing this kind of trend that it cannot learn from failure but uh we find that those models trained with even become worse at that so like if you let them try five times versus one times the improvement is very little. So we might need new algorithm to make these models really learn from failures and uh like so this is how we get a close loop of self-improvement. — So basically humans are really good at arling themselves right like if you go and you maybe walk but down the street too close to the road and then like a car almost hits you walk further away from the road you don't need that many experiences you know you don't need trillions of tokens from the internet to learn that lesson. — Yeah. So like do you think it's the issue with transformer or you think like it could be solved on top? — Yeah, it's kind of an ability that current models are lack of but I feel that transformer itself with some techniques uh could solve it like we obviously can embrace better architectures if they are really better. For example, we shift from dense models to these MOE models and we shift from like a linear models to attention and maybe there we are going a little bit back to linear models at this stage right many people are uh like researching into this and we might even shift from natural language reasoning to latent reasoning. So these are all very good new like kind of method that we can try if they finally work out to be better we can obviously adapt to it. But I feel that transformer itself is in like um uh is allowed to be capable in many tasks. I think there are some um people uh showing that theoretically transformer can solve like touring complete tasks. I don't forget I forget about the specific uh paper name but I just remember uh the key idea is that transformer can really solve many things theoretically. So I feel that this is something that we can improve with the transformer for now and it does not need to be bottlenecked by it architecture. Um but I do feel that the self-improvement capability is a really important capability to for us to um understand like whether this kind of agents are good for now and whether they can be improved. So the kind of a self-improvement loop here actually I think it allows us to dig many abilities that is not very well prepared or very well studied right now. Uh one thing I could think is a word modeling capability. So my advisor uh man actually works a lot of papers on word modeling capability. So um a one sentence definition is that when language model um transition from the single term model to an agent that interacts with the environment it must know like what kind of things will happen after they take an action. So for example if I push the uh table it will move bottle it will fall. So when we are in the single turn agent for example like they kind of answer the mathematical task or they like do this kind of uh tool retrieval I think we can let it remember what will happen from the training stage. For example uh if we do this kind of uh tool related multi-turn we can just let the model remember what what kind of uh things will happen after they use a tool. But if we really make them work in an environment, they will keep encountering new environments like um I think so my advisors uh postto mentor uh FIFA and Jajin from Stanford are working on the kind of like behavior challenge which means that they let some agents work on like a thousands of uh environments and uh they are all um like a simulated household environments and each time the agent like encounter a new environment you must what's their embodiment is like what their um like the whole house is like they need to traverse into the house and know like how many rooms are there and what they need to do here and we have done lots of follow-up work on this for example like theory of space where we uh change from the current QA style benchmark to some kind of benchmark that the model must like travel inside the environment first and uh stop when they feel that the exploration is already sufficient and then we give them the questions. So by doing that we make sure that if the

Segment 8 (35:00 - 40:00)

agent is good here it must know it's whether it's know this environment or not. So we kind of want to do more on improving the modeling like world modeling capabilities of these agents so that when they are really deployed in our environment it can keep learning from the environment like what will happen if I do this and uh this will like I think work as a very core ability of agents if you really wanted them to improve in the real world environment. — Yeah. So rather than just text, you know, like was this a good answer, bad answer, it's like what state did the room end up in or this uh code base end up in so that they have a better understanding of all the variables and everything at play. — Yeah. Yeah. Not only about the embodiment of word but also code and uh like uh the internet anything. Yeah. — So is it fair to say that you expect that if we solve the memory issue there will be fast takeoff? H I think we need to solve all these problems that I mentioned like memory issue, self-improvement core ability is like word modeling. — Okay. So, so what's your current like because it's very hard to make a good benchmark right like they get saturated very fast and then some companies do like um specific training to you know specific benchmarks. So what do you think makes a good benchmark and what's your biggest issue with the current benchmarks and EVOS? — Yeah. Yeah. So uh that's a very good question. So um uh I have also been feeling that benchmark saturates a lot because uh actually uh I think uh Gemini 3 pro use it said uh one of our labs work like MQ so it's also about like a spatial intelligence benchmark and uh it really improved I think uh I don't remember the specific member but it's large improvement um so I kind of feel that the problem should be non-trivial so um I actually like um heard from a um a very prestigious researcher in this field that um her taste regarding benchmark problem is that it could it should be as hard as possible and you must find a baseline that is non-trivial. So what is kind of a ntrivial baseline? If current model can already do very well on this then it's not non-trivial. If the best human can do this is actually also a little bit non-trivial. So if there's something that even the best human cannot do this very well — if% — is considered. — Yeah. Yeah. So this is the ideal case but many realistic case is that um not to say the best human even the best model cannot do this very well. Um for example we find actually a very important capability that has been ignored is that um agent follows the budget requirement when they execute a task. Uh we probably heard a lot of news that this kind of code agents uh you let them uh work inside your computer of your folder and they spend like a billions of tokens a whole night. — So we tested that like this could not be because of the users did not tell them not to use too much tokens. But even if you tell them okay you should use this t uh you should finish this task uh with like 1 million token they will use far more than that. And uh if you ask them to understand like estimate uh how many tokens they needed to finish this task uh their prediction with the real cor uh like task finish uh like token requirement has a very weak correlation I think is 0. 1 — or similar yeah — I think this is similar like when you talk to the models say like okay this feature could take us couple months to implement right like they have no idea like what's now possible with how fast things can be built and I think This is a similar issue here with the cost. — Yeah. — So like okay what would a benchmark like that look like? It's like okay you want it to land as close to the cost that you say and then probably the best result to you know for that cost. — Yeah. So um uh on my end I feel that a good benchmark should have this kind of u like uh features. So first um I think u not only like no trivial problem as mentioned and the second thing is that you kind of have a good understanding um presented by your taxonomy that when a model fail you want to know why it fails. — So some benchmark use this successful numbers but on those fail cases what's happening we're not knowing that much. So if there's a good benchmark I think actually um a very easy thing to do is that you not only have a large benchmark but inside the benchmark you have different classes like uh for example you do spatial intelligence you give them different tasks for them to do like uh alocentric egoentric something like active even more and you know when agent fails which subcategory it is doing not

Segment 9 (40:00 - 45:00)

that much well. Um so this is the first thing we can do. The second thing we can do is that to understand not only from the task cases but from the failure cases in their answer. For example like um some of them may uh some of the models when they are resolving spatial intelligence problem they may count the objects wrongly or they may judge the relationships between different objects wrongly and there are even more cases. So the final but most uh important critical thing that we are currently research on is to understand the failure case of reasoning which is even much harder than understanding the previous two things like the prompt or the answer because reasoning is something that you even cannot supervise right like you have a lot of proxies to supervise visioning for example you can like check uh what kind of strate you can regularize the reasoning into specific format so you can check more easily but that's still harder than uh checking their answers directly. So if a benchmark can do something that can even check models reasoning whether they are wrong or not I think that's pretty much good for the insights to improve these models. Oh, this is also like a how humans, you know, teach it. Like if you're in math class, right, and you have the incorrect result. It's not like the teacher says like, "Oh, yeah, you failed. Bye. " It's like, "Okay, let's look at the steps. " And like, "Okay, step four is where you made the mistake. " — And like that's how he trains you. It's like focuses on that step and then you understand that. It's not just like the end, — right? So like that would be very like reductionist. And okay, you touched on this reasoning collapse, right? This is I think your flagship discovery. So where Asians stop thinking? Explain this in simple terms. — Yeah. So during the last year, we've been keeping experiencing I collapse in lots of our experiments. Uh a very interesting observation is that we find all these single turn tasks agent will increase its reasoning length all over the training steps. But on multi-turn agentic tasks at least on the 20 uh like about 20 environments that we tried all the agents will have a decreased reasoning lens all over the steps. It's very strange to us that why this agents cannot like learning to reason with RL in this multi-turn agent steps and um we hypothesis that this might be because the task itself is more difficult the reward signal is more sparse because when you fail you don't know which steps you fail and also the environments we are using is not that much diverse so agent cannot learn from each of these environments and that they somehow benefit from each other. Um and we kind of look into the detailed reasoning of these models and we show kind of um how reasoning collapse happens. So one very commonly used uh kind of schema is to track the entropy. So if the entropy fails then the model start to generate more deterministic reasoning like um when you give them prompt they will generate a very deterministic answers but we kind of uh use a lot of strategies to make the entropy larger but the model still cannot get very good performance — and what we find with this model is like they although they have very good uh like a very high entropy for example given the same problem it can generate different answers but we find that given all these problems it will generate a same set of reasoning set so what this reasoning looks like for example given a problem it will say that's a good question it will say I will finish this task carefully and it will say like okay um I'm an agent I need to do this task so you can see that entropy is really improving because for all these problems they will generate diverse answers but for different problems it will generate the same set of answers So we kind of measure it with a mutual information which means that given the reasoning chain whether I can detect which prompt it was from. So for example if there's a reasoning chain that can be appended to any prompt. We feel that this is kind of more like a template rather than a real input grounded reasoning. — Yeah. It's not from first principles. It's just I mean this is probably the reason why like LLMs are bad at inventing new ideas, you know. It's like if different things can lead them through the same reasoning chain then I mean humans are probably just more diverse in this — as you said like high higher entropy. — Yeah. Yeah. So um we kind of detected this problem and understood one of the root causes is because the noise the like um uh noise that is hard to prohibit from this area stage for this models. Um all this task itself have

Segment 10 (45:00 - 50:00)

noise and we are even keep adding noise to these models. For example, we have the entropy bonus. We have the cha term which is actually irrelevant to the task. And all this noise come together to make the model like when they generate a kind of a response that is very safe and can get this kind of baseline reward it will stick to it because it does not know what like if other they try other things they will encounter this noise and okay this uh they were like they will kind of prohibit themsel from generating new things but rather stay in a safe area — for basically for the model. — Yeah. Yeah. — But isn't that like kind of implied by the auto reggressive architecture that they're going to do this? — Um so it's not exactly about architecture because finally we find that the noise from the task itself is higher than we expected. So we kind of measured the like gradients from all these um models like when they are doing update and we find that even for this task with very low reward variance um which means that the model generate a lot of answers uh and trajectory and find that their reward is very similar to each other. We find that even for this groups ARL will still make them a lot of like gradients which means that like ARL itself feels like there's a lot of things you learn inside the trajectories even if they appear to be similar and we feel this is where the pro like the noise come from because actually for this prompt you just like mute yourself from it because these are the problem that I generate like when I think about them for a lot of times I get the same reward so the best strategy is just to mute it because this is not worth learning either because it is too easy or too hard. But actually the current algorithm feels like they actually make the very large grandm here. So the model keeps learning themselves from this kind of very unworthy to learn instances and our intervention is very simple just to remove it — and um remove the model from those low signal to noise ratio trajectories and uh keep the model learning on this like um task that are worth learning at its RL moment. So it's kind of like you yourself as a human you experience a lot and something is just forgotten. You feel that is not worth your learning and you don't learn it. — Maybe like a bad book you know like you just stop reading if you're not — Yeah. Yeah. Yeah. You just stop learning. And uh we find this is the same for IO. And what's more even surprising is that like um we actually improve the efficiency of these models because like some people may say that okay you generate a lot of row out and you drop a lot of them. So the efficiency is very bad. But actually we find that if you are not learning from this bad trajectories, you're actually preventing yourself from these noises. So you actually learn faster because you are only learning on those trajectory that are worth learning. — This is very interesting. What is like you know for people who are not researchers what does elite AI research look like? Is it more like exploration? you know you have intuition you go play around with it or is it like more like strict you know scientific approach like day-to-day practically how does it look like? — Yeah. So uh to me like doing research is you have a idea and uh the idea is kind of that you believe something and you're not sure whether your belief is correct or not and you try to verify it. So for all the air research it's like you have a belief that model should be trained in several way and uh for example like data should be formulated like this or like something of the model some bottleneck of the model is not clearly like researched and you kind of want to build a work upon that. uh for example you take research into related works and see if they have already taken research into this matter and you somehow like uh train all these models on this either small scale or large scale but the final goal is that you want to make sure that your assumption is right and uh it can like uh persuade others at the same field. Yeah. So the very most important prerequisite is basically being good at questioning things, right? Like don't just like assume that this is the way the field does it like doing it this way. Like you know what came to mind is like the early days of open AI when they discovered the scaling loss and you know everybody assumed that like more data is bad because it's going to be like overfitting and other issues and they were like hm what if more data is good and they tried it and the model was better they tried it again model was that much better and that was like a huge discovery that like you know is powering the AR race right now. So yeah I guess uh it's like how would you describe it like a combination of raw intelligence with like questioning you know society or conventions. What are some other variables that make a great researcher?

Segment 11 (50:00 - 55:00)

— Yeah. Yeah. So, regarding the question ability itself, I think it's becoming more and more important especially if you are doing research because all these agents can verify the answer for your well. So, the taste and the ability to raise important questions are becoming more and more important. So it's kind of not like um yeah I think there this leads to another important ability that I would like to say is you don't only ask an important but obvious question. You can decompose it to plans and um make concrete steps. Um for example we know that like self-involving agent is important but what are the specific things the specific question you can ask — based on the current progress of yourself of the society. So um like my friend talked to me a lot about they feel kind of like this neuroscience is becoming more important because like a human can just upload their awareness to a computer and something like that but I think this is really important but how can we achieve that from now what are the things we can do at now — so yeah so this actually like troubles them a lot but I feel this is the core ability if you want to do research you have to know what you can do right now — yeah that's true the timing and what's possible having a good intuition there. It's like you know Google Glass is the famous example for like 2012 or something you know the AR glasses and it's just too early for that technology and throughout the history there was many instances like the AI field is a whole thing like people are in the 50s are predicting like we're a couple weeks away you know from AGI basically and they were just too early. So yeah, like knowing this is probably the hardest part is like knowing what's possible and having a good like intuition of like this is still years away and this is possible now — because you know you can be a genius with like idea from the future but then you're like kind of wasting your time because you're just working on something that simply isn't possible given our current tools and current technology. — Yeah, — that's fascinating. Yeah. Do you think the current like coding tools could like be truly called agents like you know because a lot of people don't really understand difference between LLM and an agent? So how would you describe that? What is like a true agent? — Oh this big question I think. Um so I think there are a lot of definitions around for what is actually an agent but I think the biggest different thing from the agent is the environment. It's not the agent, it's the environment. — Yeah. — So we can probably heard hear from a lot of people working on this field that the agent can uh only learn what is already embedded in their knowledge and we can take a look at what kind of tasks they use. I think it's quite normal and uh I think their work is quite good based on their own findings on these tasks. But it's like math single turn coding question answering — and you kind of feel that they can only get feedback from their own answers. — the road. — Again like human analogy is helpful here because like you know human is obviously people compare it to like agent like AGI is capable of what intelligent human can do. But if you take an intelligent human, you put him in solitary confinement in prison and you only let him write like one letter per month on a piece of paper and like there's no other input from that human like people wouldn't say that's AGI because like it's only outputting text tokens but in fact it is AGI it's just limited by the environment. — Yeah. Yeah. Completely agree. Yeah. I think exactly this is why I feel that a real agent needs to be inside an environment. If IM is inside environment actually is an agent but not that much good agent we need to train that to let — maybe like open claw is a step in that direction you know like people realizing like if you give an AI full dedicated computer and you just don't limit it you don't have to like press enter for permissions suddenly the same model is way more powerful — yeah yeah of course of cloud is uh like definitely an agent and the next step is how can we make self-improving agent — so basically like removing the roadblocks like starting with online obviously because software moves faster in hardware but then going physical humanoid robots. So like what basically like listing out the things you know that those are probably like amazing startup opportunities listing out the things that like agents cannot do efficiently right now on the web that humans can and they're just like destroying them right like payments authentication stuff like that. — Yeah. — That's probably like where a lot of pro progress would be unlocked. — Yeah. Yeah. I think um first thing is uh low uh I think low latency decision- making. So um I I've tried a lot of web agents because I interned at Utori like the web agent company uh last year in the summer. So um we get a research into all these kind of web agents around the world and find that they can do um good reasoning for each step for example which button they need to click at each stage. — Yeah it's very slow and it's not

Segment 12 (55:00 - 59:00)

something acceptable for human to think a lot before they really take an action. So one ability another is I kind of feel I'm not sure what the specifically the thing is like but actually I kind of feel that all these agents web agents currently cannot even click a button very well. — So yeah kind of strange but it's true that they have good reasoning but they don't know how to ground their reading to actions. They don't really understand like they see the button in the DOM or like visually but they don't understand like the button as a concept probably. — Yeah. Yeah. Yeah. First is they don't understand what function the button should have. The other is that they want to click the button but they fail. — They will say that I need to click the button at which position of the web page but actually the position is very wrong and it's kind of a strange that even with the SFT and it's very hard to improve it. Yeah. So I feel there's something inside the multimodel infra or like multimodel architecture that limits us from this but I'm not an expert in multimodality. So yeah just a guess. — Anything else that comes to mind that's limiting — um word modeling budget awareness as I mentioned and let me think about something more. Uh I can only think about this but I definitely — I mean this will already be huge you know like if you if all of these were solved which are probably you know closer than we think then even with zero improvement in the models it would already be like massive gains right like we see many of these like services like rent a human and like other creative stuff where you know there was this web 4 thing where agent could pay for compute with crypto and like he has to make money to survive like people are inventing this in real time and again Like I think a lot of people don't realize what's going to be possible very soon. — Yeah. — Do you feel that way that like we're like living through you know just some whether it's slow take off fast take off but like living through like a pivotal moment. — Yeah. Um my a little bit unique take on this is like I've been hearing about this word from my primary school state. — So every year is pivotal. — So I don't know which year is more pivotal than other years. So my take on this is like if you feel you're living in the pivotal age like just remember or just think about like whichever year is pretty pivotal and um the only thing that you can do is to think about how to like I mean leave yourself in this pivotal age. Yeah. So — how to get involved you know. — Yeah. — How to participate. — Yeah. I think just enjoy it. So don't be afraid of like being taken away or like have this kind of formal stuff. So I I kind of believe that like uh at the end of the day everyone will not need to work because their working efficiency is much uh like lower than AI right now. So um if human can still have some kind of strength over AI then we can use it to improve AI. If human is kind of useless in improving the society then we can just handle two AIs and um so yeah there is a kind of a thought on this is like um if you make sure that your AI is capable and also they help humans. So, so it's kind of not like you build an AI and they decide not to take humans as kind of the world and they kind of want to eliminate humans but it's like you improve the models and before you make to the final stage of HI you first make sure that they are aligned and they will not hurt human and then we can even let the human think uh even let the AI think how to improve humans from both the society from the education from uh how we do things from how we think what we are philosophically. So if we have a strong enough AI, it will figure out how to make us human become better. Yeah. And I think this is a meaning of how why we are like working on this AI field. — Awesome man. I think that's a great place to end it. — Yeah. — Thank you very much for the time. Where should people go to check out more of your work on your Twitter? — So there's a link zenus. me me and uh you can check my work and our lab and all the things that I post like those uh Twitter polls, some like uh thoughts, some academia translations and so much more. Yeah. — Awesome. I'm going to link it below the video. By the way, if you want to scale your AI business and work with me 101, make sure to apply to my accelerator. It's the first link below the

Другие видео автора — David Ondrej

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник