Lessons from Building Cursor
25:33

Lessons from Building Cursor

ByteByteGo 06.03.2026 9 459 просмотров 192 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
With Sualeh Asif (cofounder at Cursor) at The Pragmatic Summit. More details in this accompanying article: https://newsletter.pragmaticengineer.com/p/the-future-of-software-engineering-with-ai 0:00 Intro Composer 1.5 agentic model 5:46 Cloud agents 17:18 Software engineer 1 year from now 20:59 Lessons learned while building a browser in one week 24:00 Advice for engineers

Оглавление (5 сегментов)

Intro Composer 1.5 agentic model

Hey Smalley. So a few days ago you guys released the composer 1. 5. Congratulations. Thank you. So it's a really good model and can you talk a little bit about it? Um so I think it's probably one of the best models that you can use. Um We think it's somewhere between Sonnet 4. 5 and Opus 4. 5 in capability. Um and um almost entirely trained through lots and lots of RL. Um I I It's like it's hard to make good models. So yes, it's not it's not Opus yet, but I think I think we'll be making really great models and I think I'm very excited for people to try it. Uh in general, I think we've been trying to make models that are really fast and like feel really good to use. It's not just about like you know, have a model that you press enter and then you have to go to sleep. I think the goal would be you make models where it's quite engaging to use them and they're extremely fast and uh you type your query and the model does it as fast as possibly can. So kind of what's the thought process of making your own model versus the riding the frontier wave? Um I think in general as the products and the models get a lot more integrated, it's really important to build the features that you care about right into the model itself. So there are certain capabilities where if the it's not built into the model, the model won't be able to use the thing. I think in general, you can think of uh you know, previously there were models that couldn't even use grep uh very well. And so part of the things that made the model really good at grep is, you know, you RL for making the model extremely good at this. Uh a classic example of the composer models very good at using things like semantic search. Uh so for a very very large code base, the models are exceptionally good at figuring out where to go in like one to three queries as opposed to like tens of greps. And I don't know. I think I think those things can only be learned through RL. So if for example, one of the things we really want the model to be good at is like tons of recursive sub agents to the point where like we can nail your query in like uh almost all queries in less than two to three minutes and those would require things that are basically training our own model. Is there some of the uh kind of the technical detail of the RL you could like paint a picture how difficult it is? Um in some sense, RL is you know, it's a new technology and because it's a new technology, uh there's like both the infrastructure as you're scaling the thing gets quite complicated. Um it's hard to point to where it actually It's hard to talk about exactly the algorithms that are used. So uh I don't think I can say that, but maybe I can talk a little bit more about the sort of infra on running the code and environments. Those I think are much more fun to talk about. Okay. Uh there I think, you know, like you're running millions of sandboxes and I think, you know, there's no like there comes a point where you're making such infrastructure where you can't actually buy the infrastructure from anyone. I think, you know, software companies that have been built in the last decade, most of the time like if you had to do something really hard, you just, you know, pawned it off to a provider. And it was this great separation of concerns between, you know, AWS handles all of my and I just built great software on top. I think you can't do that when you're doing like um you know, 100 million plus of CPU compute per year or something. Like you actually need to think through how exactly are you going to orchestrate the hundreds of, you know, hundreds of thousands of sandboxes or millions of sandboxes that are running at the same time for RL. Cool. So help the audience here. So there's a lot of models, right? There's frontier, you have your composer 1. 5. What are some of the decision process that could help? Um currently there's two ways to think through it. One of them is I think uh one of the goals of um one of the things I really want to happen is like you have an input box, you press enter and you don't have to worry about the model at all given that for sort of easy-ish queries, you can use a really fast model and for things that are really really difficult, it just goes through a very smart model. Um we try not to do routing at all nowadays. Uh we try to make it uh you know, very transparent what we're doing for the user. I think a lot of people get very confused. I don't think we're in a world where routing works really well, but in a perfect world, uh you should not have to be picking between many different models. That feels kind of I don't know. If you use Google, you never think through, ah should I use this fast indexer or this slow indexer or this medium indexer? You just press enter and you forget about it. I think the world should look a lot more like Google than it looks like today. So we're not quite there yet, right? So what's the mental model that uh that could help us decide which one to use? I think right now it's mostly a matter of preferences. I think I've seen people who um I you know, even in among the team, there's like a split between the folks who uh use composer and then there's folks who use, you know, Opus or Codex. Uh it really comes down to personal preferences. I feel like I don't know if I can like rationalize some chain. Um uh Yeah, Opus people maybe like talking to the model more about like the plans and uh you know, how to think through the problem and Codex people just want to one try. I don't know. I don't know what their reasoning process is there. Cool. Thanks. That's great. Yeah. Next I'm we are really excited about the cloud agents because

Cloud agents

right now, like when I run agents, I usually have to carry my laptop around. So can you share a bit about the plan and maybe the architecture for the cloud agents? — The way I think through this is I think through them as like capability jumps of like the product plus model combined. And capability jumps is what gets people excited. Like I think a lot of people got excited after the Opus launch because it felt like a big capability jump from the previous generation. And then people got excited about, you know, the original version of Cursor because it felt like a big capability jump in term you know, on top of Copilot. And uh we think there's still many more capability jumps to go and we're like I cloud feels like the next big capability jump that people will use. It's not like the cloud agents today. So Codex web or cloud code web or Cursor web, they both feel like a worse version of the local agent. There's uh you have your local agent that's really fast and responsive and does almost everything in like 5 minutes. And then you do it on, you know, the web and like the setup takes a long time. Like it's slow to boot up. Like it's hard to see what the files are changed. It's just like almost everything about it is worse other than uh you can close your computer. But like if you're working, you know, from your desk, like you don't really close your computer that much. Um and one of the big problems if you just think like a 5-year-old is that, you know, you press write the prompt, press enter, come back and you see a thousand line diff and you have no clue whether the thousand line diff is mergeable or not. And in any way, it is your responsibility to figure out it's correct or wrong. And that feels fundamentally wrong. Like it should feel like the model wrote the code and it should be the model's responsibility to figure out if it's correct or incorrect. So I would not be surprised. Well, I think, you know, Cursor's going to be releasing a product uh here which I think I'm very excited about. And uh once you get the model to actually test its code um and prove to you that it's done the right the thing correctly, you we've seen, you know, usage of cloud agents go up by like factors of 10. Yeah, one mental model to have is that, you know, um roughly speaking, you know, probably cloud agents to local agents right now have a ratio of like 1%. So 1% of all compute is dedicated towards cloud agents. And, you know, if you want to be in a future where like 90% of compute is dedicated towards cloud agents, means you have to grow cloud agent usage by a factor of a thousand. You can't really grow anything by a factor of a thousand by just tweaking the UI. That's like you know, small tweaks never really get you factors of a thousand. So to get a factor of a thousand, something fundamentally has to change about the product. Like it has to be a step change. And um I think I think models being able to test the code is one of those, but we'll see how it goes. As another note, I think um why is testing the code really hard? I think the testing the code is hard because you have this like weird dev ex problem where to be able to test the code correctly, the model needs to be able to use your app like a human would. Uh and for most companies, uh there's these human dev ex teams. The human dev ex teams are responsible for making sure that you know, any new employee that joins, they they have like perfect running dev ex. It's really easy to use or even like open source, like really optimized is for making it really easy for people to to use that repository and get start get started. Um we're in a weird world where like sometimes the way companies operate is if you start service B before service A on your first day and everything is crashing and like everything is horrible, you like go to the person on the side and go, what the hell is going on? The person on the side goes, you need to start service A before B and everything is fine now. And you just remember it and life is great. But the model never never yells at you when um when the dev ex is off. The model just kind of silently gets degraded and there's no like big red glaring the model is yelling at you for fix your dev ex. Uh so I think companies in the future will have some dev ex teams where you know, you tell the model, please like here's all the nice ways of using our using this repository as even if it's if it gets really complicated. Even if you're like stripe or something like here's all the services you need to boot up in this order. You boot them up, then you get spin up this website. Everything should work as expected. You can click around. Uh, those will come more and more important. Hm. So, what are some of the lessons learned in building this cloud agent infrastructure then? It's like seems very challenging that you if you have like so many agents running on the cloud. I mean, the main challenge comes down to what Um, the normal model of the last 10 to 20 years of building infrastructure was you had Um, you had RPCs and each RPC would terminate between, you know, 100 milliseconds and uh, a second or two. If you have RPCs that take between 100 milliseconds to 2 seconds, you could monitor their P50, P90. It was really easy to monitor. It's really easy to understand like the input output of what's going inside of the servers. — [snorts] — That's unfortunately just really not true with agents like agents could take anywhere from a few minutes to like a few days. And the natural variance of it is so high that you have no way of telling if your system is degraded or just the agent takes a long time. And so, you know, building infrastructure is just very challenging. Uh, as another note, for example, what happens if you want to do a deploy for an agent that's running for 12 hours? Like do you How do you even replace the server? Um, and so, I think one of the things that as companies build longer and longer running agents, they will find it very challenging to figure out how exactly to deploy the agent. And like what is the natural client server model for something like this? Um, naturally, there's two models. One of them is sort of the cloud code as model. There is no client server. Everything is a single binary that runs locally, which has many disadvantages. You can't like disaggregate anything outside of Um, outside of that machine and you can't get any reliability if that machine dies. And so, the only way out to get some client server abstraction is — [clears throat] — Um, you go with something that looks like Temporal. Um, Temporal is this sort of workflow engine out of Uber. Um, uh, generally allows you to run super long running workflows uh, that are like very very reliable. If any step within the middle dies, it can like boot it up again. And um, I think I think I'd be excited to see there's only like two really good uh, services I know. There's Temporal and there's this thing called Re-state. Um, but I'd be excited to see more of these in the future as agents become pretty important. Yeah. Cuz as the model getting smarter and can run for like minutes and then maybe hours. I find it difficult to sometimes get the agent to do the right thing to take advantage of that. Um, is there any advice you can give us on like doing long running, try to get agents run as long as possible. Um, getting the agents to run for a long time tends to be I feel like getting easier and easier in that like I think um, if you give them a very hard task and there's a clear way of making sure that the task is correct or wrong, uh, the model would be very happy to go on uh, for hours or days. Um, I think uh, one of the great advantages of doing something like the browser experiment was seeing the model just um, you know, setting the model at the goal of hey, build a browser and seeing the model just rip or I don't know, 3 days in a row. Uh, to be able to um, you know, at least I think the model made like something like three or 4,000 commits over that period. Um, or I guess I shouldn't say the models like the harness uh, the model and the harness combined. And yeah, I I've become less and less worried about like if that will happen and it's more like I think it'll be very natural. The models will just do that in. Yeah, they'll be pretty reliable at it. Cuz like what what's the trick though? Cuz context is still like limited, right? Like 200k, maybe a million. I think this is one of the things we trained into the composer model is I think and then I think OpenAI is also releasing an endpoint for it. Uh, so, you know, almost simultaneously all the uh, yeah, one of the things you do during RL is uh, you don't like once the model hits the context window, the model just writes out a summary or a document for itself to remember what it was doing. And um, you know, obviously you can prompt the model to write uh, such a thing, which is what used to happen over all of last year. Uh, but nowhere in the prompt is it incentivized to write a really good summary that will be helpful to its future self. In the same way that you could have previously asked the model to grep, but there was no incentive to like produce grep such that like a future version of itself would be able to use those answers to like make progress on the problem. I think things like uh, things like these sort of self summarization endpoints will make the model produce actually helpful versions of summaries that like it can then use to continue working on it no matter if the context window is 200k or a million. Hm. Um, so, these kind of self um, I think as you RL the model to longer and longer tasks, things like self summarization will just make the model you know, it's kind of like you know, it's part of the optimization stack. And so, the optimization stack is forcing the model to produce things that are very uh, like summaries that are very helpful. And I don't know, composer is getting really good at it. Hm. Cuz it I want to push back a little bit is like this summarization is lossy, right? So, what's the trick that you have? Um, you don't need a trick. I think again like I think in RL if you force the model. Like humans again have some sort of lossy summarization, but the lossy is not as bad as you think. I think that's because humans are forced to remember things that are important. If you remembered everything, it would be horrible. Uh, obviously over a years, you know, uh, whatever RL process that's happening in our brain has forced us to remember things that are really important. Uh, we don't really know what's going on in brains, but I think somewhere between summaries I think there's also like other cool ideas. Like one cool idea would be you every time you do some form of compaction slash summarization, you dump the summary uh, you dump the previous conversation into a file and the model can grep that file every single time it wants to look up a version of it in the past. And uh, that it can learn to be effective after um, you know, RL will force it to learn to grep effectively. So, that will just be you know, baked into a version of itself. It's like if it needs to look up a past version of itself, it can just grep that. Grep like the grep the conversations in the past. This sounds like there's a trick that I could pick up as my workflow too then, right? Cuz you could like basically say by reference, you could point to some data that the model might need later on. Yeah. I mean, I think I think again like the there's some generalization effects where like the model will learn to use them, but I think the most powerful version of it will only come if you like the harness is RL'd to actually really really use the tips and tricks that you're, you know, adding effectively. See. Cool. Yeah.

Software engineer 1 year from now

Yeah. So, I'm going to ask you a question I think every developer really cares about right now. Like uh, last year, we developers are still writing a lot of code. But this year, many people tell me like the AI is writing 100% of the code. Where do you think the industry is going like 1 year from now? Like what do you think will happen? Um, I mean, in some ways it's kind of crazy now. Like I think you know, March or April, everyone is kind of writing code by hand and December, no one is writing code by hand. So, in some ways like coding got solved in 6 months and you know, I don't mean it some like crazy inflammatory statement about uh, you know, everyone should be riled up by it. It's just it's like a boring fact about the world is that 6 months in coding kind of got solved. That doesn't mean engineering is done and there's no engineers anymore or something, but the best engineers I know are not writing code by hand anymore and that's just that is the way it is and I would expect that there will be continued changes in the practice of engineering over the course of the next year. Uh, there's no reason to expect RL will um, you know, I I I I you know, immediately deprecate all engineers, but I think outside of like, you know, crazy physical limits, I think um, you know, probably will be able to make the models a lot better. Um, I think we should expect just like one to two capability jumps or every half, so uh, we should expect one to two capability jumps in the models over the next you know, half and then I should we should probably expect another one to two capability jumps in the half after. And capability jumps look something like, you know, the Opus moment where uh, it went from, you know, you could kind of write the code but had to review every line to like you could kind of write the code and trust that the model would just get it right. Um, and as these capability jumps continue, you should expect you know, people to change their workflow and adapt and that will just keep happening. So, I don't know, maybe you know, like what is the spicy thing I can say? Uh, maybe the next one is, you know, uh, I think maybe like after something like cloud agents takes off, I wouldn't be surprised if people become much more like managers. Like there's sort of this manager instinct uh, that a lot of ICs don't have uh, that I think will be more polished over time. Like people become better and better at using that manager instinct. Um, another uh, spicy is what happens after the first 6 months of these like cloud agents becoming really good. Um I I'm excited about a world where there's some world of a sort of a self-driving code base where um you allocate $100,000 on your code base, you know, you're plaid and you decide I will spend $100,000 a day on the code base. It's sort of like $30 million a year is not that much like compared to the R& D cost. And for you know, the $100,000 like 7% of it is spent on security and 12% of it is on like you know, removing, you know, uh tech debt and 25% is finding bugs and 50% of it is handling issues on the backlog or something and the rest of the code base sort of manages itself. Um I I'm Uh what does GitHub look like in that world? What does get infrastructure look like? I don't know, but I think there's sort of this um force of technology that's pushing us all forward and in that world I I'd be super excited to see what that product looks like. So you mentioned about the managerial

Lessons learned while building a browser in one week

instinct that we all have. You recently run this experiment, right, with like tons of agent doing the developing a web browser. Right? What from what I understand you guys try a lot of different technique in the beginning so they have the flat organization and you have hierarchical. So is there anything that surprised you in that process? What did you learn from it? I mean in some world I found the browser experiment or in general things like making a browser as very very surprising. Uh I think everyone should have found it way more surprising than they did, I think. Yeah. Before you know, somewhere internally deep down even though I knew the models were getting really smart and every time you asked me intellectually I knew they were getting smart. Uh deep down I was thinking, well, you know, they kind of plumb these RPC calls, you know, you get it put it through some proto and like they do all the stuff that I can do but they do it a little faster. Like that was my mental model of how the agents worked was like they do whatever I do but a little faster. It's like fine. Like that is kind of what we use them for. Um but for the browser I think for the first time I was thinking I can't do this. You know, this is something where I just can't do it. Uh like if you give me open source codex and composer 1. 5 and you let me rip for like a month on an island, I probably couldn't make a browser that's like really functional. It's like kind of hard. Like there's They going from HTML to parsing the CSS, like finding the right layout rules, what exactly the architecture should be, how does that all get passed into the rendering engine? I don't know, it's kind of hard. And uh yeah, for the first time I could see the model doing things that are just actually crazy. Just Yeah. batshit crazy and you were thinking, wow, like the world will be a very different place as the models get a lot better. So um I think the harnesses are still just like in these like experimental phases where we're trying out different ideas. Uh we don't know exactly what's going to land but probably is uh yeah, some version of the crazy self-driving code base where like you know, at some point you're like the models are good enough where they just commit code to the code base and you don't there's no point of stopping them to or you know, for human review. That seems kind of an unnecessary blocker. Mhm. Cool. But again, I think the other way to think through this for now is a very boring answer. It's like super super lame and boring, which is if you know, your code base will stay for many years, like review every line of code. If your code base is for a weekend, like who cares? Yeah. Thank you for sharing. That's super interesting. Um

Advice for engineers

So the model is getting like better and better. Um it's really exciting and also really scary for a lot of engineers. Like do you have any advice like for engineers who want to like stay ahead so they can um be competitive in the future? I think it's like, you know, I don't know. It's kind In some ways it's kind of exciting. In some ways I think you just um yeah, the art of engineering is changing and you like kind of go along with it. I don't have like, you know, some great meta advice of I think for people who really like artisanally writing code, I think there will be areas of the code base that where artisanally writing code is still very important. And then there you know, if you really write you know, like writing kernels or something. I Yeah, I think kernels are probably going away actually. Uh maybe something else, you know, there exists something where uh where like artisanally writing code makes a lot of sense and uh then there's a lot of places which will be mostly written by models and you know, I think a lot of people actually enjoy coding more now. Like most engineers I know like coding more uh than they used to in the past because a lot of really boring aspects like debugging for like hours and hours for a bug has just gone away.

Другие видео автора — ByteByteGo

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник