Codex and the future of coding with AI — the OpenAI Podcast Ep. 6

50:39

Codex and the future of coding with AI — the OpenAI Podcast Ep. 6

OpenAI 15.09.2025 66 545 просмотров 1 431 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

What happens when AI becomes a true coding collaborator? OpenAI co-founder Greg Brockman and Codex engineering lead Thibault Sottiaux talk about the evolution of Codex—from the first glimpses of AI writing code, to today’s GPT-5 Codex agents that can work for hours on complex refactorings. They discuss building “harnesses,” the rise of agentic coding, code review breakthroughs, and how AI may transform software development in the years ahead. Chapters 1:15 – The first sparks of AI coding with GPT-3 4:00 – Why coding became OpenAI’s deepest focus area 7:20 – What a “harness” is and why it matters for agents 11:45 – Lessons from GitHub Copilot and latency tradeoffs 16:10 – Experimenting with terminals, IDEs, and async agents 22:00 – Internal tools like 10x and Codex code review 27:45 – Why GPT-5 Codex can run for hours on complex tasks 33:15 – The rise of refactoring and enterprise use cases 38:50 – The future of agentic software engineers 45:00 – Safety, oversight, and aligning agents with human intent 51:30 – What coding (and compute) may look like in 2030 57:40 – Advice: why it’s still a great time to learn to code

Оглавление (10 сегментов)

The first sparks of AI coding with GPT-3

— And as soon as you saw that you knew this is going to work. be big. And I remember at some point we were talking about these aspirational goals of imagine if you could have a language model that would write a thousand lines of coherent code, right? That was like a big goal for us. And the thing that's kind of wild is that goal has come and passed. And I think that we don't think twice about it, right? I think that while you're developing this technology, you really just see the holes, the flaws, the things that don't work. Um, but every so often it's good to like step back and realize that like actually things have just like come so far. It's incredible how used we get to things improving all the time and how they just become like a daily driver and you just use it every day and then you reflect back to like a month ago this wasn't even possible. Um and this just continues to happen. I think that's quite fascinating like how quickly humans adapt to new things. — Now one of the struggles that we've always had is the question of whether to go deep in a domain — right because we're really here for the G right for AGI general intelligence. And so to first order, our instinct is just push on making all the capabilities better at once. Coding's always been the exception to that, right? We really have a very different program that we use to focus on coding data, on code metrics, on trying to really understand how do our models perform on code. Um, and that, you know, we've started to do that in other domains too, but for programming and coding that that's been like a very exceptional focus for us. And you know for GPD4 we really produced a single model that was just a leap on all fronts. Um, but we actually had trained, you know, the codeex model and I remember doing like a Python sort of focused model like we were really trying to push the level of coding capability back in you know 2021 or so and uh you know I remember when we did the codeex demo that was maybe the first demonstration of what we'd call vibe coding today right I remember building this interface and having this realization that for just standard language model stuff the interface the harness is so simple right you're just completing a thing and you know maybe there's a follow-up turn or something like that but that's it for coding that you actually this text comes to life right you need to execute it that it needs to be hooked up to tools all these things and so you realize that the harness is almost like equally part of how you make this model usable as the intelligence and so that is something that I think we kind of knew from that moment um and it's been interesting to see as we got to more capable models this year and really started to focus on not just making the raw capability like how do you win at programming competitions But how do you make it useful? Right. Training in a diversity of environments, really connecting to how people are going to use it and then really building the harness which is something that have really pushed hard. — Could you unpack like a harness what that means in sort of simple terms? — Yes, it's quite simple. You have the model and the model is just capable of

Why coding became OpenAI’s deepest focus area

input output and what we call the harness is how do we integrate that with the rest of the infrastructure so that the model can actually act on its environment. — Uh so it's the set of tools, it's the way that it's uh looping. So the agent loop as we refer to it as the agent loop and um it's in essence it's fairly simple but when you start to integrate these pieces together and really train it end to end you start to see like pretty magical behavior uh and an ability of the model to really act and create things on your behalf and be a true collaborator. — So think about it a little bit as you know the harness being your body and the model being your brain. — Okay. It is it's interesting to see like yeah how far it came like GPD3 days where you literally had to write like you know uh commented code and say like this function does this with its Python put your hashtag in front of that whatever and it's just interesting to see how the models have now become just naturally intuitively good at coding and you mentioned that you know trying to determine between a general purpose model or saying how important code is. Was it just outside demand? People telling you wanted these models to better code or was this coming internally because you guys wanted to use this more? — Both. — Yeah, — absolutely both. And I remember, you know, in I think 2022 is when we worked with GitHub to produce GitHub copilot. — And the thing that was very interesting there was that for the first time you really felt what is it like to have an AI in the middle of your coding workflow and how can it accelerate you? And I remember that there were a lot of questions around the exact right interface. Do you want ghost text so it just does a completion? Do you want a little drop down with a bunch of different possibilities? Um, but one thing that was very clear was latency was a product feature and that the constraint for something like an autocomplete is that 1500 milliseconds, right? That's like the time that you have to produce a completion. Anything is slower than that, it could be incredibly brilliant. No one wants to sit around waiting for it. And so the mandate that we had, the clear signal we had from users and from you know the product managers and all the people thinking about the product side of it is get the smartest model you can subject to the latency constraint. — And then you have something like GBD4 which much smarter but it's not going to hit your latency budget. What do you do? Is it a useless model? Like absolutely not. The thing you have to do is you change the harness. You change the interface. And I think that that's like a really important theme is you need to kind of co-evolve the interfaces and the way that you use the model around its affordances. And so super fast smart models is going to be great, but the incredibly smart but slower models, it's also worth it. And I think that we've always had a thesis that the um you know that the returns on that intelligence are worth it. And it's never obvious in the moment because you're just like, well, it's just going to be too slow. Why would anyone want to use it? But I think that our approach has very much been to say just bet that the greater intelligence will pan out in the long run. — It was hard for me to wrap my head around where that was all headed back when working on the GitHub co-pilot because at that point we're used to like you said the completion ask do a thing it completes a thing. And I think I didn't really understand how much more value you would get out of building a harness added adding all these capabilities there. And it just seemed like all you just need is the model. But now you realize the tooling everything else matters can make such a big difference. And then you brought up the idea of modalities. And now we have um

What a “harness” is and why it matters for agents

CLI codec cli. So I can go to the command line. I can do this. There's a plugin for VS code. So I can go use this there. And then also I can deploy stuff to the web and do that. And I don't think I've fully kind of comprehend the value of that. And so like how is this something you're using? How are you kind of deploying these things yourself? Like where are you finding the most you know utility out of it? I think just to go back a little bit on like the first signs that we saw is like we had a lot of developers at the company outside of the company like our users use chat GPT to help them debug like very complex problems and one thing that we clearly saw like people are trying to get more and more context into chat GPT and you're trying to get bits of your code and stack traces and things and then you paste that and you present that to a very smart model to get some help. Uh and interactions were starting to get more and more complex up to some point where we realized like hey maybe instead of the user driving this thing maybe let the model actually drive the interaction and find its own context and then find its way and be able to debug you know this hard problem by itself so that you can just sit back and you know watch the model do the work. So it's like sort of like reversing that interaction that um you know led to this I think u thinking a lot more about the harness and giving the model the ability to act — and we iterated on form factors right I mean I remember at the beginning of the year we had a couple of different approaches we had you know sort of the async agentic harness but we also had the local experience and a couple different implementations of it and uh yeah — we actually started to play a little bit with this idea of like running it in the terminal. Um, and then we felt that was not AGI pled enough. Uh, we needed the ability to run this at scale and remotely and just close a laptop and have, — you know, the agent to just like continue to do its work and then you can maybe follow it on your phone and interact with it there. That seemed like very cool. So, we pushed on that, but we actually had a prototype of it fully working in a terminal. And um, people were using that productively at OpenAI. We decided to not uh launch this as a product. it didn't feel like polished enough. It was called 10x uh because we felt like it was giving us this 10x productivity boost. But then we decided to like you know just experiment with different form factors and like really go all in with the async form factor initially and now we've kind of gone back a little bit of that and reevolved and said like hey actually this agent we can bring it back to your terminal. in your ID. But the thing that we're really trying to get right is like this entity, this collaborator that's working with you and then bringing that to you in the tools that you're already using as a developer. — Yeah. And there were other shots on goal as well, right? So we had a version where there was a remote demon that would connect to uh to a local agent and so you kind of could uh could get both at once. And I think that part of the evolution has been that there's almost this matrix of different ways you could try to deploy a tool, right? There's this like async, it has its own computer off in the cloud. There's the local that it's running synchronously there. Um, you can blend between these. Um, that there's a question of, you know, there's been a question for us of how much do we focus on trying to build something that is externalizable, right? that is like useful in the diversity of environments that people have out there versus really focus on our own environment and try to make it so that things work really well for our internal engineers. And one of the challenges has been we want to kind of do all of this, right? We ultimately want tools that are useful to everyone, but if you can't even make it useful for yourself, how are you going to make it extremely useful for everyone else? And so part of the challenge for us has been really figuring out where do we focus and like how do we achieve the sort of biggest bang for the buck in terms of of our engineering efforts. And you know for me one of the things that's been an overarching focus has been we know that coding and building very capable agents is one of the most important things that we can do this year. Um at the beginning of the year we set a company goal of an agentic software engineer by the end of the year and figuring out exactly what that means and how to instantiate that and how to bring together all the opportunity and all the kind of compute that we have to bear on this problem like that has been a great undertaking for many people at OpenAI. — So you mentioned that you had the tool 10X and that was an internal tool and that seemed to be something at some point you said oh this is really useful to other people. It's got to be hard to decide when to do that and when not to and how much to sort of prioritize that. You know, we've seen cloud code has become extremely powerful, which I imagine is probably a similar story with

Lessons from GitHub Copilot and latency tradeoffs

something that was used internally and then became something deployed. When you start to think about next steps of, you know, where do you decide to take it next? put the emphasis? You know, you mentioned before, you know, I can now run things in the cloud, run these web, you know, do these kind of agentic like tasks where I walk away. And my problem is just it's such a new modality. It's really hard for me to think about and but sometimes these things have to sit around for a while and people sort of discover them independently and have you found that internally that somebody says oh now I get it. — I say absolutely right and I think that you know my perspective is that we kind of know the shape of the future right of the long term. It is very clear that you're going to want an AI that has its own computer that is able to run you know delegate to a fleet of agents and be able to solve multiple tasks in parallel. you should wake up in the morning, you're sipping your coffee, you know, answering questions for your agent, like providing some review, being like, "Ah, no, this wasn't quite what I meant. " Um, this workflow clearly needs to happen, but the models aren't quite smart enough for this to be the way that you interact with them. — And so having an agent that is really there in your terminal, in your editor to help you with the way that you do your work that looks very similar to the way you would have done it a year ago. that's also the present. And so I think that the way that we've seen it is almost we're blurring together here's what the future looks like, but we also can't abandon the present and thinking about how do you bring AI into code review and how do you make it so that it appears proactively and does work for you that's useful. Um and then you have a whole new challenge as well of if you have a lot more PRs like how do you actually sort through those to the ones that are the ones you actually want to merge. And so I think we've kind of seen all of this opportunity space and we've seen people start to change how they develop within OpenAI, how they even structure their code bases. — Yeah, I think there are two things to that affect that really combined. Um, and I mean, you know, this is where we're at today. It's one, infrastructure is hard and we would love for, you know, all of everyone's code and like tests and packages to be like perfectly containerizable and so we can run them at scale. That's not the case. like uh people have very thorough and complex setups that probably only runs on their laptop and we want to be able to leverage that and meet you know people where they are so that you know they don't have to configure things specifically for codecs that gives you this very easy entry point uh into experiencing you know what a very powerful coding agent can do for you and this also at the same time lets us experiment with you know what the right interfaces uh six months ago we weren't playing with these kinds of tools and this is all very new and evolving fast and we have to continue to iterate here and innovate on like what the right interface and what the right um way to collaborate with these agents are and we don't feel like we have really nailed that yet. That's going to continue to evolve. But bringing it uh to like a zero setup, extremely easy to use out of the box, you know, allows a lot more people to benefit from it and like play with it and for us to get the feedback so that we can continue to innovate. That's very important. I remember at the beginning of the year talking to one of our engineers who I think is really fantastic and he was saying that chatbt we had this integration where it could automatically see the context in this terminal and he's like it's transformative because he doesn't have to like copy paste errors he just like can instantly be like hey like you know what's the bug and it would just tell him and it was great right and you realize that it was an integration that we built that was so transformative wasn't about a smarter model — and I think that one thing that's very easy to get confused by is to really focus on only one of these dimensions and be like which one matters because the answer is they kind of both matter and the way I've always thought about this I remember when we were originally releasing the API back in 2020 um is there's two dimensions to what makes an AI desirable. There's intelligence which you can think of as one access and then there's convenience which you can think of as latency you could think of as cost um the integrations available to it and there's some acceptance region right where it's like if the model's incredibly smart but it takes you like a month to run it or something like you still might right if what kind of output is such a valuable piece of code or you know cure for a certain disease or something like that okay fine like it's worthwhile if the model's incredibly not that intelligent not that capable, then all you want to do is autocomplete. So it has to be

Experimenting with terminals, IDEs, and async agents

incredibly convenient, zero cognitive tax for you to think about what it's suggesting, that kind of thing. And where we are is of course somewhere on the spectrum now. We now have smarter models that are, you know, reasonably less convenient than autocomplete, but still like more convenient than you have to sit around and wait for a month for for the answer to appear. And so I think that a lot of our challenge is figuring out when do you invest in pulling that convenience to the left? when do you invest in pushing the intelligence up and it's a massive design space is what makes it fun. — Yeah. I don't know if you remember but I made an app that was featured on the launch back in 2020 AI channels — and it was and that was yeah the challenge was GB3 app was very capable but I had to write these like 600word prompts to get it to do stuff and because at 6 cents per thousand tokens and the latency I'm like I don't think this is the world for this right now. Yes. and then GPD 3. 5 and GPD4 and then all of a sudden you see all that capabilities and it was hard for me to say why but then you see that all of a sudden the things that come together and you mentioned you know the idea of just having you know the model be able to see the context inside of the you know where you're working and I remember when I was copy pasting using chat GBT into my workspace and it reminded me going into a grocery store and refusing to get a cart and just carrying everything to the checkout I'm like this is terribly inefficient once you put things on wheels it works really well and I think we're seeing all kinds of those unlocks now. — Now the problem I deal with is when I sit down to work on something is do I go into CLI? Do I go use the VS Code plugin? Do I go into cursor? Do I use some other tool? And how do you guys figure this out? — Right now we're still at the experimentation phase where we're trying different ways for you to interact with the agent and bring it where you're already productive. So for example, Codex is now in GitHub. You can mention codeex and it will do work for you. If you do add codeex, fix this bug or move the tests over here, it will go and like run off and like do it with its own little laptop, you know, on our data centers and you don't have to think about it. But if you're working with files in a folder um you know then you have that decision that are you going to do it in your ID, terminal what we're seeing is users are developing like power users are developing very complex workflows with the terminal more. Mhm. — And then when you're actually working on a file or a project, you prefer to do it in the IDE. It's a bit more of a polished interface. You can undo things, you can see the edits, you know, it's not like just scrolling by you. And then the terminal is just an amazing also vibe coding tool where, you know, if you don't really care that much about the code that's being produced, you know, you can just generate a little app. It's much more about that interaction. It elevates the interaction more instead of focusing on the code. So, it's more focused on uh the outcome. And it just sort of like depends on what you want to do, but it's still very much an experimentation phase right now and we're trying different things out. Uh and yeah, it's going to continue like that, I think. — Yeah, I really agree with that. And I also think that a lot of our direction will be more integration across these things, — right? Because people are capable of using multiple tools, right? You already have your terminal, your browser, your, you know, GitHub web interface, your, you know, repo on your local machine. Um each of these is something people have kind of learned when it's appropriate to reach for what tool. And I think that because we're in this experimentation phase that these things can feel very disperate and very different and like you know you have to kind of learn a new set of you know skills and the affordances of the relevant tool. Um and I think that a lot of as we're iterating what's on us is to really think about how do these fit together. And so you can start to see it right with the codeex ID extension being able to run remote codeex tasks. And I think that ultimately our vision is that there should be a AI that has access to its own computer, its own clusters, but is also able to look over your shoulder, right? They can also come and help you locally. And these shouldn't be distinct things, — right? And it's like this one coding entity that is there to help you and collaborate with you. Like when I collaborate with Greg, you know, I don't complain that sometimes you're on Slack, sometimes I talk to you in person, — sometimes you complain, — you know, sometimes you interact like through a GitHub like review. Like this seems like very natural when you interact with other humans and collaborators. And this is also where, you know, how we're thinking about codeex as an agentic like entity that is really meant to just supercharge you when you're trying to achieve things. — So let's talk about some of the ways of using it like agents. md. Do you want to explain that? Yeah, agent 7D is a set of instructions that you can give to codeex that lives alongside your code so that codeex has a little bit more context about how to best navigate the code and accomplish the tasks. There are two main things that are useful to put in agents. mmd that we find is like helping with it's like a compression thing where it is a little bit more efficient for the agent to just read codex. mmd instead of like exploring the entire codebase and then preferences that are actually not clear in the codebase itself uh where you would be like you know actually tests should be over here or you know I like things to be done in this particular uh fashion and those two things you know preference references and then explaining to the agent how to navigate the codebase effectively are very useful things to have in agents MG. — Yep. And I think that there's something deeply fundamental here of how do you communicate to an agent that has no context what you want, what your preferences are and to try to save it a little bit of the kind of spin up that a human would require. Right? We do this for humans, right? We write readme. mds and this is just a convention for a name of a file that an agent should go look at. But there's also something that's a little point in time, right, that the agents right now don't have great memory, right? It's like if you're running your agent for the 10th time

Internal tools like 10x and Codex code review

has it really benefited from the nine times that it went and solved a hard problem for you? And so I think that we have real research to do to think about how do you have memory? an agent that really just goes and explores your codebase and really deeply understands it um and then is able to leverage that knowledge. And so this is one of the examples and there are many where we see great fruit on the horizon for further research progress. — It's a very competitive landscape now. There was a point where you know openi kind of came out of nowhere for a lot of people and all of a sudden there was GPD3 then there was GPD4 and then uh think anthropics building great models and Gemini you know from Google has gotten really good. How do you guys see the landscape? How do you see your placement there? I mean, I think that there's a lot of progress to be had. I I focus a little less on the competition and a little more on the potential, — right? Because, you know, we started OpenAI 2015 thinking that AGI is going to be possible, maybe sooner than people think, and we just want to be a positive force in how it plays out, right? And that really thinking about what does that mean? Trying to connect that to practical execution has been a lot of the task. And so as we started to figure out how to build capable models that are actually useful, right, that can actually help people, actually bringing that to people is this really critical thing. And you can look at choices that we've made along the way. For example, releasing chat GBT and making chat GBT free tier available widely, right? That's something that we do because of our mission, because we really think about we want AI to be available and accessible and benefit everyone. And so in my view, the most important thing is to continue on that exponential progress and really think about how to bring it to people in a positive and useful way. Um, so I really look at where we're at right now is that these models like there's the GP4 class of pre-trained models, there's reinforcement learning on top of it to make it just much more reliable and smart, right? It's like you think about if you've just sort of read the internet, right? you've just observed a bunch of, you know, sort of human thought and you're trying to write some code for the first time, you're probably going to have a bad time of it. — But if you've had the ability to actually try to solve some hard code problems, you have a Python interpreter, you have a, you know, access to the kinds of tools that humans do, then you're going to be able to become much more robust and refined. Um, so we now have these pieces working together in concert, but we got to keep pushing them to the next level. It's very clear that things like being able to refactor massive code bases, like no one's cracked that just yet. There's no fundamental reason we can't. And I think the moment you get that um I think refactoring code is the is one of the killer use cases for enterprise, right? It's, you know, if if you could bring down the cost of code migrations by, you know, 2x, I think you'll end up with 10x more of them happening. Think about the number of systems that are stuck in cobalt. Um and there's no cobalt programmers being trained, right? It's just like it's a it's strictly like you know building liability for the world to have this dependency like the only way through is by building systems that can actually tackle that. So I just think it's a massive open space. The exponential continues and we need to stay on that. My favorite thing today that happened was there was a tweet from OpenAI which was showing people how to use the CLI to switch from the completions API to the responses API because it's — that's a great use. I expect to see more of that. you know where you have special instructions given to codeex in order to go do like refactorings reliably uh and then you just shut it off and it does it for you. That's like a wonderful thing. Migrations are some of the worst things. Nobody wants to do migrations. Nobody wants to like change from like one library to the other uh and then make sure that everything still works. You know, if we can automate like most of that, that's going to be like a very beautiful contribution. Yeah, I think there's a lot of other ground as well. Um I think that security patching is a good example of something that I think will become very important soon and that that's something we're being very thoughtful about. Um I think that being able to actually have AIs that produce new tools, right? You think about how important the uh the Unix set of standard tools has been and AIs that are actually able to build their own tools that are useful for you, are useful for themselves. you can actually build up a ladder of complexity there or utility there to be able to just like continue to improve this flywheel of efficiency. Um, AIS that are actually really not just writing code but able to execute, you know, their own uh, you know, be able to administer services or be able to do, you know, SR work and things like that. Um, I think all of that is on the horizon. It's like starting to happen, but it's not really happening yet in the way that we would like to see. One big one that we cracked internally at OpenAI and then we decided to release it as like code review where we started to notice that the big bottleneck for us was with increased amounts of code needing to be reviewed is like the well amount of reviews simply that uh people had to do on the teams and so we decided to really focus on a very high signal codeex mode where it's able to review a PR and really think deeply about the contract and the intention that you know you were meaning to implement and then look at the code and validate whether that intention is matched and found in that code. And it's able to go layers deep, look at all the dependencies, think about the contract and really raise things that some of our best employees, reviewers wouldn't have been able to find unless they were spending hours really deeply thinking about that PR. Um

Why GPT-5 Codex can run for hours on complex tasks

and we released this internally first at OpenAI. Uh it was quite successful and people were upset actually when it broke because they felt like they were losing that safety net and it accelerated teams and including the codeex team tremendously. Um the night before we released the ID extension, one of the top engineers on my team was like cranking out 25 PRs and we were finding quite a few bugs automatically. Codex was and you know we were able to put out an ID extension that was almost bug free the next day. Um so the velocity there is incredible — and it's very interesting that for the code review tool in particular people were very nervous about having this enabled because I think our previous experience with every auto code review experiment that we've tried is that it's just noise. — Right? You just get an email for some bot and you're like another one of those things. you ignore it. And I think we've had kind of the opposite finding from where we are now. And it really shows you when the capability is below threshold, it just feels like this thing is like totally net negative. I don't want to hear about it. see it. Once you kind of crack above some threshold of utility, suddenly people want it, right? And get very upset if it gets taken away. And I think also our observation is if something kind of works in AI right now, one year from now, it'll be incredibly reliable, incredibly mission critical. And I think that that's where we're going with code review. — Part of the interesting things there with code review as well is like bringing humans along and really have this be a collaborator including and review. And one thing we thought a lot about is like how can we raise those findings so that you are actually excited to read uh this finding and you might even learn something uh including you know when it's wrong like you know you can actually understand its reasoning like most of the time like actually more than 90% of the time it's right uh and you often learn something uh as the person who authored the code or someone who's helping review the code. Yeah, just, you know, circling back to what we were saying earlier about the rate of progress and sometimes stepping back and thinking about how things were earlier. Like I remember for GPD3 and for GPD4 really focusing on the doubling down problem. Like do you remember if the AI would say something wrong and you'd point out the mistake? — Oh, it would yeah argue with you. — Oh yeah. It would try to like convince you that the thing was right. Like we're so far past that being the core problem. Like I'm sure it happens in some obscure edge cases just like it does for humans, but it's really amazing to see that we're at a level where even when it's not quite zeroed in on the right thing, it's highlighting stuff that matters. It has like pretty reasonable thoughts. And I that yeah, I always walk away from these code reviews thinking like, huh, okay, yeah, that that's a good point. I should be thinking about that. We're now just getting into the launch of GPD5 and as the recording of this podcast we now have GP5 codeex — which we're tremendously excited about. — Very excited. — Why should I be excited about this gentleman? Sell me on this. — So GP5 codeex is a version of GP5 that we have optimized for Codex and we talked about the harness and so it's optimized for the harness. We really consider it to be like one agent where you couple the model very closely to the set of tools and it's able to be even more reliable. One of the things that this model exhibits is an ability to go on for like much longer uh and to really have that grit that you need on like these complex refactoring tasks. But at the same time for simple tasks it's actually comes way faster at you and is able to reply without much thinking. And so it's like this great collaborator where you can, you know, ask questions about your code, find where, you know, this bit of piece of code is that you need to change or like better understand, plan, but at the same time, once you let it go onto something, it will work for like a very long period of time. We've seen it, we've seen it work internally up to seven hours for like very complex refactorings. We haven't seen other models do that before. Uh, and we also have really worked tremendously on like code quality. uh and it's just really optimized for, you know, what people are using uh GP5 within codeex for. So when you talk about working longer and you say worked up to seven hours, you're not just talking about it keeps putting things back into context that it's actually making decisions, deciding what's important and moving forward or — Yes. So imagine like a really tricky refactoring. Um we've all had to deal with this where you know you've decided that your codebase is unmaintainable. you need to make a couple of changes in order to move forward. Um, so you make a plan and then you let the model go. Uh, you let CEX GP5 codeex go at it and it will just like work its way through all of the issues, get the test to run, uh, pass and just completely finish the refactoring. This is like one of the things that we've seen it do like for up to seven hours. — Wow. — Yeah. The thing that I find so remarkable is that the core intelligence of these models is clearly so just like stunning, right? I think that even 3 six months ago, I think our models were better than I am at navigating our internal codebase, right? To find a specific piece of functionality. And that requires some really sophisticated. — Are you going to have to let yourself go? Are you like, Greg, I'm sorry. — Because this is the thing is I get to do more. It's like is what I want to spend my time doing people to know me for is like being able to find functionality codebase. Like absolutely not, — right? That's not how I define my value as an engineer or what I want to spend my time on as an engineer. And now I think that to me is the core of it, right? That there's this amazing intelligence and that it can first of all suck away all the like kind of mundane boring parts and certainly some

The rise of refactoring and enterprise use cases

of the there are some fun parts too, right? Like you know I think that really thinking about the architecture of things. It's a great partner, but I get to choose how I spend my time, right? And I get to think about how many of these agents do you want running on what task, how do I break down things? And so I view it as increasing the opportunity surface for programmers. And you know, I'm a I'm an Emacs user through and through. Uh you know, I started using uh you know, VS Code and Cursor and uh Windsorf and these things. Um partly to just try things out, but partly because I like the diversity of different tools, but it's really hard to get me out of my terminal. Wow. Um and so, but you know, I have found that we're now above threshold where I really find myself missing the like I'm like doing some refactor. I'm like, why am I typing this thing? Right? like, you know, or it's like you're trying to remember exactly the syntax for a specific thing or like trying to to, you know, sort of do these are very mechanical things. I'm like, I just want to like have an intern go do the thing. But I have that now in my terminal and uh and I think it's really amazing that we're at the point that you have this core intelligence um and that you get to pick and choose when and how to use it. — Please add Whisper to the uh you know the extension too because now I just love to talk to the model and tell it to do things. — Yeah. You should be able to video chat with your model. Like I think we're heading towards a real collaborator, a real coworker. — Well, yeah, let's talk about the future. Where do you see this headed? that? What's exciting about the agentic future? How are we going to be using these systems? We have strong conviction that the way that this is headed is large populations of agents u somewhere in the cloud uh that we as humanity as you know people, teams, organizations like supervise uh and steer in order to produce like great economical value. So if we're going like you know a couple years from now this is what it's going to look like. It's millions of agents working in you know our and like company's data centers in order to do useful work like now the question is like how do we get there gradually uh and to experiment on the right form factor and the right uh interaction patterns here. One of the things that is incredibly uh important to solve is the safety, security, alignment uh of all of this so that agents can perform useful work but in a safe way uh and you get to always stay in control uh as like the operator as a human. — Uh and this is why for Codeex CLI the by default the agent operates in a sandbox um and isn't able to edit files like randomly on your computer. Uh, and we're going to be continuing to invest a lot in making, you know, basically the environment safe. Invest in like understanding when humans need to steer, approve certain actions, giving more and more permissions so that your agent has its own set of permissions that you know you allow it to use and then maybe escalate permissions when you allow it to do like exceptionally you know more risky things. Um and so figure out this entire system and then making it multi- aent and steerable by individuals, teams, organizations and then aligning you know that with the whole intent of organizations. This is where it's headed for me. Um it's a bit nebulous but it's also very exciting I think. — Yep. Yeah. I think that's exactly right. I mean I think at a you know zoomed in level there's a bunch of technical problems that need to be solved like Tibo is kind of getting at scalable oversight, right? How do you as a human manage agents that are out there writing lots of code? Right? You probably don't want to read every line of code. Probably right now most people do not read all the code that comes out of these systems. But how do you — Of course I do. — Exactly. But how do you maintain trust, right? How do you make sure that AI is producing things that are actually correct? And I think that there are technical approaches and we've been thinking about the these kinds of things since probably 2017 is the first time we published some strategies for how you can have humans or weaker AI start to supervise even stronger AIs and kind of bootstrap your way to uh to making sure that they're doing very capable important tasks that we can maintain trust and oversight and really be in the driver's seat. Um, so that's a very important problem and it really is exemplified in a very practical way through thinking about more and more capable coding agents. Um, but I think there's also other dimensions that are very easy to miss because I think at each level of AI capability, people kind of overfit to what they see and think, oh, this is AI. This is what AI is going to be. But the thing we haven't quite seen yet is AI's solving really hard novel problems, — right? Right now you think of it as, okay, like I need to do my refactor. you at least have a shape of like what that thing would be, right? It's like it'll do a lot of the work for you, save a lot of time, but what about solving problems that are fundamentally unsolvable um through any other means? And I think of this not necessarily, you know, just in the coding domain, but think of it in medicine, right? You know, producing new drugs, think of it in material science, producing new materials that have novel properties. And I think that there's a lot of new sort of uh capability coming down the pike that is going to unlock these kinds of applications. And so, you know, for me, one big milestone is the first time that you have an artifact produced by an AI that is extremely valuable and interesting unto itself, not because it was produced by an AI, cheaper to produce, but because it's simply like a breakthrough. is simply something that is just novel and um that the AI you don't even necessarily have it to be autonomously created by the AI but just in partnership with humans and that the AI is a critical dependency and so I think we're starting to see signs of life on this kind of thing. We're seeing it in life sciences where humans ask uh

The future of agentic software engineers

you know human experimenters ask 03 for five ideas of experimental you know protocols to run they try out the five of them four of them don't work but one of them does and the kind of feedback that we've been getting and this was back in the 03 days is that it's kind of the results are at the level of what you'd expect from like a you know third or fourth year PhD student which is crazy right crazy and that was 03 right GBD5 and GBD5 pro we're seeing totally different results there. There we're seeing research scientists saying, "Okay, yeah, this is doing real novel stuff. " Um, and sometimes it's again, it's not just on its own solving these grand theories, but it's together in partnership being able to just stretch far beyond where humanistic could go. And that to me is like one of the critical things that we need to continue to push on and get right. One of the challenges I have when talking to people kind of about the future and I want to hear you guys talk about this is that people tend to imagine the future is kind of the present but with like shiny clothes and robots and they think about like well then what happens when robots do all the code and all that and you brought up a fact like they're the things you like to do and the things you don't care to do. Where are we in 2030? What does it look like? It was 5 years ago GPD3. Now five years from now 2030 is such we didn't have these tools six months ago. Uh, so it's hard to picture exactly what this is going to look like uh 5 years from now, but one thing that — I'm going to pop out of the bushes five years from now with this podcast and be like, "You said this. " — Well, your agent will do it for you. — Yeah. Yeah. It's So, one thing that's important is like the things that are the pieces of code that are critical infrastructure and underpinning society, we need to like continue to understand and have the tools to understand. Um and this is why also we were thinking about code review is like and code review should help you know understand that code and be this teammate that you know helps you de dive into the code written by someone else potentially helped with AI — and I would actually argue that we already have a problem of there's lots of code out there that is not necessarily secure — right this happens all the time I remember like heartbleleed back — I guess it's almost — 12 years ago go or something. Um, critical vulnerability in a key piece of software used across the internet and you realize that that's not singular, right? That there's lots of vulnerabilities out there that no one has found yet. — All these packages and stuff from npm and all these PI packages that are just sitting there that people put exploits into. — And the way that it's always worked is that there's a cat-and- mouse game between attackers getting more sophisticated, defenders getting better. And I think that with AI, you're like, well, maybe it's going to like which side will advantage the most? Um, maybe it'll just sort of accelerate this this cat and mouse. But I think that there's some hope that actually you can unlock fundamental new capabilities through AI. For example, formal verification — that are sort of an endgame for defense. — Mhm. And I think that to me is very exciting is thinking about not just how do you continue this like you know sort of never-ending rat race but how do you actually end up with increased stability increased understandability and I think that there's other opportunities like that for us to really understand our systems in a way that right now it's almost you know we're sort of at the edge of human understanding of the software traditional software systems that have been built. One of the reasons we built Codex is to improve um the infrastructure and the code out there in the world, not necessarily to increase the amount of code in the world. And so this is like a very important point where it's also like helping find bugs, helping refactor, helping find more elegant, more performant uh performant implementations that achieve the same thing or actually are more general but not necessarily ending up with like a 100 million lines of code that you know you don't understand. One thing that I'm really excited about is like how Codeex can help teams, individuals, you know, just write better code, be better software engineers, and end up with simpler systems that are actually doing more things for us. — I think part of the 2030 outlook is we will be in a world of material abundance, right? I think that AI is going to make it much easier than you could almost imagine to create anything you want, — right? and that will probably be true in the physical world in addition to digital world in ways that are hard to predict. But I think it'll be a world of absolute compute scarcity. — And we've seen a little bit of what this is like within OpenAI, right? That the way that different research projects fight over compute or that the success of the research pro program is determined by the compute allocation is something that is, you know, it's uh it's hard to overstate, right? And I think that we're going to be in a world where your ability to produce and create whatever you imagine will be limited partly by your imagination but partly by the compute power behind it. And so one thing we think about a lot is how do we increase the supply of compute in the world right? We want to increase the intelligence but also the availability of that intelligence and fundamentally it is a physical infrastructure problem not just a software problem. I know with GBD5 I think one thing that's quite amazing is like we're able to give it you know as part of the free the plus plan uh the pro plan it's like you know you can use codeex with your uh plus plan you get GP5 like the same version that everyone else gets and it's like this incredible intelligence but the model is also incredibly like cost effective in that way — I think that was one of the things that was really stood out for me was I thought the model was much more capable but it came out at the same price point or some ways cheaper than the previous model and that was something like wow I mean that patterns are great — I think the degree to which we are improving the intelligence and cutting prices is something that is very easy to miss take for granted but it's actually crazy right I think we did like a 80% price cut on 03 something like that if you just look at to your point of like 6 cents per thousand tokens back for GPD3 level intelligence — yeah there was an article that came out one of the newspapers was complaining that well these reasoning models have made it more expensive but they didn't compare reasoning models to reasoning models in like the last six to seven months and how much more efficient they've become. — Yep. And that will just continue.

Safety, oversight, and aligning agents with human intent

You know, on the comput scarcity point, one thing that I find very sort of suggestive is thinking about, you know, right now people talk about building big, you know, big fleets of a million GPUs of millions of GPUs, — that level of GPUs. If we reach a point, which is probably not in that far future, where you're going to want agents running on your behalf constantly, right? Like it's reasonable for every person to want a dedicated GPU just for them running their agent. — And now you're talking almost 10 billion GPUs that we need. We're orders of magnitude off of that. And so I think that part of our job is to figure out how to supply that compute, how to make it exist in the world, um but how to make the most out of the like very limited compute that exists right now. And uh you know that's an efficiency problem. It's also a increase that intelligence problem. Um but yeah I think it's very clear that bringing this to fruition is going to be just like a lot of work and a lot of building. — So one of the interesting things about agents and the relationship to GPUs and them acting is that it is very beneficial to have a GPU also close to you. uh because you know if when it's acting and doing 200 tool calls over the span of like a couple minutes it's always is doing this back and forth between the GPU and like your laptop and executing those to calls getting that context back then continuing to reflect and so bringing GPUs like to people you know bring GPUs close to people uh is you know a great contribution there as well and you know really benefits because it reduces the latency tremendously of the entire interaction and the entire rollout. Gentlemen, we get the question that comes up periodically is about the future, about labor, about all of this. Um, number one, learn to code, not learn to code. — I think it's a wonderful time to yeah, I agree. Definitely learn to code, but learn to use AI. That to me is the most important thing. — There's something tremendously enjoyable about using codeex to learn about a new programming language. Uh, a lot of people on my team were new to Rust. Uh and we decided to build a core uh the core harness in Rust. Um and it's been really great seeing like how quickly they can pick up a new language uh just by using codeex asking questions exploring a codebase that they don't know and still achieving great results. Obviously we also have very experienced RS engineers to continue to mentor and you know make sure that we have a high bar. Um but it's just a really fun time to learn to code. I remember the way that I learned to program was by W3 schools tutorials, PHP, JavaScript, HTML, CSS. And I remember when I was building some of my first applications and I was trying to figure out how to and I didn't even know the word for it, serialized data, right? And I came up with some sort of like, you know, sort of, you know, structure that had some special sequence of characters that was serving as a delimter. And what would happen if you actually had that sequence of characters in your data? Like let's not talk about that. So that's why I had to have a very special sequence. And this is the kind of thing where you're not going to have a tutorial that will flag this kind of issue for you, but will codeex in its code review be like, "Hey, there's JSON serialization. Just use this library. " Absolutely. And so I think that the potential to accelerate make it so much easier to code so you don't have to sort of reinvent all these wheels and that it can ask the question for you or answer the question for you. You don't even know that you needed to ask. Like that to me is why I think it's like a better time than ever to to build. I've learned a lot just by looking how it solves a problem. Found new libraries, found new methods and stuff. That's often I like to sometimes give it like a crazy task like how would you create your own language model with only a thousand lines of code and what would you try to do and sometimes it might fail but then you look at the direction and tried to do it and you go oh I didn't even know that was a thing. One of the things as well is that you know the people who are most successful coding with AI also have really studied like you know fundamentals of software engineering and put the right — uh framework in place right architecture have taught about how to structure their codebase and then are you know getting help from AI but still you know following that general blueprint uh and that you know really accelerates you and allows you to go like much further than you know you would be able to go if you actually didn't understand the code that's being written — since you've launched this since you've made this available GPD5 since you've been able to deploy things with codecs what have you seen as usage rates? — Yeah, usage has been exploding. So, we've seen more than a 10x growth in usage from across users and the users that were using it already are using it much more as well. So, we're seeing more sophisticated usage. Um, and people are using it for longer periods of time as well. uh we have now included in the plus and pro plan with generous limits and that's contributed a lot to being successful. — Yeah, I think that the vibes I think also have really started to shift as people I think are starting to realize how you need to use GPD5, right? I think it's a little bit of a different flavor. I think that we have our own spin on the right harnesses and tools and the ecosystem of how these things fit together. And I think that once it clicks for people, then they just go so fast. Gentlemen, thank you so much for joining us here and talking about this. Any last thoughts? — Thank you for having us. Yeah, we're really excited about everything that comes next. Um, I think we have so much to build. Progress continues on the exponential and I think really bringing these tools to be usable and useful by everyone is core to our mission. — Yeah, thanks for having us. Uh, I'm also super excited now that we have Codex and it keeps improving. Uh, we're also like getting accelerated and building better codeex every day. And personally, I think I spend more time talking to Codex now than most people. And it's really how I feel the AGI. And I hope like more people will be able to benefit from it.

Другие видео автора — OpenAI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник