"Hear how OpenAI engineers use Codex to rethink how code gets written, refactored, and merged.
From pair programming in your local environment to delegating tasks to the cloud, this session focuses on how Codex is unlocking a faster, more creative future for coding."
Оглавление (6 сегментов)
Segment 1 (00:00 - 05:00)
I'm — and I'm here at OpenAI and I build Codex. With Codeex, we're building an AI software engineer. I personally like to think about it as a little bit like a human teammate. You can pair program with it on your computer. You can delegate to it or as you'll see, you can give it a job without explicit prompting. There's been recently a massive vibe shift. This has started from August where we had pretty decent usage and since then, thanks to all of you, we've grown tenfold. Today I want to start by sharing some of the recent updates that have created this vibe shift. Then we'll bring some engineers from OpenAI to show you some examples of how we use Codex dayto-day. Some of them are building here on the Codex team. Some of them are just really excited users of Codex at OpenAI. Let's first talk about some of those updates. Codeex now works everywhere you build. Whether it's in your IDE, your terminal, GitHub, web or mobile. No matter where you are, it is the same powerful agent under the hood. The first and most important improvement we made was to completely overhaul the agent. We think of the agent as a combination of two things. the reasoning model under the hood and its tool harness to allow it to act and impact change upon the world to create value for you. First, the model. In August, we shipped GPD5, our best agentic model thus far. That was until we listened to your feedback and we improved upon it by shipping GPD5 codecs, a model that was further optimized for work within codeex, improving by being smarter, better following code style and adapting its thinking time. One of my favorite quotes from the feedback from you all was that it feels a little bit more like a true senior engineer because it gives such few compliments and it also pushes back on bad ideas. Next, we completely rewrote the harness to make most of the new models, add support for planning, MCP, autoco compaction so that you can have these really long conversations and interactions and so much more. At this point, we started seeing the CLI usage take off. But there's more feedback. The model felt really good. The agent was useful, but the CLI felt early. We appreciate the feedback and so we decided to completely revamp the codec. We simplified approvals modes, created a more legible UI, and added a ton of polish polish. By default, it works with sandboxing, so it is safe by default, but you always have control. It's been a work in progress and we shipped a big update last Friday. We'll ship a new release today. Again, more feedback from you all. A bunch of you collaborate with the agent and want to look at the code at the same time. This is why we shipped it in the IDE directly as a native extension here. It works with your code alongside, you know, you having control over your ID. You get this little collaborator. It works in VS Code. It works in cursor and other popular forks. This immediately took off. Within the first week, we had a 100,000 users. Many of you, I'm sure, in this room. A lot of our users prefer to use codecs in their ID directly. Part of the magic here is that it is the exact same agent. It is the same open-source harness that is powering the CLI bundled right within the extension. At the same time, we were also upgrading codeex cloud so that you could run many more tasks in parallel. For us, this is still the beginning, but we think it's incredibly cool to be able to command codecs through your phone. Cloud tasks now run 90% fers faster. They can set up their dependencies automatically and verify their work by taking screenshots and sending them to you. Giving the agent its own computer really feels magical when it works. And then you can start working with agents like this in
Segment 2 (05:00 - 10:00)
tools like GitHub or now Slack. Here is an example of one engineer. Some of you might know him who had a question and then another engineer Steve immediately jumps on it and delegates it to Codex. Here Codex receives the entire context from the thread and just gets to work. A couple of minutes later, it post the solution together with a summary. It actually went explored the whole problem and wrote some code. All of this progress means that we can write code so much faster, which also means that we have a lot of code collectively to review. Validating and reviewing code is now becoming a huge bottleneck. This we've been thinking about this for a while. Past experiments with code review at OpenAI showed that it could be useful but also oftentimes noisy. Previous attempts had to be turned off because users were complaining about the lack of signal. So we purposely trained GPT5 codeex to be great at ultra thorough code review. It goes through the dependencies all the code in depth inside its little container truly explores the contract of like your intent and what actually happens in the implementation and then comes back with high quality findings. We now find that many teams decide to enable it by default and almost and want to make it mandatory because it is such a high signal finding every time. You can trig it while pairing with Codex or you can completely automate it by running on every PR in GitHub. Okay, it's been a busy few months for a small growing team. We've been using Codeex to build Codeex. There's really no way we could have done it without it. Even more fun has been seeing OpenAI as a whole get accelerated. Today, 92% almost all of OpenI technical staff uses Codex daily, up from 50% around last July. Engineers that use Codex submit 70% more PRs per week. And pretty much all PRs are reviewed by Codex. When it finds an issue, people are actually excited. It saves you time. You ship with more confidence. There's nothing worse than finding a bug after we actually ship the feature. When we as a team see the stats, it feels great. But even better is being at lunch with someone who then goes, "Hey, I use codeex all the time. Here's a cool thing that I do with it. Do you want to hear about it? " And so we wanted to give you a taste of that. So let's get to lunch with a few teammates and hear about their stories. They'll show you real workflows of our teams, how they use it every day. Please welcome Nacho to the stage to talk about iterating on UI for the chatbt iOS app. Thanks. Hello, my name is Nacho Sto. I'm a member of the core iOS team at OpenAI. I'm going to do two things today. I'm going to tell you about a workflow that I use frequently when building the chatpt app and I'd like to share a demo that shows you how I do this. Let's start with a demo. Tibo asked me to build a weather app. So I have a starter project with just an empty window. And I've also asked Chat TPT to make a mockup of what I want the UI to look like. So I'm going to ask Codex to implement that design. Great. While that's running, let me tell you what's special about how Codex is going to implement this. Working in the Chatp core team means I spend a lot of time on infrastructure performance, but also do some amount of front-end work. Recently, I worked on this small feature where we simplified our personalization screen to make our new chat GPT personalities more discoverable. And I'm sure you run into something like this before where that last 10% of polish like getting these headers and footers aligned was actually taking 90% of the time. But Codex can help you with that 10%. And it can work on that while you do something else. Maybe watching some of the other dev day talks. You can even have nine other terminal tabs running codeex if you want to be a true 10x engineer. Who here has been sent a pull request from a junior engineer and within a few seconds you know that they didn't actually test it because there's no way that it works.
Segment 3 (10:00 - 15:00)
If you use chat tpt or any other agent six months ago, you were working with that junior engineer. But Codex is not. Like Tibo said, I would argue that Codex is now a senior engineer. It doesn't just write the code and assume that it works. You will verify that it does. I'm a big fan of TDD, testdriven development. And I think Codex really thrives with that workflow. It will run your tests, fix the code, run your tests again over and over until they pass. But why stop at unit tests? Codex is multimodal, which means it can also verify its work visually. A few weeks ago, we gave Codex that superpower of being able to see images. So, I taught it to generate snapshots for the UI code that it writes. And best of all, it's actually very simple. First, I made this very simple mech file that runs the unit test to extract the Swift UI previews and that calls a small Python script which Codex wrote by the way and that extracts those images, puts them in a folder so Codex can find them. Then in the agents file below, I've just told Codex about that script and I've asked it to use it to verify it work. We use this workflow to build the chat tpt iOS and Mac app. But you could do the same on web for example with tools like storybook or playright. So that's my workflow. I give codec some tools to generate screenshots so we can verify the UI code that it writes. Let's check in and see how Codex is doing. Okay, so if I scroll back, looks like it read some code, started with a plan uh to re review the existing code, uh implement the UI, and provide preview data to verify that it's good. So great, looks like it wrote all that code, ran the snapshot tests. Cool. So no, I guess not bad for three minutes. Go ahead and run that up. Cool. So, obviously it's a very simple example, but it actually scales without many changes to match larger projects like the chatbt app. And it can run for many hours depending on the tasks, iterating over and over until it's pixel perfect. And speaking of working for many hours, I'd like to pass it over to Fel who's going to show us how to scale these verification loops to run for longer periods of time and more complex problems. Thank you. Thanks, Nacho. I'm Freelivity. Here at OpenAI, I've set high scores for the longest sessions or the most tokens produced. I'm known as the guy that gets Codex to do this for being able to use Codex to oneot big features and complex code changes. I've seen the GP5 codeex model work for over seven hours productively. That was my prompt. Or process more than 150 million tokens over the course of a marathon session. This is one of those refactor my personal JSON parser project. And for large projects like this, there can be long periods of time where it seems like all of the tests are failing until the work is complete, especially when you're making that core change. Now, this is a JSON parser built for streaming tool calls, a parser for the AI age. And this par this PR has over 15,000 lines of code change. And it was created over many hours of work from codecs, but only a few minutes and a handful of prompts from me. Let's walk through how I go from prompt to pull request. We'll do this in just a couple prompts. First, we'll tell Codeex that we want a plan to implement our feature. Then, we're going to review that plan and tell it to execute. And finally, we ship. Here, I've opened my project in VS Code, and I'll open up our codeex extension as well. uh I have a fairly complex feature I want to implement and I've prepared that in a document for me to read and I'm going to tell that I want it to write a plan to implement this feature and I've described the end state but I want to do the heavy lifting for me and research how to integrate this library into my parser. So, what I do is I ask Codeex to write a spec. And I'm going to go ahead and kick that off. And actually, I'm going to turn off autoc context here. A little bit of an aside. I've rehearsed this a few times and it's actually found finished specs
Segment 4 (15:00 - 20:00)
from my git history and cheated and copied right to the end of the process. So, I'm going to have it really do the work. And uh you'll notice I don't need to tell it a lot. I've given it my example. I've told it to do some research. Follow the example of the code that I've already specd file. I've abbreviated everything here so we don't have to read all of it. It's 160 lines. Uh but really what I'm doing here is I'm writing a design document for design documents. Codex is now a senior engineer after all. So, we should be asking it to do some of its owners doc, you know, this plan is going to contain is going to be a living document. It's going to describe its big picture. It's going to have a to-do list and progress that it keeps up to date. And I also want to say, why do I keep on saying exact plan? And I'm doing that because I want to give the model a term to anchor on and know when I use the term exec plan, use plans. m MD to design that to iterate on it and follow up. It's good to give it a mo a term that's unique so that it knows to reflect back on that. And when I say that, it's something special, not just any design doc or implementation spec. So in this spec, we've got our progress, our surprises and discoveries. We even have a decision log in here for me to keep track of what it's been working on. Now, normally I don't ask engineers to write this much. I only do that when maybe I don't like their project. Uh but in this case, this helps codeex steer towards a completed project. It is its memory as it works on this large plan. And after this talk, we'll upload the plans to MD recipe to our open AI cookbooks so any of you can adopt it in your repositories. Now, how does it know how to use this plans. m MD? As I mentioned earlier, I've used my agents. m MD. I drop a couple lines here in my agents. md. Just a few instructions. When you're working on something complex, this is what an exec plan is. Refer plans MD. Make sure that you're following that. Now, as you can see, it's doing quite a bit of research on the side here. So, let's go ahead and look at a completed spec. So I've switched over to a completed session here and it's written my spec. Let me open up that plan here. So I can review this. I can give it feedback. I can look at okay, you know, quite a lot of words, but it is what I wanted to do. And it has a plan. Looks like a couple spikes, some features that it wants to implement and of course documentation. So that looks good to me. I'm going to go ahead and tell codeex. Let's go ahead and implement. And we can't type today. There we go. And so while that runs, uh, I like to keep an eye on codecs. I keep something scrolling on my screen. My manager knows that I'm still working. And I like to watch the tests. So what I'll do is I'll kick off these tests. They run very fast. Uh, Codex helped me write all of these, by the way, from simple property tests or simple uh, unit tests to exhausted property tests. There's even some fuzzing in this crate. And uh, so I'll keep an eye on this and if it stays red for too long, I might intervene and say, you know, Codex, maybe we need to back out. Maybe that plan is going a little off the rails. All right, let's go ahead and look at what it's completed in this project. So, I'm skipping ahead to Codeex having finished that task. By the way, that took over an hour uh in my previous session. So, we're skipping ahead quite a ways. And it looks like it's written some new tests. Um they're all passing, which is great. Uh let's go ahead and look at the changes. Okay. Wow. And it looks like it vendored in and even maybe forked or updated the upstream library to make some changes to implement what it needed to do. Now, again, I don't have all day to read all of this to you. So, I'm going to go ahead and open up the plan again. So, I and I can see in the progress. It's checked off some big items. It's completed some spikes. It's updated and read plans. MD specifies that all of these plans have to be a living document. And so, I can use this as an executive summary to know what it's accomplished. That way, I don't have to read all the code myself. Okay, it looks like it's done and the tests are passing. So, uh what I've
Segment 5 (20:00 - 25:00)
shown you today is we can go from uh an implementation idea, a feature, a prompt to a PR in only a few steps. Rigorous planning and thorough testing enable the model to work on this feature for a sustained period of time. And let's just see how many lines of code it's written. Okay. 4,200 lines of code in just about an hour of work. Incredible. Now, I could just merge this as, but I would really like another set of eyes on this code. Thankfully, we have Daniel up next to talk about code review. Hello. All right. My name is Daniel and I am an engineer here on the codeex team. So, uh, today I want to talk about code reviews. As Tibo mentioned, we launched code reviews on GitHub a couple months ago and it has been a huge hit u both externally but especially internally. We love code reviews. We have them running on all of our PRs and it's finding so many bugs that we would have otherwise missed and some of these bugs are so complex that you have to like read and reread the comment a couple times to even understand what it's saying. So, I highly recommend you enable code reviews for all your GitHub PRs. Um, here's an example of one of my PRs that's on the Codeex repo. It's open source. So, I uh pushed a feature and then immediately Codex started reviewing my code and it found a P1 issue. Great. So, then I said, "Thanks, Codeex. Please fix it. " And that kicked off a background task um to make that change. And then once that got merged, I said, "All right, codeex um now that you have all this, now you have this change, review it again, make sure we don't have any issues. " And then I found another issue. And then uh I was just embarrassed. So this got me thinking, what if you could have a workflow where you create a feature and then you review it for bugs and then if there are any bugs you fix it and then you review it again and then you fix and review and fix and review until theoretically your code doesn't have any issues. So uh we decided to make this super easy by bringing code reviews to local as well. And I'm going to show you how to do that with slash commands. And this is what I do every day before I even submit the PR. Okay. Uh so I'm working on a little feature. Uh you can see it has like three different commits. It's a pretty small one. Um and I have the CLI running on the side. So all I have to do is write slash review, hit enter, and then you'll see there are a couple different options here. So the first option is reviewing against a base branch just like a PR. So this would take a some of your well all of your commits in your base branch compare it to main uh just like a normal PR and then look at the whole diff and try to find any uh issues with it. There are other options too like reviewing uncommitted changes or specific commit or custom review instructions. But what I usually do at the end of the day when I have a bunch of different commits is just review the whole thing. So I select the first option. Now I have to select a base branch. Usually main is the first one. So I hit enter again. And now code review begins. So a question I have is why is it so good? Why is GP5 codec so good at code reviews? Because we actually trained it specifically on finding very technical bugs and it will go on for a very long time researching all sorts of different files. And then when it has a hypothesis for something that could be wrong, it will even write tests uh scripts, execute them to make sure that it gives you like one or maybe at most two critical issues that you have to fix before you land your PR. It doesn't give you like 20 or 30 different things that it one shots from just looking at your diff. It doesn't waste your time. So, um yeah, there's actually a bug here. Uh, if anyone guessed Oh, nice. It got it for us. Um, so it's a P 0. Great. Um, and it's exactly correct. So, we aren't supposed to be hard- coding the string here in the code. We should be getting
Segment 6 (25:00 - 28:00)
that dynamically. So, all I have to do now is tell Codeex, please fix. Uh and usually I don't even read the comments. So uh it just goes but yeah the nice thing about uh reviews in the CLI is that it actually spawns a separate thread from the parent. So let's say you've been working on this feature and it is like super biased. You know you have to do this feature like this, you have to implement it like that. The review thread is separate. It has a fresh pair of eyes, a fresh context, new chat. Uh so it doesn't have that same implementation bias and it'll help find these bugs for you. Uh so yeah, that is going to go ahead and u you know give us some changes. While that runs, I want to actually show you how you can enable um reviews on all your PRs. So go to chatp. com/codex and then you just connect your GitHub and then there's a button here called enable code review. So this will take you to the code review settings and you can have like repository level settings to say like I want this repo to get code reviews. I want that one to not. But I just have this toggle over here that I just say review all of my pull requests. Please make sure I don't ship any regressions to prod. So let's go back. Fantastic. Uh it made the change. Let's see. That looks correct. Yeah, now it's getting the prompt directory dynamically. So now that this is done, what I want to do is I want to run slash review again. So I hit slash review. Enter. Great. So this will start another review thread. And then once that goes on, hopefully it won't find any issues. But if there are any issues, you can continue it again. Um, and then once that's done, it gives you a thumbs up. You commit, you push to get, and then you get one final thumbs up from, uh, codeex on your PR, and you're merged. So that is what I do every day. Uh, using slashre in my daily workflow before I even create a PR. Thank you so much, and I'll hand it back to Tibo to wrap it up. All right, folks. That's it. I hope today's demos give you a glimpse of how we're shipping faster and with more confidence with Codeex and a little bit about where we're going. If you haven't tried Codex yet, just mpm install. This will give you codex right in your terminal and you just type codex and you could get going and use a lot of the things that we demoed you today. Everything we showed to you today is real uh and you can use it right away. Gabriel Peele, one of the people here working on the codec team actually just sent me a message that the V045 of the CLI is out like right now. It has a few uh incremental updates and also support for uh OOTH MCP which I think is very cool. Uh so just go and install it. Um and this will give you the latest version. And then if you want to hang with a few of the people building codeex uh just come and join us at the booths. Uh there will be some of us there and also some of the you know top users of codec here at OpenAI. We also have uh a Q& A on uh Discord that you can join and uh this will start shortly. So come and say hi, don't be shy and uh thank you for joining today.