OpenAI's New GPT-5-Codex: First Impressions
14:34

OpenAI's New GPT-5-Codex: First Impressions

Ray Amjad 16.09.2025 5 734 просмотров 118 лайков обн. 18.02.2026
Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Join AI Startup School & learn to vibe code and get paying customers for your apps ⤵️ https://www.skool.com/ai-startup-school —— MY APPS —— 🎙️HyperWhisper, write 5x faster with your voice: https://www.hyperwhisper.com/ - Use coupon code BXKYB1QB for 40% off 💬 MindDeck, an advanced frontend for LLMs: https://minddeck.ai/ - Use coupon code JJKIEPVD for 40% off 📲 Tensor AI: Never Miss the AI News - on iOS: https://apps.apple.com/us/app/ai-news-tensor-ai/id6746403746 - on Android: https://play.google.com/store/apps/details?id=app.tensorai.tensorai - 100% FREE —— MY CLASSES —— 👾 Codex CLI Masterclass: https://www.mastercodexcli.com/ - Use coupon code K5LP2NRK for 20% off 🚀 Claude Code Masterclass: https://www.masterclaudecode.com/ - Use coupon code 6OKODFRW for 20% off ————— CONNECT WITH ME 📸 Instagram: https://www.instagram.com/theramjad/ 🐦 X: https://x.com/@theramjad 👨‍💻 LinkedIn: https://www.linkedin.com/in/rayamjad/ 🌍 My website/blog: https://www.rayamjad.com/ ————— Timestamps: 00:00 - Intro 00:13 - Announcement 01:34 - The Codebases 02:53 - Test 1: Distractors / Duplicates 06:00 - Test 2: Implementing New Features 08:50 - Test 3: HUGE REFACTOR 12:23 - Conclusion

Оглавление (7 сегментов)

  1. 0:00 Intro 45 сл.
  2. 0:13 Announcement 305 сл.
  3. 1:34 The Codebases 324 сл.
  4. 2:53 Test 1: Distractors / Duplicates 711 сл.
  5. 6:00 Test 2: Implementing New Features 606 сл.
  6. 8:50 Test 3: HUGE REFACTOR 696 сл.
  7. 12:23 Conclusion 527 сл.
0:00

Intro

Okay, so OpenAI released a brand new version of GPT5 called GPT5 codecs and I'll be trying out in this video on real world production code bases to see how it stacks up against GPT5 the previous version and also cloud opus 4. 1. I'll
0:13

Announcement

first be going through the benchmarks but if you want to skip ahead then there are timestamps down below. Basically GPT5 CEX is a version of GPT5 further optimized for agent coding and codecs. And one of the biggest improvements you can see is that on code refactoring tasks it gets a much higher score than GPT5. And during one of their own refactors, it seems they've seen GPT5 codecs work independently for more than seven hours at a time on large complex tasks, iterating on its implementation, fixing test failures, and ultimately delivering a successful implementation. And for the OpenAI employee traffic over here, they've seen that with each turn of the model, whenever it gives a response, how many tokens are used, the bottom 10% have seen a 90% reduction in the amount of tokens used, which means that it should feel snappier and like shorter for those like well-defined smaller tasks. And for like more challenging tasks such as over here in the 90th percentile, it should think for longer and also perhaps give a better solution. And you can see that over here it seems better at writing high impact comments and also doesn't leave as many inaccurate comments as well. Now it is available in version 0. 36. So you want to make sure that you updated to that version. And one of the ways that you can tell it's been optimized is that if you go to codec system prompt and compare it to the original one. So this is the original one over here. it's about 310 lines whereas a codeex one is about 100 lines. So it seems that a lot of the best practices of the system prompts itself is baked into the fine-tuning or the training of the model or basically whatever they've done to
1:34

The Codebases

the model. All right, so I'll be trying out GPT5 and GPT5 codecs via codec cli and also comparing it against each other and to claude opus 4. 1 via cloud code on these two real world production code bases. So, the first of which is an AI news application that I made that basically helps stay up to date with the latest AI news. And you can download it and use it for free from the Google Play Store and App Store. And it is about 26,700 lines of code. So, it's like reasonably sized. And then another application I'll be testing on is Coll, which is an application that I made independently and then sold for about $200,000 to someone else. And basically, it's like a B2B SAS that's targeted at community owners. And it is about 42,000 lines of code. And the main thing I want to do with this is a big refactor just like the 7-hour refactor that they stated they did. I did get permission from the new owner and they said I can use it as long as I don't reveal too much of the code basically. And if you're interested in the techniques that I use to build and scale this, then I cover a lot to do with that in my AI startup skill over here. And basically a bunch of people have gone for it and had success with their own software and applications as well that you can see over here. And one part people have found especially useful is being able to ask me any question regarding their own applications that they're making such as how they would implement a certain thing whether a certain thing is worth making given the market and being able to get feedback on their ideas and their products from myself and other people as well. There's a link down below if you do want to join
2:53

Test 1: Distractors / Duplicates

it. Okay, so the first test is to make sure that GPT5 Codeex does not run into this problem. One thing I noticed with using claude code and JPT5 in some cases is that it likes to define the same type or like interface again rather than importing it from wherever it's been defined before. This also applies to functions as well. It also likes to define the same function again rather than importing it from where it's already been defined in the codebase. And the problem with that ends up being that when it makes a feature upgrades to one of the functions, the functions or types slowly fall out of sync and then more bugs arise because of that. And this problem has happened to me literally dozens of times. And one of the ways that I got around this is by using another model. So for example, I would use cloud code to write a lot of the code and then I would use codeex CLI to check the code that cloud code had written to make sure there weren't any like duplicates such as this. And I have talked about this before on another video called codeex cli just fixed cloud code which you can watch as well. Okay. An example that I just found in my codebase for tenza AI over here is that you can see I have sponsor config defined as an interface in a shared types file and I also have sponsor config defined over here and I did not do this I cla coded this when I was vi coding it with cloud code and this is a classic example of how bugs can arise in a vioded codebase. All right so I'm going to update my version of codeex because it's been a while since I've updated it. So press enter over here and then I'm going to run cord in one of the folders. So the tensor AI folder and then I'm going to run codeex in an exact duplicate of that folder which is tenza AI codeex and you can see it says introducing GPT5 codeex here. Press enter. And now for the models I'll use opus 4. 1 and I will also use model GPT5 codeex high. And using my tool hyper whisper coupon code down below I'm going to say hey so can you find any types or interfaces that are used twice or multiple times throughout the codebase and then consolidate them to prevent like any repetition. And now I'm going to press enter over here and then enter over here. So you can see that cloud code has already started making changes over here. Whereas codeex is still exploring the codebase. And now GPT5 codeex is already done despite spending more time like exploring the codebase. Okay. So I also gave the same prompt to GPT5 high. And it seems they're all done. So I'm going to compare how good of a job they all did. Okay. So first comparing GPT5 high and GPT5 codeex high. You can see the two solutions are almost identical where for every page on boarding it like replaces the type that's defined at the top and then makes a brand new type in a types folder over here and then imports it onto all the pages and venve both identify the duplicate type over here which I mentioned earlier but the only difference is that GPT5 codec tide decides to export this sponsor language over here and then import it into mobile API route over here. And look at cloud code. It identified all the same types that were duplicates, but it also identified two over here between article list provider and also the language provider here. And then it basically moved all the types into the same shared types holder. So I'm actually not sure here. GPT5 codeex was slightly better than GPT5 because it considered one particular case that GPT5 did not consider. But it seems that cloud code was slightly more extensive as well. But if I'm going to pick one of the solutions, I'm going to go with GPT5 codeex because it seems simpler and gets the main thing done that I wanted it to.
6:00

Test 2: Implementing New Features

Okay. And now let's see how all three of them compare when it comes to implementing new features. So one of the pieces of feedback that I got is adding an unseen marker and an archive option over here. So when I open up the application you can see these are all the latest AI news. Uh this is all page and basically I think what this person wants is when I swipe like right there's an archive option where it no longer appears in the list and when I swipe left then I can probably like mark as on scene. So with a new chat, I'm going to describe this to all three models and then see how they compare when coming to a solution. Hey, so on the mobile application homepage on the list, can you add an option where if I swipe from right to left, then there's an option for me to archive the article and then I no longer see it. You should add the relevant database migration for this. And if I swipe left to right, then there's an option for me to start the article and unstar it. Adds it to a starred folder and removes it from the starred folder as well. You should make a brand new provider to deal with starring articles perhaps. Basically, let's see what you come up with. And I'll give all three of them the same prompt and then press enter and see what they come up with. And now GPT5 high is already done. And GPT5 Codeex High is still running over here. So basically, because it's still going, it makes me think that what they showed here on the graph is true because in the 90th percentile use cases, it does generate more tokens per response, which probably means that it's going to be implementing a more comprehensive solution. But we'll see what actually happens. And by the looks of it, it seems that it might actually finish off at the same time as Opus 4. 1 here. All right, so it seems that all of them are done. So let's test them out. Okay, so this is a cloud code solution. If I swipe left, there's an archive button, archive over here, and it disappears. If I swipe right, then there's a star button. But it doesn't actually add it under star tab over here. So maybe I did not explain that properly, or maybe just didn't implement it. All right. So, this is the GPT5 high version and it seems archive over here works. Star over here. Did it actually make a star? It did make a starred folder right over here. And it seems it actually made two different starred folders over here. Okay. Now, here's a GPT5 codec solution. And if I swipe left and swipe right, seems to work. Archive seems to work. Star seems to not actually work. Can I unstar it? star it added it to bookmarks but it didn't actually move it to star folder and it seems that all three of them faced a similar solution where you can kind of see that like when I'm like halfway moving it but I think that can be fixed pretty easily. So overall, it seems that GPT5 and GPT5 codeex is better than Opus 4. 1 because it actually made a starred folder like I said, but GPT5 normal actually added it to start folder despite making two of the folders and GPT5 codeex did not do it. So I think this is something that can be easily iterated upon. Okay, now I'm
8:50

Test 3: HUGE REFACTOR

curious whether GPT5 CEX is able to do one of those 7-hour long refactors that they talked about. And basically this entire application was made using the pages router in Nex. js. I'm going to switch it to the apps router instead and basically see how GPT5 high and GPT5 CEX high both perform. Hey, so this application was made using the Nex. js pages router. Can you rewrite it so uses a NextJS apps router instead? And then remove the pages directory. Make sure no functionality is lost whatsoever. Everything should be in the same position and place. And basically it should feel like the same application. Okay, so this is going to run on GPT5 Codeex High and high. And bear in mind they're both running in different folders as well. So I'm actually going to go and get lunch and then come back once this is done. Okay, so I just came back from lunch and it seems that GPT5 claims to be done over here and I used 330,000 tokens. I reviewed the footage and it took about 22 minutes in total. Whereas GPT5 Codeex High, I used 485,000 tokens and after about 42 minutes of running, it said, "I'm sorry, but this migration is taking longer than I expected. I'm not able to finish converting the entire project from the pages router to the app router. " And then I'm going to say, "Are you done here? If not, continue. " Because sometimes it says that it is done, but it isn't actually done. But honestly, I'm a little surprised that GVt5 High was running for over 40 minutes on this big refactor migration. But of course, that doesn't really matter unless the code actually is working at the end of it. So, I'll be back again once this is actually done. Okay, so now you can see that GPT5 Codeex has run for another 30 minutes. And now it says the context window is completely full. So 1 million tokens have been used after about 33 minutes plus 40 minutes, so 73 minutes. And interestingly enough, it says it's now hit a wall with finishing the migration. Anyways, either I can probably clear the chat and then tell it to continue on with the migration and that might take like another hour or 2 hours, especially because GPT5 only used 403,000 tokens, whereas GPT5 codeex basically used a million tokens. All right, so you can see here GPT5 changed 127 files here. And basically, there is no more pages router. Everything is in the app router here. And I guess it does look good, but let's run it just to make sure. So I'll do npm rundev. But it does look kind of concerning because it deleted 5,355 lines but only added 3,76 lines. Uh so maybe some things will be missing in this codebase now. All right. Well, it's not loading. It seems to have not done a good job because it is missing a lot of like use clients directives. Uh so overall like I guess it made a lot of changes but didn't actually implement it properly. Now let's stop that and try again with the GPT5 codeex version. Mpm rundev. It seems to have made or edited this many files, but I don't think it actually deleted the pages router yet because it didn't actually finish the migration. It hit the context window before then. Uh, so let's delete the pages router ourselves and see how well this works. Yeah, and like it said, it can't resolve tstack query. So, let me actually clear the chat and then get it to try again. And just to be fair, let's make GPT5 high try again as well. All right, so it seems that GPT5 Codeex fixed its React query error, but now it's giving a totally different error to do with use client just like GPT5 was doing. And as for GPT5, let's check over here. And you can see like the landing page flashes like very shortly or very briefly and then it runs into this error where like the key value is missing or something
12:23

Conclusion

like that. So honestly I think with the new upgrade to codeex CLI whilst it may actually be better code refactoring tasks it probably is limited to like one file or a couple files not like over 100 files across the entire codebase such as what I just did. I have read online that they haven't really released this data set but I'm imagining that you probably need closer to 70 to 80% for it to actually do the bigger refactor that I asked it to do. But I think what this also means is that many people who are working in legacy code bases, rather than rewriting the legacy codebase, they'd probably wait like another 6 months or a year for the models to be able to get good enough that they can refactor it in Rust or some other programming language instead of trying to do it themselves. But I think some people may be able to achieve good refactoring right now if they have enough like test coverage across their codebase that Codex CLI can constantly keep checking. But honestly, I think the most surprising thing about this to me is that codeci is able to run for like 40 minutes, 35 minutes at a time without being interrupted. I think you can probably push it to over 1 hour. And there are probably many other use cases that are not coding related that you can probably use codeex CLI to be able to do such as if it's running for over an hour, you can probably get it to do incredibly deep research on a particular topic, much deeper than many deep research tools that you'd find online can do. Or you can just have it doing other things as well. At least when it comes to software engineering benchmark, I don't think the jump is that big to notice a massive difference in a sense it can now do many things that I previously couldn't do, for example. But I will be trying out more over the coming week. And I'll make another video if I find any good use cases out of it. Anyways, now I think it's pretty good that Entropic have a serious competitor when it comes to coding models at least. It will probably push them to release Sonic 4. 5, Opus 4. 5, or whatever their next model is much sooner. And maybe one of the reasons why Claude Code has recently dropped in quality, as many people are claiming, is because they're now like pushing for that release date to be sooner and they're using some of the GPUs that would have otherwise been used for cloud code to train or finish off training the next model. Anyways, if I do find more interesting things about Codeex CLI, I will make more videos about them. So, do subscribe if you do want to see that as well. And if you are interested in the techniques that I use to scale and sell that previous software, then I do cover a lot of that in my AI startup skill. There's a link down below if you're interested.

Ещё от Ray Amjad

Ctrl+V

Экстракт Знаний в Telegram

Транскрипты, идеи, методички — всё самое полезное из лучших YouTube-каналов.

Подписаться