How I Used Codex CLI to Fix Claude Code

11:00

How I Used Codex CLI to Fix Claude Code

Ray Amjad 30.08.2025 12 144 просмотров 453 лайков обн. 18.02.2026

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Join AI Startup School & learn to vibe code and get paying customers for your apps ⤵️ https://www.skool.com/ai-startup-school —— MY APPS —— 🎙️HyperWhisper, write 3x faster with your voice: https://www.hyperwhisper.com/ - Use coupon code X8RW3ELH for 40% off 💬 MindDeck, an advanced frontend for LLMs: https://minddeck.ai/ - Use coupon code AWHK2ZWF for 40% off 📲 Tensor AI: Never Miss the AI News - on iOS: https://apps.apple.com/us/app/ai-news-tensor-ai/id6746403746 - on Android: https://play.google.com/store/apps/details?id=app.tensorai.tensorai - 100% FREE —— MY CLASSES —— 👾 Codex CLI Masterclass: https://www.mastercodexcli.com/ - Use coupon code K5LP2NRK for 20% off 🚀 Claude Code Masterclass: https://www.masterclaudecode.com/ - Use coupon code 6OKODFRW for 20% off ————— CONNECT WITH ME 📸 Instagram: https://www.instagram.com/theramjad/ 👨‍💻 LinkedIn: https://www.linkedin.com/in/rayamjad/ 🌍 My website/blog: https://www.rayamjad.com/ ————— Links Mentioned - https://research.trychroma.com/context-rot - https://arxiv.org/abs/2402.18216 Timestamps: 00:00 - Intro 00:24 - Context Rot 02:45 - The Problem with Claude Code 05:26 - Where Codex CLI Comes In 07:57 - LLM's Task Switching Capabilities 08:33 - How I've Been Using It 09:41 - Comparison to Subagents 10:24 - Conclusion

Оглавление (8 сегментов)

Intro

So over the last few weeks, OpenAI's Codex CLI has gotten quite good, and I'll be explaining how I've been using it alongside Claude Code over the last few days to achieve even better results than using Claude Code alone. Firstly, why has it suddenly gotten good? That's because GPT-5 was launched about 3 weeks ago, and when you're using GPT-5 High when it comes to coding, it actually performs quite well. And also, they've been releasing updates almost every other day since the launch of GPT-5 as well. Now, before getting into the problem that I was having when it

Context Rot

came to using Claude Code alone, I want to explain what Context Rot is because it will actually help you understand why the problem was happening to begin with. Basically, a paper came out about 6 weeks ago by Chroma, and they write about how increasing input tokens impacts LLM performance. Now, we all know the needle in a haystack test that many LLMs do, so when Google releases a 1 million token context window, they say it achieves like a perfect score on the needle in a haystack. What they're doing here is they have a bunch of random text, which is related to the needle in some way. And they have a needle, such as the best writing advice I got from my college classmate was to write every week. And then they ask a question, such as what is the best writing advice I got from my college classmate, and they see whether the LLM is able to retrieve that piece of information from this long piece of text. So what Chroma did is introduce something called distractors, which are somewhat semantically similar to the needle, such as I think the best writing tip I received from my college professor, and not the classmate, my college professor, was to write every day. So this is semantically similar to the needle, but it's a different piece of information. And then they experimented with different types of distractors, so they had easy ones over here, and challenging ones over here. And you can see that as they changed the number of distractors, the performance of the model actually decreases for many high performance, medium, and low performance models. So as you can expect when there's one distractor, or when there are zero distractors, the high performance models, such as Claude 4, Sonny, and also GPT-5 as well, are fairly consistent. But as the token or input tokens increases, with one distractor, you can see decreases. And as you have more and more distractors, and with increasing input tokens, you can see the performance is like almost 35% over here. And they also compared different distractors as well. So you can see distractor 0, 1, 2, and 3 over here. And the fourth distractor, which is distractor 3, is the most challenging across the board. Because if we look back to what the distractor is, it says, I thought the best writing advice I got from my college classmate was to write each essay in four different styles, but not anymore. And this, but not anymore, makes this distractor much more challenging to different LLMs out of all the distractors on this list. I would recommend reading through the paper because it is quite interesting. But the result here is pretty clear. As you increase the amount of input tokens, the performance of the LLM decreases. As you introduce more distractors, then the performance decreases even more when you're introducing more input tokens. And as you introduce more challenging distractors as well, the performance drops even further. And basically, the problem that I was having with Claude Code

The Problem with Claude Code

over the last couple weeks is that I would have my file over here in green. And I would have the needle, which is a thing that needed to be edited to achieve the result that I gave Claude Code to do. Claude 4. 1 Opus in this case, because I usually use Opus. And then I would have a bunch of weak distractors scattered across the code base. And bear in mind, I'm not the model, so I don't know exactly what is a distractor to me. But these are a bunch of weak distractors that kind of do something similar to whatever the needle is doing. And then for some reason, Claude 4. 1 Opus decides to add another distractor in another file that is related to the needle. So it adds like a strong distractor over here when implementing the feature that I said that it should do. And then it adds another strong distractor over here. And this kind of happens over a three, four hour long coding session. And if you replicate that across like dozens more files in the code base, you can see the distractors are starting to pile up. And usually when I'm vibe coding with Claude Code, I'm watching a television show as well. So I'm just pressing accept, accept without reading through the code. And during that time, I reset the context window a bunch of times because I try not to go over 50% of the context window. And then when I ask it to implement something, it ends up like making a change over here instead of over here where the needle is. And other times it makes a change over here instead of over here where the needle is. And this kind of happens across the code base. And then when I'm testing its implementation, I'm like, hey, this doesn't work. Like, why is it not working? And the thing it just added. And then I check the code base. And it turns out there are three or four different functions that all do something very similar. And when I asked it to make an edit, it edited one of the functions for like another part of the code base. And it didn't edit another function. Or when I ask it to do some refactoring or remove stuff, it would like remove references to the thing, for example, like for a function. But it wouldn't actually remove the function itself from the code base. And ultimately, my code base became filled with distractors. So I'd have dozens across different files, some being weak, some being stronger, some being related to like some functionality I was adding, some not being related. And I would just find myself having to intervene more as the projects got bigger and more files were added. And basically what would happen is I would have Claude Code try and remove some of the duplicate code. And it would only remove it about 50% of the time. And other times I would notice there are multiple functions that do the same thing. And I would have it merge the functions together. And then it wouldn't delete the old function, despite me telling it to delete it. And yeah, ultimately it just became a massive nightmare. Because Claude Code would lose the forest from the trees. It would be so caught up in the weeds of actually like executing on a task. That it didn't pay attention to the bigger picture. And realize that there are multiple functions that are doing similar things. That need to be merged in some way. Or that this is not the most effective solution. Given this other part of the code base and so forth. And this is where Codex CLI comes in handy. So I have Claude Code open on the left.

Where Codex CLI Comes In

And Codex CLI on the right. And this is a real production application that I'm editing called MindDeck. You can basically run many different LLMs in parallel. So you can see over here, I can run up to eight different LLMs. And there are a bunch of advanced features as well. That just make it good for like LLM power users. You can also use many different models. All the models available in OpenRouter. You bring your own API keys as well. And there are a bunch of more advanced features. Like importing from ChatGPT and so forth. And basically I was adding more MCP servers to the application. You can see it over here with Claude Code. And it did some research. It made an implementation. And then what I do is I give all the stuff that it did to Codex CLI. And I make it come up with a critique of the plan that it just implemented. Or find any problems. And you can see the critique over here. It identifies some problems. And then I give this plan back to Claude Code. It makes some of the changes. Depending on what it thinks is a good change. And then I give the result back to Codex CLI. And then I just keep going back and forth between the two. And often I implement much better solutions. I also remove duplicates from arising. And distractors arising in the code base. Because one model, which is Claude Code, is making all the changes. It's like focused on the weeds itself. Whereas the other model, Codex CLI, has like a big picture overview. And understanding of everything that is happening. And I find that GPT-5 High, which I'm using in this case. Has really good attention to detail. And can make good recommendations to other models. Depending on what it's doing. I find for some reason that Claude Code is not able to both implement the features that are required. And also assess its own work. It kind of needs another model to like keep in check. And assess its own work. And I guess it's kind of like being a human as well. It's really hard to maintain both a close-up view of the code base. And be able to make all those edits. And also to be able to maintain a big picture understanding of the entire code base. And this is essentially what I'm doing over here. I'm having Claude Code maintain the like close-up view by editing everything that's required. And I have Codex CLI maintain the big picture understanding of how the code base fits together. And oftentimes without me telling it to. It is able to recognize the duplicate functions and distractors. And suggest that many different things are meant to be merged together in some way. In other words, over long coding sessions. It's quite easy for Claude Code to lose the forest from the trees. Implement distractors. Implement duplicate functions. And just generally do a worse solution overall. Unless there's someone else checking its code as it's going along. And in this case I found Codex CLI to be quite good at that. But I'm sure you can experiment with other models and other providers and tools as well. And it kind

LLM's Task Switching Capabilities

of reminds me of this paper which many people intuitively know. Called LLM task interference. And they basically investigate how much worse LLMs are. When it comes to like doing a task switch. From whatever previous task it was doing. So for example it could have been doing a sentiment analysis over here. And then you ask it to solve some like math problems. And the performance can be slightly worse than if you just started a new chat. And ask it to solve math problems instead. I find that it's good to have one tool such as Codex CLI. Maintaining a big picture understanding of the code base. And critiquing the code or implementation that another LLM. Such as Claude 4. 1 Opus in Claude Code was writing. I've been doing this for my other application as well.

How I've Been Using It

HyperWhisper. There's a coupon code down below for that if you're interested as well. And basically I have Claude Code make a bunch of changes. And then I give all the changes including the summary to Codex CLI. And then I ask Codex CLI to come up with a critique of all the changes that were made so far. And then I pass this critique back to Claude Code. And I say what do you think of this? And then I give all the critique over here. And the recommended changes and all. And then it comes up with a new plan after investigating everything that was done. And then I tell it to do X, Y, and Z. And then I pass this back to Codex CLI. And I keep going back and forth between the two. Where I have one acting as an implementer. And another acting as like the big picture thinker. And checking the implementation. And I found that in the last few days of doing this. It led to less bugs. Less distractors. And it just led to better code overall with better solutions that consider edge cases and so forth. And I have been switching up a bit in some cases. Where I get Claude Code to be the critiquer or the checker. And I have Codex CLI be the implementer. And I found that in the case of SwiftUI. It actually works better to have Codex CLI as an implementer. And Claude Code as a critiquer slash checker. And I have kind of done this before with subagents. Where

Comparison to Subagents

I had the main Claude Code session. Having the big picture overview of everything. And having the subagents act as like implementers. And also checkers and so forth. But I found this approach of where I'm using a different model. In this case which is GPT-5 High. You want to do slash model and then change it to GPT-5 High. I found that to be better overall. Because I think GPT-5 High is fundamentally like has different architecture. Has different training data. Thinks in a different way. And is able to be a better critiquer of the code that Claude Code writes. Than like Claude Code itself. It's like having someone else critique your work. And they're going to come up with a better critique. Because they're a different person. We have different life experiences. Different training data. And so forth throughout their life. Than if you try and critique your own work. But yeah. I will be continuing to experiment with this over

Conclusion

the coming weeks. If you do want to learn more about Codex CLI. And how it compares to Claude Code. Then I do have a previous video about it. But ultimately I would just recommend using both of them in parallel. And having one act as a critiquer. And one acting as an implementer. Anyways this video is not sponsored or anything. I don't accept sponsors on this channel. Because I think it can lead to some kind of bias. But this video is made possible. And supported by the people who buy my AI products. Using a link in the description down below. There should be some coupon codes as well. If you're interested. And I've generally found them very useful. And if you buy them. Then you're also supporting like an indie developer. A small YouTuber.

Другие видео автора — Ray Amjad

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник