Claude Opus 4.5 and Gemini 3 Pro just dropped within days of each other, so I put them head-to-head in a quick side-by-side test. Both models claim major gains in coding, reasoning, and agent-style tasks — but they perform very differently.
In this video, I run the same prompt through both models to see which one actually comes out on top.
This channel covers fast, clear updates on the biggest moves in AI, with breakdowns you can actually understand.
For hands-on demos, tools, workflows, and dev-focused content, check out World of AI, our channel dedicated to building with these models: @intheworldofai
🔗 My Links:
📩 Sponsor a Video or Feature Your Product: intheuniverseofaiz@gmail.com
🔥 Become a Patron (Private Discord): /worldofai
🧠 Follow me on Twitter: /intheworldofai
🌐 Website: https://www.worldzofai.com
🚨 Subscribe To The FREE AI Newsletter For Regular AI Updates: https://intheworldofai.com/
0:00 - Intro
0:32 - What's New
1:05 - Benchmark Results
3:13 - Opus 4.5 VS Gemini 3.0 Pro
7:04 - Tool Search Tool
8:36 - Outro
#Opus45
#Gemini3Pro
#ClaudeAI
#Anthropic
#GoogleAI
#AIupdate
#AIcomparison
#FrontierModels
claude opus 4, claude opus, claude opus 4.5, claude opus 4.1, claude 4 opus, claude opus 4.5 demo, claude opus 4.5 test, claude opus 4.5 review, claude opus coding, claude opus 4.5 coding, claude opus pricing, opus vs claude, gemini 3 pro, opus 4.5 vs gemini 3, claude opus 4.5 model, claude opus 4.5 2025, claude opus review, claude opus 4.5 app build, claude 4.5 vs opus 4.1, claude 4.5 vs opus
Claude Opus 4. 5 is here. Last week we got Gemini 3, we got Kodak Max, and now Entropic has entered the ring with a brand new Frontier model. And according to their own benchmarks, this is now the best model in the world for coding agents and computer use. Today I'm going to walk through the announcement page with you, break down the benchmarks, compare it to Gemini 3 Pro and GPT 5. 1, and then look at the new developer features like advanced tool use and effort control. So let's get into it. All right, so this is
the announcement page and Cloud Opus 4. 5, they're claiming that this new model is intelligent, efficient, and the best model in the world for coding agents and computer use. So, they're already positioning this as the go-to model for real work. Deep research, spreadsheets, slides, debugging, using computers, all the stuff people actually rely on AI for day-to-day. And they say it's a step forward in what AI systems can do and a preview of larger changes to how work gets done. And the rest of the page is them actually backing up that claim. The software engineering
bench verified is known as the gold standard for software engineering. and Opus 4. 5 currently sits at 80. 9%. Before that, Sonnet 4. 5 was at 77. 2%. That's a solid jump. Then if you look at the other Frontier models that launched literally days ago, Gemini 3 Pro sits around the mid70s, it sits currently at 76. 2%. Then we have GPT 5. 1 Kodak Max, which is at 77. 9%. And then GPT 5. 1 at the medium level is at the mid70s 76. 3%. This is the benchmark entropic wanted to dominate and they they're showing that they did. Even though the chart is zoomed in and exaggerates the gap a little bit, the model is still at the top. And I appreciate that Entropic directly compares themselves to the very latest releases. It shows that they're confident and this bigger table just reinforces that pattern on the terminal bench 2. 0 O benchmark opus 4. 5 is at the top at 59. 3%. And then on the agentic tool use which is the frontier performance you could think of it like that. Once again opus 4. 5 comes up at the top. Gemini 3 pro is at 85. 3 and opus is at 88. 9%. Computer use they're once again at the top at 66. 3%. We can see that they're leading in agentic coding, terminal coding, to use, skilled to use, computer use, and novel probleming. They've also shown areas where they're not winning, and it's these areas at the bottom. When it comes to graduate level reasoning, Opus is at 87%, but the new model from Google, Gemini 3 Pro, sits at 91. 9%. When it comes to visual reasoning, OPI sits at 80. 7% and the leading model is actually from OpenAI at GPT 5. 1 which sits at 85. 4%. When it comes to multilingual Q& A, we know that Gemini is really good at that and it sits at 91. 8% and Opus is behind by a little at 90. 8%.
All right, so I'm on LM Arena right now and I want to compare Cloud Opus 4. 5 to Gemini 3 Pro and see which model produces a better result. And the prompt I'm asking both of these model is to create a brand new futuristic mobile app called Dreamweaver that lets people record dreams, remix them, and export them as stories. So it's very abstract as opposed to use some creativity. So we want to test the model's ability to generate something that's not really straightforward. And I want it to create the output in this order. One sentence, so an app pitch, key features, five bullets, a complete app onboarding flow, so four steps of the onboarding, and also I wanted to generate three ASCI UI mockups, a home screen, dream recorder screen, dream remix editor, and each of these mockup must use the art include buttons, icon labels, and layout and fit inside a smartphone frame. So, it's a very complex and very not straightforward prompt, but we want to see what the model is able to create when it's given a task like this that's very ambiguous. So, let's see what it does. It looks like Gemini 3 finished generating already, and Claude is still going, but it looks like Claude is adding more elements in there. So, let's see which one is better. So, first, let's look at Claude. This is the one sentence app pitch. Dream Weaver transforms your fleeting night visions into sharable stories. Okay, looks pretty generic. Key features, they added a voice stream capture, dream library, and pattern insights, remix studio, story expert engine, and dream marketplace. Now, when we looking at the text it generated for the visuals, it looks a little bit choppy. So, it's not super clear. As we can see, it does not look like it's a complete smartphone. Uh, but okay. So, you have welcome and permission. Okay, pretty basic stuff. Sleep profile setup. This looks a bit odd. Then we have our record option. So the art looks a little bit cut off. So it's not 100% accurate, but you know, not bad for what I was able to create. And even the UI mockups don't look super accurate. We can see here that all the art is cut off and it's not really coherent, but we do see it has added elements like emojis and everything like that, which is cool to see. So, yeah, it's not bad. Uh, let's look at Gemini 3. All right, Dreamw Weaver utilizes advanced generative AI and BOF feedback syncing to visualize your subconscious. Okay, it looks this one looks a little bit more professional, less creative compared to the cloud one. Uh, we can see straight to the point like neurovvoice capture, generative dreamscapes, lucidity slider, sleep to script export, the collective. So this is once again like more professional, more straight to the point. Then we look at the onboarding flow, sync and collaboration, voice print analysis, style selection, the first thread. Okay. Yeah. So this one looks like a really good uh if I was presenting this in front of executives, this would be something I would use rather than the cloud. Now if you look at the UI mockups, the UI mockups actually look like they fit into a smartphone. You can see it added a time, Wi-Fi signal, uh, battery percentage, and the UI mockup is obviously much more better compared to the claw one that we saw. Once again, it fits into the smartphone layout. It's not breaking. Um, the buttons are there, pause, finish, everything like that. Keywords detected. Yeah. And then this is the last generation as well. It added like the sliders. It talked about remix sliders, darkness, coherence. So, in my personal opinion, I think Gemini 3 Pro was better. So, I'm going to vote for that one. But let me know what you guys think in the comments. Which output did you like better? Now, let's talk about
two upgrades that really matter if you're building agents or doing longunning tasks. Two use and effort control. So, first, two use. With the rise of MCP servers, every tool comes with big chunks of metadata. Names, descriptions, JSON schemas, examples, and all of that gets shoved into the model's context window before you even start solving a problem. Some tool servers eat up to 20,000 plus token just in definitions. Entropic solution in Opus 4. 5 is something they called advanced tool use, which basically lets the model search for tools instead of loading everything up front. So instead of dumping every tool into the prompt, Claude only pulls in the specific tool it needs when it needs it. The result is simple. Way less context bloat, way more room for actual problem solving. Then there's effort control, which is basically a slider that lets you choose how hard the model thinks. Lower effort equals faster and cheaper. Higher effort equals deeper reasoning and more accuracy. And medium for a balanced approach. What's cool is that Opus 4. 5 is not just more powerful, it's more efficient. At medium effort, it matches Sonis 4. 5 best coding score while using 76% fewer tokens. And at a higher effort, it beats Sonnet while still using about half the tokens. So together, tool search plus effort control give you two things. Cleaner, more scalable agent systems, better accuracy per token. It's not just about raw intelligence. a smarter handling of context and smarter use of compute. If
you enjoyed this video, this is what we do here. Fast, clear updates on the biggest moves in AI. If you want to stay ahead of everything happening in this space, make sure you're subscribed. And if you want the hands-on side demos, tools, workflows, and everything developers can actually build, check out the world of AI. We also run a simple no noise newsletter that gives you the most important AI tools and updates in just a couple of minutes. Subscribe here. Follow World of AI. Join the newsletter.