Is ChatGPT 4o Really Better Than GPT-5?
8:48

Is ChatGPT 4o Really Better Than GPT-5?

Corey McClain 20.08.2025 1 152 просмотров 29 лайков обн. 18.02.2026
Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Download the Prompts and dashboards here: https://forms.gle/PfTT66eRAw2AzY4f6 GPT Model Showdown: GPT-4 vs GPT-5 Thinking – Surprise Results! In this video, we investigate whether GPT-5 Thinking is truly better than the older GPT models by conducting a comprehensive experiment. We compare responses from GPT-4o, GPT-4 o3, and GPT-5 Thinking using 10 identical prompts. Utilizing sophisticated AI models, Claude Opus 4.1 and Gemini 2.5 Pro, for unbiased evaluations, we uncover that GPT-5 Thinking generally outperforms the others, but GPT-4.0 surprisingly excels in communication and clarity. Dive into the detailed insights and see if GPT-5 Thinking deserves another shot! 00:00 Introduction and Experiment Setup 00:31 Evaluation System 02:29 First Round of Results 03:25 Detailed Analysis of Model Performance 05:24 Second Round of Results 06:36 Recommendations 08:14 Final Thoughts

Оглавление (7 сегментов)

  1. 0:00 Introduction and Experiment Setup 106 сл.
  2. 0:31 Evaluation System 301 сл.
  3. 2:29 First Round of Results 165 сл.
  4. 3:25 Detailed Analysis of Model Performance 303 сл.
  5. 5:24 Second Round of Results 201 сл.
  6. 6:36 Recommendations 228 сл.
  7. 8:14 Final Thoughts 108 сл.
0:00

Introduction and Experiment Setup

Ever since GPT5 launched, a lot of people have been complaining saying that it's worse than the older models. And I wanted to know if that was the case. So, I conducted a small experiment where I tested chat GPT 403 and Chat GPT5 thinking with the same 10 prompts to see what their responses would be and then to rate them accordingly. But instead of me actually rating them myself, I chose to use AI to actually rate them because I know the models, I know the responses, and it's impossible for my bias not to leak into the situation. And instead of just using
0:31

Evaluation System

one AI model, I use two different AI models. On top of that, I ran the experiment two times so that I can get some consistent insight and to make certain that the findings were accurate or at the least reasonable. So, the first thing I did was to actually put together an evaluation system. And this is the prompt. You are tasked with conducting a comprehensive comparison of three AI models based on their responses to identical prompts. For each prompt, you will perform three pair-wise comparisons. A versus B, B verse C, and A versus C. Your evaluation should be objective, detailed, and focused on observable differences in performance. For each pair-wise comparison, analyze the following. I'm going to read the main headline, but not the bullet points. You can scan them while I read. response quality and accuracy, intelligence and reasoning, communication and clarity, creativity and originality, technical competence, and then the prompt finishes off with an evaluation rubric so that we understand how the scoring and scaling actually corresponds in each of these different categories with great precision and detail. So, the next thing I did was I went to Claude and I selected Claude Opus 41. I used the very best model they had and I ran the evaluators prompt. I also did the exact same thing with Gemini 2. 5 Pro because these models are very intelligent and they're very good for this process. So, the next thing I did was I took all of the responses from 403 and chat GPT5 thinking. I placed them in a Google doc and I labeled them model A, model B, and model C so that Claw nor Gemini knew which model was which. Then I took the data, pasted it into the conversation and allow the evaluators
2:29

First Round of Results

prompt to run. I did the same thing with Google Gemini and asked both of them to present their data in a comprehensive interactive dashboard. And so the first time I ran these responses with 40, 03, and GPT5 thinking, and I had Google Gemini 2. 5 Pro evaluate their responses. Model C came in first place. Model B came in second place and Model A came in last place overall. But I want you to pay very close attention to model A and model B. And I'll tell you which models of which in just a second, but I want you to look and see just how close they are. Model B scored a 4. 21 out of five and Model A scored a 4. 15 out of five overall or an 84 and an 83 percentile. So, they're very close accorded to Google Gemini 2. 5 Pro. And they're almost imperceptible when you compare one to the other. And in fact, when we
3:25

Detailed Analysis of Model Performance

actually take a closer look at the category breakdown of each of these models, we can see that model C consistently won in all categories. Except for communication and clarity, model B actually won in that particular category. But I want you to pay very close attention to this line right here. When it came to intelligence and reasoning, Model A actually edged out Model B. So that even though model B won this category right here, model A actually demonstrated more intelligence and reasoning according to Gemini 2. 5. And once again, when you look at the response quality and accuracy of 4. 35 to a 4. 45 or we look at communication and clarity, a 450 where model A performs better than model C as well. So, not only was model Able to outperform model B in intelligence and reasoning, model A was actually able to outperform model C in communication and clarity. And even with the remaining scores for creativity and originality and finally technical competence, model A was able to stay within a shot of model B. But claw evaluation of the responses told a little bit different story. So, yes, overall model C was the best. Model B came in second and model A came in third. But Claude's evaluation recognized a larger gap between model A and model B. And when we do a category performance analysis, we can see that when it comes to intelligence and reasoning, there is definitely a larger gap between model A and model B. But if we come down to creativity and originality, we can see that Claude actually has model A outperforming model B. And once again, just like with Gemini, we also have model A outperforming model C when it comes to communication and clarity. So what I did
5:24

Second Round of Results

was I ran the evaluation one more time. And like I said in the beginning of this video, I wanted to make certain that the findings had at least some basis in reality and they weren't just hallucinations from the AI. So I ran the experiment again and this time the scores came out a little bit differently with Claw. We got a 3. 82 82 for model A, 4. 12 for model B, and 4. 35 for model C. And in fact, the overview tells us everything with the comparison right here. B is better than A, C is better than B, A. Model C consistently outperforms with superior technical depth and comprehensive coverage. Model B shows solid performance with good balance across categories. Model A provides adequate responses but lacks the sophistication of B and C. And this is the second time I ran the experiment with Gemini and I requested Gemini to make an interactive dashboard as well. But this time, model A actually did outperform Model B when it came to response quality. But Model B outperformed model A in every other category. And Model C outperformed model B and A in every category. And in case
6:36

Recommendations

you haven't figured it out by now, model A is 40, model B is 03, and model C is GPT5 thinking. And so while GPT5 thinking is the best model in almost every category, I think the real story here is 4. 4 O is not a reasoning model, but periodically when measured according to a standard rubric in different categories according to Claude Opus 4. 1 and Gemini 2. 5 Pro, Chat GPT40 outperforms Chat GPT5 thinking in communication. When you couple that with Gemini 2. 5 Pro's evaluation of Chat GPT 40's intelligence and reasoning as being slightly better than 03 scoring a 40 over 03's 3. 9. That's a very good sign especially when Claude Opus 4. 1 follows up and says that 40's creativity and originality are marginally better than 03s. And I'm aware that this may not be consistently the case, but for two independent state-of-the-art AI models to evaluate chat GPT40 and to rate it in the category of communication and clarity over chat GPT5 consistently, when these models disagreed on other things was a very telling sign. And so while most power users can immediately tell the difference in GPT5 thinking and they know that it's a better model than 03 was and 4, the users who still prefer 4 are not necessarily preferring it simply from an emotional attachment. And
8:14

Final Thoughts

so I would say that chatt5 thinking is definitely worth a second chance, especially if you're a content creator and you want to use it to come up with new ideas or help you prototype content or work out your scripts or your frameworks or your bullets or presentations or anything else much faster with a higher level of quality so that you have less polishing to do on the back end. But if you got value out of the video, make sure you hit the like button, subscribe to the channel, and as always, take care, have a good day, and I'll see you in the next

Ещё от Corey McClain

Ctrl+V

Экстракт Знаний в Telegram

Транскрипты, идеи, методички — всё самое полезное из лучших YouTube-каналов.

Подписаться