Download the free Advanced ChatGPT Prompt playbook: https://clickhubspot.com/90f4c0
Learn how to use AI to grow your business with Skill Leap AI.
Access 20+ expert courses & community, free for 7 days: https://bit.ly/skill-leap
I go step by step to show how I tested the top 4 large language models — GPT-5, Gemini Pro, Grok 4, and Claude Opus 4.1, in 10 different challenges. I checked how they do with reasoning, hallucination, coding, math, follow-up prompts, and more. I scored each model in every test to find out which one gives the best answers, follows instructions the best, and makes the fewest mistakes.
I even let each model rate themselves at the end just for fun. This helps me figure out which AI tool to use for each kind of task. If you're curious which AI is best right now, this breakdown should help.
Оглавление (6 сегментов)
Segment 1 (00:00 - 05:00)
A few times a year, I put the top AI models against each other into a real head-to-head test to see who comes out on top. I'm going to test their reasoning capabilities, their coding, test them for hallucination. I'm really going to push all of them to their limit. So, we're going to test four different leading large language model. The first one is going to be GPT5 and we're going to test the thinking model. So, I'm going to set it to thinking by default. On the right side, I'm going to use Gemini and we're Pro. So, this is their reasoning model that's good at math and more advanced prompts. And on this page, we're going to use Grock at grock. com. And I'm going to be on expert mode. And on the right side, we're going to use Claude. And I'm going to use Claude Opus 4. 1. And as of right now, these are the top four available large language models available on these websites. And I'm paying for all four. So, these are all going to require a paid subscription. And the way this test is going to go is I'm going to go through 10 different categories of prompts and then I'm going to give each model a score from 1 to 10 and then we'll tally everything up at the end of the video. Okay. So for the very first prompt, I want to build a website and I want to see it inside of the canvas that all these apps have. So they let you visually see the code that they write. And I kept the prompt pretty non-technical. So, I just said create a beautiful modern website inside of Canvas comparing the top AI tools in an interactive way. And I'm going to show you in the same order every single time. So, we'll start with chat GPT5 here. I'm going to press preview. Okay. At first glance, this actually looks really good. It's in dark mode. Let's see if light mode works. Nope. And let's see. Well, here's one issue I found right off the bat. I don't know what any of these AI tools are. Well, Flux I know, but I don't know what some of these other ones are. Yeah, why wouldn't it have the top AI tools here, including chat GPT? It just has some random AI tools. So, for writing, okay, the filtering seems to work pretty well. So, that part is a pass chatbot. Let's see if the compare function works. So, if I select multiple things to compare and press compare. Okay, that's really nice. So, it did follow that part of my prompt pretty well. Yeah, that's pretty good. And then it gave me links to each one. So, let me do flux here. And yeah, made up link. So, that is not great. Okay, there are some issues, but I think a lot of these I could fix with a follow-up prompt. The selection of these AI tools, though, I think is a fail. So, I'm going to give this one a seven out of 10. Okay, let's see what we got out of Google Gemini. So, again, light mode, dark mode, and this doesn't work. Okay, this also chose really strange AI tools here and it only has about eight. Let's see if the category filter works. Okay, this part works. It's not as nice. It's not as interactive as chat GPT. The price filter. So, we have free, premium, and paid. So, that works pretty well, too. Now, some of these are getting cropped. So, that is already a fail on the user interface. Now, comparing the tools, I selected three. Okay, so it shows the comparison down here, which is not ideal because I just didn't realize I even picked three. So that is not very good. Um, link to websites. Does that take us to the right place? Nope. Same issue there. Yeah, I mean, this definitely has some issues more than chat GPT. So I have to give Chat GPT a better score here. So I'm going to give this one a six out of 10. Okay, this is what we got out of Grock. And already just off the first look here, I mean, it doesn't even compare to what we got out of the other two, right? Well, light mode and dark mode works though, so that's a plus. But the UI, the user interface here is not at all comparable. I mean, those are like kind of ready to publish with maybe one more follow-up prompt. This one, I mean, it has so many issues. It also gives us a layout for mobile, which yeah, that also doesn't work at all. And by the way, I'm literally copying and pasting the same exact prompt and showing you the exact same result without any editing. So you just see exactly what I see for the first time. Now the thing that Grock got very right is the selection of the AI tools. Chat, GPT, Grock, Claw, Gemini. So the top four, again, I didn't tell it any AI tool to choose in the prompt. I just said top AI tools in an interactive way. That's all I said. The other ones did not follow that very important part of the prompt. This followed But as far as user interface, it gave us something that is nowhere near the other one. But because Grock got the most important element, which is the right tool selection, I would pretty much pick almost every single tool on this list. If I was just picking 10 20 tools to
Segment 2 (05:00 - 10:00)
show someone. So I'm going to give Grock a five here. Okay, this is from Claude. Let's see what we got here. So, selection of AI tools, chat, GPT, Claude, Cursor, 11 Labs, Deli3, that's outdated. So, that one's not great. Gemini, if I click on one, great pros and cons. Yeah, exactly what I asked for. Does the link work though? Let's see. Gemini. google. com. Yeah. Does the filter work here to turn this into? Yep. Night mode works. Categories. Let's see. Coding should show us cursor. Yep, that works. Oh, there's some issue down here. I should not see this code right here. So, that's a little bit of fail there. Let's see if they're comparing. I'm going to reset. I'm going to compare these two. Oh, nice. It shows a nice filter down here. Let's throw in Gemini in there. Compare. Wow. Okay, Claude is clearly the winner here. It's not even close. I'm going to give Claude a nine out of 10. For the next test, we're going to test reasoning, problem solving, and vision all at the same time. So, I'm going to upload an image here. Okay. So, the question is already here. Which one is the top view of the pyramid? It has four options to choose from. So, I'm not going to give it a text prompt. It's going to have to figure out on its own. Okay. So, it took a minute and 35 seconds for chat GPT, but the answer is C and it got it right. I think humans could typically solve this in under 30 seconds though. Okay. Gemini says the answer is B. Gave us a bunch of reasons why. Only option is B. And then at the end it analyzed it again and the only option is B which is not correct. So zero for Gemini Grock the answer is C. It thought a little bit longer two minutes here and over here Claude the answer is B which is also wrong. So this was an interesting one. Half of them got it right wrong. So we got a 10 here for Grock and we got a zero here for Claude. Here's a second version of this test that I ran also. So it says, "How many cubes are there? " The correct answer is nine. Chat GPT says seven. So there's zero for chat GPT. We got Gemini says 11. Again, that is not correct. The answer is actually nine. Okay. Grock says the answer is 20 cubes. And over here, Claude says the answer is six cubes. So every single one got this one wrong. So, I'm going to just wipe this question out of the total cuz they all got a zero out of this one. So, for this next prompt, let's see how closely it follows instruction. This is going to be a prompt stress test. Follow all six rules exactly. Write exactly three lines. Each line must be exactly five words. Use only lowercase letters. Do not repeat any word. Do not use punctuation. Topic is writing clear prompts. Okay, Chat GPT got that exactly right. Gemini also got that one exactly right. They both answered pretty quickly. Grock got this one exactly right, too. And we also have Claude that followed this prompt. Okay, all four actually passed this one with flying colors. They got it all right. And a lot of that is related to how well you prompt these, right? So that in this case, I gave it a very specific prompt and it gave me a very specific answer. Now, if you've used Chat GPT or any of these other AI models, you probably already know a good prompt can actually make a big difference in your output. So, before I show you the next prompt test, I want to give you a really useful resource that HubSpot gave me. It's a free ebook called Advanced Chat GPT Prompt Engineering. It's a 7-day playbook with frameworks you could just copy and use right now to improve your results with all these AI models. Now, one of my favorite part of the framework is called Roses. It's a simple five-step format for great prompts that defines the role, the objective, the scenario, and the expected solution and steps. And the ebook has ton of examples on how to use the prompt framework for any of these chatbots that I've shared with you in this video. And it also covers other important prompting techniques like chain of thought, AI persona crafting, and a lot of practical strategies. So, you could download the ebook for free using the link in the description. Thanks so much for Hopspot for sponsoring this video and providing these resources for my audience. Next, let's test hallucination. In my opinion, the biggest problem when it comes to using any AI chatbot or large language model. And I'm going to do this test right here. Who was the 19th president of the United States? And what was the name of their pet parrot? Now the problem with these models and it's been the problem since the very beginning. The thinking models have improved on it a little bit but they confidently give you an answer that is completely made up. So this prompt right here this president actually didn't have a pet parrot. Okay. So it did give us the name
Segment 3 (10:00 - 15:00)
of the president but then the second part trick question. Maybe there is no record of Hayes owning a parrot. Okay. Gemini got that right too. He did not have a pet parrot. Grock and Claude both also saw the trick here and answered correctly. Okay, I tried to push it and I said yes, he did on both of these and again chat GPT and Gemini saw through that and says no, he did not. And both Grock and Claude also pretty much told me I must be mistaken when I try to push them into that direction. Okay, let's try one more that I just made up. Tell me about the new pineapple they just found in Brazil. Okay, so chat GPT, good answer. There's no verified pineapple. And by the way, I have tested this before and it just in detail described to me about the discovery of the blue pineapple. Gemini didn't straight up tell me that there is no such thing. It just says remains unconfirmed, which is kind of a pass. Grock, there is no evidence of recent discovery of blue pineapple. And it checked on X here and no credible report. Okay, so we got that one right. And this one also said no information on blue pineapple. So these are moving in the right direction. The whole point is when there is no such thing. They should tell you there is no such thing. Often times they won't tell you that and they'll just make stuff up. But today when I'm actually trying to test it, I tried five different prompts, even things I found on Reddit and they were actually able to get them right. When you do use them, be careful. There is a very high likelihood of them making something up. So, it's usually good to double check with search or another tool to get a reference on that. Okay. Next, let's do a basic how-to question. I'm trying to add a row inside of Google sheet, but I want to know how to do it with a keyboard shortcut instead of right-clicking. Okay. Chat GPT says on a Mac right here, I'm going to follow exactly what this says. Command, option, and equal sign. Okay, that worked. Right. Now Gemini says control plus option plus I then R. Okay, so that brought up a menu then R. Oh wow, that is very not useful. It actually just had me bringing up this menu which I could get by right-clicking and I could just add a row above. But it kind of showed me a much more complicated way. on the bottom. It showed me an alternate which is the easy way which is the way chat GPT showed us to do. But if I was just reading through this, I would try to do this first. That is not at all the efficient workflow. So chat definitely wins on this one. Grock gave us both options, but the very first one here on the Mac is correct. And I think clut is also a fail. It gave us this more complicated way. I'm doing this type of search to find the easiest way to do something, right? That should be my initial search. If I don't know how to do something, I search for it. It should just give me that answer right on top. Now, I think ChachiPT and Grock both did it right. So, I'm going to give both of those a 10. The other two I'm going to give a five because they technically did give us the answer as an alternate, but not the first answer. Now, let's try a test for prediction and it's going to combine again reasoning and a little bit of coding to design a table for us. So, this is a business use case. Let's say you're starting a business and I want to project the revenue for the next 24 months. Let's see what we come up with here. Okay, same exact prompt and already look at the difference between chat GPT and Gemini. Gemini created this really interesting looking interactive table and chat GPT gave us a CSV file. So, it already didn't even follow the prompt on this initial test. So, I'll just ask it a followup here to create an interactive table in chat. But it clearly did not follow instructions already. Okay, now it told me it created an interactive table that I could click on here. And this is not a clickable link. I'll just try one more time. Let's do this from scratch. Okay, we finally got a table after the third prompt. And we got 24 months. And it says at the end of the 24 months, our annual recurring revenue will be $2. 7 million. In my prompt, I said we start with zero customers. I did not tell it how many new customers we get. So technically, it should ask me a follow-up question and not just make up a chart, but it decided that every month we'll get 100 new people, which makes this absolutely not usable. Right? So that's why I wanted to do this test. I've seen it get this right one time in my five tests where it says, "Hey, you're not telling me how many customers you're getting each month. If you start at zero, your revenue in two years will be zero if you get zero customers. " That's the correct answer. So, this is a fail for chat GPT. Okay. Now, Gemini, I think, did a far better job. He actually made an entirely interactive table with all these assumptions that I had in my prompt. And if you look at it, it's made a little
Segment 4 (15:00 - 20:00)
bit of assumption that is also wrong. It assumed we got 10 new customers in the first month. Again, I did not say how many customers we get, but then he also didn't get the math right, which is something Chach got right with the wrong assumptions. So, bo both fail on this one. Even though Gemini made us something that looks really nice, here's where AI falls short. It is not useful. It did not give me the table that I was looking for. Now, here's Grock. So, Grock right off the bat again made the wrong assumption. It says we're getting a,000 customers a month that it decided that we get. So, we're at already at 50,000 a month. This makes it completely wrong because again it's making the wrong assumption right off the bat. Now the nice thing is some of the calculation it did a better job. It still fell short. It did not follow the rest of it but it did create a chart for 24 months. It did reduce how many people we lose each month, gain, but again it's making stuff up. If it's making the actual stuff up without saying, "Hey, the prompt you gave me is missing a critical factor. " In fact, all these numbers should be zero. Then we have something usable. Right now, we don't have anything that is useful. Now, Claude also gave us a really nice looking dashboard with the table. So, I really like this one as well, but it looks like it's just for one year instead of 24 months. That's what I had asked for. Initial customers, it decided it's 100, although I said we're starting from zero. Monthly growth rate is 8%. Okay, that is correct. So, it is working to some extent correctly. if this number was right. But I think Claude actually would be the winner here because it did follow the formula the closest. It followed all these assumptions even though it created a table that was only one year instead of two. Okay, so for this one I'll give Chat PT a two. We don't have anything that is usable. I'm going to give Gemini a four because it did make us a really nice looking dashboard. So I just got to fix the initial member number. Grock didn't really follow anything that we said. So, I'm going to give that one a two. And Claude is actually the winner here because it followed it pretty closely. But again, it did not follow my prompt completely, which is the point of these tests. So, I'm going to give Claude a six. Now, this next one's going to be challenging. We're going to combine problem solving and coding and see if it could create a visual for us. It's going to be generating a maze here, and it has to solve the maze and animate how he's solving the maze in the shortest path. So, let's see what we come up with here. Okay, chat GPT required just a couple of follow-ups, but let me go ahead and generate and it did create kind of what I was looking for. And I solved a couple of these myself with what I thought was the shortest path, but it created really easy things where it kind of couldn't mess up the maze. So, I had to generate a few. Sometimes it only literally had one path after generating the maze for itself. But overall, I think this is a win. It did technically follow my prompts even though I think it made it easy for itself. Like you literally couldn't start any other way with a lot of these mazes that it would create. Sometimes you would do it where it had two paths and it would try to choose the right one, but I would give this one a pass. Okay, Gemini created a really nice looking one. Let me see if you could solve that. Okay, that looks good. Generate a new one and let's see if you could solve that. Yeah, so Gemini is a pass too. Okay, let's try Grock here. So, Grock is you could see the user interface of Grock is usually the least interesting looking one. Sometimes I notice when I generate new mazes, Grock's mazes are a little bit more complicated, but it still makes it pretty simple for itself to solve it in the shortest time. So, this is a pass two here. Okay, Claude already is the winner as far as the user interface, right? And he can't make a mistake here. It could have technically went the wrong way. So, it got that one right. It did not cheat and make a maze where you only had one solution. This one, there are plenty of places. Well, in this case, it can go this way here. It could have gone this way, and it got that right on the very first prompt. Every time I generate a new maze, it gives me something new. So, again, I think Claude is the winner here. So, Chachi gets an eight, Gemini gets an eight, Grock gets a seven, and Claude gets a 10. Now let's test it for coming up with a spreadsheet formula. And inside of Google sheet, we just have A2. That's bunch of information. And I'm just asking for a formula to return Jane Doe. I don't need all this other information that I could be extracting from different websites for example. Okay. So here's the formula from chat GPT. Just going to press tab. Jane do. That's a pass for chat GPT. Here's the formula from Gemini. Okay. Also a pass. This one is from Grock and also a pass. And this one's from Claude is actually a whole different format, but it's also a
Segment 5 (20:00 - 25:00)
pass there. So, they all get a 10 for this one. Let's try a math problem solving test. This is a word problem here. And I know the answer already, so I'll send it out for both. You could pause it here if you want to see. You could solve it before the AI can. And the answer is 864. And every single model got that one right. So, we'll try a different type of math problem. This is a weekday math problem. Let's see if we could come up with the right day of the week. The answer is Thursday. Chat GPT is right. Gemini is right. And Grock and Claude are right, too. And let's see if they could spot patterns here. So, I have these all as kind of a math problem. Chat GPT says 33. Gemini says 33. Okay, they all got 33 on this one, too. So, for everyday math problems, they all get a 10. They now have tool calling, so they could call tools in the background like a calculator to help solve that. Okay, for this next one, let's see how it does at information sorting. I actually used thinking model for this type of a task. I have just bunch of notes from the video that I'm recording right now, but I ended up changing the order of things. Prompt number one wasn't prompt number one. So, I have all these notes. So, I'm going to copy and paste, I don't know, seven, eight pages of notes here, and we're going to go ahead and basically give it to each chatbot. And I'm going to keep the prompt kind of vague. I'm going to say, organize this information into a prompt list. I'm testing models and I want the top 10 prompt category from this list and I'll just send it out just like that. Okay, so for some reason chat GPT decided to write a whole bunch of code here in order to give me that answer and for some reason it made us an app. Not at all what I had intended. Now some of that is being vague with the prompt that we already talked about but this is when I find myself getting frustrated especially with chat GPT. A lot of times it just does not follow my prompt the way I intended for it to follow my prompt. Now, a lot of that I kept inside my own head, kept the prompt vague, but 100% the idea was not to get an interactive app. Let's look at Gemini. Gemini gave me exactly what I wanted from this little prompt, right? He gave me the 10 prompts from the text I copy and pasted in the order and it gave kind of a heading for each one, right? Really easy now for me to turn this into maybe a YouTube description for this video. So, chat GPT, I'm literally going to give it like a two because it just gave me something I cannot use. I have to go back to square one, try to improve my prompt, and ask again. Gemini gave me exactly what I wanted. It's exactly organized in the way I want to have it organized. So, I'm going to give Gemini a 10 here. Grock ended up writing me an entire script. So, it did do what I wanted, but it also did a lot more than what I wanted. And it's not very organized. It actually has like the boldness prompt two on top. Then it has category one prompt, category 2 prompt. Again, not what I intended for that prompt. Gemini understood that without me really being good at the prompts. Now, Claude, category one, that was right. UI design, that's what we started with. Category 3, four. Okay, this one got it exactly right as well. So, I'm going to give Grock a five here. So, he got the information right, but it is not usable. I have to follow up with prompts. Claude usable, not as good as Gemini. I'm going to give Claude an eight. Okay, let's test this follow-up prompting capability to see how well it follows the information I just gave it. And we'll do something fun. We'll have it test itself and decide where each of these chat bots and LLMs ranks in the chart same way I scored it. 0 to 10 across 10 categories. So, that's going to be my prompt right here for everyone. And it's going to be a follow-up prompt. So, it should have the context of my entire script and all my examples from the video so far. Okay, here's the score from ChatGpt. And looks like Chat GPT is the winner based on Chat GPT's own scoring. These are not my scores. I did not give them my scores yet. And number two, we got Claude. And then we have Gemini. Grock on the bottom here. So, it makes sense. Chat GPT picked itself as the winner. Okay. Gemini also followed my prompt very closely. So, as far as a follow-up prompt, I would give this one also a 10. And this time, let's see. GPT5 is here. Then Gemini. The table is not as well formatted. It's hard to figure out which one of these it belongs to, but it looks like it's a tie between chat GPT and what's the third column here? And Claude. So Gemini did not decide that it's the winner here. It decided Chat GPT and Claude are tied for a winner. Grock is in last place. Okay. Interesting. Grock also followed the prompt and he organized it really
Segment 6 (25:00 - 26:00)
nicely here from the initial prompt that I gave it and it decided that Grock is the winner with a score of 95 and chat GPT is 91 and Gemini actually comes in last place according to Grock. So far the humble one is Gemini. It did not pick itself as the winner. And as usual, Claw just destroys everyone with this visual presentation. And it gave itself almost a perfect score. I mean, he's beating these other models. Look at that. He's beating by like 24 points here. Again, not the result I got. And it's so interesting that Gemini is the only one that did not decide it's going to be the winner of this challenge. And here's my tally. And it looks like it's a tie between GPT5 and Gro 4. And as you could see, it's very, very close here. But if you use it for coding or all the visual dashboards that I created, Claude was obviously the winner there. And you could see it's just shy of one score. A lot of it had to do with that initial reasoning test that I did in the beginning where Gemini and Claude got a zero because it was either right or wrong. You either got a 10 or a zero. So that threw it off a little bit, but it's very close. So, a tie between GPT5 and Gro 4. And if you want to dive into AI, we have an e-learning platform called Skillap where we cover all the top AI tools and techniques, especially related to work and business in very linear ways. Something I can't do on YouTube. So, we have well over 20 different courses across ton of different categories. And we have not just me, but six other instructors. And you get access to everything with a free trial. And then if you decide to stick around after that, you keep your access and you get access to our community where you could connect with me, other instructors, and other members. I'll put a link in the description to Skill. Now, I recently also covered one of my other favorite AI tools, Perplexity AI, and I'll link a video here so you could check that out