Grok 4.1 just released and I put it through a bunch of tests. Here's how it turned out...
Sign up for my free newsletter (first info on Vibe Coding Academy): https://www.alexfinn.ai/subscribe
Follow my X: https://x.com/AlexFinnX
My $300k/yr AI app: https://www.creatorbuddy.io/
Grok:
grok.com
Timestamps:
0:00 Intro
0:26 The announcement
1:49 Strengths and weaknesses
9:36 AI use cases
Gro 4. 1 just released and I'm gonna be honest with you, it is absolutely incredible in some ways and potentially the most disappointing model of the year in a few other ways. In this video, I'm going to show you exactly how to get the most out of Gro 4. 1 and also show you the use cases you need to avoid with everything you got. By the end of this video, you know exactly which models to be using for every single one of your use cases. Let's get into it. So, it was
just announced moments ago. Gro 41 has released. It has actually been silently rolling out for two weeks now. And here are the benchmarks. So, this is according to LM Arena. And I have to be honest with you, I just don't buy it. I don't buy LM Arena. These make zero sense at all. After all my testing with Gro 4. 1, I don't see how you can choose it over many of these models for a good amount of use cases. I mean, on top of that, I don't even know how at this point you can choose Gemini 2. 5 Pro, which is almost eight or nine months old, over models like Sonnet 45 or GPT5. That makes zero sense. But skipping past that small detail for a second here, there are a few things they're saying Grock 4. 1 is vastly improved at things like emotional intelligence. As you can see here on the emotional intelligence benchmarks, which again is based on ELO, and for those who don't know, ELO is just people put the models head-to-head and then choose a winner. And then so this is just their ranking based on which ones people are choosing. Apparently, it is a much more emotionally intelligent model. Creative writing, I did a ton of tests around creative writing. Again, apparently it's saying it's better than Kimmy K2, Claude Sonnet, and 03. I'll show you some creative writing examples I did and show you how it stacked up. Spoiler alert, I did not like it for creative writing.
Here is a breakdown of the conclusions I drew from my test. And I'll show you the results of these tests, too, in a second. Exactly what prompts I used and the results I got that helped me draw these conclusions. Grock 4. 1 is fantastic at social sentiment and current events. Why is that? It is the only AI built on top of X. So, it is the only AI powered by tweets and everything going on currently on X. So, for example, I asked Grock 4. 1, when do people think Gemini 3 is going to release? It gave me a bunch of information about what's currently going on with social sentiment. So, it gave me the polyarket prediction markets and where they're currently at. It told me about a whole bunch of leaks people are talking about on X. And best of all is I said, what are people saying about it on X? And it gave me exact tweets from accounts that I can actually click on and see those tweets. So I can see actual sources from live tweets that are happening within the hour on Grock. That is a very big strength of Grock that no other AI has. So for instance, I asked the same thing to chat GPT 5. 1 and it basically refused to show me any tweets. It gave me links to Reddit even though I said give me uh tweets. It seems like the most it could do when it came to X was uh search for the Gemini 3 hashtag, but literally no one uses hashtags. It's also great at the API. So, this is less about 4. 1 and this is just more about Grock in general. Grock is a very cost effective API. So, if you're building an app out and you want to use a cost-effective AI solution that pulls live data from X, the Grock API is by far really the only way to do that. Now, here comes the weaknesses. And here is why I am so disappointed with Grock 4. 1. Starting with the top two here. It being incredibly slow and it being weak at coding. If you use the thinking model for Grock 4. 1, you are going to have a bad time. For instance, I'm going to give the same prompt to Gro 4. 1 thinking and chat GPT 5. 1 thinking to write this code for a 3D firsterson shooter. And you'll see how slow this is. I hit send on both. I had send on Gro 4. 1 first, but out of all my tests with all the thinking models out there, Opus 5. 1 thinking, 2. 5 Pro, Grock 4. 1 Thinking, it isn't even close. It is torturously slow to use Grock 4. 1 Thinking. And I really hope they speed this up because one of the best and coolest parts about Gro was they had that super fast coding model that was by far the fastest out of them all. And as you can see here, we're 57 seconds in and we have the full code being built out here in GPT 5. 1 thinking. So this took under a minute uh to write this application. And we are now writing the code for Grock 4. 1 thinking. And as you can see, this took almost 2 minutes. So double the amount of time to write this code. And here's the most disappointing part of them all. I'll show you how this code ran, but I'm going to tell you right now, the Grock 4. 1 code did not work. And here's the disappointing part of it all is I run four basic coding tests with every new model that releases. I create a 3D firstperson shooter. I create an animation of Elon Musk dancing, a 3D city flythrough, and a music visualizer. Out of all the models I've run this with, Sonnet 405 has been the strongest. Unfortunately, Grock 4. 1 thinking could not even run the city flythrough test at all. It just could not. I gave it 10 different chances. Could not write the code for it. I did this on a live stream from earlier today if you want to check that out. I'll link that down below. And it could not build the 3D firstperson shooter. And even the ones it could write the code properly on the Elon dancing animation, the music visualizer, they were both incredibly weak out of 10. I was very, very disappointed with the codew writing. Then the other two big tests I do with this is creative thinking and business planning. This is actually my main use case with AI. If you watch any of my other videos, you know I give a ton of instructions on how to use AI for creative thinking and business planning. If you're anything like me, you build a lot of apps with AI. And one of my favorite things to do is as I'm building the app, I have an AI build me my product roadmap, be my product manager, and overall advise me on all the features to build. So I went to Gro 4. 1 thinking and Chad GPT 5. 1 thinking, and I gave it the same exact prompt. Right now, I'm working on a Vibe app store. So, it's an app store for soloubuilt vibecoded apps. And I said, "Hey, can you come up with some features for me? " Grock's feature list is just really, really weak. The second idea it gave me is Vibe Souls, which is a permanent onchain reputation system where I give everyone crypto tokens and then their crypto tokens can evolve like Pokemon. It is just a horrible idea that no one would ever want in an app store for vibecoded apps. In the early days of using like chat GPT40 or Gro 2 when the AIs were not nearly that smart, all the ideas it give you are ideas like this. Just ideas that kind of sound interesting on paper but no one would actually use in an app. And on the other side of it too, it just doesn't talk to you in a very human way. We'll talk about vibes in a second, but it talks here's a high octane brainstorm packed with actually new, actually unique, actually sticky ideas. Like that's just not how human beings really speak. When I talked to an AI, I wanted to feel human. That doesn't feel human. It just feels off and weird. But on the other hand, when I talk to 5. 1 thinking, which in my opinion is the best model for creative thinking, business planning, things like that right now, you can see right now you've basically spec. That's fine, but it's not defensible and it won't drive real retention. Like that's just kind of how humans and product managers talk. And here's the ideas it gave me. A build log feed so you can see the latest updates from every app you're following in the app store. Structured feedback requests, right? things people would actually want in an app store for vibe coded apps and they're kind of really good normal use cases that would drive her attention to the app. Secret society tiers are not a feature in an app store. People would be interested in or use a 3:00 a. m. club, no code wizard, Grock loyalist. These are all just weird ideas only an AI can come up with. They're like slop ideas. So, I do not plan on using Grock 4. 1 for creative thinking or business planning. It just gives you really weird business ideas. And then the last weakness I'll cover too is vibes. I think this is an incredibly underrated feature of AI. And if you use any AI or use all the models, you know what I'm talking about. Just the feel of talking to the model. And for me, this has always been Grock's biggest weakness across all of the models they've ever released. Ever since Grock one, they language Grock uses has always just been kind of whimsical, if that's a way to describe it. Like just reading this out, it doesn't feel human. Implement even three to four of these S tier ideas and viable have retention numbers that make Reddit blush. Which two to three of these make you go, "Holy f, we're building that tomorrow. Let's doubleclick on those. " Like, that's just not how human beings talk. That's a weird vibe. It's almost like corporate speak for AI. It's just kind of weird. Like, if we go to the end of the chat GBT response and we see how they kind of wrapped up the idea, they give me a high impact short list of all the features I want to build. Just goes, "Hey, if you want next step, we can pick two of these and I'll help you spec these as proper features so you can throw the prompt straight in a claude code. " Like, that's how human beings speak. Those are good vibes. It makes me want to continue the conversation with Chad GPT. So, as you can see here, my benchmarks for both 4. 1 and 4. 1 thinking were very weak. I gave 4. 1 thinking a 6. 1 and 4. 1 and 11. And that's out of a possible 40. Right now, Sonnet 45 has the highest score across all the tests I gave at 26. 9. It just didn't do well. It's also still in beta. It says beta inside Grock, so I don't want to give it these final conclusions just yet. But
here's where I stand on AI use cases at the moment. Feel free to screenshot this. This is every use case I can think of for AI and every model I would use for that use case. So coding sonnet 45 is best. Creative writing and business planning. Chad GBT 5. 1 thinking is the creme de la creme. Anything media related, video generation, image generation, I'm going with Google. So VO and nano banana. But here is where Grock comes in. Social sentiment and current news. I'm using Grock. It is the most up-to-date model on what's going on in this moment. So, if you need to know what's happening right now and what other people think about it, I am using Grock for that. But to be honest with you, after all my testing, I cannot think of other use cases I would use Grock over for anything else. Straight up chatting chat GBT 5. 1 thinking is the best right now. Coding Sonnet 45, I just don't think anything comes close to it. Grock's media is very good. So, Grock imagines very good, but it is not as good as VO3. 1. Nothing's coming close and Nano Banana 2. It's going to be very hard to beat Google in those two areas just because they're trained on Google images and YouTube. So, again, strengths. If you're building any sort of app and you need current up-to-date information, the app, use the Grock API. If you just have questions about what other people think and what's going on the internet, use Grock 4. 1. But on the other side, it is pretty darn weak in all these areas. And I hope in their full release, because again, this is just a beta. It is much improved. You can use Gro 4. 1 right now by going to grock. com. Give it a test. Let me know what you think. If you've used it already, let me know in the replies what you think about it. If you learned anything at all in this video, leave a like down below. Subscribe. Turn on notifications because all I do is create amazing videos on AI. We have a ton of new models releasing this week, so I cannot wait to cover them. I also have the number one AI newsletter on planet Earth linked down below for that and I will see you in the next