Claude 4 Is Finally Here - And I Pushed It to the Limit

19:13

Claude 4 Is Finally Here - And I Pushed It to the Limit

Skill Leap AI 22.05.2025 35 351 просмотров 742 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

🚀 Learn how to use AI to grow your business. Access 20+ expert courses & community—free for 7 days: https://bit.ly/skill-leap Here is the official Claude 4 announcement: https://www.anthropic.com/news/claude-4 I walk through how I test Claude 4's newest models, Claude Opus 4 and Claude Sonnet 4, using real-world tasks to see how they compare to other top AI models like ChatGPT and Gemini. I try coding a custom chess game in Python, running long documents through its 1 million token context window, and building visual dashboards from data. I also show how well it does with logical reasoning, web search, and acting like a personal assistant. I even see if it can turn a screenshot into usable web code for a site banner. Claude’s new “extended thinking” and web tools really help make responses smarter. Claude Opus 4 claims to be the best coding model out there now. It worked great with some prompts, but still had issues with complex logic. Claude Sonnet 4 is free to use and better than Sonnet 3.7, though uploading screenshots doesn’t work the same right now. I also tested it on a tricky train riddle and Airbnb research — it handled both well.

Оглавление (4 сегментов)

Segment 1 (00:00 - 05:00)

Clot 4 is finally here. It's been quite a while since their last update. In fact, Claw 3 came out back in March of 2024. They had Claw 3. 7 Sonnet that came out back in February of 25. They just released two new models. Claude Opus 4. Opus is their very best model and Claude Sonnet 4. And I'm going to show you exactly how to use it in this video. It's available inside of the Claude app and I'm going to run it across bunch of different prompts. Okay, before I show you the prompts, let me catch you up with Claude here. So, the Claude family of large language models came in three different sizes. Haiku, Sonnet, and Opus. Now, Opus had Opus 3 and now they have Opus 4. They didn't have five or 7. Sonnet had 5. 7. And for developers that were creating apps using large language models, claw 3. 5 and 3. 7 sonnet were pretty much the standard go-to large language model to use. Now with a new upgraded sonnet 4, that's obviously going to replace 3. 5 and 3. 7 and this new powerful OP. We haven't had Opus since 3, so it's been quite a long time since the release of the last Opus. Obviously, there's going to be some cost, but I'll compare cost in a little bit here, too. Now they claim right here so far Opus 4 is the world's best coding model. So we'll do some testing there. And it says it's good at longunning task and agent workflow. And Claude Sonnet 4 is a significant upgrade to Claude 3. 7. Again, better coding here and reasoning while responding more precisely to your instructions. So we'll test that too. Now along with these models, they also introduce some tools and I have access to some of these. So you have extended thinking with tools use. So for both models you could use things like web search. Now Claude for a very long time did not have web search. They just released it like a month or two ago and it is now combining web search with extended thinking. So if you use any other models the thinking models are now some of the best models available out there because they think in the background before they respond giving you a much more useful answer. They also have clawed code which has been out for some time but now that is generally available. So this is more for developers here. Ties into things like GitHub. And these are available in the API. They have other API capabilities too. We'll get to this in a little bit. And Claude, Opus4, and Sana 4 are called hybrid models. So they could give you almost instant answers, but also have reasoning built into them. So this is one of the main reasons I actually still use Claude. Now I used to use Claude all the time last year. This year I use it a little bit less. I use Gemini 2. 5 Pro and Chat GPT more. This is probably the third most used tool that I use and I'll show you some of the reasons why I still use it. But now with these models, we'll see how it compares to those other top tier models. And I'll quickly show you some of these benchmarks here. So for software engineering, this is four. It's beating everything else including Gemini 2. 5 Pro, which right now is on top of the leaderboards, all the reasoning models over here. And when you compare it to other things like graduate level reasoning or agentic tool use again it's pretty much beating everything out there and some things like Gemini 2. 5 do come close in some of these categories here and I'll leave the full blog post below if you want to see some of their videos and some of their demos here. But let's go ahead and take it for a test here. Inside of cloud. ai AI, you will have two models, but Claude Opus 4 right now and some of the extended thinking and things like that are only available in a paid plan which start at $20 a month, but Claude Sonic 4, this should be the default here and it should be free to use for everyone. They always do have a limit too. So, they all always add new plans with more extensions. So, they have a max plan for example that gives you even more, but that is more expensive than the $20 plan I have here. And if you use Claude all the time, this plan gives you a lot more access to the best models. Right now, with the pro plan, I still pretty much hit my limit every time I use Claude here. But with the max plan, you could get five to 20 times more usage and it starts at $100 a month. So, we'll start here with Claude Opus 4. And I'm going to do a bunch of different testing in different categories. The first one I want to do is actually a coding test that I've never got any model to get right. So, this is going to be a game of Python that I'm going to run locally on my computer for chess. The only thing is I changed the rules. So, the pawns in chess are going to be able to move like bishops. So, they can't just move straight. They could move diagonally, too. And I have a folder called chess assets. So, I took the chess pieces on the internet and I downloaded every piece as a little image here. So, I told it where that is. And I'm going to see if it could do this for me. Okay, so it wrote the whole game for me. It gave me some directions here. And the speed is pretty close to the previous version. I didn't really notice any difference in

Segment 2 (05:00 - 10:00)

speed. It wasn't really slower or faster. And I'm also going to add one more thing here. I'm going to add a screenshot here of the file names of the chest pieces. I did not give it that. So, it did not guess those right because why would I guess those right? I named those on my own. So, I'll send this out. So, it's going to revise the code here, which is really nice. Sometimes when you do this inside of Claude compared to the other apps that I've used for this type of thing, it does an update right here, but it doesn't start from scratch. You see how it's just deleting the part of the code based on this instruction instead of writing everything from scratch. So this saves you a ton of time because of the way this works. Okay, so we got the first version of the game and it looks like it did not load the pieces. So I'm going to take a screenshot of this here and tell it to give me the pieces. Okay, we got the game up and running. Let's go ahead and try it. So, this does move like a bishop. So, let's see. We take it there. Just going to run it through a few different moves. Okay. Would this be able to take this piece? Yes, we got that right. Oh, this pawn is not able to move. So, something broke here in the process, but up to that point, it was working. Yeah, I can't get these pawns to work the right way. Now, would the other pieces move right? Let's see. Nope. It looks like it broke the entire game. I can't get these other pieces to move. Okay, some of them move, but overall, I would call this a fail, too, because every time I've used this other model to do this, this is where we got stuck. It could do a chess game just fine. It could render out the pieces just fine, but it has a logic problem still. I think it's really good at analyzing documents. And now it has a large context window of 1 million tokens. So, I'm going to give it a huge document over here. I'm going to press upload. In fact, let's use Claude Opus right here. To do that, I'm going to press the plus sign and upload a file. Now, this is the Nvidia 2025 annual report. And this document is 180 pages. Now, one thing to keep in mind, sometimes you will have to compress large documents if they're PDFs because it does have a file upload limit of 30 megabytes. So, this one I had to compress. It was 50 megabytes, but all the information is still in the compressed version of it. All 180 pages. Now, I haven't tested this, but this is what we're going to do. We're going to see if you could find a needle in the haststack. I'm going to ask it for director compensation. I'm going to take this guy's name over here. We have this information. It's on page 53 of a 181page document. Let's see if you could find that for us. Okay, what was Robert's compensation? Okay, 85 in cash. Let's see. Total 343828. 34382885. Wow, that is good. Okay. To be able to find something. I mean, there's got to be so much text and so many numbers in this, right? And it found exactly the right information. Lot of models that I've tested this out with didn't even come close to getting these type of things right. And there's always the hallucination problem, right? They could just totally give you the wrong number. So even with clot, it's definitely worth checking to make sure it is giving you the right information here. Okay, for this test, let's use cloth sonnet for and this is a very common way I use cloth. I take a screenshot of things I want to improve. So here is kind of the banner of the website, right? I'm going to press the plus sign, upload the file. You could always take a screenshot from here, too. Okay, it says file format not supported for PNG, but this is exactly how I took screenshots before. Okay, let me try Opus here. Maybe they changed that. Okay, that's kind of strange. I don't know why I could upload a screenshot using Opus, but Sonnet right now doesn't have that option, and I was using Sonnet 3. 7 up to this point. So hopefully they do change that because I use this all the time. Okay. And I'm going to say turn this image into an embed code with inline CSS and JavaScript here. So we could add this to our website that way. Okay. Here's a quick preview of it on canvas, but I usually like to just download this HTML file and look at it on a full screen. Okay, this is actually looking good. Usually it couldn't render out a image, right? Because that was an image I had on my website, but it looks like it just kind of recreated that as a graphic. Let me see what that looks like. So this is our website right now and obviously there's an actual image here and there's some other things but I think it did a really good job here with the formatting. Okay, now let's see if you could analyze data and turn it into a dashboard. This is one of the most common ways I use this. Again, I'm going to use Opus just because the Sonnet screenshot is not working. So here's a random Google Analytics from an old website that I have. I'm going to go ahead and upload this here. And I'm just gonna ask it turn this into a visual dashboard and simplify it. Make it easy to share with my team. Let's see what we get with that. Okay, nice. Here we got this dashboard right over here. So, let

Segment 3 (10:00 - 15:00)

me extend this out. And you could actually publish these. So, if you want to share these with your team, you don't always have to go ahead and download. You could press publish on top. Okay. And this is what it looks like. Wow, this looks a whole lot better than what I was doing with 3. 7. That looks really, really good. So, total sessions. Yeah, all the right information is here. And I'm going to go ahead and check those with the actual data. And it got all the numbers exactly right. Wow, I really like this layout. And it looks to be perfectly responsive and mobile friendly, too. Nice. Now, the next part is I want to see how good it is using search with one of the models and how it compiles that data. So, my prompt here is I'm going to try to have it compare the top frontier models, which is now Claude 4, Gemini 2. 5, and chat GPT. And we'll see what it comes up with. And if you click this right here, web search is on by default now. And we're going to try extended thinking that you will have to turn on manually, but we'll do that in a different prompt that is going to require that. Right now, you don't need this. And I would recommend, not for this, but anytime that you use models like this, if you want more context, give it access to your drive, your Google Drive, and your Gmail and calendar. So it could be a better personal assistant for you and pulling relevant information. For this one, we'll go ahead and use Sonnet here with search and we'll send this out. Okay, it took about 15 20 seconds here and it searched in batches and it gave me 10 different results here. 10 different results. So I went through 40 or so websites and here is what we got. So let me go ahead and extend this out and see what this table says. Overall leader in performance claw for OP released May 22nd. Okay, that's today. Number one in LM arena leaderboard is Gemini. Well, this just came out so this has not been updated in this one in the LM arena which ranks all the different top models. Coding Claude Opus is the winner again. reasoning Gemini 2. 5 Pro multimodal capability 2. 5 Pro okay memory and context is a tie 1 million context window although Gemini does extend to two GPT 41 though is the one that does 1 million the regular 40 does not do that for tool use claude is the winner really interesting excellent for Zapier integration speed GPT40 I agree with that that's definitely faster and when it comes to cost so let me do a quick compar comparison here since it's showing you cost that is only related to the API. You could use all these models typically for a $20 membership inside of those different apps. As long as you have that subscription to chat GPT, you get the best model. Gemini, $20 gets you the best model. And Claude Opus is part of the $20 plan, too. So, that's a tie for that. But, I'll do a follow-up prompt to just compare the API cost. And I'll put that as a table. And this is a good way to test to see how he's doing a follow-up prompt in the same chat. Okay, so main model pricing when it comes to Cloud 4 Opz, it's pretty expensive. Input 15, output 75. Okay, so if you compare that to Gemini, I mean, wow, that that's way more, right? Compared to Chad GPT, way more. So I could see a lot of people still maybe using the older models over here. Haiku is very effective as far as price goes is a smaller model here. So it compared it to flash but again flash is a fraction of the cost of that too. So right here its own ranking says if you are building apps using models like this Gemini 2. 5 Pro is going to win best flagship as far as value goes right and Opus is definitely premium pricing here. So you'll have to compare for your own application here to see if this makes sense. Is far more expensive. Okay let's go ahead and test out reasoning. Now I'm going to turn on the best model here and we are going to use this tool called extended thinking which is only available in the paid plan. If a train leaves Chicago at 3 p. m. traveling 60 miles per hour and another leaves New York City at 4 p. m. going 80, where do they meet? Okay, he asked me a bunch of questions but it says, "Do you want me to make standard assumptions? " I'm going to say yes. Okay, so let me just show you. It gives you the thought process but a very brief recap of his thinking process, right? And it doesn't tell you exactly how long he's thinking for, but I think he did a really good job breaking this down. And it came to a conclusion pretty quickly. So, the answer is Western Pennsylvania here because I asked it, where are they going to meet the two trains based on that speed? And I think this is actually pretty good. I don't know if it's exactly right because I changed all the numbers within this question. So, I ran it against all the other models, all the top models to see what I got. So, Gemini actually got it totally wrong. It didn't

Segment 4 (15:00 - 19:00)

even follow the prompt. It says they would meet at 8:39 p. m. Well, that's not what I asked. I didn't say what time. I said where. It did not tell me where. The best chanceing model 04 mini high actually gave us a very plausible answer too, which is northern Ohio. So, Pennsylvania and Ohio are next to each other. So, it could technically be northern Ohio or western Pennsylvania here. They both kind of make sense, but this gave me a very specific answer. Roughly 28 miles east of Cleveland in this very specific town in Ohio. And I even tried it with Deepseek. Deepseek thought for 169 seconds. And what's interesting about DeepSeek, it look at all the thinking that it shows you over here before it gives you a response, right? So this is obviously a free open source model but it is a reasoning model this one R1 DeepSeek R1 but it looks like the train will meet at okay the formatting broke here so we don't know where the train is going to meet when it comes to DeepSeek. Okay, I tried one more time with DeepSeek. The formatting is still broken and still didn't give me a location or a city. Okay, they said it's really good at being an AI agent for us, right? So this is a good prompt to kind of test that out. You are a virtual assistant. Search three Airbnbs in Austin on the $250 a night. Summarize reviews and make recommendations. Now, the agent part technically should be able to make reservations, but this is nowhere near that. There are other applications that kind of are able to do that right now. Uh like Manis AI is more of an agent where it could take more steps for us. But let's see if he could do this part. Okay, he answered me pretty quickly and I think he did a good job. We found three recommendations or suggestions here and I think the recommendation picked exactly one and here's the reason why with a nice checklist. So as a personal assistant this saves me a ton of time like to do these type of research. Now to take it to the next step and use aentic workflows where it could actually book that for me that's not quite there yet but hopefully they're moving in that direction. Some of these other apps like manisai are moving in that direction. Okay. Well, I guess that's the end of my testing for today. I've already reached my limit here doing the testing that I showed you in this video here. And all of those tests were my first initial test with those prompts. I did not pre-est them or pick anything out of random. I just wanted to show you exactly my first experience using this model on the Claude website and those are the results that I got today. And I also wanted to show you something that we updated with our website skill leap. So skill leap is a library of AI courses. Me and other instructors have bunch of different AI courses, over 20 different courses, a live community here where we answer every questions, ton of resources, including prompt libraries, downloadable resources, and things like that. But we recently added learning paths. A lot of people were asking us, well, do you have a more step-by-step like what course should I watch first? What should I watch second? Well, we're starting to put together, we have two different learning paths. We're rolling out five new ones, too. So, if you're new, foundation of generative AI. So the path has multiple courses in order and then it has certification too with individual courses and then when you complete it. So this has five different courses and then we have this one here and it keeps track of all your progress so you could always jump back to it. Shows you the percentage complete. So we rolled this out. It's part of the subscription and we have right now still a 7-day 100% free trial. So you could watch courses, make sure it's a good fit. Usually the vast majority of people stick around. That's why we still have the free trial. So, you watch a course because we roll out three courses typically per month. We're trying to get to four courses and we have two new courses coming out this month. So, I'll put a link in the description. You could browse courses here even before signing up just to make sure we have the right stuff for you. We have AI coding for entrepreneurs coming up actually in just a couple weeks. We have this new chat GPT image generation. So, we move pretty quick when new updates come out on releasing comprehensive courses. These are typically anywhere from 20 to 30 different lessons, two to five hours each. So, if you want to try that out for yourself, I'll go ahead and link that below. Thanks so much for watching this one. I'll see you on the next

Другие видео автора — Skill Leap AI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник