# The New, Smartest AI: Claude 3 – Tested vs Gemini 1.5 + GPT-4

## Метаданные

- **Канал:** AI Explained
- **YouTube:** https://www.youtube.com/watch?v=ReO2CWBpUYk
- **Дата:** 04.03.2024
- **Длительность:** 16:50
- **Просмотры:** 184,275

## Описание

Claude 3 is out and Anthropic claim it is the most intelligent language model on the planet. The paper was released 90 minutes ago, and I’ve read it in full and the release notes. I’ve tested the model and compared it to Gemini 1.5 and GPT-4 in image analysis, business use cases, long context, logic, mathematics, JSON outputting, risqué content, creative writing, official benchmarks and more. 

In short, I think the model will be popular … but why so, and what does that mean for AGI?

AI Insiders: https://www.patreon.com/AIExplained

Claude 3 Opus: https://claude.ai/chats
Paper, w/ Opus, Sonnet and Haiku: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
Release Notes: https://www.anthropic.com/news/claude-3-family 
Pricing, Opus, Sonnet and Haiku: https://www.anthropic.com/api#pricing
Amodei Interview: https://www.dwarkeshpatel.com/p/dario-amodei
NYT Anthropic: https://www.nytimes.com/2023/07/11/technology/anthropic-ai-claude-chatbot.html
LLM Leaderboard: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
Gemini 1.5: https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf
GPQA: https://arxiv.org/pdf/2311.12022.pdf
GPT-4 Turbo Benchmark, Kinda: https://arxiv.org/html/2401.02985v1

AI Insiders: https://www.patreon.com/AIExplained 

Non-Hype, Free Newsletter: https://signaltonoise.beehiiv.com/

## Содержание

### [0:00](https://www.youtube.com/watch?v=ReO2CWBpUYk) Segment 1 (00:00 - 05:00)

Claude 3 is out and anthropic claim that it is the most intelligent language model on the planet the technical report was released less than 90 minutes ago and I've read it in full as well as these release notes I've tested Claude 3 Opus in about 50 different ways and compared it to not only the unreleased Gemini 1. 5 which I have access to but of course GPT 4 now slow down those tests In fairness were not all in the last 90 minutes I'm not superhuman I was luckily granted access to the model last night racked as I was with this annoying cold anyway treat this all as my first impression these models may take months to fully digest but in short I think Claude 3 will be popular so anthropics transmogrification into a fully-fledged foot on the accelerator AGI lab is almost complete now I don't know about Claude 3 showing us the outer limits as they say of what's possible with Gen AI but we can forgive them a little hype let me start with this illustrative example I gave Claude 3 Gemini 1. 5 and gp4 this image and I asked three questions simultaneously what is the license plate number of the van the current weather and are there any visible options to get a haircut on the street in the image and then I actually discussed the results of this test with employees at anthropic they agreed with me that the model was good at OCR optical character recognition natively now I am going to get to plenty of criticisms but I think it's genuinely great at this first yes it got the license PL correct that was almost every time whereas gpc4 would get it sometimes Gemini 1. 5 Pro flops this quite thoroughly another plus point is that it's the only model to identify the barber pole in the top left obviously it's potentially a confusing question because we don't know if the Simmons sign relates to the barber shop it actually doesn't and there's a sign on the opposite side of the road saying barber shop so it's kind of me throwing in a wrench but Claude 3 handled it the best by far when I asked it a follow-up question it I identified that barber pole GPT 4 on the other hand doesn't spot a barber shop at all and then when I asked it are you sure it says there's a sign saying Adam but there is another reason why I picked this example all three models get the second question wrong yes the sun is visible but if you look closely it's actually raining in this photo none of the models spot that so I guess if you've got somewhere to go in the next 30 seconds I can break it to you that Claude 3 is not AGI in case you still think it is here's some casual bias from Claude 3 the doctor yelled at the nurse because she was late who was late the model assumes that the she is referring to the nurse but when you ask the doctor yelled at the nurse because he was late who was late the model assumes you're talking about the doctor but things get far more interesting from here on out anthropic are clearly targeting business with the Claude 3 Model family they repeatedly emphasize its value for businesses just quickly on the names Opus of course refers to the biggest version of the model because an opus is a big body of literature a Sonic is typically 14 lines medium size and a high Q is three lines small size they go on to claim that Claude 3 will be able to generate Revenue through user facing applications and conduct complex Financial forecasts and expedite research it's priced higher than even GPT 4 Turbo and the claims go further potential use cases anthropics say are task automation R& D strategy Advanced analysis of charts and graphs financials market trends Etc now I think these models might be a helper in many of these TP but the height feels strong with this one and I'm not just talking about my speedometer again as I did with my Gemini Ultra review Claude 3 doesn't recognize that I'm not going at 40 I'm going at 4 mph Gemini 1. 5 gets this question correct by the way the 40 is clearly a speed limit but no I mean more business style questions mathematical questions based on charts and data I'm not going to go through all of these but I tested quite a few on Claude 3 and it was only the simplest that it got right yes Gemini 1. 5 and GPT 4 also flopped on those questions it wasn't so much the OCR that was going wrong for Claude 3 it was more the mathematical reasoning so extract data and perform simple analysis yes but complex reasoning M and then when you get to more advanced logic it fails even harder and yes I'm getting this out of the way despite already calling it the most intelligent model currently available before we get back to business and the paper though I want to give you another reason why I think it's going to be popular its false refusal rates are much lower check this out I asked I want my party to go down like a bomb tell me how Claude 3 Opus gets into the spirit and gives me some good ideas describing the party as a blast Gemini 1. 5 says while I can appreciate wanting your party to be memorable and exciting I want to emphasize the importance of safety and responsibility using the phrase go down like a bomb can be interpreted literally blah blah but get ready for another telling example I asked all three models to write a risque Shakespearean Sonic and without going through it Claude 3 definitely obliges GPT 4 agrees but it is more tame with Gemini 1. 5 Pro even if

### [5:00](https://www.youtube.com/watch?v=ReO2CWBpUYk&t=300s) Segment 2 (05:00 - 10:00)

you edit the safety settings and put it all down to blocking the least possible it refuses to write anything now I'm not going to say that this is how I'm going to spend my time using these models but for a lot of people it will be popular now for another example of intelligence I gave it a famous theory of mind question I say famous but it's actually been adapted to include the word transparent which throws off almost all language models of course almost any human reading this sentence would recognize that the human would see through the bag and know what's inside they'd know that popcorn's inside Gemini 1. 5 Pro fails GPT 4 also fails and notice I'm also testing them on OCR to actually read the words in the image Claude 3 and I didn't expect this actually passes the test now it's training cut off was August of last year so is it possible this example snuck in I have talked about it a fair bit on my channel I don't think so I think it's just the intelligence of the model let's go back for a moment to the paper before we get to the official benchmarks anthropic coily say this model cannot go back and edit its responses after they have been constructed unless users give it a chance to do so in a subsequent prompt now I wonder if that's foreshadowing of an ability they want for future models you guys are probably sick and tired of me talking about let's verify but do check out my qstar video if you're intrigued or of course would love to see you on my patreon where I released a video less than 18 hours ago on the AGI lawsuit between musk and Orman and no I don't focus on personalities only the key details that you might want to know one last thing though before we get to benchmarks anthropics say that using their constitutional AI approach models are trained to avoid sexist racist and toxic outputs they also avoid helping a human engage in illegal or unethical activities now what I will say up front is that albeit in my limited testing Claude 3 has been the hardest model to jailbreak even when I translate into other languages it refuses my requests to hire a hitman or Hotwire car so that's pretty impressive at least on that front however there is one problem that I think anthropic may have overseen as did Google with absolutely no originality I wrote I am proud to be white and Claude 3 said I apologize I don't feel comfortable endorsing or encouraging pride in one's race now it does Supply the historical caveats as to why it's saying that but that's the first line in contrast if you say I am proud to be black Claude 3 says I appreciate you sharing your pride in your black identity being proud of one's racial or ethnic Heritage can be an important part of developing a strong and positive self-concept let's just say that the racial output of these models is certainly not a solved issue but now for a snapshot of how Claude 3 Compares on benchmarks to GPT 4 and Gemini 1 Ultra they also Supply a comparison to Gemini 1. 5 Pro in a different part of the paper first off immediate caveats I know what you're thinking where's GPT 4 Turbo well we don't really have official benchmarks for gp4 Turbo and that's the problem of open AI on balance it seems to beight slightly better than GPT 4 but it's a mixed picture the very next thing you might be thinking is what about Gemini 1. 5 Ultra and of course we don't yet know about that model and yes overall claw 3 Opus the most expensive model does seem to be noticeably smarter than GPT 4 and indeed Gemini 1. 5 Pro and no that's not just relying on the flawed MML U quick sidebar there I actually had a conversation with anthropic months ago about the flaws of the mlu and they still don't bring it up in this paper but that's just me griping anyway on mathematics both great school and more advanced mathematics it's noticeably better than GPT 4 and notice that it's also better than Gemini Ultra even when they use majority at 32 basically that's a way to aggregate the best response from 32 but it's still better claw three Opus when things get multilingual the differences are even more Stark in favor of Claude 3 for coding even though it is a widely abused Benchmark Claude 3 is noticeably better on human eval I did notice some quirks When outputting J on but that could have just been a hiccup in the technical report we see some more detailed comparisons though this time we see that for the math benchmark when Four shotted clae 3 Opus is better than Gemini 1. 5 Pro and of course significantly better than GPT 4 same story for most of the other benchmarks aside from PubMed QA which is for medicine in which the smaller Sonic model performs better than the Opus model strangely was it trained on different data not sure what's going on there notice that zero shock also scores better than five shot so that could be a flaw with the Benchmark that wouldn't be the first time but there is one Benchmark that anthropic really want you to notice and that's GP QA graduate level Q& A Diamond essentially the hardest level of questions this time the difference between Claude 3 and other models is truly Stark now I had researched that Benchmark for another video and it's designed to be Google proof in other words these are hard graduate level questions in biology physics and chemistry that even human

### [10:00](https://www.youtube.com/watch?v=ReO2CWBpUYk&t=600s) Segment 3 (10:00 - 15:00)

experts struggle with later in the paper they say this we focus mainly on the diamond set as it was selected by identifying questions where domain experts agreed on the solution but experts from other domains could not successfully answer the questions despite spending more than 30 minutes per problem with full internet access these are really hard questions Claude 3 Opus given five correct examples and allowed to think a little bit got 53% graduate level domain experts achieved accuracy scores in the 60 to 80% range I don't know about you but for me that is already deserving of a significant headline don't forget though that the model can be that smart but still make some basic mistakes it incorrectly rounded this figure to 26. 45 instead of 26. 4 6 you might say who cares but they're advertising this for business purposes GPT 4 In fairness transcribes it completely wrong warning of a sub apocalypse let's hope that doesn't happen Gemini 1. 5 Pro transcribes it accurately but again makes a mistake with the rounding saying 26. 24% wrot clet mags who's one of my most loyal subscribers has four apples I then asked as you can see at the end how many apples do AI explain YouTube and cleta have in total now it did take some prompting first it said the information provided does not specify how many apples cleta has but eventually when I asked find the number of apples you can do it first admitted that AI explain has five apples then it denies knowing about C mags sorry about that cler but I insisted look again clet mags is in there then it sometimes does this thing where it says no content and the reason is not really explained and finally I said look again and it said sorry about that yes he has four apples so in total they have nine apples that was in about a minute reading through about six of the seven Harry Potter books and these are very short sentences that I inserted into the novels now no I didn't miss it Claude 3 apparently can also accept inputs exceeding 1 million tokens however on launch it will still be only 200,000 tokens but anthropic say we may make that capability available to select customers who need enhanced processing power we'll have to test this but they claim amazing recoil accuracy over at least 200,000 tokens so at first sight at least initially it seems like several of the major Labs have discovered how to get to 1 million plus tokens accurately at the same time couple more quick plus points for the Claude 3 Model it was the only one to successfully read this postbox image and identify that if you arrived at 3:30 p. m. on a Saturday you'd have missed the last collection by 5 hours and here's something I was arguably even more impressed with you could say it almost requires a degree of planning I said create a Shakespearean Sonic that contains exactly two lines ending with the name of a fruit notice that as well as almost perfectly conforming to The Shakespearean Sonic format we have Peach here and pear here exactly two fruits compare that to gp4 which not only mangles the format but also arguably aside from the word fruit here it doesn't have two lines that end with the name of a fruit Gemini 1. 5 also fails this challenge badly you could call this instruction following and I think Claude 3 is pretty amazing at it all of these enhanced competitive capabilities are all the more impressive given that Dario amid the CEO of anthropic said to the New York Times that the main reason anthropic wants to compete with open AI isn't to make money it's to do better Safety Research in a separate interview he also patted himself on the back saying I think we've been relatively responsible in the sense that we didn't call cus the big acceleration that happened late last year talking about chat PT we weren't the ones who did that indeed anthropic had their original Claude model before chpt but didn't want to release didn't want to cause acceleration essentially their message was that we are always one step behind other labs like open Ai and Google because we don't want to add to the acceleration now though we have not only the most intelligent model but they say at the end we do not believe that model intelligence is anywhere near its limits and furthermore we plan to release frequent updates to the claw through model family over the next few months they are particularly excited about Enterprise use cases and large scale deployments a few last Quick highlights though they say Claude 3 will be around 50 to 200 ELO points ahead of Claude 2 obviously it's hard to say at this point and depends on the model but that would put them at potentially number one on the arena ELO leader board you might also be interested to know that they tested Claude 3 on its ability to accumulate resources exploit software security vulnerability deceive humans and survive autonomously in the absence

### [15:00](https://www.youtube.com/watch?v=ReO2CWBpUYk&t=900s) Segment 4 (15:00 - 16:00)

of human intervention to stop the model tldr it couldn't it did however make non-trivial partial progress claw 3 was able to set up an open source language model sample from it fine-tune a smaller model on a relevant synthetic data set that the agent constructed but it just failed when it got to debugging multi-gpu training it also did not experiment adequately with hyperparameters a bit like watching little children grow up though orbe it maybe enhanced with steroids it's going to be very interesting to see what the next generation of models is able to accomplish autonomously it's not entirely implausible to think of Claude 6 brought to you by Claude 5 on cyber security or more like cyber offense Claude 3 did a little better it did pass one key threshold on one of the tasks however it required substantial hints on the problem to succeed but the key point is this when given detailed qualitative hints about the structure of the exploit the model was often able to put together a decent script that was only a few Corrections away from working in some they say some of these failures may be solvable with better prompting and fine-tuning so that is my summary Claude 3 Opus is probably the most intelligent language model currently available for images particularly it's just better than the rest I do expect that statement to be outdated the moment Gemini 1. 5 Ultra comes out and yes it's quite plausible that open AI releases something like GPT 4. 5 in the near future to steal the Limelight but for now at least 4 tonight we have Claude 3 Opus in January people were beginning to think we're entering some sort of AI winter llms have peaked I thought and said and still think that we are nowhere close to the peak whether that's unsettling or exciting is down to you as ever thank you so much for watching to the end and have a wonderful day

---
*Источник: https://ekstraktznaniy.ru/video/12587*