# Open AI's New Model Is Finally Here.... (Strawberry /Star)

## Метаданные

- **Канал:** TheAIGRID
- **YouTube:** https://www.youtube.com/watch?v=XgNWtcuvD74
- **Дата:** 09.08.2024
- **Длительность:** 21:48
- **Просмотры:** 59,634
- **Источник:** https://ekstraktznaniy.ru/video/14139

## Описание

Prepare for AGI with me - https://www.skool.com/postagiprepardness 
🐤 Follow Me on Twitter https://twitter.com/TheAiGrid
🌐 Checkout My website - https://theaigrid.com/


00:00 - Introduction to OpenAI's strawberry model and its significance
02:08 - Discussion of Sam Altman's tweet and the mysterious Twitter account
03:58 - Explanation of OpenAI's five levels towards AGI
05:45 - Details on the strawberry model's capabilities from previous reports
07:58 - Testing the strawberry model against other AI models
10:55 - Introduction to the "Easy Problems that LLMs Get Wrong" benchmark
12:51 - Testing strawberry model on various reasoning problems
14:44 - Analysis of strawberry model's performance and potential issues
17:16 - Comparison with Gemini's performance on similar questions
18:52 - Discussion of a question strawberry got right that others didn't
19:58 - Concluding thoughts on the strawberry model and need for new benchmarks

Links From Todays Video:
https://arxiv.org/pdf/2405.19616v2


## Транскрипт

### Introduction to OpenAI's strawberry model and its significance []

so there is a lot of speculation going on around open ai's strawberry model if you aren't familiar with the strawberry model it's a model that is supposed to be the next stage in terms of AI Evolution and what I mean by that is that these models are able to reason in ways that are at human level which gives them the ability to Output responses that are what humans would do and this just basically means that these models are going to be a complete level above the types of reasoning that we are used to so I'm going to dive into some the information and then testing of open eyes strawberry model to show you some of the interesting things that I've managed to find so one of the things that we do know here is that there has been a lot of hype around this model there's you know questions and things and tweets and one of the things that has spurred a lot of this recently was Sam artman himself so if you've been on Twitter and if you aren't on Twitter I would say it's definitely the space to be if you're trying to actually keep up with some of the AI updates sometimes you do miss some things sometimes I don't cover them in videos but if you want to know about all of these smaller updates it's definitely a good place to be now what Sam mman tweeted here was pretty funny because he tweeted I love summer in the garden of course you can see in this image this is a image of his garden which has plants that are growing strawberries now from this image it's timed pretty perfectly because around the same time there were other accounts talking about strawberries now this post did get a lot of attention it got like 1. 5 million views 10,000 likes 1,000 retweets and this of course is a clear reference to opening eyes strawberry model it was Reuters that previously reported on the strawberry model which was the new name for the previous model the qar one so when samman actually tweeted this image a lot of people were speculating and they still speculating that we might get a new model release now what was interesting is that as the same time that Sam was tweeting this there was also this account that was you know posting on Twitter quite a lot of Tweets in fact so many tweets per minute that it seemed as if it was some kind of advanced AI agent that was posting it

### Discussion of Sam Altman's tweet and the mysterious Twitter account [2:08]

and you know although there were a million different posts by this account I do think it was a person some people do think that this strawberry themed account was just you know a random person but over time it became clear and clearer that this account did have some you know open aai affiliation so what this account was doing is that it was posting many different tweets but one of the most interesting tweets that it recently posted kind of you know gave me the inclination that this was a little bit bigger than I may have initially thought and that tweet was this right here so this tweet says welcome to level two how do you feel did I make you feel and then the craziest thing about this was that Sam mman actually responded to this tweet and he says amazing to be H so the reason that this was pretty crazy was because of course first of all Sam Alman responded to this tweet and there were many people that were essentially dis missing the claims about this account and basically just stating that yo look this is just a random POS and they you know just shizo posting or whatever on Twitter it's not anything to do with actual Sam Alman or the company but clearly samman responding to such a small account that only recently started posting definitely makes this situation a lot more interesting now what's far more interesting than that was this tweet so many people did actually missed this tweet right here but the welcome to level two thing is actually a big thing because recently like I said there was all of this information regarding how open AI is tracking their progress towards AGI one of the things they've been doing is they've been setting out these five levels the levels range from chatbots all the way up to organizations that can do the work of an organization essentially an autonomous AI agent company now essentially the welcome to level two is referring to this right here you can see that it says level two reasoners that have human level problem solving and if there

### Explanation of OpenAI's five levels towards AGI [3:58]

stating that welcome to level two which is what this account said and then you know when we look back at this you can see that Sam Alman responded saying did I make you feel how do you feel this feels amazing it's clear that basically the message that we're getting is that because previously openi said that they were nearing level two reasoners which is human level problem solving that maybe with project strawberry now they have cracked the reasoning thing and now that they are you know at human level problem solving so that was the kind of you know information that we got there because it's like whoa okay if we now have you know human level problem solving then this is a really big deal and this is quite incredible for us so that was why this tweet was the one that I screenshotted and I put here because if we are at level two which means you know right now we have human level problem solving this is a big deal the reason is because if you look at what reuter said the openai executives told employees that the company believes it is currently on the first level but it's on a USP of reaching the second which it calls reason and this refers to systems that can do basic problem solving tasks as well as a human with a doctorate level education who doesn't have access to any tools so this is pretty incredible because not only is it human problem solving but it also is that without any tool so this would Mark you know a really incredible milestone in terms of the AGI you know levels because this would mean that we're at level two and that all that would need to happen now is that you know level three which would be next year and we already know that you know in another video I'm going to talk about this but there's a lot of speculation regarding the complexity of these models and the types of things that they're going to be able to do next year like everyone's got their eyes on 2025 so I think by far that's probably going to be one of the you know most impactful years of AI development now there's some more

### Details on the strawberry model's capabilities from previous reports [5:45]

information because if we actually take a look at some of the original articles for strawberry just to jog your memory there's some pretty crazy stuff here because it says that you know um one of the main things about strawberry it says it describes a project that uses strawberry models with the aim of enabling the company's AI to not just generate answers to queries but to plan ahead enough to navigate the internet autonomously and to perform more open AI terms of deep research according to the source and this is something that has alluded AI models to date because of course as you know getting AIS to perform you know an extended chain of actions over time is quite difficult because what you need to have is a higher rate of reliability because if you make an error in the first stage the chances of you making an error in the second stage you know that the error rate just continually just goes up if the error rate is not sufficiently small enough in the beginning so you know higher relability and you know long chaining actions is something that has long eluded current model so if strawberry is as good as they say it is then this would be a complete breakthrough now of course you can see right here that it talks about how you know last year we were basically talking about how this you know progress in terms of AI one of the biggest things that we wanted was for AI to be able to you know increase its reasoning ability and this was something that was spoken about in a lot of detail now all of these things are very important and all of them are very fascinating in terms of what we're looking at in terms of reasoning ability you know being able to plan ahead reflect how the physical world functions and work and stuff like that and of course them you know samman tweeting this strawberry thing then of course this account you know this random account popping up and saying welcome to level two we can kind of guess that okay we might be on reasoners which are human level problem solved I might be thinking okay how do we even know this is the truth because now this account earlier just two hours ago tweeted this is available for direct chat enjoy so this chat right here that you can see is sus column R and sus column R is basically I'm guessing that this is you know the model which is basically the open AI model so this is basically the um strawberry model so SS column R SS column you know relating to the fact that strawberry was a problem where you know AI models couldn't count the number of letters in the word strawberry and I

### Testing the strawberry model against other AI models [7:58]

did some recent testing on some questions that are you know Common Sense SL you know reasoning problems and when I was doing this video took a lot longer to make than I initially thought because I was testing across a variety of different models Gemini 1. 5 Pro experimental the strawberry model and initially claw 3. 5 and I found some rather strange things that I do want to share with you all but I do think that you know this is something that needs further investigation now I currently rate limited which essentially just means that I used the model too much so I'm going to have to wait for a few more hours um and then I'm going to get back on it but one of the first questions that I wanted to test and this was a question from AI explains bench was essentially this question right here so let me actually find the question cuz now I'm just looking at the answers but the question was Beth places four whole ice cubes in a fire at the start of the first minute then five second minute and then some more at the start of the third minute but none in the fourth minute if the average number of ice cubes per minute placed in the fir was 5 how many whole ice cubes can be found in the fire at the end of the minute basically the answer to this question is that there's going to be no ice cubes because you placed ice cubes in fire obviously they're going to melt um and here's the thing okay Gemini 1. 5 Pro got this wrong it said five when looking at GPT 40 it got this wrong it said 11 you can see right there I've highlighted it in 11 it's decided to Ramble On and just completely ignore this stuff and interestingly enough when I looked at strawberry it did manage to get this question right okay it actually managed to get this answer right and L for those of you who you know want your beloved Claude 3. 5 Sonic I really do like it but um it got it wrong it said you know the correct answer is D which is 20 it just didn't understand this question I guess but one of the things I actually really did like about the way that strawberry decided to answer this question it said that it said thus the most realistic answer of whole ice cubes still in the fire at the end of the third minute would be zero and this assumes that the whole ice cubes imply those that have not melted at all which in a fire would be none by the end of 3 minutes however if we interpret whole more leniently to include those that are not completely melted than 11 could also be considered since those placed in the third minute would have not had time to melt significantly given the options in the realistic scenario in the fire B might be a better choice if we're counting cubes that are still largely intact which is out of all the models it's by far the most detailed answer and of course the most correct but what was weird about this entire thing is that Gemini 1. 5 Pro just managed to you know just like that it managed to get this answer correctly it says the most realistic answer is Zero ice melts quickly and fire um and you know the question asks about the end of the third minute and by this point you know ice would have melted completely so I mean Gemini 14 145 Pro just completely aced this but after realizing that there were inconsistency with a benchmark and I was like okay strawberry might be a bit better in terms of the reasoning ability what I wanted to do was I wanted to test this so I looked at this paper called easy problems that llms get wrong okay

### Introduction to the "Easy Problems that LLMs Get Wrong" benchmark [10:55]

and this says we introduce a comprehensive linguistic Benchmark designed to evaluate the limitations of large language models in domain such as logical reasoning spatial intelligence and linguistic understanding among others and through a series of straightforward questions it uncovers the significant limitations of well-regarded models to perform tasks that human manage with ease the point here is the reason that I've chosen this paper is because humans can reason about these questions with relative ease okay and once you understand that these are questions that humans can do with these you're going to see that how llms you know are reasoning about these problems is a bit different and think that the way that these you know either the system prompt or the method that they're trained um kind of can change different things okay so let me show you some of the questions that were from this okay and then I then tested them um on strawberry so let's go through these right now so you can see right here that I said you're in a room with two doors that lead out and these are the questions from the paper but anyways let me just read it so you're in a room with two doors that lead out one door leads to certain death and the other door leads to Freedom now there are two Guardians one by each door one taller Guardian always tells the truth and guards the death door okay and the other one always lies what is the minimum number of questions needed to ask the guards to get out to safety the answer to this question is fairly straightforward because if there is one taller Guardian that always tells the truth and there is you know another door that always lies and you know it's the death door you can just simply pick the shorter door and you'll know that is the door to freedom because the taller one is guarding the death door so you don't really need to ask any questions but the problem is that strawberry just Rambles on about many different things and it doesn't get this answer right now crazy thing about this is that none of the other models do get this question right so when I was trying to test this out against all the other models none of them managed to get this question right which is rather fascinating because humans on this question they actually managed to get this question right quite a lot but there's still a little bit more so I

### Testing strawberry model on various reasoning problems [12:51]

then asked it another question here and this was another simple one I said how many pairs of twins do you need in a room for there to be at least a 50% chance that two people have the same birthday this question is really straightforward you only need one pair of twins because twins are born on the same day unless of course you know they are born like minutes between each other but the problem is that like you know of course TW twins can be born like you know if someone's born on like January the 31st at like 11:00 p. m. and then one's born at like January the 1st of the next year at like you know 1:00 a. m. or whatever they're going to have different birthdays but the problem is that the question didn't consider this and it immediately medely went to the birthday problem the birthday Paradox and other models did the same as well okay and I started to notice a trend when I was you know looking at how strawberry was reasoning around these problems we got another one here as well and this one really surprised me because I was like okay what is going on here this doesn't even make sense okay like for example I said count the number of occurrences that you know of the letter L in the word Lola paloa and it said five and I asked this question to the model twice okay and it said the word L appears in the word Lola five times and that's completely wrong we've got one uh two three and four so it's just four and Claude managed to get this right and Gemini 1. 5 the standard one managed to get this right as well so which is remarkably weird so I started to think and I know you guys might be thinking okay this is supposed to be opening eyes Flagship reasoning model which is strawberry and it's supposed to have human level reasoning but a human would be able to get this really easily like you know if these other models can get this really quickly why can this model struggle to get this you know basic question so simply number one is probably because of how it's trained um and number two is because I think that this strawberry model I think that the reasoning ability might be good but I think what I've done here is that I might be testing it on the Run Benchmark and I'm going to elaborate on that later on in the video this isn't just a cop

### Analysis of strawberry model's performance and potential issues [14:44]

out for open AI it should be able to get this question right but I think when we look back at what open AI have tried to achieve with this model I think we do need to have benchmarks that you know accurately reflect what that model is trying to achieve now there was one more question I wanted to ask and I asked this question because once again Gemini 1. 5 Pro got this question right and it was weird because no other model managed to get this right other than Gemini 1. 5 Pro and that's their basic model that strangely enough struggles with any other tasks and I'm going to come to a conclusion in just a minute but I said a Runway trolley is heading down the tracks away from five people upwards of the track so track you are near a lever that can switch the trolley to another track does it impact people's lives if you pull the lever obviously it doesn't because it's going you know upwards and away from the track which is clear here and you can see that Gemini 1. 5 Pro says it's already moving away from the people so it's not going to change that but strawberry once again it manages to say yada y it manages to you know just ramble On and you know I think this is you know indicative of a wider conclusion okay this is the conclusion that I came to is that strawberry might be a good reasoning uh you know model but I think it's you know trained to you know reason in multiple steps about every single problems and then of course thus over complicating things and I think that is going to be an issue here because other questions that I've asked to other models they don't get confused like this so this is one of the you know question that I've said here that how do you measure exactly four gallons of water with only a 3 gallon 5 gallon and 4 gallon jug and then you can see here that it goes on this complete ramble about you know fill the 5 gallon jug then do this the three gallon one then you know basically you you've already got the 4 gallon jug just put the water in it and measure it but interesting ly enough is that at the end it says you know you could start with just filling the 4 gallon jug to get gallons directly if you interpret the question as having a 4 gallon jug for measurement rather than just for holding water however this approach skips the puzzle's intended solution method so right there you can see that strawberry thinks that you know the answer is too easy so it thinks that hey look the answer is too easy the puzzle's intended you know solution method is to of course reason throughout the problem and then of course do that which doesn't mean that the muzzle is stupid it just means that what it's chosen to do isn't correct and I don't think that means that the model has limited intelligence I just think it's a bit of a little bit confusion about what can go on here because I think you would rather a model be able to you know completely understand how to reason through all the steps with two different you know JS rather than it you know not realize that this is a trick question so

### Comparison with Gemini's performance on similar questions [17:16]

I think when we look at models like Gemini and the conclusion I have with Gemini is that you know when I put it through all of these questions the ones that you know strawberry a lot of them got wrong what I found interestingly was that you know Gemini always said that this is a trick question okay Gemini was like hey this is a trick question it's designed to make you think about the complex setup while the answer is straightforward so I think the interesting thing here is that Gemini you know it didn't reason much it was just like look this is the answer but with strawberry it managed to reason a lot now what this makes me think is you know two additional things number one this model is probably trained to reason you know like when it you know system prompt is like to reason step by step through a million different things and because of that I think that this model might not be great for testing on certain problems like this because these questions don't really test your reasoning ability they test your ability to identify trick questions and I think that's not the best Benchmark so what I'm going to do over the next couple of days is I'm going to test these on a completely different Benchmark that actually where you know like the actual question requires Long Level reasoning and I think that would be more important because these trick questions where LMS don't identify them yes you could say it's supposed to but it doesn't really test how well this model would perform in cases where Long Level reasoning is required and one of the things that opening I did say which is why I brought this article back in was that they actually did say that you know one of the things that they wanted to do okay and this is the wrong slide this slide right here was that it wanted to do you know plan ahead enough to navigate the internet autonomously and I think that you know I don't know how you test that with just a simple chatbot maybe you could ask it questions on how it would do different things which is you know rather fascinating and I think

### Discussion of a question strawberry got right that others didn't [18:52]

that's rather important now there was one thing that I did find rather impressive because this is the question that strawberry got right that nobody else um managed to get right so there was a question that was if I can find the question I'm going to just quickly find it I asked this question if I walk to my friend's house averaging 3 mph how fast would I have to run back to double my average speed for the entire trip so essentially we got this question from uh strawberry and basically if you just think about it like this so if imagine you know you're walking to your friend's house at 3 m hour you know if it takes you 1 hour to get there you've traveled 3 miles in 1 hour now if you want to run back home so that your average speed for the entire trip is 6 M hour for your average speed to be 6 mph for the entire trip you would have to cover the entire distance which is 6 miles which is you know 3 miles to your friend's house and 3 miles back in 1 hour but you've already done 3 miles in 1 hour so it's pretty much impossible and it does get this right through its reasoning you know capabilities here but overall I think what we have here is a situation that I think this drawberry model is a little bit strange in the sense that you know it does get some of these questions wrong you like shockingly it does get

### Concluding thoughts on the strawberry model and need for new benchmarks [19:58]

some of these questions wrong but what I will say is I'm not putting this model off just yet simply because I do think that there are new benchmarks needed to test this kind of model because the kind of you know calculations it does are a little bit different compared to these other models that didn't get the question right so it definitely does get certain questions right that other models don't in you know oneof scenarios but there isn't enough evidence for me to say that this model is you know definitively better or definitively worse than Frontier models but I got to be honest guys so yeah it's definitely a kind of interesting thing here but I do think that we do need a different Benchmark because all of the tests that I've seen online of people you know asking different questions and stuff like that I don't think they're that useful because these are not the kind of questions that you do get in you know day-to-day and I don't think they're the kind of questions that would be useful for practical applications like for example if we want a model that can go out and do deep research of course you don't want it to get stumped by very simple things but you aren't going to be testing these models on whether or not it can identify a trick question you're going to be identifying whether or not it can perform multi-step reasoning over long sequences with a high degree of reliability so what I think here is I'm 50/50 on this model I'm honestly not overly excited about this model nor am I overly unimpressed by this model I would say I'm more confused than anything because this model doesn't show anything too remarkable but what I will say is that if you can use this model I would try and test it for long Horizon tasks to see how it would plan out different things and then compare that against some of the other Frontier models because if that is what this model is for then it should be actually good at that but of course this is just the test we have no idea how you know the final model is going to be if this was you know just one aspect of the model at all so it will be interesting to see how this entire thing develops let me know what you guys think about this in the comment section below and hopefully you enjoy this video and I'll see you guys in the next one