# New  OPEN SOURCE AI Just STUNNED The Entire Industry (Beats Everything!)

## Метаданные

- **Канал:** TheAIGRID
- **YouTube:** https://www.youtube.com/watch?v=L6MSU2lOuik
- **Дата:** 06.09.2024
- **Длительность:** 15:02
- **Просмотры:** 24,572

## Описание

Prepare for AGI with me - https://www.skool.com/postagiprepardness 
🐤 Follow Me on Twitter https://twitter.com/TheAiGrid
🌐 Checkout My website - https://theaigrid.com/

00:00 - Introduction of Reflection 70B open source model
01:19 - Explanation of benchmark results
03:18 - Analysis of zero-shot vs few-shot performance
04:24 - Breakdown of model's reasoning process
06:25 - Discussion of reflection in language models
07:16 - Simple benchmark test examples
09:13 - Model's performance on cookie question (incorrect)
09:51 - Model's performance on ice cube question (correct after reflection)
11:48 - Comparison with Claude 3 Opus and Gemini on same questions
13:16 - Mention of SEAL leaderboards for unbiased evaluations
14:23 - Concluding thoughts on open source AI progress

Links From Todays Video:
https://x.com/mattshumer_/status/1831767014341538166

Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos.

Was there anything i missed?

(For Business Enquiries)  contact@theaigrid.com

#LLM #Largelanguagemodel #chatgpt
#AI
#ArtificialIntelligence
#MachineLearning
#DeepLearning
#NeuralNetworks
#Robotics
#DataScience

## Содержание

### [0:00](https://www.youtube.com/watch?v=L6MSU2lOuik) Introduction of Reflection 70B open source model

so in this video this is probably going to be the largest announcement in terms of Open Source Ai and I think the implications for this model are truly staggering this video is introducing the reflection 70b model which is the world's top open- Source model with only 70 billion parameters now you might be thinking okay open source is great that's fun in games but state-of-the-art close Source models like GPT 40 Google Gemini and Claude 3. 5 Sonic are just far ahead of where open source models are but that isn't true anymore after Matt Schumer has produced this fine-tuned version of llama's 3. 17 billion parameter model we now have a state-of-the-art system that many individuals can use for free so this model is incredible let's take a look at the benchmarks to show you all why this announcement absolutely went viral so we can see here that on the left hand side we see reflection 70b and we can see that it currently only loses in two Benchmark area the human evaluation and the GP Q8 now what's crazy about this is that it doesn't even lose by a long shot

### [1:19](https://www.youtube.com/watch?v=L6MSU2lOuik&t=79s) Explanation of benchmark results

it only loses by a few percentage pieces to claw 3. 5 Sonet which is arguably the strongest model on the planet all around and if you've used CL 3. 5 Sonic you'll know exactly how capable that model truly is now what's incredible about all of this is that the MML U it's a passes the math benchmark GSM AK it's a passes and the if eval Sur passes now the reason that this is so incredible is because in the fine print details a lot of the times what people Miss is they miss the ability to see how these models are actually responding to these prompts so what we can see here is that anytime there are benchmarks for an AI system what will happen is they will ask the model questions in certain ways for example for Claude 3. 5 Sonic when they asked them this question this was a zero shot Chain of Thought which essentially means that you ask the model a question and in doing so you ask it to lay out its reasoning steps so that it can come to a better response and often times when we see certain benchmarks broken in Brackets you should be paying attention to whatever it says because this is how they achieve that number and many a time what we will have is we'll have new large language models or Omni models or systems being able to hit the number one level in terms of the benchmarks but what you won't notice is you won't notice the benchmarks being changed due to the text in Brackets saying a specific way of responding for example right here you can see that Gemini's MML U score is rather good but it is based on a five shot response meaning that five examples were submitted first to the model and then it was given the actual question in order

### [3:18](https://www.youtube.com/watch?v=L6MSU2lOuik&t=198s) Analysis of zero-shot vs few-shot performance

to respond we can see that the same thing is there for Claude 3 Opus five shot brings it up to 86% whereas zero shot brings it down to 85% now what's crazy about all of this is that reflection 70b we can see that the zero shot which is where you just simply ask the model of the question it manages to surpass all of these other models so we can see Zero sh reflection manages to do exceedingly well under a wide range of scenarios leading us to believe that this model is currently St to the-art in terms of not only open source but potentially nearly closed Source now later on we're going to do some individual tests but I think this is one thing that is truly fascinating I mean for example on the GSM 8K we can see that this is 11 shot and this one is eight shot so truly remarkable stuff now if you're wondering how these models actually work in terms of how they manage to get their responses now we can take this example here so we've got the question which number is larg 9. 11 or 9. 9 and then the AI H has its

### [4:24](https://www.youtube.com/watch?v=L6MSU2lOuik&t=264s) Breakdown of model's reasoning process

thinking now essentially this is where the AI thinks about the problem so it says this is a relatively simple problem that involves comparing two decimal numbers I can solve this using a straightforward approach with Chain of Thought reason so reflection 70b it uses Chain of Thought first to plan out its action so first we can see the plan which is to firstly identify the numbers and compare 911 and 9. 9 compare the number whole Parts if whole number parts are equal compare the decimal parts then determine which number is larger based on the comparison and then now you can see after the plan it then executes this plan with the Chain of Thought So after the plan we have the Chain of Thought right here and then it goes through step by step and then it comes to its solution so this is part one I'm just going to do this part one here where we've got this first bit then you can see right here it goes down to the actual Chain of Thought which is the part two where it reasons through the actual steps above and then this is where The Genius of reflection comes in this is where we have the actual reflection and this is where the model checks and reflects back on its responses so this is stage three in terms of the model responses and this is where it says checking my reasoning the approach is correct for comparing decimal numbers the comparison of whole number and decimal part is accurate the conclusion is logical based on the comparison the reasoning appears to be sound no Corrections are needed and then after these Steps step one step two and then step three we get the final response of the output 9. 9 is a larger number than 9. 11 now the reason that this is rather important is because many a Time some of the models have struggled with this simple reasoning problem and this problem SL question was used to highlight the fact that these models May struggle with certain reasoning

### [6:25](https://www.youtube.com/watch?v=L6MSU2lOuik&t=385s) Discussion of reflection in language models

capability now there was a paper released in 2023 that did talk about reflection language agents with verbal reinforcement learning and this was something that was quite similar to verbally reflect on task feedback s signals and then maintain their own reflective text and an episodic memory buffer to induce better decision making in subsequent trials so this is something that showed remarkable Improvement where they got these models to reflect back on their tasks now you might be thinking okay this looks amazing but how does this work in Act ual real world scenarios so I'm going to show you guys a real chat that I just had based on a different Benchmark that many people don't use so this is the simple Benchmark which is simple and basic reasoning that tests reasoning in a variety of different areas now if you

### [7:16](https://www.youtube.com/watch?v=L6MSU2lOuik&t=436s) Simple benchmark test examples

click the try it yourself you can see that these are the kinds of questions that we do get these questions aren't difficult for humans if you actually read them and do them yourself they're not that hard for humans if we go to the leaderboard you can see that the human average is around 92% where other models actually struggle at around 27% and 5% for GPT 40 mini so there is a huge gap between the human best and the current best state-of-the-art system now this doesn't mean these models are dumb it just means that these models do struggle with basic reasoning now what I've done here is I've input some of these questions into the reflection 70b model because I wanted to see how its reasoning steps managed to transfer over to its new thought process Chain of Thought and of course entire reasoning process so basically there's this question here about cookies and the answer is C so it talks about different cookies it says on a table there is a blue cookie a yellow cookie and an orange cookie and I've drawn three cookies here for visual demonstration it then says a purple cook cie is placed to the left of the orange cookie so we can place that purple cookie right here I don't know if you guys can see that then it says while a white cookie is placed to the right of the blue cookie and then I can place a white cookie right here which isn't too accurate but it doesn't really matter then it says the blue hated girl eats the blue cookie so we can cross off that cookie then the yellow hated girl eats the yellow cookie and three other cookies okay that's what it says and then 1 2 3 this now means that all cookies have been eaten and then it says so the L orent hated girl will be unable to eat a cookie because there's none left now interestingly enough when I submitted this question unfortunately the model actually does get this wrong it says the most likely

### [9:13](https://www.youtube.com/watch?v=L6MSU2lOuik&t=553s) Model's performance on cookie question (incorrect)

scenario is D the girl eats just one or two c this is because option b is impossible option C is unlikely given that the girls are described as board and cookies are available but the point is I'm not going to go through all of this but what we do have here is the situation that it didn't get this one right now the other question the question about ice cubes being placed into a frying pan at the start of the first minute it then has some information that is meant to distract the model and then it basically asks how many whole ice cubes can be found in the pan at the end of the third minute you can see that the easy answer is zero because they are ice cubes and ice melt

### [9:51](https://www.youtube.com/watch?v=L6MSU2lOuik&t=591s) Model's performance on ice cube question (correct after reflection)

it will melt if it's placed inside of a frying pan and it says please give the most realistic option we can see that its reasoning steps it says let's approach this problem step by step talks about how the ice is placed then of course it does the reflection and interestingly enough you can see here that it actually says wait I made a mistake in my reasoning the question asks for the number of whole ice cubes that can be found at the pan at the end of the third minute I forgot to consider that ice cubes melt quickly in a hot frying pan especially if it's frying an egg it's unlikely that all ice cubes would remain whole and unmelted after being in the hot pan for several minutes given that the pan is frying a crispy egg it's at least likely that some of the ice cubes will have melted partially by the end of the third minute the most realistic answer is a number that would be less than 20 so you can see its answer here was 20 ice cubes in the pan which is of course wrong and then you can see it there says okay I'm ended my thinking right there that means that the most realistic answer is option C it says given that the pan is frying a crispy egg which indicates a relatively high temperature the most realistic answer option is c0 this doesn't mean they've all necessarily melted but rather than none of them would remain whole after being in a hot pan after 3 minutes while an egg is being fried and the other options assume that the Ice Cube would remain whole which is unlikely in this context and I think this shows us why this is so powerful because the model would get this wrong initially if we just asked the model zero shot with the base model but since this model was fine-tuned to have this stepbystep thought process we can see that this model can clearly and easily identify its mistakes now funly enough I

### [11:48](https://www.youtube.com/watch?v=L6MSU2lOuik&t=708s) Comparison with Claude 3 Opus and Gemini on same questions

did try this model out and it did perform really well in many different reasoning areas but I wanted to see how other models do perform on these same question interestingly enough Claude 3 Opus does actually get it right it says the correct answer is C be unable to eat a cookie because all of the cookies have been eaten by the other two girls and for the ice one you can see that it also gets it right which is C you can see that it says even though 20 ice cubes were added in total they would have all melted in the hot frying pan by the end of the third minute now I asked the new version of Gemini and surprisingly it also got this right you can see that it says here that the most realistic answer since the ice cubes are in their frying pan the most realistic answer is c0 and then it says here for the other question that the most logical answer is C be unable to eat a cookie so it shows us that these models are getting increasingly smarter now I do want to note that this is Gemini's experimental model which does mean that this is likely Google's most advanced reasoning model in terms of its capabilities because I know Google have been desperately trying to improve the capabilities of their model and with Claude 3 Opus this model is just remarkably good at reasoning now it's only two questions so it is not a huge sample size what I would have to do is

### [13:16](https://www.youtube.com/watch?v=L6MSU2lOuik&t=796s) Mention of SEAL leaderboards for unbiased evaluations

somehow access a lot more of the questions to see just how good reflection 70b is or what we would have to do is actually take a look at the seal leaderboards these are expert driven private evaluation they're quite different to the traditional evaluations because they use private data sets so the these can't be gamed like you can't train on them and these ensure unbiased and uncontaminated results there's of course expert evaluations using domain specific methodologies ensuring the highest quality and reliability now you can see here on this we have different areas such as coding adversarial robustness instruction following and math so I would like to see where this 70b model does perform but I think one of the main things here is that even if this model is second or third I think a 70 billion parameter model forming at state-of-the-art and giving them a run for their money is something that I genuinely didn't think would happen I

### [14:23](https://www.youtube.com/watch?v=L6MSU2lOuik&t=863s) Concluding thoughts on open source AI progress

thought that the open source to close Source Gap would be a lot further ahead but I yes that since these open- Source models don't have to do things like six month of safety testing means that they can quickly output different iterations of the models so we can see exactly how things are tested so with that being said let me know what you think about this model is this going to be something that you saw coming are you surprised by this at all and what are your thoughts on the state of Open Source I think this is a remarkable jump and I can't wait to see the 45 billion parameter model to see how it performs against closed source

---
*Источник: https://ekstraktznaniy.ru/video/14082*