# New Research Proves AGI Was Achieved...

## Метаданные

- **Канал:** TheAIGRID
- **YouTube:** https://www.youtube.com/watch?v=CUwhk5C4bb0
- **Дата:** 13.11.2024
- **Длительность:** 15:36
- **Просмотры:** 83,257

## Описание

Prepare for AGI with me - https://www.skool.com/postagiprepardness 
🐤 Follow Me on Twitter https://twitter.com/TheAiGrid
🌐 Checkout My website - https://theaigrid.com/

0:00 AGI Threshold
0:45 Arc Benchmark
1:38 Benchmark Design
2:23 Test Examples
3:12 MIT Research
4:11 Training Methods
5:10 Search Algorithm
5:57 Human Level
6:38 AGI Path
7:23 O1 Paradigm
8:10 AlphaGo Insights
9:28 Creative Search
10:12 Hanabi Results
11:46 Test Compute
12:52 Human Efficiency
13:54 Altman's View
14:39 Performance Threshold
15:27 Final Thoughts

Links From Todays Video:
https://www.youtube.com/watch?v=eaAonE58sLU 
https://www.youtube.com/watch?v=Kc1atfJkiJU 


Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos.

Was there anything i missed?

(For Business Enquiries)  contact@theaigrid.com

Music Used

LEMMiNO - Cipher
https://www.youtube.com/watch?v=b0q5PR1xpA0
CC BY-SA 4.0
LEMMiNO - Encounters
https://www.youtube.com/watch?v=xdwWCl_5x2s

#LLM #Largelanguagemodel #chatgpt
#AI
#ArtificialIntelligence
#MachineLearning
#DeepLearning
#NeuralNetworks
#Robotics
#DataScience

## Содержание

### [0:00](https://www.youtube.com/watch?v=CUwhk5C4bb0) AGI Threshold

did the AI Community just pass the AGI threshold and not even realize it that's what this paper is aiming to look at which is called the surprising effectiveness of test time training for abstract reasoning now this is research coming out of MIT and I think this is going to be a really fascinating paper because it discusses one of the hardest benchmarks that exists in AI most of you guys know about the GSM 8K gpq those benchmarks but did you know that there was a specific benchmark invented by Francis soay who is a senior staff engineer at Google who's best known for creating The KES deep learning library in 2015 now take a look at what he says about the AR AGI Benchmark because when I show you the results of this recent research you're going to understand why this research is important what is the

### [0:45](https://www.youtube.com/watch?v=CUwhk5C4bb0&t=45s) Arc Benchmark

arc Benchmark and why do you even need this prize why won't the biggest llm we have in a year be able to just saturate it sure so Arc is intended as a kind of IQ test for machine intelligence and what makes it different from most benchmarks out there is that it's designed to be resistant to memorization so if you look at the way LMS work they are basically this uh big interpolative memory and the way you scale up the capabilities is by trying to cram as much uh knowledge and pattern as possible into them and uh by contrast Arc does not require a lot of knowledge at all it's designed to only require what's known as core knowledge which is uh basic knowledge about things like um Elementary physics objectness counting that sort of thing um the sort of knowledge that any four-year-old or 5-year-old uh possesses right um but

### [1:38](https://www.youtube.com/watch?v=CUwhk5C4bb0&t=98s) Benchmark Design

what's interesting is that each puzzle in Arc is novel is something that you've probably not encountered before even if you've memorized the entire internet and that's what makes it sorry that's whates makes Arc challenging for LMS and so far LMS have not been doing very well on it in fact the approach is that are working well are more towards uh discret program search so essentially if you didn't understand what Francis CH just said there he's basically saying that look this Arc Benchmark that he invented is essentially one that is a lot different to traditional benchmarks that allow for llms to excel even if they've already seen the question and this kind of testing is vastly different because it means that you can't train for this kind of exam it has to be

### [2:23](https://www.youtube.com/watch?v=CUwhk5C4bb0&t=143s) Test Examples

within you to reason and understand and humans perform at around 85% but llms really do suffer so for those of you who want to see what this test kind of looks like this is exactly what the test looks like now it's actually pretty simple in this regard you can see that there are holes in this one you can see that the yellow area is filled in then on this one all the areas you have an object with a hole in the middle it's filled in with yellow then for here you can see the same it's yellow and that of course that would be yellow all of those areas you'd fill in for yellow and then of course you do that for the output now the thing is that llms do struggle with these you know kind of tests because they haven't actually seen them before which means that when it comes to reasoning over problems that they haven't seen before LMS do struggle and this is what you call the problem of llms struggling with things that are out of distribution and of course if we are

### [3:12](https://www.youtube.com/watch?v=CUwhk5C4bb0&t=192s) MIT Research

to get to AGI we need a system that can perform well on these kinds of tests because it means that they will perform well on things that they haven't seen before which allow which allows them to be a lot more reliable in many different use cases in many different Industries and so that's where this research comes in from mic that says the surprising effectiveness of test time training for abstract reasoning it talks about you know how language models have shown impressive performance on task within their training distribution but often struggle with novel problems requiring complex reasoning and we investigate the effectiveness of test time training which is updating the model parameters temporarily during inference using a loss derived from input data y yada y basically saying that look we've managed to find a way to increase these model significantly and the results of this are pretty outstanding because they surpass human level reasoning which is pretty incredible because this is the first time it's been done on a benchmark that traditionally llms are considered

### [4:11](https://www.youtube.com/watch?v=CUwhk5C4bb0&t=251s) Training Methods

to fail at so this was the kind of thing that they did this is where you can see a quick example of the training test time data I'm not going to bore you guys with this too much essentially this is where they just used a method which is kind of like a method of searching through the possible solutions to a question and essentially the kind of methods that they did was they flipped the model vertically horizontally and one thing that they also did was they left one out so for example let's say you were trying to predict which number comes in a sequence of 2 4 6 of course 2 46 would be eight what they did was they then looked at the predictions for four and six to see which would come before that which is of course two then they looked at 2 to six what would come between that which would of course be four then they looked at basically different combinations to predict which would come next and this was basically the variation of their search algorithm which allowed them to search in a way over the possibilities of possible solution and then after they essentially generated multiple predictions from this transformed version the hierarchical voting method Aggregates these

### [5:10](https://www.youtube.com/watch?v=CUwhk5C4bb0&t=310s) Search Algorithm

predictions the model uses int transformation voting first followed by global voting to choose the most consistent and correct answers slash likely correct answer and of course they actually mentioned that they used self-consistency to validate predictions across the transformed inputs ensuring The Chosen answer is the one that appears the most frequently across the variations and this resembles a search for agreement or consistency across the outputs now you might think okay it's all well and good having all of this crazy stuff and searching over the possibilities but what were the results of this study well the crazy thing about this and why some people are now claiming that look we have just slowly approached Ai and we're basically now the frog in boiling water which is an analogy to where if you put a frog in extremely hot water it's going to jump straight out but if you put a frog in

### [5:57](https://www.youtube.com/watch?v=CUwhk5C4bb0&t=357s) Human Level

water that is slow slowly getting hotter and hotter it won't realize and will eventually boil and that's apparently what has happened today where they're basically stating that look we get state-of-the-art public validation accuracy of 61. 9% matching the average human score and our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in these models now what's crazy about this guys is that this is of course St the art and it's one of the first times where it's reached the human level score which is completely unheard of when we're looking at a system that is performing on The Benchmark that is supposed to be the kind of Benchmark that tells us whether or not we have AGI now of course some

### [6:38](https://www.youtube.com/watch?v=CUwhk5C4bb0&t=398s) AGI Path

people would argue that this is a kind of abstract reasoning test that doesn't really deduce if we have AGI or not because when we look at open ai's definition of AGI is a system that is apparently an autonomous system that outperforms humans at most economically valuable work of course there are various definitions but I do think that even if we do get the same kind of methods applied to different models we can understand how to make the systems a lot more accurate and then translate that into valuable work now I do think one of the things here is really interesting because what it's showing us is that there is quite likely a clear path to AGI and now a lot of the things that we've seen are going to make sense so let me explain to you exactly what I'm talking about so one of the things that everyone is familiar with by now is quite likely the 01 Paradigm okay and

### [7:23](https://www.youtube.com/watch?v=CUwhk5C4bb0&t=443s) O1 Paradigm

this paper was quite like the 01 Paradigm because opening eyes 01 model so searches during inference time now the crazy thing is that we don't actually know what open eyes1 models are doing at inference time because these models reasoning tokens are actually hidden from us in order for them to protect their modes but what we do know is that as test time compute increases so as you allow the modu to think for longer the ability to get higher scores on benchmarks and reason more effectively increases and that's exactly what we saw with this paper which shows a six times Improvement only using an 8B parameter llm now what's crazy about this 01 Paradigm is what it actually reveals to us about some of the data that we've known about AI before do you remember Alpha go and what the creators of alphago were saying about the future of llms literally just one year ago I

### [8:10](https://www.youtube.com/watch?v=CUwhk5C4bb0&t=490s) AlphaGo Insights

I think that's on the right track I think there is a these Foundation models are World models of a kind and to do really creative um problem solving you need to start searching so if I think about something like alphago in the move 37 famous move 37 where did that come from all its data that it's seen of human games or something like that no it didn't it came from it identifying a move as being quite unlikely but you know possible and then via a process of search coming to understand that the that was actually a very good move so you need to you to get real creativity you need to search through spaces of possibilities and find these sort of hidden gems that's what creativity is I think current language models they don't really do that kind of a thing they really are mimicking the data they are mimicking all the human Ingenuity and everything which they have seen from all this data that's coming from the internet that's originally derived from humans if you want a system that can go be truly beyond that and not just generalize in novel ways so it can you know these models can blend things they can do you know Harry Potter in the style of a Kanye West rap or something even though it's never happened they can blend things together but to do is truly

### [9:28](https://www.youtube.com/watch?v=CUwhk5C4bb0&t=568s) Creative Search

creative that is not just a blending of existing things that requires searching through a space of possibilities and finding these hidden gems that that are sort of the hidden away in there somewhere and that requires search so I don't think we'll see systems that truly Step Beyond their training data until we have powerful search in the process and that's incredible because that's exactly what we're seeing with 01 and of course today that's exactly what we're seeing on the surprising effectiveness of test time training for abstract reasoning which is supposed to be the kind of Benchmark that llms completely fail at but looking at these methods it's clear that when you use these kind of search methods and techniques you're able to bring that Benchmark up even higher now Shane leg

### [10:12](https://www.youtube.com/watch?v=CUwhk5C4bb0&t=612s) Hanabi Results

from Google Deep Mind wasn't the only person that actually spoke about this if you're actually wondering we actually have information from people who actually worked on the actual 01 model and look at what he says about a previous game called Hanabi and how the DraStic performance increase led to something that they literally couldn't believe and it was of course due to the search this is what you would get by adding this search algorithm to the different bots so if you take this handcrafted heris bot that was only getting 28% and then added the simplest search imaginable where you just like you know do a bunch of rollouts for all the different actions you could take and then pick the one that had the highest expected value that would boost your performance to nearly 60% which was beating all the previous DL Bots just out of the box this was using like a single CPU core at test time uh for like a second and the beautiful thing was that you can actually add this on top of all the other deepl bots so if you added it to like the latest in great is bot uh deepl bot you would boost the performance even further to uh around 72% and then if you did this was only if you did search for a single player so if you did it for both players that's the green bars and you can see the performance went up even more um now I should also point out the point of the upper Bound for this game is not 100% because there are some ver there are some like deal outs that you just cannot win so really the top performance is possible is like I think uh maybe 90% um and so you can see like we're quickly saturating um performance in this domain now when my teammates and I at Fair um got this result my teammate literally thought it was a bug because it was just unimaginable that you do this like simple thing and their performance jumps up from like 28% to state of the r 58%

### [11:46](https://www.youtube.com/watch?v=CUwhk5C4bb0&t=706s) Test Compute

so this is why there is a clear Trend that we're going to move towards an area where it looks like test time comput and train time compute are going to be things that are used to allow these llms allow these AI systems whichever way whichever mechanism they decide to use in terms of their searching because there are a variety of different ways you can actually search with an AI system but it's quite clear that this is going to be the next method that a lot of future systems are going to have in order to unlock Advanced reasoning and of course the ability to perform things that are out of distribution now what's crazy about this is search is incredible but this actually got me thinking okay because when we look at how humans are versus how AI systems are sure you can have an AI system that searches over a th000 possibilities or even 10,000 possibilities but what happens if that search were to get even more sample efficient take a look at what demasab says because this is exactly what I'm thinking about sure we can get superhuman AI systems but that kind of search just isn't as efficient as humans are would maybe look at millions of uh possible moves for every decision it's going to make alpha zero and alphao

### [12:52](https://www.youtube.com/watch?v=CUwhk5C4bb0&t=772s) Human Efficiency

made you know looked at around T tens of thousands of um possible positions in order to make a decision about what to move next but a human grandm a human world champion uh probably only looks at a few hundreds of moves even the top ones in order to make their very uh good decision about what to play next so that suggests that obviously the Brute Force systems don't have any real model other than theistic about the game Alpha zero has quite a decent uh model but the world but the human you know human top human players have a much richer much more accurate model than then of go or chess so that allows them to make you know worldclass decisions on a very small amount of search so I think there's still there's a sort of tradeoff there like you know if you improve the models then I think your search can be more efficient and therefore you can get further with your search yeah now when you start to understand that okay this now makes sense these AI systems are basically searching through a variety of different possibilities with a more creative output set via the temperature

### [13:54](https://www.youtube.com/watch?v=CUwhk5C4bb0&t=834s) Altman's View

and this means that we're allowed to get a lot of different solutions that have a lot more variety thus allowing us to get to the right answer more often and of course you train on those thought patterns that led to those right answers but you might be thinking okay since this is now the case and since this is here I'm thinking that this is probably why Sam ultman said in a recent interview this is why they know exactly what they need to do in order to get to AGI this is the first time ever where I felt like we actually know what to do like and then of course when we do remember that in open ai's blog post where they spoke about 01 they spoke about how when 01 was allowed 10,000 submissions per problem the model achieved a score of

### [14:39](https://www.youtube.com/watch?v=CUwhk5C4bb0&t=879s) Performance Threshold

coming out of openi isn't hype anymore it does seem like they've got the road map to AGI which might just be a combination of different search tactics as well as refining how they manage to make that search even more efficient but it will be interesting to see what even happens with the O2 model as samman is predicting it gets 105% on the GP QA he also predicted that it will saturate a lot of these benchmarks but a lot of other companies research are basically supporting the claims for the 01/2

### [15:27](https://www.youtube.com/watch?v=CUwhk5C4bb0&t=927s) Final Thoughts

Paradigm I'd love to know your thoughts whether or not we have achieved AGI in I guess the outof distribution aspect but it will be interesting to see where things go from here

---
*Источник: https://ekstraktznaniy.ru/video/13754*