# Metas AI Team Just Revealed Their Secret To AGI

## Метаданные

- **Канал:** TheAIGRID
- **YouTube:** https://www.youtube.com/watch?v=xvnlHzvlsGo
- **Дата:** 20.10.2024
- **Длительность:** 19:38
- **Просмотры:** 45,855
- **Источник:** https://ekstraktznaniy.ru/video/13962

## Описание

Prepare for AGI with me - https://www.skool.com/postagiprepardness 
🐤 Follow Me on Twitter https://twitter.com/TheAiGrid
🌐 Checkout My website - https://theaigrid.com/

00:00:00 Yann's talk
00:00:23 Human-level AI
00:01:11 AI requirements
00:01:54 Current limitations
00:02:25 Proposed architecture
00:03:47 Moravec paradox
00:05:27 Data comparison
00:07:26 Visual data
00:09:02 Objective-driven AI
00:11:33 JEPA architecture
00:12:57 Video prediction
00:15:31 Future implications
00:17:49 AGI timeline
00:19:25 Expert disagreement

Links From Todays Video:
https://www.youtube.com/watch?v=4DsCtgtQlZU&t=361s&pp=ygUZWWFubiBsZWN1biBodW1hbiBsZXZlbCBhaQ%3D%3D

Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos.

Was there anything i missed?


## Транскрипт

### Yann's talk []

yanen has given a recent talk that is probably one of the most insightful talks recently given in AI because it talks about the future of artificial intelligence and how we're actually going to get to AGI he talks about the timelines for AGI the architectures that we need and of course he's going to start this conversation with what the future of AI assistants are going to look like okay um so I'm going to talk about

### Human-level AI [0:23]

human level AI or how do we get there and uh how we are not going to get there as well so first of all we do need human leveli because the future there is a future in which most of us will be wearing smart glasses or wearing other types of devices and we'll be talking to them and they will those system will host assistants uh maybe not just one maybe a whole collection of them and what that will cause is that all of us will have basically a staff of kind of smart virtual people working for us um so it's like everybody would be a boss this is where yanak actually talks about what we need in order to get to Advanced machine intelligence some could

### AI requirements [1:11]

call that AGI ASI but I think it's really important that we look at the things that we currently don't have things like persistent memory a bunch of other things and these are the kinds of things that you know future systems that are really smart and are on the level of artificial super intelligence this is going to be the base level of things that we're really going to need just not of real humans um and we need to build this for basically amplifying human intelligence and you know making people more creative more productive and everything um but for this we need machines that understand the world they can remember things that you know have intuition have uh Common Sense U things that can reason and plan to the same level as humans and despite

### Current limitations [1:54]

what you might have heard from you know some of the most enthusiastic people current AI systems are not capable of any of this um so that's what we need systems that learn to basically model the world have mental models of how the world Works um and you know every animal has one search model um your cat certainly has one that's more sophisticated than any AI system ever built whever devised U systems that have persistent

### Proposed architecture [2:25]

memory which current LMS don't have systems that can plan complex action sequences which is not possible with Lins today and systems that are controllable and safe um so I'm going to be proposing an architecture for this that's called that I call objective driven AI I wrote kind of a vision paper about this uh that I posted about two years ago and a lot of people at Fair are basically working towards implementing that plan uh Fair used to have a combination of you know kind of long-term Blue Sky research and kind of more applied projects um but meta year and a half ago created a product division called geni focused on AI products um and they do appli R& D so now Fair has been sort of redirected towards the longest ter the longer term Next Generation AI system we don't do it ATMs basically um so um the success next this is where yanin actually speaks about how we need something we're basically missing something because we keep running into more X Paradox where things that are easier for humans are extraordinarily difficult for computers and things that are you know really you know good for computers things that computers really excel at like mathematics and advanced

### Moravec paradox [3:47]

calculations these are things that humans um really do struggle with so we need a way to solve that by tackling this in a different method um and we're still missing something big to reach human level intelligence um and I'm not necessarily talking about human level intelligence here but even you know your cat or your dog can do amazing feats that are still completely out of reach of current AI systems how is it that any 10-year-old can learn to clear up the dino table and fill up the dishwasher and the 10-year-old can learn this in one shot right there's no need to practice or anything um 17-year-old can learn to drive a car in about 20 hours of practice we still don't have level five cell driving cars and we certainly don't have household robots that can clear up the dinner table and F up the dishwasher um so we we're really missing something big right otherwise we would be able to do those things with AI systems um so we keep bumping into this thing called the Maric Paradox which is that things that appear trivial to us that we don't even consider intelligent seem to be really difficult to do with machines uh but like high level complex you know abstract thinking manipulating language seems to be easy for machines or things like playing chess and go and stuff like that so next we have one of the most fascinating pieces of data no pun intended but this is where yanak actually talks about how our world model needs to be trained on a lot more data than we think and he basically says that it would take an llm like it would take you know 350,000 years for a human to read the amount of uh data that we need and especially at like 250 words a minute and of course

### Data comparison [5:27]

you can see that you know as a human child is awake for you know 16,000 hours that's just you know more data than any llm has seen despite these bigger training runs so this is's basically saying that look like you know we think we've got a lot of data but when we compare it to systems that are actually doing really well like humans and animals and these things are actually you know really good in the world when we actually look at every single you know image as a frame as a piece of data that's actually so much data that means that you know we just are going to need a lot more of it okay so it's like you know need to think about um how we even going to do that okay so maybe one reason for this is the following um the uh an llm is typically trained on 20 trillion tokens a token is basically zero is like three4 of a word on average for a typical language so that's 1. 51 to the 13 words um each token is about three bytes typically so that six 10 to the 13 bytes it would take you know on the order of a few hundred thousand years for any of us to read through this right that's the totality of all the text available publicly on the internet essentially um but then consider a human child of four-year-old has been awake a total of 16,000 hours which by the way is 30 minutes of YouTube uploads um we have 2 million optical nerve fibers uh you know optic nerve fibers coming to our brain each fiber roughly carries about 1 BTE per second maybe it's half B per second some estimat say it's like three bits per second doesn't matter it's an order of magnitude so that data volume is about 10 to the 14 bytes you know roughly the same order of magnitude as the llm so in four years a child has seen as much visual data or data as the biggest l train on

### Visual data [7:26]

the entire publicly available text on the internet so that tells you a number of things that tells you uh first that we're never going to reach anything close to human level intelligence by just training on text it's just not going to happen um then the counter argument is okay but visual information is very redundant so first of all this one bite per second per optic nerve fiber that's already a 100 to1 compression ratio compared to the photos sensors you have in your retina we have on the order of 60 to 100 million photos sensors in our retina and that gets compressed using neurons in front of your retina to 1 million nerve fibers uh so there is a already 100 to1 compression then it gets to the brain and then it's expanded by a factor of 50 or something like that um so I'm measuring a compressed information but it's still very redundant and redundancy actually is what celf supervisor learning requires cell supervisor learning will only learn something useful from redundant data if the data is highly compressed which means it's completely random you can't learn anything you need redundancy to be able to to learn the underlying structure of the data um so we're going to have to train systems to learn common sense and physical intuition by basically watching videos or by living in the real world um and so next is where we have yanan's objective driven Ai and this is

### Objective-driven AI [9:02]

essentially the main architecture that will essentially be artificial general intelligence now this is quite the different architecture compared to current standard llms and even quite different from the o1 reasoning considering it's an entirely different new system so I'm going to try to use a simplified breakdown because yanan does talk about this for 10 plus minutes and I got to be honest it is so basically instead of just reacting to data like how current AI systems which are llms respond based on patterns objectiv driven AI Works more like a thinking process it would allow the AI to imagine different possible future scenarios and basically make plans based on that now the reason that this is truly important is because the goal is to move Beyond AI that can only perform specific tasks like predicting the next word in a sentence and move towards AI that can figure out how to achieve goals in new situations even if it's never faced those exact scenarios before and that's something that AI has a really big problem doing so how this objective driven AI works is that the AI has a world model which is essentially a mental representation of how the world works and then it combines this world model with goals objectives and then optimizes it actions to achieve those goals while considering any constraints like avoiding danger and instead of just going through preset actions like following a script it can adjust and adapt based on what it learns or what changes in the environment which is quite more like how humans plan so I've added this graph by Google Gemini that showcases exactly the key differences between llms and objectiv driven Ai and I think this is a useful graphic that you might want to screenshot because it just simplifies the understanding next we have the V jeer architecture now this is something that was actually open sourced earlier this year in around February this is something that meta are openly trying to build upon with the open source community and are still developing but basically what they're trying to do is they're trying to get a system that can predict things as efficiently as humans if you know humans don't you know have to do things millions of times to them to get it right they can do things a few times and implicitly you know understand exactly what's going on and that's what V Cher is doing so I'm going to play for you this first video by meta which is a really simple understanding that's going to you know show you exactly what's going on and then you're going to hear Yan Lun actually talk about you know why generative architectures don't work for predicting certain things which is it's really interesting because um I think the space needs this kind of you know input because I think once we

### JEPA architecture [11:33]

start to you know criticize ideas I think that's how we can actually lead to some kind of improvement today machines require thousands of examples and hours of training to learn a single concept the goal with jeppa which means joint embedding predictive architectures is to create highly intelligent machines that can learn as efficiently as humans VP is pre-trained on video data allowing it to efficiently learn Concepts about the physical world similar to how a baby learns by observing its parents it's able to learn new Concepts and solve new tasks using only a few examples without full fine-tuning V jeppa is a non-generative model that learns by predicting missing or masked parts of a video in an abstract representation space unlike generative approaches that try and fill in every missing pixel V JEA has the flexibility to discard irrelevant information which leads to more efficient training to allow our fellow researchers to build upon this work we're publicly releasing V JEA we believe this work is another important step in the journey towards AI That's able to understand the world plan reason predict and accomplish complex task you cannot predict which word is going to follow a sequence of words but you can produce a probability distribution of all possible words in the dictionary but when it's video frames we do not have a good way to represent probability distributions over video frames and in fact uh I mean the is

### Video prediction [12:57]

completely impossible like if I take a video of this room right I take a camera um I I shoot that part and then I stop the video and I asked the system to predict what's next in the video it might predict that there's going to be the rest of the room at some point there's going to be a wall there's going to be people sitting the density is probably going to be similar to what's on the left but it cannot possibly predict at the pixel level what all of you look like what the texture of the worldall looks like um and you know the precise size of the roof all things like that there there's no way you can predict all those details accurately so the solution to this is what I call joint embedding predictive architectures and the idea is to just give up on predicting pixels instead of predicting pixels let's learn a representation an abstract representation of what goes on in the world and then predict in that representation space okay so that's the architecture joint embedding predictive architecture these two embeddings take X the corrupted version running to an encoder take y and then train the system to predict the representation y from the representation of s of of X now the question is how you do this because uh if you just train a system like this using you know gr descend back propagation to minimize the prediction error is going to collapse it's going to say it's going to learn a representation that is constant and now it becomes super easy to predict but it's not informative um so but that's the difference that um I want you to um remember the difference between generative architectures that try to reconstruct predictors Auto encoders generative architectures MK Auto encoders whatever and then the joint Ting architecture you make predictions in representation space the future I think is in those joint Ting architecture we have tons of empirical evidence that to learn good representations of image the best way to do it is to use architectures all attempts at trying to learn representation of images using reconstruction are bad they don't work very well and there were huge projects on this and claims that they work but they really don't the best performance are obtained with architecture on the right next what is fascinating is that we get the first instance of what will happen once these systems are truly here so this is where Yan Lin gives his ideas and opinions on what the future is going

### Future implications [15:31]

to look like and I think it's always important to look at what those who are considered some of the most skeptical of current AI what they view the future to be like because their opinions are the least hypy meaning that this is potentially the most realistic look at the future we're going to get so uh if we succeed in doing this okay we're going to have systems that really will mediate all of our interaction with the digital world they can they will answer all of our questions they will be with us a lot of times um they will basically constitute a repository of all human knowledge um and this feels like an infrastructure kind of thing like the internet right it's not like a product it's more like an infrastructure uh these AI platforms must be open source I don't need to convince anybody from IBM here because IBM and meta are part of something called the AI Alliance which promotes uh open source ai platforms but um and I really thank Dario for spare heading this um and everybody at IBM so we need those platforms to be open source because we need those AI systems to be diverse we need them to understand all languages in the world all cultures all value systems um and you're not going to get that out of a single assistant produced by a company on the West Coast or the east coast of the US um you know this will have to be contributions from the entire world and of course it's very expensive to train uh Foundation models so only a few companies can do this so if those companies like ma can uh provide those base model in open source then the entire world can f tune them for their own purpose so that's kind of the philosophy that meta has uh adopted in IBM as well so open source AI is not just a good idea it's necessary for cultural diversity perhaps even the preservation of democracy um so um you know training and fine-tuning uh will be CR sourced or will be you know done by the ecosystem of startups and and

### AGI timeline [17:49]

other companies and this is really what has jum started the ecosystem of AI startups is the availab availability of those open source AI models how long is it going to take to reach human level AI I don't know could be years to decades there's a huge variance um and there's many problems to solve on a way and it's probably almost certainly harder than we think uh it's not going to happen in one day it's going to be like progress you know Progressive U Evolution so it's not like one day we're going to discover the secret to a gii and we're going to turn on a machine and immediately we'll have super intelligence and all of us will be killed by intelligence system no it's not happening this way uh machines will surpass human intelligence but they will be under control because there will be objective driven we give them goals and they fulfill those goals it's just like many of us here are leaders um in industry or Academia or whatever um we work with people who are smarter than us I certainly do there's a lot of people working with me who are smarter than me doesn't mean they want to dominate or take over right so of course I think this talk was rather fascinating because it talks about AGI and future intelligences and says that they are not right around the corner but they are going to be many years away and they are much harder than we do think and I think that's quite fascinating considering earlier this week we did get an interview from demes sabis saying that AGI is at least 10 years away whilst other individuals of leading companies are saying that super intelligence or AGI is around two to

### Expert disagreement [19:25]

three years away so I mean living in probably one of the most uncertain times considering the fact that the industry's experts are completely divided on these timelines so with that being said let me know what you guys think about the future of a young