# China DROPS AI BOMBSHELL: OpenAI Is WRONG!

## Метаданные

- **Канал:** TheAIGRID
- **YouTube:** https://www.youtube.com/watch?v=ceIlHXeYVh8
- **Дата:** 06.11.2024
- **Длительность:** 24:37
- **Просмотры:** 59,576

## Описание

Prepare for AGI with me - https://www.skool.com/postagiprepardness 
🐤 Follow Me on Twitter https://twitter.com/TheAiGrid
🌐 Checkout My website - https://theaigrid.com/

0:00 Research Introduction
1:17 OpenAI Claims
2:32 Distribution Testing
3:48 Training Data
5:11 Model Behavior
6:45 Color Priority
8:15 Shape Transformation
9:44 Research Implications
11:23 Data Retrieval
13:16 Marcus Response
15:06 Pattern Matching
16:25 LeCun Theory
17:36 Architecture Differences
19:39 Meta Demonstration
20:40 Prediction Limits
22:26 VJPA Comparison
24:17 Final Thoughts
Links From Todays Video:
https://x.com/bingyikang/status/1853635009611219019

Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos.

Was there anything i missed?

(For Business Enquiries)  contact@theaigrid.com

#LLM #Largelanguagemodel #chatgpt
#AI
#ArtificialIntelligence
#MachineLearning
#DeepLearning
#NeuralNetworks
#Robotics
#DataScience

## Содержание

### [0:00](https://www.youtube.com/watch?v=ceIlHXeYVh8) Research Introduction

they're basically saying that look opening I have got this world model wrong and you guys need to pay attention to this video so of course paper here yada yada but what is the video that they made so they made this video okay and it says how far is video generation from a world model a physical perspective so this is coming out of this company that you can see right here called bite dance so the video starts as follows so if you click the video you can see that it actually complement Sora it basically says that look Sora can create incredibly realistic videos to which it can everybody has Sora we know exactly what this system is capable of it's capable of incredible stuff but then this is where they start to get a bit skeptical okay and trust me that when you actually see what this research is saying it's not just about Sora it's about the underlying technology behind it and how certain things work it says video generation models seem okay seem the key word there being seem to generate the world but let's take a closer look okay so let's continue watching and this is essentially just referencing what opening ey have put on their blog themselves so this isn't taken anything out of context but this is exactly what open ey have said on their blog post themselves about Sora so pay attention it says from open ey our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world that is what open AI themselves have said about

### [1:17](https://www.youtube.com/watch?v=ceIlHXeYVh8&t=77s) OpenAI Claims

Sora so they're basically stating that they are going to be using potentially Sora or some variation of that to get to AGI because of course you need to be able to simulate the physical world if you want to act and you know essentially exist in the physical but things might not be as they seem so let's take a look at what this research continues to state so let's continue to watch this video cuz this video is super fascinating so this is whether they say or not does Sora really understand the physical law okay remember it's not just about Sora it's about that you know Sora is potentially a Gateway towards Ai and if you have something that can generate you know the physical world in a really cool way that means that you know really do have a world model and a world model is essential for developing artificial general intelligence systems so here you can see that they state that does Sora really understand physical law and this is essentially a video from the Sora model where sometimes it hallucinates and it messes up so this is where you can see like this glass just jumps up and it just does something random and they're basically saying that look the answer is no these are not World models they do not simulate physical reality there is something completely different but this is where things start to get interesting okay so it says we conduct systematic study on this problem in synthetic scenes at scale so think are going to get a little bit complicated just for 60 seconds but

### [2:32](https://www.youtube.com/watch?v=ceIlHXeYVh8&t=152s) Distribution Testing

just stay with me trust me it's worth it so they said that they have a 2d physics simulation engine which is going to be used to generate the synthetic videos okay so that's pretty simple to understand and the video generation model is trained to predict future frames quite like how traditional models are doing the same thing okay so this is nothing that's out of the ordinary but video generation models are importantly trained to predict the future frame so it's basically predicting the next fame or whatever's going to happen next and now this is basically where they're saying that look they have unlimited data scaling which is really cool and now also this allows them to create and use the data to test different settings so you can change the speed velocity you can change different things okay and what this allows you to do is have things that are inside the distribution and when they say inside distribution versus outside of distribution all they mean is that like okay let's say we take a look at things that are inside the distribution we can say that this means that it was on the test data versus if it wasn't how do these models perform and if models cannot perform well on stuff that is out of distribution it is quite likely that they are very unable to generalize well and this is something that we really do need if systems could only perform well on/in distribution it means these systems are only performing retrieval think about it like this guys you can drive your car to another state even if you haven't seen the roads

### [3:48](https://www.youtube.com/watch?v=ceIlHXeYVh8&t=228s) Training Data

before but AI systems really struggle with those tasks because they are out of distribution so that's why this kind of thing is really essential for getting so now that you've understood this concept okay so of course you look at here and this is where they have this graph which is like combinatorx or something like that and then they basically look at some ones that are trained and then you know half of those are tests so this is where we look at their observations now remember I said that we have in distribution outside of distribution and the combination of both things so in distribution they said that they have perfect recall because of course remember this is stuff that it was already trained on so if it's in the distribution test data set if you've seen a model that has been trained on 5,000 million cars flying you're going to look at cars flying if you prompt it and then you're going to be able to see that it's able to recall that perfectly however when we look at stuff that is outside of the distribution things don't perform well and even in combination we see that there is this kind of scaling law that is a little bit different so let's take a look at exactly what it said and when you see exactly how this works you'll understand that why these failure modes are truly significant to the development of AI so it says we observed many interesting failure modes so let's take a look at what they observed okay so we have the real versus the generated and this is where things start to get crazy now they post you know first post these without any context so you can't really understand what's going on but as

### [5:11](https://www.youtube.com/watch?v=ceIlHXeYVh8&t=311s) Model Behavior

soon as they give you this information it shows you why these models are failing to generalize out of distribution in scenarios so it says the key message here is that the model has perfect in distribution generalization but fails to generalize an out of distribution scenario so let's take a look at what they've done here so you can see that in distribution there's basically no kinds of error these errors are super small but outside of distribution when you're trying to prompt it for things where you know it hasn't potentially seen before we can see that the error is large and random which is not good at all this is not ideally what you want to see with a system that uses this kind of architecture because this is going to be the system that most companies are using to try and get to AGI now of course open could be working on something different but based on Public Information we understand that things aren't looking good so far so you can see here it says that we trained the model with multiple scenes combined of multiple objects and procedures so you can see the real you know simulation and the generated whiches you know it looks really cool okay now take a look at the scaling laws for the combination of these okay and it says with larger scaling laws and model data coverage the ratio of physically pausable videos in increases significantly which does make sense okay but this is the thing that gets me okay when I saw this I was like okay maybe these systems AR as SM as they saw and some of the people that are truly speculative on AI have a real point and it seems that every single month more information comes out and it seems like these guys are getting more and more credibility it says here that videos are generated by case based retrieval and replay not by simulating Dynamics I'm going to say that one more time guys it says that videos are

### [6:45](https://www.youtube.com/watch?v=ceIlHXeYVh8&t=405s) Color Priority

generated by case-based retrieval and not by simulating dynamics that means that these models don't understand the actual physics the only thing that these models are doing is retrieving the content that they have been trained on and you have to understand that guys if that is actually true if the research what they're saying if their conclusions are accurate then that would mean that these AI systems are not as great as we think which has severe implications so if these videos are just retrieving and not actually simulating then it means that we would need an entirely new architecture to get to AGI but if you don't believe what they're stating take a look there's one example that might just blow your mind so it says for instance okay we train the model with scenes I didn't even get to read that where the ball goes left and right at far speeds okay so you trained the ball to go left and right at far speeds and you tested it with low speed towards the right but the model produces the following result instead of actually going towards the right so you've trained the model to predict what happens when this red ball is here so this is the training data okay so you train the model with scenes where the ball goes left and right at far speed remember that then you decided to test with low speed the ball only going towards the right but what do you see the ball then goes towards the left okay which is just absolutely insane because it only saw what was in the training data so if you want it to go to the right this is essentially where you can see this happens because the model record the data set cases which actually has cases where it's going towards the

### [8:15](https://www.youtube.com/watch?v=ceIlHXeYVh8&t=495s) Shape Transformation

left now this you know case example is pretty confusing it's not that confusing but there's a better example that they do have and it basically says that the video is generated by referring to the similar training data not by the physical rules and that's even if you use an accurate World Sim so whatever they use to generate the physics initially was accurate but the model that they're using okay when you train a model even if it's on accurate physics like our real home world just because it's seen that training data doesn't mean it's fundamentally understanding exactly what's going on there's another example that you need to see it says we even discovered the generative models preference order over different attributes take a look at this cuz this is just absolutely insane and this is when is going to click into your brain that oh my gosh maybe these models are actually just Advanced retrieval mechanisms instead of actual intelligent beings so it says when doing retrieval the model has an internal priority over different attributes okay so this is pretty crazy because it basically says that look these models are prioritizing certain things when it's retrieving and it basically means that like you can kind of predict what the model is going to do based on what you've retrieved or what you've stored in your database so basically the model will look at first the color then the size then the velocity and the shape it doesn't even look at the physics at all ites it doesn't really care it just looks at those things so this is the example that blew my mind and I was like oh my gosh so it says that for example in the training data okay all the circles are red while all the squares are blue so this is the only trading data that they have all the circles are red and all the squares are blue pretty cool right okay

### [9:44](https://www.youtube.com/watch?v=ceIlHXeYVh8&t=584s) Research Implications

very simple very easy to follow so let's take a look at what happened when they use the model so when the test starts with a red square which is of course what they didn't see in the training data remember the only red object they saw was a circle okay and of course the only square that they saw was a blue square so you can see that this thing let's take a look at what happens when you have a red square so if we and we decide that okay we want a red square it actually turns into a circle that is absolutely insane you guys they've trained it okay to use the physics of course of this red circle and this blue square but the training data favors the color of the object so much so that when it looks at a red square going left to right it actually transforms into a circle instead meaning that it's not predicting the physics of objects moving left to right it is predicting the fact that a red circle is moving from left to right which means it's not focused on the physics it's focused on the color of the object which is absolutely insane so the like this is why we get hallucinations and this actually makes sense now guys this is something that of course is out of distribution you can see we got this is something that we never saw before and you can see exactly what happens when you have something that is out of distribution so this is something that is absolutely crazy and it says in retrieval okay the color determines the shape instead of the opposite which has remarkable implications for other video models that use similar architectures so I'm not going to lie guys this is pretty incredible research because it basically shows some things that we may not have known before and here's the takeaway that they do say okay that this is where the video generation model behaves well only in distribution so only on training

### [11:23](https://www.youtube.com/watch?v=ceIlHXeYVh8&t=683s) Data Retrieval

data it's going to perform well out of distribution is poor which means if you're trying to prompt something in a text to video model the reason why you can't get good results is not because of your prompting strategy it's probably because the model hasn't been trained on that so it's probably simply because it just doesn't have enough training data and when it it performs poorly and it says combinatorial generalization is possible at scaling and it generates mostly via case-based retrieving and combination but not understanding of the physical law so basically say if you want these models to actually predict the physical law you cannot use this architecture at all because this is not what they're doing they are actually retrieving it now this completely incredible because this means that there's a series of knock on offense so of course they have the paper okay and this video was super helpful because it does have the informations but you can see right here guys that this means that like you know right here we have the situation where it means that okay these uh things are just completely not okay and of course maybe there's going to be some research that comes out after but it seems that scaling of video generation models uh you know as a promising path towards building general purpose simulators of the physical world apparently this means that this is not the truth and I think it is due to the generative architecture of course maybe I'm at of my depth even just by saying that but guys this video is a complete bombshell because so many people okay have been critical of the generative Tire landscape and this is another thing that is just like whoa okay this is pretty insane now crazy even crazier okay is that uh you know someone commented on Twitter and said uh have you found the solution to this problem okay and you can see that they've responded saying you know unfortunately we have not actually this is probably the mission of the entire AI Community which is absolutely insane okay now of course some AI critics have commented on this and of course it what would we be doing if it wasn't you know a Critic video when we're talking about Gary Marcus you can see here he says first uh P marker confirms deep learning is hitting a wall I mean Gary Marcus has

### [13:16](https://www.youtube.com/watch?v=ceIlHXeYVh8&t=796s) Marcus Response

been saying that for quite some time but he says this is exactly what I've been telling you about AI since 1998 and over again the difference between within and without of training space generalization is all until we solve this we won't get to AGI so basically's trying to say that look okay if we don't have a system that can generalize out of distribution tasks we are not going to get to AGI because AGI is a system that is going to be able to generalize out of distribution task because you know I guess you know of course you can always put more data in but that kind of architecture I mean I don't want to say it doesn't scale to AGI of course you could use unlimited synthetic data but the point is okay is that like when you're trying to replicate human intelligence okay that comes from the human's ability to be able to do things they haven't seen before and not like fail to understand the physical law like think about it like this okay remember how in the video okay and this is how like it kind of blew my mind okay this is when I really understood it okay so remember how in the video okay like we saw okay and this is why like this uh research is so important okay remember how in in the uh in the research this is what we saw okay let me bring it back to a specific example so in the research okay we saw okay this okay and I'm going to just explain this one more time because it just recently hit me again okay we saw the fact that like all red uh items are circle and all uh blue square blue items are squares so all circles are red and all squares are blue okay but if a human if you understood this okay and if you moved it from left to right okay when you were predicting what happens next If It Moves left or right like if as a human you were given this okay saw this and then you you saw this okay your then prediction okay wouldn't be that this then turns into a circle okay like that wouldn't be your prediction like you wouldn't think if something was left or right it then turns into another object but because this system does think that I mean I guess that the systems aren't learning and as this research says it's simply via uh case based retrieval um which is rather fascinating now of

### [15:06](https://www.youtube.com/watch?v=ceIlHXeYVh8&t=906s) Pattern Matching

course um like remember what I talked about you know like I said you know two three weeks ago I did that uh video on the Apple paper which is basically where they said uh we found no formal reasoning in language models the behavior is better explained by sophisticated pattern matching so fragile that in fact changing names can alter results by 10% I don't know about you guys but if you're studying for a test and they Chang the names um I don't think your results would drop that significant ly obviously 01 does perform a lot better but of course we need things that are really effective at performing on tasks that they haven't seen before and changing names really shouldn't result in that much of a drop 10% is really significant if you look at how models change from iteration to iteration a lot of these AI companies uh are really happy when they get a 10% Improvement so if you can change the name on a benchmark and it manages to drop 10% that that's uh pretty bad now one of the things that I wanted to talk about is of course Yan Lan because I'm guessing that right now he's feeling pretty Vindicated because this kind of research is something that backs up the kind of thing that he's been talking about for quite some time now I may include a video clip I'm not really sure I don't know how many people watch the videos that I post on Yan lanun because yanan is seen unfortunately as an AI skepti skeptic just because he doesn't you know uh really focus on llms and generative AI he focuses on an entirely different architecture called objective driven AI

### [16:25](https://www.youtube.com/watch?v=ceIlHXeYVh8&t=985s) LeCun Theory

now basically what he says okay is that objective driven AI may actually reach human level intelligence one day and that when you actually look at what objective your an AI is okay he basically says that this kind of AI is where they use World models okay uh joint embedding predictive architectures which are not generative okay and with this then you can actually get to systems that understand the physical world have persistent memory can reason and can perhaps plan hierarchically so okay this is yanan's entire uh Theory this is what he's working on at meta and this is his big thing I may include a clip I'm not sure because there was an entire 30-minute talk I did a long video on it have yanan's objective driven Ai and this is essentially the main architecture that will essentially be artificial general intelligence now this is quite the different architecture compared to current standard llms and even quite different from the O reasoning considering it's an entirely different new system so I'm going to try to use a simplified breakdown because yanan does talk about this for 10 plus minutes and I got to be honest it is so basically instead of just reacting to data like how current AI systems which are llms respond based on patterns objective driven AI Works more like a thinking process it would allow the AI

### [17:36](https://www.youtube.com/watch?v=ceIlHXeYVh8&t=1056s) Architecture Differences

to imagine different possible future scenarios and basically make plans based on that now the reason that this is truly important is because the goal is to move Beyond AI that can only perform specific tasks like predicting the next word in a sentence and move towards AI that can figure out how to achieve goals in new situations even if it's never faced those exact act scenarios before and that's something that AI has a really big problem doing so how this objective dur AI works is that the AI has a world model which is essentially a mental representation of how the world works and then it combines this world model with goals objectives and then optimizes its actions to achieve those goals while considering any constraints like avoiding danger and instead of just going through preset actions like following a script it can adjust and adapt based on what it learns or what changes in the environment which is quite more like how humans plan so I've added this graph by Google Gemini that showcases exactly the key differences between llms and objective driven Ai and I think this is a useful graphic that you might want to screenshot because it just simplifies the understanding next we have the V jeer architecture now this is something that was actually open sourced earlier this year in around February this is something that meta are openly trying to build upon with the open source community and are still developing but basically what they're trying to do is trying to get a system that can predict things as efficiently as humans if you know humans don't you know have to do things millions of times for them to get it right they can do things a few times and implicitly you know understand exactly what's going on and that's what V Cher is doing so I'm going to play for you this first video by meta which is a really simple understanding that's going to you know show you exactly what's going on and then you're going to hear Yan Lun actually talk about you know why generative architectures don't work for predicting certain things which is it's really interesting because um I think the space needs this kind of you know input because I think once we start to you know criticize ideas I think that's how we can actually lead to some kind of improvement today

### [19:39](https://www.youtube.com/watch?v=ceIlHXeYVh8&t=1179s) Meta Demonstration

machines require thousands of examples and hours of training to learn a single concept the goal with jeppa which means joint embedding predictive architectures is to create highly intelligent machines that can learn as efficiently as humans vjp is pre-trained on video data allowing it to efficiently learn Concepts about the physical world similar to how a baby learns by observing its parents it's able to learn new Concepts and solve new tasks using only a few examples without full fine-tuning V jeppa is a non-generative model that learns by predicting missing or mask parts of a video in an abstract representation space unlike generative approaches that try and fill in every missing pixel V jeppa has the flexibility to discard irrelevant information which leads to more efficient training to allow our fellow researchers to upon this work we're publicly releasing v jeppa We believe this work is another important step in the journey towards AI That's able to understand the world plan reason predict and accomplish complex task you cannot

### [20:40](https://www.youtube.com/watch?v=ceIlHXeYVh8&t=1240s) Prediction Limits

predict which word is going to follow a sequence of words but you can produce a probability distribution of all possible words in the dictionary but when it's video frames we do not have a good way to represent probability distributions over video frames and in fact uh I mean the task is completely impossible like if I take a video of this room right I take a camera um I I shoot that part and then I stop the video and I asked the system to predict what's next in the video it might predict that there's going to be the rest of the room at some point there's going to be a wall there's going to be people sitting the density is probably going to be similar to what's on the left but it cannot possibly predict at the pixel level what all of you look like what the texture of the world looks like um and you know the precise size of the room and all things like that there there's no way you can predict all those details accurately so the solution to this is what I call joint embedding predictive architectures and the idea is to just give up on predicting pixels instead of predicting pixels let's learn a representation an abstract representation of what goes on in the world and then predict in that representation space okay so that's the architecture joint embedding predictive architecture these two embeddings take X the corrupted version running to an encoder take y and then train the system to predict the representation of Y from the representation of s of of X now the question is how you do this because uh if you just train a system like this using you know gr descend back propagation to minimize the prediction error it's going to collapse it's going to say it's going to learn a representation that is constant and now

### [22:26](https://www.youtube.com/watch?v=ceIlHXeYVh8&t=1346s) VJPA Comparison

it becomes super easy to predict but it's not informative um so but that's the difference that um I want you to um remember the difference between generative architectures that try to reconstruct predictors Auto encoders generative architectures M Doo encoders whatever and then the joint ending architecture you make predictions in representation space the future I think is in those joint Ting architecture we have tons of empirical evidence that to learn good representations of images the best way to do it is to use Jing architectures all attempts at trying to learn representation of images using reconstruction are bad they don't work very well and there were huge projects on this and claims that they work but they really don't the best performance are obtained with so for those of you who want um a quick uh you know comparison on what the differences are between V jeer which is meta's approach and of course the popular generative approaches V jeer is basically where you have you know efficient learning uses 1. 5 times to six times less training data through abstract understanding it can learn tasks without retraining the entire model and it excels at detecting detailed objects interactions now one of the limitations is that it only works well with the short videos up to 10 seconds when you compare that to the physics based video generation which is where it can complete you know full physics based video predictions it does have some limitations you know such as it tends to memorize rather than understand physics and the distribution limitations struggles with scenarios outside the training data and of course it requires some retraining yeah I'm not going to lie guys this is definitely one of the craziest videos that I think I've made in quite some time and this kind of research is truly impactful because the basically saying that look if we actually want to get to AGI and real world physical video simulators we're going to need an entirely new different approach because this thing is just relying on training data to predict the future

### [24:17](https://www.youtube.com/watch?v=ceIlHXeYVh8&t=1457s) Final Thoughts

action so um clearly they're basically saying that look opening ey Sora is good but it's just you know wrong in terms of its architecture approach if you're actually trying to get to a world generator which is pretty a prettyy crazy um conclusion to come to but I'm not going to lie guys after seeing the research I can completely understand I think these guys are on to something so with that being said let me know what you guys think I do apologize for my

---
*Источник: https://ekstraktznaniy.ru/video/13807*