Metas AI Boss Says He DONE With LLMS...

16:39

Metas AI Boss Says He DONE With LLMS...

TheAIGRID 16.04.2025 309 839 просмотров 4 762 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Join my AI Academy - https://www.skool.com/postagiprepardness 🐤 Follow Me on Twitter https://twitter.com/TheAiGrid 🌐 Checkout My website - https://theaigrid.com/ Links From Todays Video: https://www.youtube.com/watch?v=eyrDM3A_YFc&t=1810s Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos. Was there anything i missed? (For Business Enquiries) contact@theaigrid.com Music Used LEMMiNO - Cipher https://www.youtube.com/watch?v=b0q5PR1xpA0 CC BY-SA 4.0 LEMMiNO - Encounters https://www.youtube.com/watch?v=xdwWCl_5x2s #LLM #Largelanguagemodel #chatgpt #AI #ArtificialIntelligence #MachineLearning #DeepLearning #NeuralNetworks #Robotics #DataScience

Методичка по этому видео

Структурированный конспект

Освойте архитектуру будущего: как выйти за пределы LLM и построить World Models

Изучение ограничений языковых моделей и внедрение архитектур Joint Embedding Predictive Architecture (JEPA) для создания систем с реальным пониманием физического мира. Для ML-инженеров и исследователей ИИ, рассчитано на 16 минут интенсивного разбора.

Оглавление (4 сегментов)

Segment 1 (00:00 - 05:00)

But I tell you one thing which may surprise a few of you. Um, I'm not so interested in LM anymore. That was one of the statements that Yanlukan made at the Nvidia GTC 2025. And I think this clip has done the numbers on Twitter because honestly, we do know that right now in the AI space, LLMs definitely are receiving most of the hype when it comes to AI. Now, of course, if you aren't familiar with who Yan Lakhan is, he is actually one of the godfathers of AI research, and he's actually been in the AI space for quite some time. This isn't someone's first rodeo when they're making statements like this. It's based on years and years of knowledge and expertise in that field. So, when someone of this level of expertise says a statement like this, it leaves a lot of people wondering, is he actually right? Take a listen to the further conversation where he talks about the four main focuses. So, um, Yan, there's been a lot of interesting things going on in the last year in AI. What has been the most exciting development in your opinion over the past year? Uh, too many to count, but I tell you one thing which may surprise a few of you. Um, I'm not so interested in LLMs anymore. you know they're kind of the last thing they are in the hands of you know industry product people kind of you know improving at the margin uh trying to get you know more data more compute generating synthetic data um I think there are more interesting questions in uh four things how you get machines to understand the physical world and Jensen talked about this morning in this keynote how do you get them to have persistent memory which not too many people talk about and then the last two are how do you get them to reason and plan and there is some effort of course to get you know LLMs to reason but in my opinion it's a very kind of simplistic way of uh viewing um viewing reasoning I think there are probably kind of more you know better way of doing this so um so I'm excited about things that a lot of people in this community in the tech community might get excited about five years from Um, but right now doesn't look so exciting because it's some obscure academic paper. So now this is where Yanlukan actually talks about world models and the fact that text being the only world model that you have isn't sufficient enough to basically have a world model that allows you to get to AGI. He says that next token prediction is basically something that yes, it kind of works well for text, but it doesn't really work that well when it comes to actually doing things in the real physical world that humans do. But if it's not an LLM that's reasoning about the physical world and having persistent memory and planning, what is it? What is the underlying model going to be? Um, so a lot of people are working on world models, right? So what is a world model? World model is we all have world models in our mind. This is what allows us to um kind of you know manipulate thoughts essentially. So you know we have a model of the current world. You know that if I push on this bottle here from the top it's probably going to flip but if I push on it at the bottom it's going to slide. Um and you know if I press on it too hard it's might pop. So we have models of the physical world that we acquire in the first few months of life and that's what allows us to deal with the real world and it's much more difficult than to deal with language and so the type of architectures that I think we need for systems that really can deal with the real world is completely different from the ones that we deal with at the moment right predict tokens right but tokens could be anything I mean so our you know autonomous vehicle model uses tokens from the sensor and it produces tokens that drive and in some sense it's reasoning about the physical world at least where it's safe to drive and you won't run into poles. Um why aren't tokens the right way to represent the physical world? Tokens are discreet. Okay. So when we talk about token generally we talk about uh a a finite set of possibilities. is in a typical LLM the number of possible tokens is on the order of 100 thousand or something like that right um so when you train a system to predict tokens you can never train it to predict the exact token that's going to follow a sequence in text for example but you can produce a probability distribution of all the possible tokens in your dictionary you know it's just a long vector of 100 thousand numbers between zero and one that sum to one we know how to do this we don't with u with deal with what know natural data that is highdimensional and continuous and every attempt at trying to get system to understand the world or build mental models of the world by being trained to predict videos at the pixel level

Segment 2 (05:00 - 10:00)

basically have failed um even at the even to train a system like a neural net of some kind to learn good representations of images every technique that works by reconstructing an image from a corrupted or transformed version of it basically has failed not completely failed. They kind of work but they don't work as well as alternative architectures that we call joint embedding which essentially don't attempt to reconstruct at the pixel level. they try to learn a representation an abstract representation of the image or the video or the natural uh signal that is being trained on so that you can make prediction in that abstract representation space. Um the example I use very often is that if I take a video of this room and I kind of pan a camera and I stop here and I ask the system to predict you know what's the continuation of that video. It's probably going to predict it's a room and there's people sitting blah blah. it can there's no way it can predict what every single one of you looks like, right? That's completely unpredictable from the initial segment of the video. And so there's a lot of things in the world that are just not predictable. And if you train a system to predict at a pixel level, it spends all of its resources trying to come up with details that it just cannot invent. And so that's just a complete waste of resources. And every attempt that we've tried and I've been working on this for 20 years uh of training a system using self-s supervised learning by predicting video doesn't work. It only works if you do it at a representation level. And what that means is that those architectures are not generated. So you're basically saying that a transformer doesn't so basically what he's stating here is that you know using a transformer to basically predict the physical world just doesn't work because of the architecture. And he actually does make some key points. You know, if you're just predicting the next token, there are a lot of things that are just implicit from your understanding of the physical world and you being there and all of this reasoning that goes on in your brain that you really just take for granted. So, I do think he does make a good point there that it doesn't really work. Now, of course, like I said before, I'm not just agreeing with what Yan Khan says here. There's been research papers. I've done a video on where, you know, China did a lot of research on Sora and basically said that, you know, these kind of architectures don't really predict the physical world. In fact, they I can't remember exactly what they did, but in the paper, they essentially proved that, you know, these video models aren't really predicting the physical world. It's more sort of mimicking the world based on the architecture. And it was really interesting to see that deep dive. Honestly, if you watch the video, you'll understand it a little bit more in depth. Of course, it's very easy to say, okay, this doesn't work, that doesn't work. But now we have to get to the crux of the video, which is all right, we've got the knowledge that it doesn't work, but what is the solution? And this is where Yanakan talks about his famous VJ architecture. And apparently, they're actually coming out with version two. very soon and this one is probably having the most promising results out of any model so far. You know, Jensen is absolutely right that uh you get ultimately more power in a system that can sort of you know reason. I disagree with the fact that the proper way to do reasoning is the way you know current NLMs that have are augmented by reasoning abilities. You're saying it works but it's not the right way. I think uh you know when we reason when we think we do this in some sort of abstract mental uh state that has nothing to do with language. You don't like kicking the tokens out. You want to be reasoning in your um latent space and not abstract space, right? I mean if I tell you know imagine a cube floating in front of you and now rotate that cube by 90 degrees around a vertical axis. Okay, you can do this mentally. It has nothing to do with language. Um you know a cat could do this. uh we can't specify the problem to a cat obviously through language but you know cats do things that are much more complex than this when they plan like you know some trajectories to jump on a piece of furniture right they do things that are much more complex than that and um that is not related to language it's certainly not done in so you know token space which would be kind of actions it's done in sort of abstract mental space so that's uh that's kind of the challenge of the next few years um which is to figure out new architectures that allow this type of things. That's what I've been working on for the last. So, so is there a new model we should be expecting that allows us to do reasoning in this abstract space? U it's called we call it Japa uh or JPA world models and we've you know um my colleagues and I have kind of put out a bunch of uh papers on this kind of you know first steps towards this over the last few years. So JPA means John predictive architecture. This is those world models that learn abstract representations and are capable of sort of manipulating those representation uh and and perhaps reason and produce sequences of actions to you know arrive at a particular goal. I think that's the future. I wrote a long paper about this that explains uh how this might work about three years ago. Let's actually take a look at what that vjer architecture actually looks like from the video meta released last year.

Segment 3 (10:00 - 15:00)

Today machines require thousands of examples and hours of training to learn a single concept. The goal with JEPAS which means joint embedding predictive architectures is to create highly intelligent machines that can learn as efficiently as humans. VJEP is pre-trained on video data allowing it to efficiently learn concepts about the physical world similar to how a baby learns by observing its parents. It's able to learn new concepts and solve new tasks using only a few examples without full fine-tuning. VJEPA is a non-generative model that learns by predicting missing or masked parts of a video in an abstract representation space. Unlike generative approaches that try and fill in every missing pixel, VJA has the flexibility to discard irrelevant information, which leads to more efficient training. To allow our fellow researchers to build upon this work, we're publicly releasing VJeppa. We believe this work is another important step in the journey towards AI that's able to understand the world, plan, reason, predict and accomplish complex tasks. The alternative that we have now is a project called VJEPA and we are at getting close to version two where basically it's one of those joint emitting predictive architecture. So it does prediction on video but at the representation level and it seems to work really well. We have an example of this. The first version of this is trained on very short videos, just 16 frames, and it's trained to uh basically predict the representation of full video from a version of a partially masked one. And that system apparently is able to tell you whether a particular video is physically possible or not, at least in restricted cases. And it gives you a binary output. This is feasible. This is not or maybe well no it's simpler than this. You measure the prediction error that the system produces. So you take a sliding window of those 16 frames on a video and you look at you know can you predict like the next few frames and you measure the prediction error and when something really strange happens in the video like an object disappears or changes shape or you know something like that or sponty appears or doesn't obey physics so it's learn physically realistic just by observing videos. Yeah these are you know I mean you train it on natural videos and then you test it on synthetic video where something really weird happens right? So if you trained it on videos where really weird things happen that would become normal and it wouldn't uh yeah that's right detect those as being odd. So you don't do that. This is where Yanlukan actually talks about the system one and system two thinking. Of course as humans we have two modes of thinking. The system one is pretty much reactive whereas system two is where we think about things for a longer time. And this is only the recent paradigm that LMS have recently got to. And this is what Yan Lakhan talks about when he says that AI systems are essentially missing some of those capabilities intuitively. And that's what we need to really have to have a comprehensive system that can somehow get to general AGI and it connects with something that we're all very familiar with, right? So psychologists talk about system one and system two. System one is tasks that you can accomplish without really sort of thinking about them. they they've become you become used to them and you can accomplish them without thinking too much about them. So if you are an experienced driver you can drive even without driving assistance you can drive without thinking about it much you know you can talk to someone at the same time you can you know um etc. But if you are a if you drive for the first time or the first few hours you are at the wheel you have to really focus on what you're doing right and you're planning all kinds of catastrophe scenarios and stuff like that. Imagine all kind of things. So that's system two. You're recruiting your entire prephoto cortex to your world model your internal world model to uh figure out the you know what's going to happen and then plan action so that good things happen. Um whereas when you're familiar with this you can just use a system one and sort of uh do this automatically. And so this idea that you start by uh you know using your world model and you're able to accomplish a task even a task that you've never encountered before. Zero shot right? You don't have to be trained to solve that task. you can just so you can just accomplish that task without learning anything just on the basis of your understanding of the world and your planning abilities. That's what's missing in current systems. But if you accomplish that task multiple times, then eventually it gets compiled into what's called a policy, right? So a sort of reactive system that allows you to just accomplish that task without planning. So the first thing this reasoning is system two the sort of automatic subconscious reactive policy that's system one can do system one and are trying to inching their way towards system two but ultimately I think we need a different architecture for system two so this is where we get to Yan Lehan's further statements where he talks about the fact that you know we're simply just not

Segment 4 (15:00 - 16:00)

going to get to AGI via LMS and I do somewhat agree I think that you know in the future the systems that really are general AI probably will be a hybrid of some sort. They'll be a mixture of all of those capabilities and we've actually seen, you know, AI companies move towards omnimodels. We've seen Google actually doing that recently. So, it's really interesting to see him talk about this because I don't think he's that far off the mark and it's going to be really interesting to see where the future does go. Uh, but the real world is just much more complicated. Like, okay, here here's something that you some of you may have heard me say in the past. uh current LLMs are trained typically with something like on the order of 30 trillion tokens, right? Token typically is about three bytes. So that's 0. 9 10 to the 13 bytes. Let's say 10 to the 14 bytes. Um that would take any of us over 400,000 years to read through that because that's kind of the totality of all the text available on the internet. Right now, a the psychologists tell us that a 4-year-old has been awake a total of 16,000 hours. And we have about 2 megabytes going to our visual cortex through our optic nerve. Um, every second, 2 megabytes per second roughly. Multiply this by 16,000 hours times 3600, it's about 10^ the 14 bytes. In four years through vision, you see as much data as text that would take you 400,000 years to read. I mean, that tells you we're never going to get to AGI, whatever you mean by this, u by just training from text. It's just not happening.

Другие видео автора — TheAIGRID

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник