A New Kind of AI Is Emerging And Its Better Than LLMS?

10:25

A New Kind of AI Is Emerging And Its Better Than LLMS?

TheAIGRID 29.12.2025 456 571 просмотров 13 929 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Checkout my newsletter : - https://aigrid.beehiiv.com/subscribe 🐤 Follow Me on Twitter https://twitter.com/TheAiGrid 🌐 Learn AI With Me : https://www.skool.com/postagiprepardness/about Links From Todays Video: https://arxiv.org/pdf/2512.10942 Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos. Was there anything i missed? (For Business Enquiries) contact@theaigrid.com Music Used LEMMiNO - Cipher https://www.youtube.com/watch?v=b0q5PR1xpA0 CC BY-SA 4.0 LEMMiNO - Encounters https://www.youtube.com/watch?v=xdwWCl_5x2s #LLM #Largelanguagemodel #chatgpt #AI #ArtificialIntelligence #MachineLearning #DeepLearning #NeuralNetworks #Robotics #DataScience

Оглавление (3 сегментов)

Segment 1 (00:00 - 05:00)

So Meta's AI chief released a new paper. And is this the beginning of the end for LM? Let's talk about it. So most of you guys know that Meta's AI chief scientist Yan Lun reportedly left Meta or is leaving Meta to build his own AI startup. But before that, he actually made a really interesting paper that I want to talk about. So the paper that he made with a bunch of different researchers from Meta is called VLJ. So this is a vision language model built on joint embedding predictive architecture which is Jepper and this is I guess you could say an extension of the VJA architecture. So this is really cool because this is from Meta's fair lab of course you know Lean Land is the one leading this and the you know ridiculous thing about this well not ridiculous but the super interesting that I found about this is that unlike models like Chachi that generate answers word by word VLJ does something completely different. This is a non-generative model. So this predicts meaning directly and it's not via text. So this model builds an internal understanding of what it sees, images, video, and then converts that understanding into words if needed. Now, because it learns in a semantic space instead of token space, it's faster, more efficient, and uses about half the parameters of traditional vision language models while often performing better. And this is crazy because what this means for robotics agent is super crazy. So let's get into this. So one of the things I wanted to you know really point out here to show you guys how you know different this architecture is that it talks about the fact that this is a non-generative system. So if you know what a generative system is usually this means a generative model like chat GPT GPT4 this produces tokens or words you know one at a time you know you go from left to right and every output must be fully written to exist. So to answer what's happening in this video, a generative model is going to be like, okay, I'm going to decide the first word, then the second, then the third until it finishes the entire sentence. It literally, you know, it can't know the final answer until it finishes generating it, which is very slow and very painful. But a non-generative system means here is that it does not need to talk to think. So VJA essentially what it does is that it does not generate words by default. It doesn't predict the next token. It doesn't need sentences to exist. Instead, it predicts a meaning vector directly. So think of the differences like this. generative AI is let me explain what I think while I'm still figuring it out and non-generative AI says you know I already know and I'll only explain if you ask and compared to and remember this is the entire reason that Yanlakan cares about this so much is because he has been saying for so long that language is not intelligence his belief is that intelligence equals understanding the world and language is simply just an output format but Vla reflects that philosophy exactly so this is why this video is talking about what this might be after LLMs where you're thinking in language, reasoning in tokens, and where you're thinking in the latent space, reasoning in meaning, and language is actually optional. This is the paradigm shift that this paper is talking about. And I think that maybe, just maybe, if this gains more traction, this could be post LLMs. So, essentially what you're looking at in this video is where you have a map of the internal understanding over time. So, each dot is essentially what the AI thinks is happening at that moment. So you can see the red ones, those are basically the instant guesses, but the blue is essentially the stabilized understanding. So you have to understand that what you're seeing on the left is essentially the vision model, what it would be able to see. Now, now what most people are going to ask here is how is this even different from a cheap vision model just describing exactly what the video is doing. Well, the short answer is that cheap models, they talk, but VLJ is understanding. So we need to break down exactly what that means. So the lowcost vision model, the describer is basically a cheap basic vision model that works like this. You have the frame, then you have the label, label, then frame, then label. So, it looks at each frame, it guesses what it sees, and it spits out the text immediately. So, this is, you know, what does that look like? Hand, bottle, picking up canister, and it's jumpy, inconsistent with no memory, and it's basically reacting and not understanding. But this is where we have VLJ. So, Vlja does this instead. It's got a video stream, of course, and it's got continuous meaning, and then it's the event. So this tracks the meaning over time building a stable understanding and it only labels the action once it's confident. That's why you see red dot which is an instant guess. Well, it might be wrong. It might be bottle. But then the blue dot is a stabilized meaning it's a canister. So the reason that this actually matters a lot is because the cheap model is going to say I see a bottle. But then VLJ is going to actually understand the action and say the action is picking up a canister. So the killer difference is of course time. Lowcost models think in single frames and they have no real sense of before and after. VLJ thinks in temporal meaning and it knows when an action starts, continues and ends. That's why it's extremely useful for robotics, wearables, agents, real world planning. And why the dot cloud matters is that you know it's showing you know meaning drifting slightly from frame to frame then locking in once enough evidence exists. And this is something that you know the tokenbased models they can't really do efficiently because number one they need to you know keep generating text and number two they can't hold silent semantic state. So you know if

Segment 2 (05:00 - 10:00)

you think about it a cheap model is basically like a CCTV motion detector shouting guesses but a VLJ is a human watching and saying ah okay he's picking something up. So then of course you might want to understand the diagram of the architecture. So this is the VLJ model architecture. So if you wanted to know how this works, this is basically the architecture. But honestly, it was a little bit confusing. So I decided to just get a simpler description. So I actually used GPT image 1. 5 to get this image because this is actually pretty good. And if you know this is too much, I also have this one right here. So language is optional, understanding is not. So basically, you know, the X encoder is the visual input. So it's going to be the video frames. The predictor is basically the brain. The Yen encoder is the textual query which is what you'd be asking it. And then of course you've got the encoded meanings from the word which is the Y decoder. Then of course you've got your comparing the thoughts which is a training loss which essentially means that you know it's getting better over time. And then of course you got the final output which is the correct answer which is the actual meaning. Now if we look at the tests of this is currently the best. So we're looking at the scoreboard which is where we can see the other ones the different AI models. We can see that clip sig LP and P core. They're older well-known vision models and compared to VLJ base this is and VJA SFT which is you know fine-tuning and then we can see that VJER is a really incredible improvement and one of the things I think you know a lot of people are going to miss is that of course you're probably going to miss the fact that VLJ is super small so you know how generative models just you know tokens on tokens and tokens but if you're thinking about something that actually reasons like a human you can see that the number of parameters and number of samples seen you can see that VL jpa is 1. 6 billion parameters and 2 billion parameters you know in terms of the sample scene. So it's remarkably more efficient than the other things that we're you know looking at. So I think it's pretty incredible how that is. I mean if we you know continue to look over here you can see that the zero shot video captioning. So this is where it's showing with the same data and same setup VOJepper actually learns faster and it reaches higher caption quality and predicting meaning you know learns faster than predicting words. Then of course you've got chart two which is zeroot video classification and it's the same thing VLJ pulls quickly ahead and the visual language models improve very slowly. So even without fine-tuning VJ understands videos better and this kills the idea that you need token generation to understand things and it you know it's clear that you know Yandan is on to something. So once again if we look at the right size remember once I said that again. Now if you look at the actual size of the models you can see that once again visual language models are you know much larger and much less efficient and vjer only needs like 0. 5 billion parameters in terms of their predictor and so there's no heavy decoder during training. So VJepper is going to get better with results with half the trainable parameters which is pretty insane in machine learning terms. And of course here we have Yan Lerna talking about this stuff. I mean, this was I think around two to three weeks ago. — Four-year-old has seen as much visual data as the biggest LLM trained on the entire text ever produced. And so what that tells you is that there is way more um information in the real world, but it's also much more complicated. It's noisy. It's high dimensional. It's continuous. And basically the methods that are employed to train LLMs do not work in the real world. That explains why we have LLMs that can pass the bar exam or solve equations or compute integrals like college students and solve math problems. But we still don't have a domestic robot. They can, you know, do the chores in the house. We don't even have level five self-driving cars. I mean, we have them, but we cheat. So, um I mean, we certainly don't have self-driving cars that can learn to drive in 20 hours of practice like any teenager. And then of course I actually went on Yelican's Twitter and I saw him uh reposting this from Sonia Joseph. Now this is someone of course that works at Meta and she essentially said that we don't simulate every atom to model intelligence. We don't use quantum field theory to model road traffic. Jeepa taught me the importance of learning physics at the right level of abstraction. Thank you Landin and the Jeppa team. It was a privilege to work with you. So I'll definitely take a look at this. The thesis behind Japa is that our current models are not predicting causal dynamics. And if you both predict in latent space and predict the future, then you're more likely to abstract away all these pixel level details. For example, when we model even this conversation right now, we don't have to model it down to the level of atoms. That would be so computationally costly and so efficient. We model things at the representation that's suited for our goal. So similarly, JEPA is optimi optimized to have physical representations at the level of abstraction it needs. It enables it to plan in the physical world and be able to do a counterfactual reasoning about objects that are moving around behind Japa. — Now I did see a few comments on Reddit talking about the video saying that most of the actions that it detects are wrong

Segment 3 (10:00 - 10:00)

though. If you stop the video at any time to actually read what it says, it's really bad. And someone also says, well, the guy, the same guy or the same person says that I stopped it like five times and they were all wrong. Made up a side of pizza, made up something else. But I think the most important thing here is not that it's going to be 100% right. I think the most important thing is that it's actually moving us in the right direction of where AI models should actually be and not just getting completely distracted by chat bots.

Другие видео автора — TheAIGRID

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник