Game OVER? New AI Research Stuns AI Community.
15:10

Game OVER? New AI Research Stuns AI Community.

TheAIGRID 24.04.2025 44 276 просмотров 1 010 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Join my AI Academy - https://www.skool.com/postagiprepardness 🐤 Follow Me on Twitter https://twitter.com/TheAiGrid 🌐 Checkout My website - https://theaigrid.com/ Links From Todays Video: https://x.com/iruletheworldmo/status/1915338995707359274 Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos. Was there anything i missed? (For Business Enquiries) contact@theaigrid.com Music Used LEMMiNO - Cipher https://www.youtube.com/watch?v=b0q5PR1xpA0 CC BY-SA 4.0 LEMMiNO - Encounters https://www.youtube.com/watch?v=xdwWCl_5x2s #LLM #Largelanguagemodel #chatgpt #AI #ArtificialIntelligence #MachineLearning #DeepLearning #NeuralNetworks #Robotics #DataScience

Оглавление (4 сегментов)

Segment 1 (00:00 - 05:00)

Game over. That is what this tweet said for the AI industry when it comes to LMS. Now, I'm going to explain to you guys exactly what is going on cuz this tweet went semiviral in the AI community. And it all stems back to this AI paper, which is rather fascinating because it gives us an insight with as to how reasoning models really work and what could be the future of AI. So, I'm actually going to break this down for you guys in a simple way that you can understand so that you aren't confused by this news. So, is it over? Is it actually game over for LLMs? We've seen many different papers come out of the woodworks and challenge the nature of these models. And this is another one that you want to pay attention to. So, let me get down into specifics. So this paper called does reinforcement learning really incentivize reasoning capacity in LLM's beyond the base model essentially is a paper that seeks to understand whether or not adding reinforcement learning to a model which is how they train the model to be better at thinking and solving problems actually makes the model smarter. And this paper the entire thing that it says is that this doesn't actually make the model smarter. So let me break down to you guys exactly what they did. So take a look at this image that will help you understand the results of this paper and then we can get into how it all works. So what did these researchers actually do? Well, they took two different types of AI. One trained the normal way called the base model. So that's just the base model, no changes to it whatsoever. And one trained with extra reinforcement learning called the RL VR model. Now they gave both models these same hard questions. And then they tested how many correct answers they could get if they let the AI try once, which is where they just try once and that's where K equals 1, or if they let it try more times. So K equaling 256, which is basically they gave the AI 256 times to try. Now, what they found was rather fascinating. They actually found that the reinforcement learning model was better when it only got one try. But when they let both models try a lot of times, the normal model, which is the base model right here that you guys can see, that one actually did better in the long term. And this is really surprising because reinforcement learning is one of the major paradigms of these reasoning models. And so finding out that the base model actually has the capabilities existing in it anyways was something that was super interesting. So what does this mean? Well, essentially the paper is basically stating that reinforcement learning doesn't really teach new skills. It only helps the AI guess better faster. And apparently it actually makes the AI less curious. So it explores fewer ways to solve a problem. And that's why it gets stuck more often when the problem is harder. Now in this paper they refer to something called distillation which might be a lot better and helps models learn new skills whereas reinforcement learning does not. Now let me explain it to you guys even more. Okay. So this is an image that they have in the paper and this is where you are seeing essentially two decision trees. So these are the paths that a model can take when it chooses to solve a problem and each node is a step in the model's reasoning. Plus one that we can see right here that basically means that the model has a correct answer and O means that the model received a wrong answer. Now the base model is the one on the left which is the original untrained version. The one on the right is the same model trained with reinforcement learning that rewards correct answers. Now we have the top row which is the more efficient sampling. This is where the base model explores many paths but it's actually slower to land on the right one. We can see it goes down here to see 0 0. It goes down these trees and it goes down all of these trees and eventually it gets to the plus one which is the correct one. Whereas with the reinforcement learning model, it has already learned to aim for the path that give rewards. So it finds correct answers faster and that's a good outcome because reinforcement learning helps the model find good answers faster. Now this is where we have problem B which is the reduced scope of reasoning capacity. This is where the base model here still explores lots of reasoning paths and one of them leads to the right answer. Whereas when we have the reinforcement learning model that had been trained to focus on the only most rewarding parts from training this time it misses the answer completely. And we can see here that this is unfortunate. When the answer was over here, the model that was trained to go down this way, it completely missed it. So this is unfortunately a bad outcome. Reinforcement learning has made the model too narrow and it misses some correct answers that it used to be able to find. So the message here is that reinforcement learning it improves efficiency. You're able to find answers quicker, but it reduces the flexibility because the model explores less and may miss answers it used to find. And basically, if the answers are hidden in the base model, which this paper states, then how do we know that we're not missing answers by training the models via reinforcement learning? Now, this statement went semiviral because it was

Segment 2 (05:00 - 10:00)

pretty dramatic. She says that turns out the RL victory lap was premature. The new paper quietly shows the fancy reward loops just squeeze the same tired reasoning pass the base model already knew. And here he's basically saying that everyone had thought reinforcement learning was unlocking smart AI, but this paper shows nope. It just pushes the AI to choose answers it already had in its brain more often. It's like forcing you to play its greatest hits instead of composing no music. Now, he also says here that pass at one goes up, sure, but the model's world actually shrinks. Feels like teaching a kid to ace flash cards and calling it wisdom. And yes, basically what he's saying here is that, you know, reinforcement learning does make the AI get the right answer faster on the first try, which is at one. However, reinforcement learning is like drilling a kid with flashcards where they remember the answers but don't understand anything deeper. So, the grand self-improving LLM dream. Basically, crib notes plus a roulette wheel. Keep sampling long enough and the base spits out the same proofs that the reinforcement learning champ brags about, minus the entropy tax. So the dream of an AI teaching itself to become brilliant might just be a myth. If you just let the normal base model try enough times, it can eventually find the same answers as the reinforcement learning trained one. Reinforcement learning doesn't give new ideas. It just makes it bet smarter, not think smarter. We can see here that he says it's compression, not discovery. That reinforcement learning isn't making AI discover new ways to solve problems. It's just compressing its knowledge into more efficient patterns. Maybe the endgame isn't better agents, just sharper funnels. We've been coaching silicon parrots to clear increasingly useless Olympiad hurdles while mistaking overfitit for insight. And this is where he's stating that we're training AIs to be parrots that pass complex exams, but they're not learning. They're just overfitting the training data and test style, not developing real understanding. And this is where he ends it saying, you know, it's hard to wonder if we're half a decade into the world's most expensive curve fitting demo. Have we just spent the last 5 years and millions of dollars teaching AIs to fit curves, which is memorizing patterns without building true intelligence? Now, there's also a Q& A from the researchers that did the paper. They essentially state here that why are you using pass at K instead of majority voting and whether that makes their findings invalid. So, they're basically saying that real world models don't get to try 256 times. So, why is this metric even useful? And they're basically saying that look, this is not about real world performance. It's about theoretical potential. The authors are clearly stating that they use pass at K to not judge if a model is useful in the wild, but to find out how far a model could go if given enough tries. And what is pass at K? Remember, like we discussed before, if the model is allowed to guess K times and one of those guesses is right, it gets a point. It's basically a way to guess a model's maximum capability, not its average or majority choice. And their reasoning is essentially that if reinforcement learning really makes the model smarter, then at large K, well, when you give the model lots of chances, the reinforcement learned model should solve more problems than the base model. But the opposite actually happens when we use the reinforcement learning models. They actually solve fewer problems at large than the base models. And this actually means that reinforcement learning isn't giving the model new reasoning skills. It's just helping give the same answers, sample the same answers more efficiently. And the trade-off is that it actually explores less. And so they're basically stating here that pass at K basically shows what a model can theoretically do, not what it usually does. And it basically just proves the point that reinforcement learning narrows the AI's mind, even if it gets quicker at spitting out some answers. Now, there's another thing that they state is that isn't pass at K meaningless since you could eventually guess the right answer if you guessed enough times. You know, they do make kind of a point. If the answer is just a number like 42 and the model guesses a thousand times, eventually it might hit it by luck. That could make pass at 1,024 seem a little noisy on some math problems. Basically, if you gave 1,000 tries, eventually it might get the right answer. But for coding problems, you can't just guess and get it right. It has to pass all the test cases, and that is hard to guess. And still, the base model often did better than the reinforcement learning trained model when given more tries. So, that's not luck. That's real reasoning inside the base model. And they even checked this by hand. They looked at some problems from the Amy and the GSM 8K and found that the base model's answers weren't just lucky guesses. They showed at least one step-by-step solution. Even on hard problems like math 500, where the answers were things like fractions or square roots, where things are super hard to guess, the base model still did better when allowed to, you know, have more tries. basically stating that look of course you could predict that you know if the model guesses randomly eventually it would get it right so this doesn't count but that's not the case because a lot of these answers are really specific and basically this just shows that it wasn't just random but buried skill that shows up with more chances and here they're also stating that even random sampling can eventually generate the you know right answer so

Segment 3 (10:00 - 15:00)

doesn't that make your result you know meaningless so yes in theory typing randomly could land a correct answer eventually but the chance of that happening is like you know nearly impossible. Pure randomness is not what's powering those correct answers. The real insight is basically saying the paper isn't saying you know keep sampling forever and eventually you'll get lucky. The paper basically says that if a model can get the right answer in 128 to,024 tries it means that the model already knows the reasoning path and it means that it's not just luck it's a hidden skill in the base model. And the difference here is the probability. If the model has some real understanding, correct answers are sampled in a few hundred tries. If not, you'd need to then sample trillions of tries, which is way beyond practical computing. And the reinforcement learning helps bias the model to pick better guesses earlier, but that doesn't discover new reasoning. It just shortcuts the model to know the ones it already had. Now, of course, we have here where they're basically saying, of course, isn't this common sense? Shouldn't reinforcement learning help the model get it right on the first try? And yes, that's of course expected. But here's why that's not obvious. Reinforcement learning is meant to boost pass at one. That's not surprising. Of course, it's meant to get the answer right quicker. It just now means the model is now better at picking the better known answer faster, which is what we all knew. But what is surprising is that reinforcement learning doesn't expand what the model can do. It just makes it more efficient. They're basically saying that look, if the base model can't solve a problem even after 1,000 tries, the reinforcement learning model won't be able to solve it either. That means reinforcement learning isn't helping the base model learn new reasoning strategies. It's just tuning it to focus on what it already knew. And this matters because in classic reinforcement learning, the whole point is, you know, to discover better strategies over time. But here with reinforcement learning in LLMs, it doesn't explore new thinking. It doesn't break out of the box. base model's limits. It just tightens what the model already does. And the reason they were surprised about this is because in the past at K experiments where the model gets lots of tries to answer, the base model actually performs better than the reinforcement learning trained model. And then now they're stating here, you know, does this basically mean that, you know, reinforcement learning can't help models reason better than the base version? Not exactly. They're not saying that reinforcement learning is useless. They're saying that so far they haven't seen proof that it makes models smarter at reasoning. And they're saying that they're careful running experiments to ask, does reinforcement learning really help the models learn new ways to think? And so far, it looks like reinforcement learning helps models answer faster, but might not be better than what the base model already knew. They are still open to this. Maybe bigger models and more data could change that. And they're testing that right now with Deepseek V3 versus R10. And so that's going to be a really interesting thing where the base model of deepseek versus the reinforcement learning one, they're going to see how much smarter it is. And they're basically just saying we need to see proof of this. Now once again, they basically state that no, reinforcement is not useless because it improves the sample efficiency. But if we want to solve truly harder problems in AI, we may need a new training paradigm that can go beyond the base model ceiling. So they're basically saying that look these models we need a new way to be able to get new knowledge from them and that's of course true kind of intelligence that we're really seeking. Now, someone from OpenAI said that reading this must feel great if you have one brain cell. And I don't know, I was thinking about this and I think in a sense this kind of is interesting because when we look at what the model is doing, of course, reinforcement learning is great because it helps you get the answer faster. But of course, the downside is that it narrows its reasoning paths and eventually might miss some real answers. But in a sense, when we actually think about it, in a practical sense, you could argue that the reinforcement learning trained model is smarter in an important way because it's more efficient and finding the correct answer more reliably and quickly is exactly what you need often. And it often makes fewer mistakes and fewer false starts and it delivers better results in real world performance and real world use cases where you only get one attempt. And that's really important for real world use cases. So this is pretty similar to how we might view human intelligence. Someone who consistently solves problems correctly on their first attempt, would typically be considered smarter than someone who needs many tries to get the same answer, even if both people technically have the same knowledge. The paper's distinction here, which I find fascinating, is that it's more about the nature of improvement. Not that the model learned new concepts or developed, you know, entirely new problems solving strategies, but rather that it got better at using what it already knew. But, you know, this efficiency gain, I personally believe it's still a practical form of intelligence. If the model can reliably choose the right approach immediately, it's definitely going to be more intelligent than a model that needs many random attempts to find the correct path, even if both theoretically contain the same capabilities. The real limitation highlighted in the paper is that the reinforcement learning appears to have a ceiling. It can't really teach a model to solve problems that are fundamentally beyond the base model's capability frontier. And that's where

Segment 4 (15:00 - 15:00)

other approaches like distillation might be needed or even other architectures. So, let me know what you guys think about this. I thought this was super interesting and I'll see you guys in the next

Другие видео автора — TheAIGRID

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник