China’s New AI Breakthrough - Attention Residuals Explained -

8:49

China’s New AI Breakthrough - Attention Residuals Explained -

TheAIGRID 19.03.2026 479 просмотров 31 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Attention Residuals Explained - China’s New AI Breakthrough: 🌐Subscribe To My Newsletter - https://aigrid.beehiiv.com/subscribe Get your Free AGI Preparedness Guide - https://theaigrid.kit.com/agi 🎓 Learn AI In 10 Minutes A Day - https://www.skool.com/theaigridacademy 🐤 Follow Me on Twitter https://twitter.com/TheAiGrid Links From Todays Video: Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos. Was there anything i missed? (For Sponsorship Enquiries) aigrid@faiz.mov (Contact Me Direclty - contact@thaigrid.com Music Used LEMMiNO - Cipher https://www.youtube.com/watch?v=b0q5PR1xpA0 CC BY-SA 4.0 LEMMiNO - Encounters https://www.youtube.com/watch?v=xdwWCl_5x2s #LLM #Largelanguagemodel #chatgpt #AI #ArtificialIntelligence #MachineLearning #DeepLearning #NeuralNetworks #Robotics #DataScience

Оглавление (2 сегментов)

Segment 1 (00:00 - 05:00)

So, China just made an AI breakthrough that has even Elon Musk saying impressive work. So, let's talk about it. Every AI model that you've ever talked to, Chat GBT, Claude, Gemini, all of them are built on top of the same basic wiring. One piece of that wiring hasn't been changed since 2015. Not tweaked, not updated, the exact same design, copied for 11 years. Researchers just kind of assumed it was finds out it wasn't. So, a Chinese lab called Moonshot AI, the team behind the Kimmy models, just published a paper that rethinks this piece. It's called attention residuals. And the core argument is very simple. There's a flaw baked in the foundation of every modern AI model, and nobody noticed it because it doesn't break anything. It just makes everything slightly worse than it should be. So, what is this hidden wiring? What's wrong with it? And how can we actually fix it? The simple answer is that the wiring in question is something called a residual connection. All it does is pass information forward through the layers of a neural network. The problem is that it passes everything forward with equal importance. No filtering, no prioritizing. The fix is you have to let the model choose what to pay attention to. Not just when you're reading your words, but when rereading its own internal layers. So, let me explain this to you all with a quick analogy. Imagine you're writing a report and you have a team of 50 editors. Editor number one reads your draft and makes some notes and passes everything along. Editor two gets the original draft plus editor's one notes and makes their own notes and then passes everything along. And editor 3 gets all of that and adds them all. By the time you get to layer 50, they're holding a massive stack of paper. The original draft plus 49 sets of notes on all piled on top of each other. There's no way to tell which notes are important and which ones are noise. And that's how most AI models work today. Large language models like chatbt or claude has dozens sometimes hundreds of layers. Each layer is like one of those editors. It processes the information, adds its own contribution and passes the whole thing forward. The mechanism that does that passing is called the residual connection and it was first introduced in 2015 for image recognition. The idea was simple and brilliant. Just add the original input back in at every step so information doesn't get lost along the way. And it worked. It was the reason we can train really deep neural networks at all without residual connections. The signal degrades so much that the model can't learn. And with them, you can stack a 100 layers and it still works. Now, in a shallow model with 10 or 20 layers, this is fine. But in a deep model and modern LLMs are deep, the pile gets so big that individual contributions get drowned out. The signal from the early layers get buried and deeper layers have to shout louder and louder to be heard over the accumulated noise. The paper calls this porm dilution. So you can just think of it as a pile getting so tall that nobody can find anything in it. So here's the clever part and this is where the paper gets really elegant. AI researchers actually solved this problem before but in a completely different context. Let me explain. Before the current generation of AI models, we had something called recural neural networks, RNNs, and they processed text one word at a time. And at each step, they compressed everything they'd read so far into a single summary, word after word. And that summary got more and more overloaded. By the time the model reached the fifth hundth word, the information from word three was basically gone. And does that sound familiar? The transformer. That architecture behind every modern AI model fixed this by introducing attention. Instead of compressing everything into one summary, each block could look back at every previous word and decide which ones mattered most. It could focus on word three if word 3 was relevant and ignore word 200 if it wasn't. This selective process is what made modern AI possible. What the Kimmy team realized is that the residual connections have the exact same problem just in a different directions. RNN's compressed information across words over time. Residual connections compress information across layers over depth. Same bottleneck, same forced averaging, same loss of useful information. And so this fix is the same. Instead of adding every layer's output together blindly, let each layer look back at all previous layers and choose which ones to focus on. Give the model attention, not across words, but across its own depth. Each layer gets to ask, "Which of my predecessors has the information I actually need right now? " That is what attention residuals is. It's literally the same idea that made Transformers revolutionary, but applied to a dimension of the architecture that everyone forgot to upgrade. Instead of every layer getting the same average soup, each layer gets a custom blend assembled on the fly based on what the input actually needs. Now, the obvious question is, does this actually make a difference? And can you do it without waking the model way more expensive to run? On the first question, yes, clearly. They tested this across five different model sizes and at every single scale, the new approach beat the standard one. To put a number on it, the improvement was equivalent to getting

Segment 2 (05:00 - 08:00)

25% more training compute for free. Same model, same data, same cost, just better wiring. And you get the performance of a model trained with a quarter more resources. They also tested it on their biggest model, Humilinearia, which has 48 billion total parameters. They tried the gain showed up on every benchmark they tried. Reasoning ability jumped significantly. Math performance improved. Coding ability went up, not by some marginal amount. On one reasoning benchmark called the GPQA diamond, scores jumped from 36. 9 to 44. 4. That's a massive leap for a change. Something as low-level as how information flows between layers. On the second question, we can look at the cost. And that's where the engineering gets interesting. The full version of this idea where every layer looks back at the other layer does use more memory. So the team built a practical version called block attention residuals. Instead of every layer having its own look back, you group layers into about 8ish blocks. And within each block, you use the old system. Between blocks, you use the new attention-based system. This gives you most of the benefit at a fraction of the cost. And how much cost? Training is less than 4% more expensive. And at inference, when the model is actually generating text for you, the slowdown is under 2%. You would never notice it. It's essentially free performance. So why does this actually matter for AI today? You see, this matters for a reason that goes beyond one paper. Residual connections are inside every transformer model ever built. Every chatbot, every image generator, every coder system. This isn't some obscure component. It's the plumbing that everything runs on. And the fact that nobody seriously questioned it for over a decade tells you something important about how AI research works. There are probably other pieces of the transformer everyone assumed were good enough. the attention mechanism itself, the way layers are normalized, the way parameters are initialized. If residual connections, the simplest, most boring piece of the whole architecture had this much room for improvement. What else might be hiding in plain sight? The broader lesson is that AI assumptions compound. You build on top of a design choice made in 2015, and 10 years later, everyone treats it like a law of physics instead of a choice that could be revisited. See, in 2015, someone figured out how to make deep networks work by adding a shortcut. And in 2017, someone figured out how to make those networks understand language by letting them choose what to focus on. And in 2025, someone finally combined those two ideas and asked the most obvious question nobody had asked. Why can the model choose what to focus on in your sentence, but not in its own layers? And the answer to that question turned out to be worth 25% more compute and improvements on every single benchmark. Not by making the model bigger, just by upgrading the plumbing. Sometimes the biggest gains aren't the flashiest parts of the systems. They're in the parts everyone stopped looking at. Now, one researcher, Ziming Louie, looked at the Kimy's new attention rule's idea and basically asked a very important question. And his answer is basically, it depends. He created a toy experiment with two extremes. On one side, you have structured data, data with clear patterns, rules, and shortcuts. And on the other side, you have random data, basically messy information where there is no nice pattern to exploit. So, the model has to brute force memorize things. What he found is that attention residuals seem to do better when the data has real structure. Why? Because the new system can learn to focus on the most useful earlier representations and almost skip unnecessary steps. But when that data is more random and chaotic, the standard residual connection can actually be better. That's because the old system is more expressive in a brute force way, while the new system can sometimes fall into a kind of averaging behavior where it blends too much together and loses sharpness. So the big takeaway here is that attention residuals are probably not a universal upgrade for every task on Earth. They may be especially good when the underlying data has clear structure. And that matters for LLMs because language itself is highly structured. Grammar is structured. Code is structured.

Другие видео автора — TheAIGRID

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник