Deepseek R1 0528: The AI Bombshell That Just Changed EVERYTHING.
29:37

Deepseek R1 0528: The AI Bombshell That Just Changed EVERYTHING.

TheAIGRID 30.05.2025 24 740 просмотров 524 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Want to Stay Up To Date? - https://aigrid.beehiiv.com/subscribe 🐤 Follow Me on Twitter https://twitter.com/TheAiGrid 🌐 Checkout My website - https://theaigrid.com/ Links From Todays Video: https://www.reddit.com/r/LocalLLaMA/comments/1kyac9f/new_deepseek_r1_8b_distill_thats_matching_the/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button https://www.reuters.com/world/china/chinas-deepseek-releases-an-update-its-r1-reasoning-model-2025-05-29/ Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos. Was there anything i missed? (For Business Enquiries) contact@theaigrid.com Music Used LEMMiNO - Cipher https://www.youtube.com/watch?v=b0q5PR1xpA0 CC BY-SA 4.0 LEMMiNO - Encounters https://www.youtube.com/watch?v=xdwWCl_5x2s #LLM #Largelanguagemodel #chatgpt #AI #ArtificialIntelligence #MachineLearning #DeepLearning #NeuralNetworks #Robotics #DataScience

Оглавление (6 сегментов)

Segment 1 (00:00 - 05:00)

So, in a shocking turn of events, Deepseek may have actually just done the absolutely impossible. Once again, if you aren't familiar with this company, and by now, if you're watching the channel, I'm pretty sure you are, Deepseek is that innovative company coming out of China that literally took the AI world by storm. And there's no exaggeration. And today in their most recent update to their thinking model R1, it actually gets even better than some of these state-of-the-art models. And this took me by surprise. So in today's video, I'll get into all the explicit details of exactly why this is a lot bigger than you think. So, when we actually take a look at these quick benchmarks here, I think it's rather important because most people don't realize that Deepseek actually released two models. They've always had the standard chat model, which is Deepseek V3, and then they've had the Deepseek R1 variant, which is essentially a model that is the thinking version, which is, of course, where the model reasons about its thought process and the entire plan before submitting a final answer. Now, right here, we can really begin to see what makes this so special. Deepseek R1, the latest update, which is indicated by the 5/28, is literally on par with state-of-the-art models like Gemini 2. 5 Pro and OpenAI's 03. The Deep Seek bar is the striped pink bar or purple bar, and we can literally see that it is toeto- toe with these state-of-the-art models. Now, you have to remember the thing that makes this absolutely incredible is that this model was trained with apparently just $6 million. Now, there have of course been various different individuals who refute that claim, but the point is that they've had to do this on a budget and a lot less time. How on earth has a company managed to catch up to state-of-the-art companies spending billions of dollars on inference for just a fraction of the cost? And it leads people to believe that it's quite likely that this company is going to be leading in the future. Now, when we take a look at all of these benchmarks, a lot of them are really important for looking at where the model lies. Most of them are across math and science and of course some across coding, which we'll get into later. Now when you take a look at the artificial analysis intelligence index, this is a really good indicator of where the model lies because it actually incorporates seven evaluations, not just one. So unlike other evaluations that look at maybe one just just one factor, this is something that looks at all the benchmarks and then sort of aggregates an average score. And we can see that the Deep Seek R1 jump has been absolutely incredible. I mean, take a look at this for example. If you've watched yesterday's video, you'll remember I spoke about just how crazy Claude 4 is. And of course, we can see that it actually leapfrogs Claude for Sonic thinking. It also leak frogs Quen 3 Reasoning, also Gemini 2. 5 Flash, Grock 3 Reasoning, and Gemini 2. 5 Pro Preview. That is absolutely incredible. Which means that some people could argue that now Deep Seek are currently just behind OpenAI in terms of model quality. And that's a major statement when we consider just how much time and effort has gone into making these models. So for me, this is an incredible surprise. Now, one of the things I want to get into is of course certain specific benchmarks. Basically, we know now that look, the model was essentially as smart as Google's models and many other tech companies in a similar area, which I genuinely didn't expect considering that this is their second iteration of their Frontier models, but this for Deepseek was just an update. Like, I think you guys should really be impressed by this because if this was their next Frontier model, which is going to be Deepseek R2, then this kind of performance would still be rather impressive. But the fact that they've just updated R1 to be this effective is quite concerning because it leads me to believe that maybe these tech companies aren't in the lead when it comes to state-of-the-art AI performance. Now, there's one benchmark I really want you guys to focus on. That is the ADA Polyot score. So, someone posted on the local llama subreddit, which is basically a subreddit for individuals that look at, you know, running LLMs locally and of course open-source AI. They speak about how Deepseek R1 scored the same as Claude 4 Opus on Adah Polygon and that's a whopping 70%. Now, the crazy thing about this is that we know that these frontier

Segment 2 (05:00 - 10:00)

models from these large tech companies often cost quite a lot. And so, one of the things I really need you guys to understand is that this actually cost around $2 to $3 to run. Whereas on the contrast, models like Claude Opus are a lot more expensive. They're going to cost probably around $50 or so for the same kind of inference. We can see here on the ADA Polygot benchmark, which is of course a very, very important benchmark for software engineering tasks. It literally just lags behind other models. And the other models that it lags behind aren't models that are bad at all. These are the state-of-the-art models on their highest reasoning settings. You can see 04 mini at a medium setting and 04 mini high and 03 high are the only models that surpass Deepseek. And you have to understand just how much reasoning that is. If you remember the initial results from OpenAI's statements on what 03 mini high and 04 mini high are, these are models that reason for a very long time with a lot of tokens and it is very expensive. I even remember them testing these high models on some AGI tests and some of those runs cost around $1,000 for a single query. So to have DeepSeek R1 that close behind O4 Mini and O3 both on their high setting, it is quite impressive to say the least and also quite concerning because maybe just maybe their next model might even be better. Now remember, like I said, it's not just about benchmark performance. We have to understand that whilst yes, most people will flock to the very best AI in the world, it's probably different in reality because of course we always have to factor in the cost of the model. This Deep Seek R1 model is completely stunning. And the crazy thing about that stunning thing is that it costs just a fraction of what these other large language model costs. Now, this is a chart that is quite outdated, but it does show you what happens when we look at the price discrepancies between the new models in terms of their outputs and their inputs. Remember, the original R1 was literally between $1 to $2 in terms of the price. But when we take a look at the other Frontier models, the only company that was really cheap was, of course, Google's models. Now, this is talking about Gemini 1. 5 Pro, but I would argue that models have actually gotten more expensive now. And so, remember how I was talking about price? The reason I think this is such a big deal is because the major thing about Deep Seek wasn't the fact that they released another state-of-the-art model. The main thing that, you know, triggered headlines was the fact that they did it for a fraction of the cost and they were able to provide a service that was far cheaper, far faster and just overall more efficient. That's why Nvidia lost, you know, billions of dollars in a single day because people were like, well, if you can do it on a few, well, not obviously just a few, but of course significantly less Nvidia GPUs, why are we all buying the stock thinking it's the most valuable thing on earth? And here we are again. You guys can clearly see that when we look at Deepseek R1 compared to other Frontier models, there is a clear difference in terms of the pricing. Claude 4 Opus is at around $75 for an output and around $15 for their input. And this is per 1 million tokens. Whereas Deepseek is around $55 and around $2. 19 for an output. Guys, that is incredible in terms of the price toerformance ratio. I mean, look at Gemini 2. 5 Pro. Look at all of these other models. The difference here is absolutely outstanding. And one thing that people don't realize is that developers and consumers don't really have loyalty to one specific platform. Yes, you'll have loyalty to OpenAI because that's of course where your memory is stored. It's definitely a great product, but developers, people in APIs where the backend is just an LLM that is performing some complex tasks, they're really going to ensure that they use the cheapest model because they can save a ton. And if that's the case for a lot of these companies, well, then maybe DeepS might just be eating into their market share. We can literally see here this user who ran the test said it just costs around $3. Absolutely incredible. Now, I know we've spoken quite a bit about benchmarks, but there is something that I want to show you guys that is rather interesting because in the start of the video, I spoke about how Deep Seek R1, the new edition, is essentially basically up to date with 03 and even surpassing companies like Anthropic. But this is where we introduce some new leaderboards because I think leaderboards may have a bit of fixing in between. And what I mean by that is that sometimes when these models are

Segment 3 (10:00 - 15:00)

released, they sometimes train on the tests that these leaderboards do examine them on. So the SEAL leaderboards are a set of expert-driven third-party rankings for LLM developed by skills AI safety evaluations and alignment lab. And now these benchmarks are designed to provide a transparent, unbiased and tamperproof assessment of the capabilities of Frontier LLM across several domains. And the main difference, which is why I think most people should be looking at these ones, is that there are private curated data sets. Unlike many public benchmarks, Seal actually uses proprietary data sets that are kept private to prevent models from being trained or fine-tuned on the evaluation data. And so when we look at the results from SEAL, this approach ensures that the results are not gamed or contaminated by prior exposure. So, you know, this is something that is far more accurate. And there's things like expert evaluation where all of the prompts and ratings are created and reviewed by verified domain experts. And this ensures that the evaluations are rigorous, relevant, and trustworthy. Now, one of the ones where I could actually find DeepSeek R1. I'm not sure if you guys can see it. I'll quickly highlight it for you. Deepseek is currently at number 12. So, this is the multi-challenge. Now, you might not know what the multi- challenge is. And this is basically a test made to see how good AI models are at real conversations with humans. Not just one message at a time, but full back and forth chats. Now, most AI tests are like, you know, can the AI answer a single smart question, but the multi-challenge is more like, can the AI keep up in a full convo without getting confused or forgetting stuff? So, what does this test really test? It basically tests four things. Number one is instruction retention. Did the AI remember what you told it to do at the start? The user memory. Can it connect things you said before with what's happening now? And editing properly. Can it fix stuff in old messages when you say change that part? And self-coherence. Does it sound like it know what it's talking about or is it just contradicting itself? When we look at this kind of benchmark, we can see here that 03 is clearly at the top. And why is it that we see the newly released version of Deepseek all the way down at position 12? Honestly, I'm not really sure. I think it might be due to how Deep Seek is trained in terms of the kind of outputs considering that this is more of a qualitative benchmark and maybe the people that are, you know, doing reinforcement learning on the model and the human feedback. Maybe since they're in China, they have a different method of evaluating them. That could be potentially why. or maybe Deepseek is just trained to be really smart at academic tasks overall. Either way, it does go to show that when you do take a look at really specific benchmarks, you can really gauge where the model excels and where it falters. This is why I really do like looking at different benchmarks, especially from Seal, because I can really see the nuances of each model and what makes it special. And I think this is really useful because often times I would waste a lot of time spending time testing out models to figure out what they're good at myself. But if I now know that Deep Seek isn't as good at instruction following or instruction retention as other models, then of course I know that it might not be the one I want to use. Now I probably should have shown this before, but this is of course how this model compares to every other model. We can see here that it looks pretty great. We can see in some areas like the Amy, it surpasses Gemini 2. 5 Pro. It also surpasses Quen 3 235B. And honestly guys, I've made a video on Quen 3 235B and that model was absolutely outstanding. And I did some rigorous testing of that model and I really couldn't believe that it was free so effective at those tasks. I mean, it was really impressive that the model was 235 billion parameters. And when we actually look at what Deep Seek R1 is, the model is around 600 billion parameters. So, whilst yes, Deepseek is a little bit larger, I still think that this is a rather incredible model. Now, there's also something to note. There hasn't been an LMSYS Arena leaderboard just yet, which is of course the benchmark that basically tests what humans would actually want in a conversation. This is of course a more qualitative benchmark because you're not really looking at whether or not it got the right answer. It's verified by real humans. So people use this website, they submit a question and they basically blind test whichever response they like more. And over time, the models that they like more get voted and get a higher ELO score, which is why the LMSYS is one of the best ones because there's no real benchmark fixing and it can basically show you which

Segment 4 (15:00 - 20:00)

models are going to give you the best vibes and the best common sense. Currently, the models that are winning are Google Gemini 2. 5 Pro and of course Claude for Opus on the webdev. There are multiple different categories. So if you do want to know exactly what model to use for each one, that would be the website to quickly look at. Now with Deepseek, they didn't just release R1. They actually distilled those capabilities down into a Quen 3 8 billion parameter base model. And it's absolutely incredible. That model achieves state-of-the-art performance among open-source models on the AME 2024, surpassing the original Quen 38B by 10% and matching the performance of Quen 3 235B thinking. And they believe that the chain of thought from Deep Seek R1 will hold significant importance for both academic research on reasoning models and industrial development focused on small scale models. Basically, they managed to take those incredible capabilities from Deepseek R1, the new edition, and put that into a smaller model that is basically now state-of-the-art for 8 billion parameters. And if I'm not mistaken, I do remember seeing someone on Twitter using this version on their phone. Honestly, intelligence being compact and being on your phone is going to be absolutely incredible within the next 10 years. Just imagine having an agent like 03 running locally offline with your private information at the tip of your fingertips. But there is of course a really big issue coming on because whilst yes, Deepseek is a great model, we might not actually get future versions of Deepseek. I know that does sound pretty crazy, but governments around the world are actually looking to block access to Deepseek. And you know, right now it does seem like it's mainly for government employees, but in the future that could very well change. I mean, that's the entire reason why deepseek, why Tik Tok was under fire and, you know, potentially going to be banned. So basically the US government is seeking to ban deepseek and it's mainly due to the you know national security data privacy and foreign influence concerns because of course as you know deepseek is a Chinese AI company and they've gained so much popularity but there are some you know questions about how it handles user data that have actually alarmed US officials and regulators. Now, the main reason that they have this national security risk is because apparently Deepseek may have connections to the Chinese Communist Party and Chinese state-owned enterprises like China Mobile. Now, US officials actually fear that Deep Seek could be used for espionage with sensitive government, corporate, or personal data potentially being accessed by Chinese authorities under China's national security laws, which require companies to share data with the government upon request. And of course, we have the data privacy concerns. Deepseek stores user data on servers located in China, raising the risk of unauthorized access and surveillance by Chinese intelligence agencies. So there's a lack of transparency about how Deep Seek collects, processes, and shares the user data, further fueling the concerns about privacy and compliance with US and international standards, which is why other countries, not just the United States, have tried to ban Deepseek. Now, recently, we did get this article on March 17th where the US Commerce Department bureaus informed staffers that in recent weeks is going to be banned on their government devices. according to a message seen by Reuters and two people familiar with the matter. So, to help keep the Department of Commerce information system safe, access to the new Chinese-based AI, Deepseek, is broadly prohibited on all GFE. That's what one person said. By using Deepseek, users are unknowingly sharing highly sensitive proprietary information with the CCP, such as contracts, documents, and financial records. lawmaker wrote in a March 3 letter referring to the Chinese Communist Party. In the wrong hands, this data is an enormous asset to the CCP known as a foreign adversary. And they talk about how numerous states have banned the model from government devices, including Virginia, Texas, New York, and a coalition of 21 state attorneys have urged Congress to pass legislation. So overall, we're already seeing that the US Commerce Department bureaus are banning deepseek on the government devices. Now, what's crazy about all of this is that we need to talk about Deep Seek's next model. Where on earth is R2? I'm not sure if some of you guys remember this, but R2 could be facing some huge delays. R2 is Deep Seek's next installation of the model. And one thing that we're worried about, well, not really worried about, but concerned about is that maybe this model may actually be delayed entirely. You

Segment 5 (20:00 - 25:00)

see, the model was meant to be delivered or released in early May. And you know, some people were saying that it was going to be in April. Many people were even calling this a pivotal moment for the AI industry. But now with recent restrictions and new laws being passed, it looks like we might not actually get Deep Seek R2 for quite some time. So Deep Seek R2 is essentially the highly anticipated successor to the company's breakthrough model, which shocked everyone in January. And the reason this shocked everyone in January, if you remember, is that Chinese companies couldn't really create systems that match the performance of leading Western companies at the time. That's what it was believed. But when we saw that, hey, these guys are actually at the same level as Western companies and they're doing it for cheaper, this raised some really big red flags and people were like, wait a minute, we actually need to focus here because China are not far behind at all. Now, what's crazy about this is that Deepseek has built its entire R2 development strategy around using the Huawei's Ascend 19B AI chips. Now, this decision wasn't made by choice, but rather by necessity, as Chinese companies have been largely cut off from accessing the most advanced AI chips made by companies like Nvidia due to export restrictions. and Huawei's Ascend chips have become the primary alternative for Chinese AI companies that need powerful computing to train their models. Now, these chips are essential for DeepSeek, and I'm going to show you guys why it gets so complicated and why we could be facing major delays. The Ascend 19B chips are domestically produced Chinese processors that Huawei claims can compete with older generation Nvidia chips like the A100. And Deepseek, here's where it gets interesting, has reportedly achieved 82% utilization of these Huawei's chip clusters, reaching 512 pedlops of computing power at FP16 precision. And while this represents impressive engineering achievements, the performance is estimated to be only around 60% of what Nvidia's current H100 chips can deliver. Now, this is where things get complicated. in May 2025 when the United States Department of Commerce, Bureaus of Industry and Security issued unprecedented guidance declaring that using Huawei Ascender chips anywhere in the world violates US export controls. This was a significant escalation in the ongoing technology war between the United States and China, extending US legal jurisdiction globally in an unprecedented manner. The US government's position is that Huawei Ascent ships were likely to have been designed with certain US software or technology or produced with semiconductor manufacturing equipment that is a direct product of certain US origin software technology or both. basically meaning that even though the chips are manufactured in China, the United States is claiming legal authority over their use because they allegedly contain American IP or made using American design equipment. And the rulings of this implication are severe. Anyone anywhere in the world who uses these Huawei chips could face criminal penalties under US law. And this is what is messing with DeepSeek. This creates an extraordinary legal risk because if they use these chips and produce that model, they could be facing several ramifications. So, we can see why Deepseek R2 is going to be probably a crazy model, but they're likely to face some delays. Now, beyond the legal issues, there are some, you know, significant technical challenges that Deep Seek R2 is going to be facing with the Huawei chip, and these could slow down the timeline. There are multiple reports indicating that Huawei's Ascent chips suffer from serious stability and performance issues that make them problematic for large-scale AI training. Chinese AI firms using ascent chips have already complained about hardware performance problems, particularly stability issues that are critical for AI model development. If you didn't know, AI models require hardware to run continuously for extended periods during training, sometimes for weeks or months without interruption. And the Ascend chips reportedly suffer from frequent crashes and stability problems that make sustained training workloads extremely difficult. And users have also reported that these Ascend chips have slower interchip connectivity compared to Nvidia alternatives. And this connectivity basically allows developers to combine multiple chips into clusters that can work together on massive AI training runs. and slower connectivity, meaning that training processes take much longer to complete and significantly increasing costs and development time. And even Huawei's own staff have reportedly expressed concerns about their platform, saying that, you know, the Ascend hardware makes it difficult and unstable to use. So, if Deep Seek is forced to abandon these Huawei chips due to the legal concerns, they would face the enormous challenge of retraining their R2 model on different hardware. Training a model with 1. 2 two trillion parameters which is rumored requires months of continuous computation on thousands of processes and the process involves feeding the AI

Segment 6 (25:00 - 29:00)

system 5. 2 pabytes of training data which is equivalent to millions of books of worth of text. Starting this entire process over on different hardware would not be a simple matter of copying files from one system to another. Different AI chips have different architectural programming interfaces and different optimization requirements. code that has been specifically optimized to run efficiently on Huawei's Ascent chips would need to be completely rewritten for alternative processes. And this software conversion process alone could take months of engineering work. The training process itself would then need to be restarted from the beginning, even with faster hardware. Training a model of R2 size typically requires several months of continuous computation. And if Deepseek is forced to switch to less powerful alternatives to Huawei chips, the trading time could be even longer than their original timeline. So overall, whilst I'm super excited about Deep Seek and their future models, they are facing the perfect storm of technical, legal, and strategic challenges that illustrate the complex realities of AI development in an increasingly fragmented global technology landscape. They're relying on these Huawei chips, which were, you know, initially seen as a clever workaround for the US export restrictions. they now are becoming a potential liability. Now, with regards to what we actually might see when Deep Seek R2 comes out and further versions are going to be there, I do think that the current information on the internet is probably kind of accurate. I mean, when we take a look at this table from the Deepseek R2 rumors versus the known models, we can see that there is quite a bit of difference. Now, let me add this to just be completely honest with you guys. Some of this information is just basically rumors from Chinese stock forums because when I was doing research with multiple AI models, that was the only thing I could really find. So, for one, the parameters being at 1. 2 trillion, that is of course a rumor. We do know that current AI models don't really have that many parameters considering that they've managed to be able to make these models so much more efficient, especially with mixture of experts, you don't really need that many parameters active at a given time. We've already seen with Krenn 3 what that model has been able to do in terms of the remarkable efficiency. And with other models now or with all of the distillation going on, it's quite likely that it might not be that high. And the reason it's actually pretty hard to guess things like the parameter count is because AI companies often really don't want you to know what those parameters are because essentially it's kind of like their secret source. So when these models are made they really try not to release too much that other companies could copy. Now of course we are expecting a hybrid architecture which is basically just a mixture of experts and apparently you know as I spoke about earlier in the video there's going to be a 5. 2 pabyte of training data. Now this is completely rumored but still kind of interesting to see. Now, for the API input cost and output cost, we're looking at around $7 per million tokens and 27. Now, I personally don't know if that is going to be possible, especially with frontier level intelligence because I think we're actually seeing the inverse in terms of the trend. Whilst yes, price to intelligence is going down, if we actually go on a new curve, which is, you know, models get smarter as they reason for longer. If they're reasoning for longer, it means they're going to use more tokens and thus cost more overall. So, I'm not entirely sure that the cost will be that much cheaper. Although, I don't expect it to be too much more expensive than these other Frontier models. One of the things I am wondering about DeepS is if it will have any vision capabilities and if potentially it still will be open source. We do know that they've pledged to be open source for quite some time, but we've already seen some companies sort of undo that open- source nature, especially with OpenAI. And I'm actually referring to Mistral AI because I think they realize that being open source is great, but it is quite hard to maintain that when you do have these training runs costing millions of dollars and you could potentially just undercut OpenAI. I mean, if Deep Seek was, let's say, $5 a month or even $10 a month, would people still pay for it? I think they might. And I know it does sound crazy to have an AI that costs just $5 a month, but considering the fact that in many current western economies, there's this cost of living crisis, I'm pretty sure that people would happily pay $5 a month for frontier intelligence, even if it was made by a Chinese company. As always, I've said that people don't really have loyalty to companies, but more so loyalty to what is best for

Другие видео автора — TheAIGRID

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник