🎓 Learn AI In 10 Minutes A Day - https://www.skool.com/theaigridacademy
🐤 Follow Me on Twitter https://twitter.com/TheAiGrid
🌐 Wan to learn even more AI https://www.youtube.com/@TheAIGRIDAcademy
Links From Todays Video:
https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think/
Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos.
Was there anything i missed?
(For Business Enquiries) contact@theaigrid.com
Music Used
LEMMiNO - Cipher
https://www.youtube.com/watch?v=b0q5PR1xpA0
CC BY-SA 4.0
LEMMiNO - Encounters
https://www.youtube.com/watch?v=xdwWCl_5x2s
#LLM #Largelanguagemodel #chatgpt
#AI
#ArtificialIntelligence
#MachineLearning
#DeepLearning
#NeuralNetworks
#Robotics
#DataScience
Оглавление (5 сегментов)
Segment 1 (00:00 - 05:00)
Google just released the smartest model in the world and nobody seems to be talking about it. So Google essentially said today they're releasing a major upgrade to Gemini 3 Deep Think, their specialized reasoning model built to push the frontier of intelligence and to solve modern challenges across science, research, and engineering. And that they've updated Gemini 3 deep think. Okay, remember deep think in close partnership with scientists and researchers to tackle tough research challenges where problems often lack clear guard rails or a single incorrect solution and data is often messy or incomplete. And I think this is by far one of the biggest air updates this year. Yes, you heard that right. This year and that's because of the major improvements in terms of not only the benchmarks but in terms of how good the model is. Okay, now it's completely crazy once I dive into these benchmarks because Google kind of I don't want to say they hid this release, but I don't think they hyped up this release enough. So, Gemini Deepth think one of the first benchmarks that we can see here is humanity's last exam. So, if you take a look at the humanity's last exam, it's essentially designed in the name. Okay, it's supposed to be humanity's last exam in the sense that once that exam is complete, well, it's basically a done deal and that AI is approaching expert level reasoning across many academic domains. Now, basically this benchmark just tests the advanced reasoning across problems like maths, physics, computer science and logic and scientific reasoning. And of course, it means in this one that the model can't use external tools like calculators, code execution, search, external software. They're basically solving all the problems by reasoning alone. And we can clearly see okay that Gemini 3 deep think is surpassing even Claude for Opus which was literally just released I think not less not even more than a week ago now. And this is once again clearly surprising because you would have thought that okay maybe it's going to be a month or two at least before we saw a marginal improvement but already we're seeing an 8% improvement. And remember guys, this is a benchmark that isn't supposed to be solved at this level yet. And that is the continued theme that you're about to see. Take a look at the next benchmark. This one is Code Forces, which is arguably even more insane because Code Forces is basically most prestigious competitive programming platform in the world. And programmers solve algorithmic problems under time pressure and get an ELO style rating similar to chess rankings. And so you've got 1,200 as a beginner, 600 solid amateur, 1,900 stock competitor, and basically the values go all the way up to 3,500, which is basically unheard of for humans. And only a handful of the absolute best competitive programmers in history have touched this range. And what do we see on this chart? We see Gemini 3 Deep Think scoring 3,455, meaning that it's performing at a level that basically rivals or exceeds virtually every human competitive programmer on Earth. We're talking about complex algorithmic reasoning, dynamic programming, graph theory, number theory, combinotaurics, not just writing code, but solving genuinely half math meets CS puzzles. And Claude Opus 4. 6 is still pretty strong, but that gap is massive, guys. 1,100 point gap is a massive difference. And that's the difference between a strong human competitor and superhuman performance. And this is one of the benchmarks that is pretty hard to game because the problems require genuine multi-step reasoning. It's not just pattern matching or you know memorization. So if these numbers hold up under indep individual evaluation like independent bodies and they test it, it basically suggests that Google have some really incredible reasoning capabilities particularly for the kind that is the deep structured logical thinking that competitive programming demands. And I saw this tweet that basically puts all of this into perspective. So as I already said, you already know how crazy code forces is. But you can literally see here, this tweet says, "This is insane. Gemini Deep Think just scored 3,455 on Code Forces, equivalent to the eighth best competitive programmer in the world. The previous best was 2727 from OpenAI 03. And this is absolutely superhuman results for an AI and technology at large. " And you can see literally this would be the number eight in the world. So there are only, you know, seven humans that are currently better than that. So that's pretty incredible when you really do put that into perspective. Not only that, we have the MMU Pro. Now, this one there wasn't such a major jump, but you have to understand that I don't think this benchmark is saturated just yet because that's usually what you tend to think when you see benchmarks where there isn't much any, you know, major jumps just yet. And so, with this benchmark, it's not saturated. It's just that the deep think, what that model actually is, is essentially an extended chain of thought reasoning. And this reasoning
Segment 2 (05:00 - 10:00)
helps on math, coding, and logic benchmarks. But if you didn't know, the MMU benchmark, it basically tests whether the model can see and interpret complex academic visuals, think circuit diagrams, histograms, medical imaging, art history, plates. If the vision encoder misreads that image, no amount of extra thinking fixes that. So, you can't really reason your way out of a perception error. And the MMU Pro was designed to resist those reasoning shortcuts. The original MMU basically was getting gamed and the MMU Pro augmented answer options and filtering to eliminate questions solvable without truly understanding the image. So that means okay that the remaining questions genuinely require multimodal grounding which is basically an architecture level capability not a reasoning time scaling win. So, we're probably going to have to see improvements in the vision of the models if we do want to actually get improvements in benchmarks like the MMU, which is why, you know, if you're using this model and you're going to use this model, I'll have a tutorial on this later. You know, there doesn't really make sense to use anything over the Gemini 3 Pro preview because it's literally just basically an extended reasoning version. That's why it's called deep think. So, if we were to get that, you know, 90% gain, which could happen in the next iteration of models, which it probably will, we're probably going to saturate those benchmarks when that does happen, which will be interesting. And I don't want this to be a complete benchmark video. But I do want to talk about the fact that there is ARGI 2, and I feel like this is once again happening where the benchmark was just saturated in a time frame that most people didn't think it was. Arc AGI 2 is notoriously one of those really hard benchmarks that you would think would have stood the test of time for at least maybe a year or two, but we're clearly seeing that in just a short few months, we've gone from 30% with Gemini 3 all the way to 84. 6% with Gemini 3 deep think. So this is pretty crazy because if you aren't familiar with that benchmark, remember that humans average about 60% on this benchmark and it's visual reasoning puzzles. So you have to understand that deep think significantly surpassing that benchmark specifically designed to test genuine intelligence and not memorization is pretty surprising and that gap is pretty big. Okay, you've got claw 4. 6 Opus, you know, at 68. 8%. But think about the reasoning jump from 30% to 84%. That's a 53. 5% improvement over the base model. That is pretty insane. So when you actually take a look at why this actually matters, unlike traditional benchmarks that you know test the memorization, Arc AGI measures the model's ability to learn new skills to novel tasks it had never really seen before. And most benchmarks can be gamed by, you know, training on similar data. But Arc AI2 was designed so that you can't brute force it with pattern matching. You need actual abstract reasoning. So with deep think, if you aren't familiar with how this works, this is basically just reasoning that uses iterative rounds to explore multiple hypotheses simultaneously before producing a response. It's basically just saying, look, we're not going to focus on speed here. We're accuracy and just put a lot more compute into it. And that's of course why this is on the $200 a month tier. You're paying for all that extra thinking time. Now, of course, this model is good, but one of the things I want you guys to know is that essentially this model is really a tool for scientific research. So on the deep mind page they spoke about three key different examples where deep think is enabling scientists to achieve a lot more. And so we're going to take a look at those examples now because I think it's important to see how this is advancing the frontier of science. So the first one here we got Lisa Carbone a mathematician at Ruters University who's working on mathematical structures required by higher energy physics community to bridge the gap between Einstein's theory of relativity and quantum mechanics. I've been using AI in my research. It really has the potential to accelerate discoveries. My research work in infinite dimensional algebra and symmetry is really a tool for the high energy theoretical physics community looking to combine Einstein's theory of gravity with quantum mechanics. I was working on a paper with a colleague which took several years to prepare. Before sending it out to the journal, I decided to put it through Gemini fact-checking and verification. It came straight back with no that's not correct. Proposition 4. 2 is mathematically incorrect as stated. It gave three separate irrefutable reasons why our mathematical arguments around one particular statement were incompatible. This was pretty destabilizing because the paper had already been peer- reviewed. I debated and the model didn't try to appease me as most AI models do by trying to guess what you want to hear. It took me a while to understand because it was really outside of my thought process and the model's reasoning was completely correct. The paper's at the forefront of research in the subject and so there's very little context or training data
Segment 3 (10:00 - 15:00)
that the model could have been trained on. So it seemed as if it did the work of a highly trained mathematician. It helped us realize that we didn't need the full claim of that result and that a simpler result was actually true. Once we have a unified theory of all the forces of nature, it'll completely change our understanding of ourselves and of the universe. — Then we have the Wang lab using deep think to optimize fabrication methods for complex crystal growth for the potential discovery of semiconductor materials. — In my lab, we use deep think to design new semiconductors. We found the result is awesome. We want to grow a 100 micron size of 2D semiconductor. Using the deep sense suggested recipe, we got a size of 130 micron. The best result ever in our lab. As silicon reaching its theoretical limit, mat lab using deep think is working with new materials in the 2D space. Two-dimensional material is a family of material that has one molecular thickness is a natural choice for futuristic electronics because its thickness is extremely small. So growing two-dimensional materials is challenging. The challenge is how we choose the parameters. You have to tune the gas flow and also we are using furnace to heat it up. It takes the expert weeks or even months to find that the sweet spot in the parameter. Deep tank not just give a temperature number but give a whole thermal profile. It's accumulated the recent advances in science. I feel they're very excited. This is just the beginning. Deep sync API opens a new door to automate many of the current instrument. And then we have Anopam Path, an R& D lead in Google's platforms and devices division and former CEO of Lyftware tested the new Deep Think to accelerate the design of physical components. — I just love building things. You know, I was one of those kids that just was always taking things apart and then I quickly kind of realized that you can do that to help people. The power of good design is you can transform the world and you can transform other people's lives for the better. Using Gemini in deep think mode, we're now able to design and iterate faster than ever before. This is one of the products that we had when we were a startup. This was designed for people with cerebopal palsy or spinal cord injuries. I've been focusing a lot in the last year on how we can enable deep think to actually help make the design process 10 times faster. I can just send an image or send a prompt and it's able to actually think through and then come up with several candidate options for us, new designs that we hadn't even thought about before. One thing we did was try to challenge the model. I gave it a image of a turbine blade and so it came up with a design and then I was actually able to talk to the model to change the pitch of the blades and even the shape. I myself am not a CAD designer so I would not have known how to make that. The AI tools we see today I see them more as accelerants and that's what makes me really optimistic and hopeful. We can rapidly explore different material options focus on research questions and technologies that don't exist today. There's many problems in this world still and there's huge opportunities for us to make things better and get products to market much faster. So hopefully those examples showed you that this isn't just a model that's benchmark hacking and all of that vibe stuff. This is a real model that real engineers, scientists, and people are using. So this is clearly going to have some levels of impact. And I think you could of course remember the trajectory that this is going to take. And so it wasn't done there. Okay, Google didn't just release deep think as a standalone model and call it a day. They went a step further and decided to build something on top of this called Althia. Okay. Or Althia. And this is basically an AI research agent specifically designed to solve professional level math, physics, and computer science problems. And this matters, okay, because up until now, AI models have been great at solving the textbook problems. You know, stuff that already has known answers. But Althia is different. It's being pointed at open research problems, the kind of stuff that mathematicians have been stuck on for a decade. And it's actually making progress. Okay. It did a decent amount of autonomous research. If you actually take a look at this stuff, Google's Althia agent essentially wrote an entire research paper from start to finish, okay? With zero human involvement. Nobody guided this. Nobody edited it. Nobody told it what to research. It picked the problem. It solved it, wrote it up, and the paper has been submitted to an actual academic journal for publication. The paper calculates
Segment 4 (15:00 - 20:00)
something called weights in arithmetic geometry. And honestly, it doesn't even matter if you know what that means. What matters is that this kind of work would normally take a PhD mathematician weeks or months to produce. And the AI did this autonomously. And this is the first time we've seen AI go from it can help you with your research to it can do the research. That's fundamentally a different thing. Going from an AI tool to an AI as a colleague and in this case, a colleague that doesn't need But it gets crazier. Okay, Google didn't just test Althia on one problem. They pointed it at a database of 700 unsolved math problems. These are from something called the Erdos conjectures, a famous collection of math questions posed by one of the greatest mathematicians of the 20th century, Paul Erdos. Some of these problems have been sitting unsolved for decades. And mathematicians around the world have basically been chipping away at them for years. And Althia went through 700 of them and autonomously solved four. Okay. And on one specific problem, Erdos 1051, the AI didn't solve it. It actually led to a broader generalization that became its own published research paper by a team of mathematicians built on what the AI found. So essentially, you're seeing two modes here. Full autonomy, where the AI agent just solves things by itself, and collaboration, where it acts like a research partner. Both are working, both are producing publishable results. And that's never really happened before. And so Google basically created a classification system to rank how significant these AI research contributions are and how much the AI did versus humans. So you can read the table from top to bottom. You've got level three and four which are significant advanced landmark breakthrough. These are empty. Google is actually deciding to be honest here. AI hasn't cured cancer or solved a Millennium Prize problem yet. And that is some important context now. But if you look at level two, that's where you've got publishable research. That row is stacked. These results are good enough to be submitted to real academic journals and they span all three columns. Some were mostly human with AI helping. Some were genuine 50/50 collaboration. And one, the iigen weights paper was essentially autonomous. The AI did it alone with no journal work. Now, if you look at the far right column at the essentially autonomous, that's where the AI is solving problems by itself at level zero, at level one, and producing publishable research at level two. And that column is essentially doing things that didn't exist 12 months ago. Look, the big picture here and what I've realized after diving through all of this information is that we're watching in real time as AI climbs the table. Right now, it's filling in levels 0 through two. The question isn't whether it reaches level three or four, it's when. And based on how fast deep think improved in just 6 months, that time might be shorter than anyone expects. Now, if you want to know how this AI agent Althia actually works and it's basically how a smart human solves hard problems. So, what it will do, it will generate an answer, then it will basically check its own work, and if the answer is correct, then it's done, and you can just send it off. But if it has small mistakes, it sends it to a reviser to fix them. And if the answer is completely wrong, it throws it out and starts over from scratch. And it decides to just keep running this loop again and again until it gets it right. So think of like a student who proofreads an essay, rewrites the bad parts, and then does that. Okay, this is the same thing except it does hundreds of times in seconds. Now, if you want to look at the results of this, I mean, it's pretty crazy. And this graph tells you about how, you know, quickly things are moving. You're looking at just two versions of the same model tested six months apart. The dark blue line is deep think from July 2025 and that's the version that won gold at the International Mathematical Olympiad. And at the time that was headline news and it maxes out around 65 to 68% no matter how much thinking time you gave it. Now look at the light blue line. That is the January 2026 version. And it's the same model family 6 months later. It starts even higher, climbs even faster, and peaks at around 90%. And at every single point on the graph, the new point is crushing the old one. And the gap between those two lines is 6 months of progress. And the bottom axis is how much time, well, thinking time the model gets, which is of course more compute time to reason through the problem. And as you basically move from left to right, the model thinks harder and harder. And the pattern is essentially clear. More thinking equals better answers. But the new version gets dramatically more out of that extra thinking time than the old one did. The green star is Althia, the research agent built on top of deep think. Notice that it sits at around 93 to 94% but at less compute than where deep think peaks. Meaning the actual wrapper around that
Segment 5 (20:00 - 22:00)
is more efficient. Okay? Meaning that it makes it smarter and more efficient. And it's not just brute forcing answers with more compute. It's essentially reasoning with more intelligence. 6 months from 65% to 90% on math olympiad problems. This is the rate of improvement that should keep you watching this space. Now, if we take a look at this one, and this is where it gets pretty humbling because the last graph was math olympiad level, and these are hard, but these are problems designed for students. And this graph is PhD level math. The kind of stuff that professional mathematicians work on their actual career. Look at where the model starts at the left. It's literally 0% with minimal thinking time. It can't solve a single one. And these problems are hard. Okay? And as you give it more compute, it climbs. But look at how messy that is compared to the previous graph. It spikes to 30%, drops back to 17, bounces to the low 20s. The model is struggling. And this is essentially pushing the model to its limits. spot. Then if you look at the far right of the graph, once you throw serious compute at it, the line finally breaks through and climbs to about 38%. And that upward trajectory at the end is the most important part. It means that more thinking does equal better results. The scaling law still works even at this level and it hasn't hit the ceiling yet and it's still climbing. And once again if we look at the green star which is Lethia sitting around at 46% this agent is outperforming the raw deep think by a significant margin and it's doing it with less compute. Now the generate verify revise loop from that flowchart which we saw earlier is what is doing the real work here. letting the AI check and correct its own reasoning is the difference between the 38% and the 46% on problems that most humans with a PhD would struggle with. The big takeaway from all of this is that we're watching an AI go from it can't solve a single PhD problem to solve nearly all of them and that curve is still going up. And this video has just been absolutely crazy to watch. And one use case that Google did show us was the model actually turning some basically a sketch into a 3D model. STL that you could then actually, you know, use in the real world as a laptop holder. So, I will probably have a tutorial on this probably around 5 or 6 hours after this video is live, but I think it's just going to show just how crazy this model is and just how far things are coming with Google. So, I'm not sure why Google didn't talk about this model even more, but it is certainly amazing.