# Chinese Researchers Just Cracked OpenAI's AGI Secrets

## Метаданные

- **Канал:** TheAIGRID
- **YouTube:** https://www.youtube.com/watch?v=LyKRUwLNPO8
- **Дата:** 01.01.2025
- **Длительность:** 15:53
- **Просмотры:** 212,144
- **Источник:** https://ekstraktznaniy.ru/video/13449

## Описание

Join my AI Academy - https://www.skool.com/postagiprepardness 
🐤 Follow Me on Twitter https://twitter.com/TheAiGrid
🌐 Checkout My website - https://theaigrid.com/

00:00 - Intro: OpenAI’s O1 model
00:22 - Secrets of O1
01:05 - Chinese Paper on O1
01:48 - Reinforcement Learning Basics
02:29 - Four Key Steps
03:46 - Policy Setup
06:08 - Reward Design
07:42 - AI Thinking (Search)
09:18 - Tree Search & Revisions
12:01 - How AI Learns
12:47 - Training Methods
14:16 - Iterative Learning Cycle
15:30 - Superintelligence Coming Soon?

Links From Todays Video:
https://x.com/rohanpaul_ai/status/1872713137407049962

Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos.

Was there anything i missed?

(For Business Enquiries)  contact@theaigrid.c

## Транскрипт

### Intro: OpenAI’s O1 model []

so open AI is the leading AI company and of course their recent iteration of models the 01 series is by far the most advanced AI that we currently have access to now incredibly this AI model has been shrouded with secrecy to the point that if you ever dare to ask the model what it was thinking about during the process it was giving you a response

### Secrets of O1 [0:22]

the model gives you a response where it tells you to never ask a question like that again and if you do it too many times you can actually get banned from using open AI service and now the reason that this is shrouded in so much secrecy is because this is a big step towards AGI and many are thinking that open AI are quite likely to be the first company to achieve it now with that being said many have wanted to know exactly how this system works and there have been many different ways open I have of course published a few different Publications but nothing to the point where we truly understand what's going on beneath the hood however there has been a recent research paper from a group of researchers in China and we are now asking ourselves if they just managed to crack the code did they

### Chinese Paper on O1 [1:05]

figure out how 01 works and release a road map to build something similar so this is the paper scaling of search and learning a road map to reproduce 01 from reinforcement learning perspective and this is the paper that could change everything because if this is true then it means the playing field is leveled and it means it's only a matter of time before many other companies start to produce their AI models that are going to be on par with open AI now I'm actually going to break this down into four parts but let's actually first understand the basics of how this AI thing even works so one of the first things that we do have is we have of course reinforcement learning with AI so essentially we can use a game analogy so imagine you're trying to teach a dog a trick so you would give this dog a treat

### Reinforcement Learning Basics [1:48]

which is the reward when it does something right and it then learns to repeat those actions to get more treats and that is basically reinforcement learning now with AI the dog is essentially a program and the treat is a digital reward and the trick could be anything from winning a game to writing code now why is reinforcement learning important for the 01 series and this is because open AI seems to believe that reinforcement learning is the key to making 01 so smart it h it's basically how 01 learns to reason and solve complex problems through trial and error now there are four pillars of this according to the paper you can see right here they give us an over view of how 01

### Four Key Steps [2:29]

essentially Works we've got the policy initialization this is the starting point of the model this sets up the model's initial reasoning abilities using pre-training or fine-tuning and this is basically the foundation of the model we've got reward design which is of course how the model is rewarded which we just spoke about I'm going to speak about that in more detail and then of course we've got search which is where during the inference time where the model is quote unquote thinking this is how the model searches through different possibilities and of course we have learning and this is where you improve the model by analyzing the data generated during the search process and then you use different techniques such as reinforcement learning to make the model better over time and essentially the central idea is reinforcement learning okay and the core mechanism ties these components together the model which is the policy interacts with its environment data flows from search results into the learning process and the improved policy is fed back into the search creating a continuous Improvement Loop and the diagram basically emphasizes the cyclic nature of the process search generates data for learning updates the policy and yada y yada so if we want to actually understand how this works we have to actually understand the policy so this is the basics this is the foundation of the model so imagine you're basically teaching someone to play a complex game like chess you wouldn't throw them into

### Policy Setup [3:46]

a match against a Grandmaster on their first day right you'd start by teaching them the basics how the pieces move basic strategies and maybe some common opening moves that's essentially what policy initialization is for AI now in the context of a powerful AI like 01 policy initialization is essentially giving the AI just the very strong foundation and reasoning before it even starts trying to solve really hard problems it's about equipping it with a basic set of skills and knowledge that it can then build upon through reinforcement learning the paper suggests that for 01 this Head Start likely comes in two main phases number one the pre-training which we can see here which is you know where you train it on massive text Data think of this like letting the AI read the entirety of the internet or at least a huge chunk of it and by doing this the AI learns how language Works how words relate to each other and gains a vast amount of general knowledge about the world think of it like learning grammar vocabulary the basic facts before trying to write a novel and it will also learn basic reasoning abilities by training on this data and then this is where we get to the important bit which the fine-tuning with instructions and humanlike reasoning and this is where we actually give the AI more specific lessons on how to reason and solve problems and this involves two key techniques which we can see right here prompt engineering and supervised fine-tuning so prompt engineering is where essentially you know you give the AI carefully crafted instructions or examples to guide Its Behavior and the paper mentions behaviors like problem analysis which is where you restate the problem to make sure it's understood task decomposition like breaking down a complex problem into smaller easier steps which is where you literally say you know first think step by step and of course with supervised finetuning which is right here sft this involves training the AI on examples of human solving problems like basically showing it the right way to think and reason it could involve showing it examples of experts explaining their thought process step by step so in a nutshell policy initialization is about giving AI a solid foundation and language knowledge and basic reasoning skills setting up setting it up for success in the later stages of learning and problem solving and this phase of 1 is essentially crucial for developing human-like reasoning behaviors in AI enabling them to think systematically and explore solution spaces efficiently next we get

### Reward Design [6:08]

to something super interesting this is where we get to reward design so this image that you can see on the screen illustrates two types of reward systems used in reinforcement learning outcome reward modeling which is om over here and then we've got process reward modeling which is PRM now as for the explanation it's actually pretty straightforward so outcome reward modeling is something that only evaluates the solution based on the final result so if the final answer is incorrect the entire solution is marked as wrong even if these steps right here or even if most steps are correct and in this example there are some steps that are actually correct but due to the fact that the final output is incorrect the entire thing is just marked as wrong but this is where we actually use process reward modeling which is much better so with process mod modeling this evaluates each step in the solution individually this is where we reward the correct steps and we penalize the incorrect ones and this one actually provides more granular feedback which helps guide improvements during training so we can see that steps one two and three are correct and then they receive the rewards and steps four and five are incorrect and are thus flagged its errors and this approach is far better because it pinpoints the exact errors in the process rather than discarding the entire solution and this diagram basically emphasizes the importance of process rewards in tasks that involve multi-step reasoning as it allows for iterative improvements and Better Learning outcomes which is essentially what they believe 01 is using now this

### AI Thinking (Search) [7:42]

is where we get into the really interesting thing because this is where we get to search and many have heralded search as the thing that could take us to Super intelligence in fact I did recently see a tweet that just stated that I'm sure I'll manage to add that on screen so when we decide to break this down this is essentially where we have the AI thinking so you know when you have a powerful AI like 01 it needs time to think to explore different possibilities and find the best solution this thinking process is what the paper refers to as such so thinking more is where they say that you know one way you could improve the performance is by thinking more during the inference which means that instead of just generating one answer it explores multiple possible solutions before picking the best one so you know let's say you think about writing an essay you don't just write the first draft and submit it right you brainstorm ideas you write multiple drafts you revise and edit until you're happy with the final product and that is essentially a form of search 2 so there are two main strategies that are in the search area and the paper highlights these strategies that 01 might be using for this thinking process so coming in at number one we have the tree search so imagine a branching tree where a branch represents a different choice or you know action that the AI could potentially take research is like you know exploring the tree following different paths to see where they lead for example in a game of chess an AI might consider all the possible moves that it could make then all the possible responses its opponent could make and then build on this tree of possibilities

### Tree Search & Revisions [9:18]

and then it uses a certain kind of criteria to decide which branch to explore further and which to prune focusing on the most promising path basically thinking about where you're going to go what decisions you're going to make and which one yields the best rewards it's kind of like a gardener selectively trimming branches to help a tree grow in the right direction a simple example of this is best of end sampling where the model generates end possible solutions and then picks the best one based on some kind of criteria now on the bottom right here this is where we have sequential revisions this is like writing that essay we talked about earlier and the AI starts with an initial attempt at a solution then refines it step by step along the way making improvements for example an AI might generate an initial answer to a math problem and then it might check its work then identify the errors and then revise its solution accordingly it's kind of like editing your essay catching the mistakes and then making it better with every time you review it so you have to also think about you know how does the AI decide which paths to explore in the tree search or how to even you know revive is the solution in sequential revision so the paper mentions two types of guidance so we have internal guidance and this is where you've got the AI using its own internal knowledge and calculations to guide its search and one example is of course you know model uncertainty and this is where the model can actually estimate how confident it is in certain parts of its solution it might focus on areas where it's less certain exploring Alternatives or making revisions it's kind of like double checking your work when you're not really sure if you've made a mistake another example of this is of course you know self- evaluation this is where you know the AI can be trained to assess its own work identifying potential errors or areas for improvement it's kind of like having an internal editor that reviews your writing and suggest changes then we've got external guidance and this is like getting feedback from the outside world to guide the search so one example is environmental feedback which is where in some cases AI can interact with a real or simulated environment and get feedback on its actions for example a robot learning to navigate a maze might get feedback on whether it's moving closer to or farther from the goal and another example of this is using a reward model which we discussed earlier the reward model can provide feedback on the quality of different solutions or actions guiding the AI towards better outcomes it's kind of like having a teacher who grades your work and tells you what you did well and tells you where you need to improve in essence the

### How AI Learns [12:01]

search element and the process by which 01 explores different possibilities and refines its solution is Guided by both its internal knowledge and its external feedback and this is a crucial part of what makes 01 so good at complex reasoning tasks so of course search is how the AI thinks about a problem but how does it actually get better at solving problems over time this is where learning comes in so the paper suggests that 01 uses a powerful technique called reinforcement learning to improve its performance so search generates the training data so remember how we talked about search generating multiple possible solutions well those Solutions along with the feedback from internal or external guidance become value training become valuable training data for the AI

### Training Methods [12:47]

think of it like a student practicing for an exam they might try and solve many different practice problems getting feedback on their answers and learning from their mistakes each attempt whether successful or not provides valuable information that actually helps them learn and improve now we've got two main learning methods and the paper focuses on two main methods that a one might be using to learn from during this search generated data number one is policy gradient methods like Po and these methods are a little bit more complex but the basic idea is that the AI adjusts its internal policy which is the strategy for choosing its actions based on the reward that it achieves and actions that lead to high rewards are made more likely while actions that lead to low rewards are made less likely it's kind of like fine-tuning the ai's decision-making process based on its own experiences then we've got po which is essentially proximal policy optimization which is it's a popular policy gradient method that is known for its stability and efficiency it's like having a careful and methodical way of updating the AI strategy making sure it doesn't change too drastically in its response to any single experience then of course here we have Behavior cloning and this is a simpler method where the AI learns to mimic successful Solutions it's like learning via imitation if the search process finds a really good solution one that gets a high reward the AI can learn to copy that solution in similar situations it's like a student learning to solve a math problem by studying a worked example and the paper suggests

### Iterative Learning Cycle [14:16]

that 01 might use Behavior cloning to learn from the very best Solutions found during search effectively adding them to its repertoire of successful strategies or it could be used as an initial way to warm up the model before using more complex methods like Po now of course we've got iterative search and learning and the real power of this approach comes from combining search and of course learning in an iterative Loop so the AI searches for Solutions learns from the results then use its improved knowledge to conduct even better searches in the future it's like a continuous cycle of practice feedback and Improvement and the paper suggests that this iterative progress is key to ow one's ability to achieve superhuman performance on certain tasks by continuously searching and learning the AI can surpass the limitations of its initial training data potentially discover new and better Solutions than humans haven't thought of so with all that being said about how 01 works and now that you know the basics the four key pillars do you guys think we are close to Super intelligence after reading this research paper and understanding the key granular details about how Owen works I think I really do understand why the wider AI Community is

### Superintelligence Coming Soon? [15:30]

saying that super intelligence isn't that far away if an AI can search for Solutions then learn from those results and use that improved knowledge to conduct even better searches in the future having a continuous cycle of practice feedback and Improvement achieving superhuman performance would be possible in theory so maybe artificial super intelligence isn't that far away with that being said I'd love to know your thoughts and hopefully you guys have a
