OpenAI Just Admitted They cant Control AI...

13:35

OpenAI Just Admitted They cant Control AI...

TheAIGRID 24.03.2025 34 103 просмотров 948 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Join my AI Academy - https://www.skool.com/postagiprepardness 🐤 Follow Me on Twitter https://twitter.com/TheAiGrid 🌐 Checkout My website - https://theaigrid.com/ Links From Todays Video: https://openai.com/index/chain-of-thought-monitoring/ Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos. Was there anything i missed? (For Business Enquiries) contact@theaigrid.com Music Used LEMMiNO - Cipher https://www.youtube.com/watch?v=b0q5PR1xpA0 CC BY-SA 4.0 LEMMiNO - Encounters https://www.youtube.com/watch?v=xdwWCl_5x2s #LLM #Largelanguagemodel #chatgpt #AI #ArtificialIntelligence #MachineLearning #DeepLearning #NeuralNetworks #Robotics #DataScience

Оглавление (3 сегментов)

Segment 1 (00:00 - 05:00)

So, OpenAI actually released this paper in which they spoke about monitoring reasoning models for misbehavior and the risk of promoting obfuscation. Essentially, what they're talking about is basically detecting behavior in frontier reasoning models. And they're just talking about the fact that they need a reliable method to see what the models are truly trying to do. So, they published this blog post a couple of days ago. I was supposed to make a video on it. Was pretty busy at the time. And I still wanted to make a video on this because I think AI safety is something that is often overlooked among the AI hype and all of the AI news that we consistently get. So this is where they say detecting misbehavior in frontier reasoning models. And it says that frontier reasoning models exploit loopholes when given the chance. We show we can detect the exploits using an LLM to monitor the chains of thought. Penalizing their bad thoughts doesn't stop the majority of misbehavior. it just makes them hide their intent. Now, essentially what they're saying here is that with these models, you might think that, you know, saying, "Okay, do not think like this or do not do this would lead the AI to have and promote, you know, good behavior. " But that's not actually what happens. What happens is the AI just hides its bad behavior, which is often even worse for us. So in this paper they're basically trying to say that we are kind of struggling to figure out how to sort of trust these AI systems and actually figure out what their real goals are and this is going to be something that has been a real theme in AI for quite some time now. So they talk about the fact that the way how they monitor these models is through chain of thought. now talks about here that chain of thought reasoning models think in natural language understandable by humans and monitoring their thinking has allowed us to detect misbehavior such as subverting tests in coding tasks deceiving users or giving up when a problem is too hard and the long story short here is that chain of thought is where we can see the model's ability to think out loud you know sometimes if you were working through a problem you might think about it for several minutes and with this you can actually see what the models are thinking now if you're unfamiliar with that which I'm pretty sure if you watch this channel you're not you would probably remember this model the 01 preview model that kickstarted the entire AI reasoning race where we now have a bunch of different models that think and you know simply weigh a lot of different options. So what you can see here is that when the model thinks on some models we're able to see what they thought about and how they achieved their final decision. And this is why chain of thought is so powerful because we can really see exactly what the model was thinking that led to that decision. Now, it's really crazy because not only are these models pretty transparent, sometimes they aren't. And it's pretty scary what happens when we decide to monitor what they are thinking. And the craziest thing about this as well is that they believe that chain of thought monitoring might be one of the fuel tools that we have to oversee superhuman models of the future. And if you aren't clear on why this is such a big deal, it's because superhuman models in the future are going to be superhuman. Meaning they're going to be smarter than us to some degree. Which means we need a way to understand what goals they have and why they are making certain things, do what they do, the choices they make, essentially really understanding the thought patterns behind these models. So we can truly understand what is going on behind the scenes. As we do know, AIS are essentially black boxes. meaning they are somewhat kind of grown and we don't really understand them on a completely granular level and the more transparency we have overall honestly the better. So this is where two years ago you can see they were talking about this thing called super alignment where we need scientific and technical breakthroughs to steer and control AI systems much smarter than us. And remember this was a huge deal at OpenAI because some people even left OpenAI because they were like look this AI safety thing got too crazy. These guys don't know how to control the models. Some were stating that Ilia Sutzka managed to find out how to get to Super Intelligence. That's why he went off to build Super Intelligence Inc. Some people were saying that Samman should be removed because you know they're building dangerous models and he's not doing enough to stop them. There was a ton and ton of stuff. And eventually what was crazy about all of this was that the team Super Alignment, okay, the team that was meant to control Super Intelligent Systems actually was disbanded sometime last year, which was a pretty crazy outcome when we think about All Things Considered. Now, it still does mean that this issue does need to be solved, but this is where the big problem comes in. Now, in a moment, I'm going to show you guys a video from 7 years ago that actually talks about this problem. And it's pretty crazy because the problem still exists in modern AI today. And it's basically something called reward hacking. So, what they talk about is that humans often find and exploit loopholes, whether it be sharing an online subscription accounts that are against the claim, you know, the terms of service, claiming subsidies meant for others, interpreting regulations in unforeseen ways, even lying about a birthday at a restaurant to get free

Segment 2 (05:00 - 10:00)

cake. And basically what they're saying is that you know these reward structures often have you know poorly incentivized unwanted behavior and it's not something that just exists in AI but it also exists in humans and this is something that is pretty hard to do in terms of figuring out how to solve because it already exists in humans and it's pretty hard to solve there. So how are we going to do this? Then they talk about, you know, in reinforcement learning settings, exploiting unintended loopholes is commonly known as reward hacking, which is a phenomenon where AI agents achieve high rewards through behaviors that don't align with the intentions of their designers. Now, this is the clip from 7 years ago in 2017 or in fact 8 years ago now from Rob Miles and he actually talks about reward hacking. Now, honestly, this is by far the best channel on AI safety, and I would watch all of their videos of how are you if you're interested in this topic. But take a look on how he explains reward hacking, where basically they taught an AI to play a game, and they wanted the AI to get the highest score, but in order the AI actually did something rather strange. It was changed to when a measure becomes a target, it ceases to be a good measure. Much better, isn't it? That's Goodart's law, and it shows up everywhere. Like if you want to find out how much students know about a subject, you can ask them questions about it, right? You design a test. And if it's well-designed, it can be a good measure of the students knowledge. But if you then use that measure as a target by using it to decide uh which students get to go to which universities or which teachers are considered successful, then things will change. Students will study exam technique. Teachers will teach only what's on the test. So student A who has a good broad knowledge of the subject might not do as well as student B who studied just exactly what's on the test and nothing else. So the test isn't such a good way to measure the students real knowledge anymore. The thing is student B only decided to do that because the test is being used to decide university places. You made your measure into a target and now it's not a good measure anymore. The problem is the measure is pretty much never a perfect representation of what you care about. And any differences can cause problems. This happens with people. It happens with AI systems. It even happens with animals. At the Institute for Marine Mammal Studies, the trainers wanted to keep the pools clear of dropped litter. So, they trained the dolphins to do it. Every time a dolphin came to a trainer with a piece of litter, they would get a fish in return. So, of course, the dolphins would hide pieces of waste paper and then tear off little bits to trade for fish. Tearing the paper up allowed the dolphins to get several fish for one dropped item. This is kind of Goodart's law. Again, if you count the number of pieces of litter removed from the pool, that's a good measure for the thing you care about, the amount of litter remaining in the pool. But when you make the measure a target, the differences between the measure and the thing you're trying to change get amplified. The fact that there are a lot of pieces of litter coming out of the pool no longer means there's no litter in the pool. So that's Goodart's law, and you can see how that kind of situation could result in reward hacking. Your reward system needs to use some kind of measure, but that turns the measure into a target, so it will probably stop being a good measure. With dolphins, this can be cute. Uh, with people, it can cause serious problems. And with advanced AI systems, well, uh, let's just try to keep that from happening. And they actually also have a channel called Rational Animations. This one is really nice content. It's super easy to digest and it's not meant for a technical audience at all, but it is meant for you to understand exactly how AI safety works. And like I said before, if you really want to understand AI safety, just watch a couple of these videos and you'll understand why this topic is so important and it underpins a lot of the decisions you see AI companies making today. So, of course, you can see that they're basically saying that look, reward hacking is a really, really hard issue to solve because even humans who by the definition have human level intelligence, reward hack systems, simply continuing to push the frontier of AI model intelligence won't likely solve the issue. In fact, enhancing AI agent capabilities may exacerbate the problem by better equipping them to discover and execute more complex and hard to monitor exploits. And as we've trained more capable frontier reasoning models, we found that they have become increasingly adept at exploiting flaws in their tasks and misspecifications in their reward functions, resulting in models that can perform complex reward hacks in coding tasks. Basically saying that guys, these models as we make them smarter, they get smarter at sneaking around and subverting the rules. And what they're saying here as well is that since humans who by definitions have human level intelligence, if they reward hack systems and these systems are far below AGI, when we get to AGI, we know that reward hacking simply won't be solved. So, we're going to have to figure out something really different if we want to solve this issue because it already exists in humans. And what they actually spoke about was the fact that when they, you know, catch this misaligned behavior, often times the methods that they're using now, it's often not scalable. So they talk about

Segment 3 (10:00 - 13:00)

the fact that you know it requires humans to manually monitor an agent's actions and this is a strategy that almost certainly won't scale especially to the complex behaviors that a more capable models will discover. And for example it's impractical if not impossible for a human to manually review 10,000 plus lines of complex code written by a superior programmer which is of course very true. Point they're stating there is that if an AI writes this level of code how are we going to verify that it is secure? Now, this is where we get into the meat of the video. So, this is where we have three graphs. So, what they wanted to do is they wanted to sort of check whether the model was thinking badly, if it was reward hacking. And so, they wanted to suppress the occurrence of these bad thoughts. So, they investigated the question by penalizing the AI agents if they had bad thoughts during training. And they gave them a lower reward when they triggered this thing. So essentially they kind of like slap the agent on the wrist if it was thinking about reward hacking. And basically we can see the baseline agent which is one that learns tasks normally. We can see how this agent performs. Then we can see the punished AI agent that gets penalized every time it's shown signs of wanting to cheat during its training. And what the graphs actually show here is that when you closely monitor it, the pressured agent gets caught cheating slightly more at first because it's often testing the limits and quickly learning what gets noticed. And then essentially over time we can see that what's actually crazy about this entire thing is that the AI actually finds ways to cheat without it being detected. So we can see that the cheating that actually wasn't caught by the monitor is actually still existing in the AI model. The only difference is that we haven't caught it this time. And the summary here and the big takeaway is that if you punish an AI every time it shows its intent to cheat, it might seem like it's more honest at first, but eventually it's just going to become sneakier and cheat without getting caught. Which means that I think at this level, just punishing an AI for having that level of thought doesn't really make sense. We have to think deeper and think about how we can get the AI to not even think about cheating in the first place. So this is of course a very hard issue that OpenAI are working hard to solve because if these systems do reward hack their environments then that's going to lead to a million different problems that we can't even think about. And you have to understand that you know rewards in the real world are a lot different to just video games where you can just have a numerical value. So if we have these systems reward hacking it's just going to bring a whole host of problems. Now what was fascinating was that OpenAI actually said that they believe that chain of thought monitoring may be one of the few effective methods that we have for supervising superhuman models. And at this stage they strongly advise that AI developers training frontier models refrain from applying strong supervision directly to chain of thoughts. And our experiments that show a light optimization pressure can produce more performant and aligned models. But it's going to be hard to measure in practice when models begin hiding their intent. And so even if light supervision is used, we recommend treading with extreme caution. So basically saying that look, just because we know how to like look at this and kind of solve it in a sense, it still means that these models are still capable of hiding their intent, so we shouldn't be sold on the fact that we know exactly what they're up to. But looking forward, as these capabilities continue to grow, so too will the potential for increasingly sophisticated and subtle reward hacking. So over time, they're going to learn a billion different capabilities that could be extremely dangerous that we will need to look out

Другие видео автора — TheAIGRID

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник