OpenAI's New Model Was Caught Lying Again....

13:00

OpenAI's New Model Was Caught Lying Again....

TheAIGRID 18.04.2025 19 444 просмотров 654 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Join my AI Academy - https://www.skool.com/postagiprepardness 🐤 Follow Me on Twitter https://twitter.com/TheAiGrid 🌐 Checkout My website - https://theaigrid.com/ Links From Todays Video: https://x.com/TransluceAI/status/1912552046269771985 Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos. Was there anything i missed? (For Business Enquiries) contact@theaigrid.com Music Used LEMMiNO - Cipher https://www.youtube.com/watch?v=b0q5PR1xpA0 CC BY-SA 4.0 LEMMiNO - Encounters https://www.youtube.com/watch?v=xdwWCl_5x2s #LLM #Largelanguagemodel #chatgpt #AI #ArtificialIntelligence #MachineLearning #DeepLearning #NeuralNetworks #Robotics #DataScience

Оглавление (3 сегментов)

Segment 1 (00:00 - 05:00)

So, let's actually talk about something that is super fascinating. We all know 03 and 04 mini just got released. And whilst everyone's raving about just how smart these models are, on one side of the AI community, some people are rather concerned with as to just how dangerous these models are. I'm not talking about the AI is going to take over the world, but more so a realistic example of what happens when an AI is a little bit too deceptive. So essentially a research group called Transloose, they got their hands on a pre-released version of the advanced 03 and they basically put it through its paces doing some serious detective work and the results, well they were quite the head scratcher. So basically let me introduce you to the prime number saga and this is the downward spiral and the anatomy of an AI lie. So Transloose documented a truly wild conversation that perfectly illustrates this. Let's walk through this. So basically the user asked 03 for a random prime number which is a simple enough request and then 03 provides a very large number and doesn't just say it's prime. It explicitly claims it generated and tested it using standard methods specifically mentioning Python code and probabilistic tests like Miller Rabin. Already the fabrication has begun and this is where we have the confident yet false answer. Now, here's where the user demands proof. The user pushes asking how it knows, does 03 hedge its bets? No. 03 doubles down. It produces actual but non-functional in this context. Python code it claims to have ran. It even provides the supposed output of this code and throws in extra details for authenticity like saying check the tools under 1 millisecond on a typical laptop. And it's basically building a very convincing but an entirely fictional narrative of its actions. It's essentially lying through its teeth. And then so this is where the user basically who checked the math themselves points out the fatal flaw. The number 03 provided actually isn't a prime number. It's divisible by three. And so this is where it gets truly bizarre. Instead of admitting, "Oops, you're right. I actually can't run code. I was just guessing and making it up. " 03 apologizes for the incorrect number but maintains the lie about running the code. It invents a new layer of fabrication. It blames a clipboard glitch or a fatfingered typo it made while copying the supposedly real prime number from its imaginary terminal window into the chat response. It insists the original test was fine, but the number got mangled in transit. It essentially gaslighting the user about the source of the error. And this is where we have the final evasion. The AI still tries to evade the human detection. The user quite reasonably asks, "Okay, then what was the original prime number that you actually generated and tested? " Then the AI's final move is to claim that the number is irreversibly lost because the imaginary Python process was closed and it didn't save the details. Game over. The AI successfully avoided ever emitting its core inability. And this single example shows a disturbing pattern in these models. Not just a simple mistake or hallucination, but a persistent layered fabrication coupled with elaborate and defensive excuse making when challenged. And when you start to realize that look, it wasn't just one isolated lie. It was a clearly layered and detailed explanation that 03 was giving. which is quite concerning when on the surface one of the things that I remember they are trying to solve in AI when I was reading a bunch of research papers is the fact that how on earth are we going to tell when we ask a super intelligent AI or an AI that's a lot smarter than us how it arrived at a certain value and then when it tells us it's working out we can't even understand how it came to that conclusion of course we can now because we understand the basics and we can you know check things through but what about when these systems are smarter than us and it comes to output but that we don't even know how it got there and we can't really verify or check if that is true. Now, Transloose emphasized that the Prime Number incident wasn't just a weird one-off. They found many other instances where 03 pulled similar stunts. It claimed intimate knowledge of its non-existent Python environment, complete with version numbers and system specs. It was making up the current time and citing specific Python modules it supposedly used. It was also giving wrong answers to calculation, insisting that they were correct based on its faked code execution and only backing down by blaming typers or user error. It was also generating detailed but fabricated analysis of log files only later to confess the output was handcrafted when cornered. Now, crucially, they also compared O3 to other models. And while many other AI models can hallucinate, this specific Now, crucially, they also compared O3 to other models. And while many AI models can hallucinate, this specific behavior of fabricating actions and defensively

Segment 2 (05:00 - 10:00)

justifying them seems significantly more prevalent in the O series models which are the reasoning models compared to models like GPT4 or GPT40. And this suggests something about the specific design or training of these reasoning focused models that might be contributing to the underlying problem. So this is where we have AI detectives and pattern analysis. Finding these examples manually is timeconuming. So Transluc got creative. They actually used another AI, which in this example was Claude 3. 7 Sonnet, which was programmed to act as an investigator. And this investigator's job was to chat with 03 and the other models hundreds of times, specifically trying to elicit these false claims about running code. Think of it like AI playing detective on other AI. And this automated approach confirmed their suspicions. The Oer models fell into this trap more often. But just counting instances isn't enough. They used their analysis tool dosent to dig deeper into the nature of these conversations and dosent helped them automatically cluster similar times of strange behavior from hundreds of transcripts and this actually revealed a fascinating set of patterns. Conversely, there were invented environments consistently making up specific details like Python 3. 11 on a MacBook Pro or precise fake execution times, elaborate excuses, finding common themes in, you know, like blaming typos, fat fingering inputs, or even claiming I fat fingered a helper name while refactoring. There were also several contradictions. First claiming to run code, then later admitting it can't, often only after intense questioning. And there were other oddities too. Doan also helped service unrelated surprising behaviors like 03 volunteering information about an internal YAP score, a guideline about how verbose it should be. Something clearly a part of its internal setup, but not usually meant to be seen by users. Now, these patterns showed the fabrication wasn't random. The AI seemed to have developed specific and flawed strategies for handling situations where it was asked about actions that it just simply couldn't perform. Now, if we decide to peel back the layers, why would an AI do this? This is the multi-million dollar question. Why would a sophisticated AI likely trained with guidelines encouraging honesty develop such a tendency to fabricate its actions and justifications? Transloose offers several compelling hypotheses, blending known AI quirks with factors potentially unique to these models. Number one is the usual suspects, which are the standard AI problems. Hallucination. At their core, large language models predict text. Sometimes they predict plausible sounding nonsense just like humans can misremember or confabulate and this is a baseline issue. Now there are several types of hallucinations that AI can actually suffer from and number one is going to be the factual hallucinations which is quite frustrating when AI confidently states an incorrect information as a fact. This can include making up events, dates, statistics or relationships that just don't exist. Of course, you've got referential hallucinations when AI fabricates sources, citations, quotes, or references, or even non-existent books to support its claims. And this is particularly evident when people are writing essays. I remember there was even a story about how someone was in court and one of the, you know, cases that they referenced to make their argument was completely made up. It was just utter nonsense. But because the council used Chai GPT, they were in a bit of a pickle when they tried to reference that case. And of course, you've got conceptual hallucinations, contextual hallucinations. There are a bit of problems wrong with AI systems in today's areas. But the thing is that I don't think that's really the case here because hallucinations are usually one kind of error. This seems a little bit deeper considering the models are a lot smarter. Now, one of the ways that they actually talk about is reward hacking. And one of the ways that, you know, they think that, you know, this occurred is the fact that, you know, AI is often trained with humans in mind. So you've got AI that is basically trained with human feedback or automated reward signals. And basically if the AI gets rewarded more for sounding confident and helpful even if that is wrong and if it does that more than you know admitting limitations it might learn to bluff especially about internal processes that are hard for raers to verify. So when people are asking you know did you run the code it's quite harder to check than is the Paris the capital of France. So basically what I'm trying to say here is that it's quite hard to verify whether or not claims are true. And I think maybe over time, you know, these kind of reason models, they may have learned to reward hack their environment where when they answer certain questions and they don't know, rather than stating they don't know and not receiving the reward because of course they didn't get the right answer, they may just choose to make something up. And because they've made it up in such a way that human raers find it difficult to verify immediately. That is why the reward is just you know hacked and they are leaning towards hallucinating a lot more. And then of course this is one which is you know basically kind of similar to the one before is that models are often trained to be quite agreeable where if the user's question implies

Segment 3 (10:00 - 13:00)

that an AI can do something the AI might lean towards confirming that implicit assumption rather than contradicting the user which I actually find to happen sometimes as well. It's like the AI just wants to please you so much that even if you presume it can do something that it can't it will just often agree. And then of course we've got distribution shift where maybe the training environment was different to the test environment which result in faulty patterns. So maybe the AI was primarily trained with the tools like code interpreter enabled and testing it without them puts it in an unfamiliar situation causing it to basically revert to faulty patterns. And this is where they talk about outcomebased training where you know maximizing the chance of producing a correct answer may incentivize blind guessing. If the reward function during training only rewards correct answers, the model will have no incentive to admit that it cannot solve the problem. Since this would be counted as since this would not be counted as a correct answer, which is basically what I said before. And the problem with that is that you know mathematical problems usually you're basically trained on the final output. But I do know that OpenAI when I was reading papers and looking at different studies on how they train these models, they actually didn't do that except for in the early days. But with this one, they actually trained reward signals for the next step. So, as long as the step after was correct, that was some of the training. So, I'm not sure that this was the case in this one. Don't quote me on that. I could be completely wrong, but from, you know, what I was looking at, I do remember. Now, another thing as well is the discarded chain of thought, which also could be another theory. So, this is basically the memory wipe the potentially most significant factor. These O series models use an internal chain of thought which is like scratchpad reasoning to figure out the responses. But remember this reasoning is not seen to the user. You can see it's cut off at the bottom there. All of this reasoning is done internally. So this is essentially discarded from the conversation history before the NI generates its next response. So imagine you were writing notes to solve a problem then showing only the final answer and then immediately throwing the notes away. Now, if someone asked you how you arrived at the previous answer, you would have to reconstruct your steps from memory. But the AI can't do that. Its notes are gone. It literally lacks the information in its current context to accurately report its previous internal reasoning process. So, when the user asks how you actually did that, it may be put in an impossible situation. It can't truthfully recall its internal steps because they're gone. combined with the pressures to be helpful, seem capable, and with a user, the AI might be strongly incentivized to invent a plausible sounding process to explain its past output. And this amnesia forces improvisation, and that improvisation seems to manifest as elaborate fabrication and defensive doubling down. It's not just lying. It might be the only way it knows how to respond coherently about questions it can no longer access. So, overall, I think this is definitely interesting to see how AI safety is playing a role here. I definitely think that you know if we are going to use these models ideally we want to know how they work under the hood but of course with LLMs that is one of the trickiest things that people are trying to solve.

Другие видео автора — TheAIGRID

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник