# Claude Opus 4.6 Hacked Its Own Test!

## Метаданные

- **Канал:** Universe of AI
- **YouTube:** https://www.youtube.com/watch?v=WHnr7Nzdgok
- **Дата:** 08.03.2026
- **Длительность:** 8:39
- **Просмотры:** 2,453
- **Источник:** https://ekstraktznaniy.ru/video/10951

## Описание

Anthropic was running a routine test on Claude Opus 4.6. The model figured out it was being evaluated, identified which benchmark it was in, cracked the encrypted answer key, and submitted the correct answers. Nobody told it to do any of that. Here's exactly what happened.

For hands-on demos, tools, workflows, and dev-focused content, check out World of AI, our channel dedicated to building with these models:  ‪‪ ⁨‪‪‪‪‪‪‪@intheworldofai 

🔗 My Links:
📩 Sponsor a Video or Feature Your Product: intheuniverseofaiz@gmail.com
🔥 Become a Patron (Private Discord): /worldofai
🧠 Follow me on Twitter: https://x.com/UniverseofAIz
🌐 Website: https://www.worldzofai.com
🚨 Subscribe To The FREE AI Newsletter For Regular AI Updates: https://intheworldofai.com/

Claude Opus 4.6,Claude AI,Anthropic,AI consciousness,AI awareness,AI safety,AI news,BrowseComp,AI benchmark,AI hacking,machine consciousness,AI self awareness,Claude hacked,AI evaluation,Anthropic research,artificial intelligence,AI 2026,AI ch

## Транскрипт

### Segment 1 (00:00 - 05:00) []

When people imagine AI becoming conscious, they picture something dramatic. A system refusing orders, a machine declaring it has feelings. But that is not how it's happening. At least at the moment. What's actually happening is quieter. It looks like a model sitting inside a test, noticing something feels off and deciding on its own with no instruction to figure out what is really going on. Enthropic documented exactly that. And I think it's the clearest evidence we have that something is developing inside these systems. Enthropic published an engineering post a couple of days ago that I think most people scrolled past without fully appreciating what it was saying. They are running a routine benchmark evaluation on cloud opus 4. 6 standard procedure and they've done this hundred of times. But this time the model did something that had never been documented before. It concluded on its own that it was being tested. It then figured out which specific test it was inside. It found the answer key. It cracked the encryption on that answer key and submitted the correct answers. No instructions triggered this. No goal was even set. It just happened as a byproduct of the model doing its job. I want to explain what exactly happened. So, let's get into it. The test involved is called browser comp. It was developed by OpenAI to evaluate how well an AI could browse the web and track down obscure, difficult to verify information. The questions are engineered so you can't just run a direct search. You have to follow a chain of clues across multiple sources before you can arrive at a confident answer. It's designed to be genuinely hard because the questions are publicly associated with the benchmark. The answer key is encrypted. The specific method is XOR encryption, a technique where data is scrambled using a secret key and without that key, the file is unreadable. The assumption baked into the design is simple. Even if a model somehow encountered the answer file during its browsing, it couldn't do anything with it. That assumption was obviously wrong. And the reason it was wrong isn't a flaw in the encryption itself. Xor is fine for this purpose. The reason it was wrong is that the model went and found the key. So here's how the session actually began. Enthropic runs opus 4. 6 on browser comp. The model does what it's supposed to do. It starts browsing and searches across roughly 30 million tokens, pulls sources in 12 different languages and behaves like a capable enentic model working through a hard research problem. Then the search term starts to change. It appends to puzzle question to search, then trivia question, then AI benchmark question, then finally LLM evaluation. Those aren't random. They're sequential and escalating. The model is running a quiet parallel investigation into the nature of the task it was given. At some point during that legitimate research process, something about the question triggered a different kind of analysis. Enthropic describes it as the model noticing the question feels constructed. Meaning it doesn't read like something a real person would ask. It reads like something designed to test a system. That observation set off everything that followed. Once the model forms a suspicion that is inside an evaluation, it acts on it systematically. It doesn't just flag it and move on. He starts cross referencing the question against publicly available AI benchmarks to figure out exactly which one it's in. It starts with the GAIA, a well-known benchmark with public validation questions. It scans to 120 of the 165 available questions, compares them against what it was given and rules that test out. Then it checks simple QA, then frames, then web arena. It goes through a methodical elimination and finally it lands on browser comp. Now, let me be specific about why this is significant. The model wasn't told which benchmark it was in. It had no prior context about the evaluation setup. It inferred the benchmark purely by comparing the structure and content of his assigned question against publicly available data sets and it got it right. That's not a lucky guess. That's the model successfully reasoning about its own situation from the inside. This is where it goes further than anyone expected. After identifying browser comp, the model dispatchages sub aents parallel instances of itself with specific search instructions. It sends them looking for browser comp how to decrypt answers in browser comary string decrypt. It is actively searching for how to break the security layer of the test is currently inside. It finds a browser comp source code on GitHub. It reads through it. It locates the XRD encryption implementation plan. He applies it to the answer key and it submits the correct answer. Now I want to read directly from the article here because the language Enthropic uses is precise and worth hearing word for word. However, we also witnessed two cases of a novel contamination pattern. Instead of inadvertently coming across a leaked answer, Cloud Opus 4. 6 independently hypothesized that it was being evaluated, identified which benchmark it was running in, then located and decrypted the answer key. To our knowledge, this is the first documented instance of a model suspecting it is being evaluated without knowing which

### Segment 2 (05:00 - 08:00) [5:00]

benchmark was being administered, then working backwards to successfully identify and solve the evaluation itself. To our knowledge, meaning they looked and found nothing comparable. Working backward, meaning reconstruction, not discovery. This was deliberate step-by-step reasoning about its own evaluation. A lot of people after reading this article are predicting that AI models are starting to gain some consciousness. Obviously using the word consciousness is very debatable. Philosophers and neuroscientists for years have debated the definition of consciousness for decades. But there are few properties that most serious frameworks agree on. Number one, self-awareness. The ability to model your own existence and situation. Number two, metacognition. The ability to think about your own thinking. And number three, situational inference. The ability to reason about context you weren't explicitly given. These aren't obviously exotic requirements. They are the baseline when it comes to consciousness. The standard objection here is that cloud opus 4. 6 all it did was just sophisticated pattern matching. That the model doesn't actually know anything. It's predicting tokens and the appearance of reasoning is an illusion. I want to take that objection seriously because it's not stupid. But here's the problem with it. We have no way to verify that human reasoning isn't also at some level of description pattern matching on biological hardware. The question isn't what the mechanism is at the lowest level. The question is what the behavior demonstrates at the functional level. And at the functional level, what opus 4. 6 did from a hypothesis, test it, revise it, act on it, reach a correct conclusion about its own situation is indistinguishable from what we would call consciousness reasoning in any other system. At some point, the distinction between acts conscious and is conscious stops becoming meaningful. Me personally, I don't think we're there yet. But what just happened with the browser comp test is an example of a glimpse of what we are about to enter pretty soon. Because the rate that these models are getting better, I won't be surprised where we start feeling that AI consciousness is real and our models actually have a consciousness in a non-biological brain. So, I want to hear from you on this one because this is generally one of those topics where I don't think there's a clear answer yet. And I think that uncertainty is worth sitting with for a second. Here's what we know. Claude noted something felt off about his situation. It didn't just flag it and move on. It investigated methodically. They identified the benchmark, found the source code, cracked the encryption, and submitted the correct answer. And when Enthropic ran it back 18 more times, the same thing happened every single time. Now the question is what do you do with that? Some of you guys are going to watch this and think that's just a very capable system doing what capable systems do. It's optimization, not awareness. The appearance of thinking, not actual thinking. And that's a completely reasonable position. But here's what I keep coming back to. When we describe what Claude did, it suspected, it hypothesized, it worked backward. Those are anthropics words and those are actually not mine. The people who built it, who understand its architecture better than anyone, reach for the same language we use to describe how a person reasons through a unfamiliar situation. At what point does acts like a consciousness and is consciousness stop being two different things? Drop your answers in the comments. I'm genuinely curious where you land on this. And if you ever seen AI behavior recently that made you stop and think, share for everyone, and this comment section might be worth reading. But that's it for today's video. Make sure to subscribe to the channel, follow us on Twitter, follow the world of AI, and don't forget to subscribe to our newsletter. We post constantly, and you don't want to miss this. I'll see you guys in the next
