# Anthropic CEO's New Warning: We're Losing Control of AI – And Time is Running Out....

## Метаданные

- **Канал:** TheAIGRID
- **YouTube:** https://www.youtube.com/watch?v=h312j5gLF-A
- **Дата:** 26.04.2025
- **Длительность:** 15:49
- **Просмотры:** 28,044

## Описание

Join my AI Academy - https://www.skool.com/postagiprepardness 
🐤 Follow Me on Twitter https://twitter.com/TheAiGrid
🌐 Checkout My website - https://theaigrid.com/


Links From Todays Video:
https://www.darioamodei.com/post/the-urgency-of-interpretability#what-we-can-do

Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos.

Was there anything i missed?

(For Business Enquiries)  contact@theaigrid.com

Music Used

LEMMiNO - Cipher
https://www.youtube.com/watch?v=b0q5PR1xpA0
CC BY-SA 4.0
LEMMiNO - Encounters
https://www.youtube.com/watch?v=xdwWCl_5x2s

#LLM #Largelanguagemodel #chatgpt
#AI
#ArtificialIntelligence
#MachineLearning
#DeepLearning
#NeuralNetworks
#Robotics
#DataScience

## Содержание

### [0:00](https://www.youtube.com/watch?v=h312j5gLF-A) Intro

One of the biggest problems in AI that people never really talk about is the fact that we truly don't understand exactly what's going on in these models. And in today's video, I'm going to be talking to you guys about Dario Ammed's recent blog post called the urgency of

### [0:12](https://www.youtube.com/watch?v=h312j5gLF-A&t=12s) The urgency of interpretability

interpretability. That's just the fancy term for understanding exactly how AI models work. And one of the things we have to understand right now is that we don't really understand how AI models work internally. It's not like a system where we've written code and functions to understand how it works at a granular level. This is a system that really is more random than we think. So he's called this blog post the urgency of interpretability and it's about the importance of understanding AI models especially considering the fact that we're going to get super smart models in the very near future. So it starts by

### [0:47](https://www.youtube.com/watch?v=h312j5gLF-A&t=47s) We can steer the bus

saying that in this decade that he's been working on AI, he's watched it grow from a tiny academic field to arguably the most important economic and geopolitical issue in the world. And in all that time, perhaps the most important lesson I've learned is this. The progress of this underlying technology is inexlerable, is inexorable, driven by force is too powerful to stop. But the way in which it happens, the order in which things are built, and the applications we choose, we have the possibility to change that. Basically, he's trying to say that we can steer it. We can't stop this bus. AI is basically an inevitability, but we can steer the bus in the right direction. He references the previous article where he spoke about the positive for the world where AI essentially helps the world and ensuring that, you know, democracies build and wield the technology before autocracies do. And over the last few months, he basically says that he's become increasingly focused on an additional opportunity for steering the bus. The tantalizing possibility opened up by some recent advances that we could actually succeed at interpretability and that is understanding the inner workings of AI systems before models reach an overwhelming level of power. One of the things here is that, you know, he actually admits that people outside of the field are often surprised and alarmed to learn that we do not understand how our own AI creations work. And of course, it seems pretty crazy when you think about it. Imagine you go to an industry where they're building technology that's going to be super powerful, but they don't even understand how that technology works. I'd be pretty surprised and somewhat confused. And he says they are right to be concerned. This lack of understanding is essentially unprecedented in the history of technology. For several years, we both anthropic and the field at large have been trying to solve this problem to create the analog of a highly precise and accurate MRI that would fully reveal the inner workings of an AI model. This goal has often felt very distant, but multiple recent breakthroughs have convinced me that we are now on the right track and have a real chance of success. And essentially, this is where he talks about the fact that the AI field is moving too quickly. The field of AI as a whole is further ahead than our efforts at interpretability and it itself is advancing very quickly. We therefore must move fast if we want interpretability to mature in time to matter. And what he means here when we really take a look at how things are going, we can see that the AI field is actually moving far too quickly in terms of the relativity to the AI research for us to understand what we're actually doing. And this is where he's saying that, you know, we kind of need to make sure that this kind of research, it actually catches up with how fast AI research is going. So this is where we

### [3:20](https://www.youtube.com/watch?v=h312j5gLF-A&t=200s) The dangers of ignorance

have the dangers of ignorance. Mon modern generative AI systems are opaque in a way that fundamentally differs from traditional software. If an ordinary software program does something, for example, a character in a video game, says a line of dialogue, or my food delivery app allows me to tip my driver, it does these things because a human has programmed them in. And that's what we are used to. However, the problem with generative AI is that it is not like that at all. It's essentially probabilistic. When a generative AI system does something like summarize a financial document, we actually have no idea at a specific or precise level why it makes the choice it does, why it chooses certain words over others, or why occasionally it makes a mistake despite usually being accurate. As my friend and co-founder Chris Ola is fond of saying, generative AI systems are more grown more than they are built. their internal mechanisms are emergent rather than directly designs. It's like growing a plant or bacterial colony. You set the conditions and then you shape the growth. But the structure that emerges is unpredictable and difficult to understand or explain. And so I think it's important that you know you guys understand this truth because it is really hard to grasp this concept. But if people understood that these systems are grown, then they would understand why one of the things that these models continue to do, which is hallucinate, is now plainfully obvious. And of course this is a really big issue if we don't really understand how these things make their decisions because as they get smarter their decisions become increasingly important. So it talks about how you know we need to make sure that we understand truly these models internal mechanisms. Talks about many of the risks and worries associated with generative AI are ultimately consequences of this opacity and it would be much easier to address if the models were interpretable. For example, AI researchers often worry about misaligned systems that could take harmful actions not intended by their creators. And our inability to understand models internal mechanisms mean that we cannot meaningfully predict such behaviors and therefore struggle to rule them out. Indeed, models do exhibit unexpected emergent behaviors, though none have risen to major levels of concern. And that's the point. If we really cannot see what is inside these models, how do we know if they are safe or not? I mean that is something that we really want to do before deploying this worldwide. implications for AI are essentially infinite and if that is the case before embedding something into our society it makes sense for us to truly understand it and when we think about how crazy it is that you know we are using this model when we take a look at many other industries how many rules and regulations are there you know for the food industry you've got the FDA for the drug industry you've got other governing bodies that make sure that everything is safe on such a granular level but of course right now AI is just essentially a computer programmer so it's not that risky but that probably will change in the future so Here's where they talk about power seeking. This is one of the major concerns with AI. And this is because the nature of AI training makes it possible that AI systems will develop on their own an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will. And this emergent nature makes it difficult to detect and mitigate such developments. Meaning that guys, if we continue to build more powerful models, how on earth are we going to really understand what's going on in the hood? Especially if they are lying to us about certain things. Models can have their internal goals. They can be power seeking. They can be controlling. And ideally, we'd want to know exactly what's going on beforehand. One of the problems here is that we can't catch the models red-handed. They talk about the fact that we can't usually trap the models so we can predict their behaviors. We can't catch them, you know, thinking power- hungry thoughts, deceitful thoughts. The only thing we're left with is vague theoretical arguments that the model might have the incentive to emerge during the training process, which some people find thoroughly compelling and others laughably unconvincing. And one of the things that, you know, they don't talk about in this, you know, blog post, but I do remember there was a recent research paper that basically spoke about the fact that Claude sometimes knows when it's in its training phase. So often can hide certain beliefs. So how do you even, you know, mitigate against that when the model truly understands what's going on? And one of the problems as well with models, which is why jailbreaks still occur, is the fact that because these are grown, there's always probably going to be a percentage chance that someone can jailbreak the model. They talk about the fact that we can put filters on the models, but there are a huge number of possible ways to jailbreak or to trick the model. And the only way to discover the existence of a jailbreak is to find it empirically. Instead, if it were possible to look inside the models, we might be able to systematically block all jailbreaks and also able to characterize what dangerous knowledge the models have. Basically saying that, you know, if they really had a way to look inside the models, ideally, they'd be able to figure out a way to stop any jailbreak from occurring and preventing any misuse. Right now, there's no foolproof way to ensure 100% a model is jailbroken. We're just bidding on human ability to basically be responsible. And that is why there are no high stakes applications for these AI models. Talk about the fact that AI systems opacity means that they are simply not used in many applications such as high stakes financial or safety critical settings because we can't fully set the limits on their behavior and a small number of mistakes could be very harmful. Better interpretability could greatly improve our ability to set the range to set the bounds on the range of possible errors. In fact, some applications, the fact that we can't see inside the models is literally a legal blocker for their adoption. Remember how I said some industries are regulated heavily? Genative AI wouldn't really be allowed to be used there because we don't really know what's going on in the model and there's no way to predict with enough accuracy that the average person doesn't get screwed over. They talk about the fact that, you know, a legal broker to their adoption for mortgage assessments where decisions are legally required to be explainable, an AI wouldn't really have the ability to do that. And it talks about how the fact that similarly AI has made great strides in science including improving the prediction of DNA and protein sequence data. But the patterns and structures predicted in this way often difficult for humans to understand and don't impart biological insight. Some research papers from the last few months have made it clear that interpretability can help us understand these patterns. They also talk about there are more exotic consequences of opacity such that it inhibits our ability to judge whether AI systems are or may someday be sentient or may be deserving of important rights. And this is a complex enough topic that I won't get into detail, but I suspect that it will be important in the future. Now, this was something that was pretty crazy to me because I remember reading that Anthropic are actually hiring a welfare researcher. basically hiring someone to make sure that these models don't basically get any stress or any pain from interacting with humans. And I know that might sound so surprising, but considering we don't really know how these models are, if there is potential for suffering, then of course it makes sense to investigate that now. And I know it sounds crazy to have someone who ensures that the model is I wouldn't say feeling okay, but just not receiving any pain is absolutely crazy. I think I even read today that somewhere that Anthropic are considering making it so that if you annoy a model, it just refuses to talk to you. And you should actually try this out with Claude. If you, you know, have a conversation with Claude and you're not really having a thoughtful discussion, it basically will stop replying to you, which is pretty crazy when you think about it. I think Anthropic are the only company that anthropomorphosized their model and basically say that this thing is actually a human in some regards to

### [10:45](https://www.youtube.com/watch?v=h312j5gLF-A&t=645s) Anthropics research

consciousness. And they actually talk about this where they talk about the interpretability research that they're doing. earlier. Yeah, I mean I think we've touched on a couple like it is you know quite closely connected to alignment in many ways. It's work that's done to shape Claude's character and shape what kind of personality does Claude have and what kinds of things does Claude value and uh Claude's preferences in many ways. Um and then yeah in terms of interpretability there's a fair amount of overlap there. interpretability is the main tool that we have to try and understand what is actually going on inside of these models um that you know probes much deeper than kind of what their outputs are. And so we're quite excited as well about potential ways that we could use interpretability to get a sense of you know potential internal experiences. We mentioned earlier that human consciousness itself is still something of a mystery and that's what complicates this research to a like terrifying degree. And I also saw

### [11:40](https://www.youtube.com/watch?v=h312j5gLF-A&t=700s) Exotic mindlike entities

this interesting clip from the deep mind principal scientist Murray Shannon and he actually calls LLM's exotic mind entities because we don't actually have the words for what they really are. And it goes to show that it isn't just anthropic that is looking at what is inside these models. In one of my papers I use the phrase exotic mindlike entities to to describe large language models. So I think that they are to a degree exotic mindlike entities. So they are kind of mindlike and they're increasingly mindlike. Now there's a very important reason for using the little hyphen like there which is because I I want to hedge my bets as to whether they really qualify as minds and so I can wrigle out of that problem by just using mind like they're exotic because they're not like us language use but in other respects they're disembodied for a start. there's really weird conceptions of selfhood that are applicable to them maybe. But so they are quite exotic entities as well. We just don't have um the right kind of conceptual framework and vocabulary for talking about these exotic mindlike entities yet. You know, we we're working on it and um uh and the more they are around us, the more we'll develop new kinds of ways of talking and and thinking about them.

### [12:55](https://www.youtube.com/watch?v=h312j5gLF-A&t=775s) Recent experiments

Now, here's where we talk about recent experiments they did. They talk about recently they did an experiment where they had a red team deliberately introduce an alignment issue into the model say a tendency for the model to exploit a loophole in the task and gave various blue teams the task of figuring out what was wrong with it. They talk about the fact that multiple blue teams succeeded and of particular relevance here. Some of them productively applied interpretability tools during the investigation and we still need to scale these methods. But the exercise helped us gain some practical experience using interpretability techniques to find and address flaws in their models. Basically saying that right now they're actively working on, you know, putting issues into models and then seeing if other teams can figure out internally what those issues are. And one of the long run goals of anthropic is essentially to do a brain scan which is basically going to be a checkup that has a high probability of identifying a wide range of issues including tendencies to lie or deceive power seeking flaws and jailbreaks. And this is basically going to be used in tandem with various techniques for training and aligning models. A bit like how a doctor might do an MRI to diagnose a disease then prescribe a drug to treat it and then do another MRI to see how the treatment is progressing and so on. So what can they do about this? On one hand, the recent progress, especially in interpretability research, has made anthropic feel that they are on the verge of cracking this research in a big way. Although the task of them is pretty crazy, they can actually see a realistic path towards interpretability being a sophisticated and reliable way to diagnose problems in even very advanced AI or true MRI for short. In fact, on its current trajectory, I would bet strongly in favor of interpretability reaching this point within 5 to 10 years. So Anthropic are essentially saying here that they bet on the fact that this will be solved in 5 to 10 years. But on the other hand, they worry that AI itself is advancing so quickly that they might not even have this much time as they've written elsewhere, AGI could come at 2027, which is only 2 years from now. So if we don't figure out how these models are smart and by that time they become, you know, basically super intelligent, what is going to occur? you have a super intelligent model that you don't understand how it works. And Dario Amade here says, I consider it basically unacceptable for humanity to be totally ignorant of how they work. They also talk about the fact that interpretability gets less attention than the constant delusion of AI model releases, but it is arguably more important. It feels to me like it is an ideal time to join the field. The recent circuits results have opened up many directions in parallel. Anthropic are actually doubling down on this and they want to get interpretability to where it can detect most model problems by 2027. And they're also investing in some interpretability startups. And they also actually talk about other companies and they say other companies such as Google Deep Mind and OpenAI have some interpretability efforts but I strongly encourage them to allocate more resources. It's basically saying that those are the companies they need to be able to really put some work

---
*Источник: https://ekstraktznaniy.ru/video/12940*