What is interpretability?

3:53

What is interpretability?

Anthropic 03.06.2024 42 001 просмотров 1 337 лайков обн. 18.02.2026

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

A surprising fact about modern large language models is that nobody really knows how they work internally. At Anthropic, the Interpretability team strives to change that — to understand these models to better plan for a future of safe AI. Find out more: https://www.anthropic.com/research

Оглавление (1 сегментов)

Segment 1 (00:00 - 03:00)

I work at Anthropic on the Interpretability team. Interpretability is the science of understanding these AI models from the inside out. Researchers like me are trying to figure out what the networks learn and how they do what they do. It's almost like doing biology of a new kind of organism. We're focused on an approach called mechanistic interpretability. We're trying to build from understanding very small units into understanding larger and larger mechanisms. It's often surprising to people that we need to go and do interpretability at all, that we don't understand these systems that we've created. In some important way, we don't build neural networks. We grow them, we learn them. It's a lot like evolution. the way that we started with little molecules bouncing against each other and then you got very basic proteins and then maybe you got cells and in the end you have, well, you have us, right? But no one designed us to make sense. Just every generation, there's this grand progression of refinement and change over time. The models are the same way. We start with a kind of blank neural network. It's like an empty scaffold that things can grow on. And then as we train the neural network, circuits grow through it. They implement the model's behavior. And so we're in this situation where we understand, you know, we understand that initial scaffolding we gave it, and we understand the process that incentivizes those circuits to form, but we don't know what those circuits are or what they do or how they work. Turns out that's challenging because the circuits get packed very densely and if you want to understand them, you sort of need to pull apart those overlapped pieces. And so if we want to understand neural networks, we're then left with this challenge of going and studying this thing that we grew rather than something that we designed from scratch. A child can pass a test at school because they actually learned the material, or they can pass the test because they cheated. As the model developers, both of those look like the same outcome. And we can't, you know, without interpretability letting us see inside the model, we can't actually tell those two apart. We want these models to be safe and reliable. By studying how they work inside, by doing this kind of model biology, we can do some kind of model medicine that can diagnose and cure what ails it and help it do what it's trying to do. The power of interpretability is that it gives us a different lens to go in and ask that question, to go and see potential problems. You could imagine developing techniques to steer models towards the correct behaviors. But if we actually understood all the nuts and bolts, then it seems like we ought to be able to intervene in ways that change what they do. AI is at a really interesting moment in its development where we've figured out some things that work, but we don't know the limits of that. And we're just beginning to even find the right words to talk about what's happening. The early 1900s, this, like, golden age of physics where sort of, you know, quantum mechanics was discovered and special relativity and general relativity, and we finally could understand things about solid state physics, and things were, all of a sudden were starting to make sense, and it feels like we're sort of speed running that right now in interpretability. The exciting part is it's just, it feels like we're in a position to really understand the core of like, what is thinking, how does thinking work? Having these hard problems and these deep, really difficult questions and also having just a little bit of traction on them. That's sort of the most, I feel like, that a scientist can ask for if you want to really discover deep things and really exciting things. And so I think there's a way in which we're very fortunate to have such interesting and difficult questions to go and grapple with.

Другие видео автора — Anthropic

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник