Tracing the thoughts of a large language model
2:55

Tracing the thoughts of a large language model

Anthropic 27.03.2025 257 013 просмотров 10 624 лайков обн. 18.02.2026
Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
AI models are trained and not directly programmed, so we don’t understand how they do most of the things they do. Our new interpretability methods allow us to trace their (often complex and surprising) thinking. With two new papers, Anthropic's researchers have taken significant steps towards understanding the circuits that underlie an AI model’s thoughts. In one example from the paper, we find evidence that Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes each line to get there. This is powerful evidence that, even though models are trained to output one word at a time, they may think on much longer horizons to do so. Read more: https://anthropic.com/research/tracing-thoughts-language-model

Оглавление (1 сегментов)

  1. 0:00 Segment 1 (00:00 - 02:00) 491 сл.
0:00

Segment 1 (00:00 - 02:00)

You often hear that AI is like a black box. Words go in and words come out, but we don't know why it said what it said. That's because AIs aren't programmed, but trained. And during training, they learn their own strategies to solve problems. If we want AIs to be as useful, reliable, and secure as possible, we want to open up the black box and understand why they do things. But even opening the black box isn't very helpful because we don't know how to interpret what we see. Think of it like a neuroscientist investigating the brain. We need tools to work out what's going on inside. We want to know how the model connects all the concepts in its mind and uses them to answer our questions. Now we've developed ways to observe some of an AI model's internal thought processes. We can actually see how these concepts are connected to form logical circuits. Let's take a simple example where we asked Claude to write the second line of a poem. The poem starts, "He saw a carrot and had to grab it. " In our study, we found that Claude is planning a rhyme even before writing the beginning of the line. Claude sees "a carrot" and "grab it" and thinks of "rabbit" as a word that would make sense with carrot and rhyme with grab it. Then it writes the rest of the line. "His hunger was like a starving rabbit. " We look at the place that the model was thinking about the word rabbit, and we see other ideas it had for places to take the poem. We also see the word habit is present there. Our new methods allow us to go in and intervene on this circuit. In this case, we dampen down rabbit, as the model is planning the second line of the poem, and then ask Claude to complete the line again. "His hunger was a powerful habit. " We see that the model is capable of taking the beginning of a new poem and thinking of different ways it could complete it, and then writing it towards those completions. The fact we can cause these changes to occur well before the final line is written is strong evidence that the model is planning ahead of time. This poetry planning result, along with the many other examples in our paper, only makes sense in a world where the models are really thinking, in their own way, about what they say. Just as neuroscience helps us treat diseases and make people healthier, our longer-term plan is to use this deeper understanding of AI to help make the models safer and more reliable. If we can learn to read the model's mind, we can be much more confident it is doing what we intended. You can find many more examples of Claude's internal thoughts in our new paper at anthropic. com/research.

Ещё от Anthropic

Ctrl+V

Экстракт Знаний в Telegram

Транскрипты, идеи, методички — всё самое полезное из лучших YouTube-каналов.

Подписаться