What is sycophancy in AI models?

6:08

What is sycophancy in AI models?

Anthropic 18.12.2025 92 240 просмотров 4 752 лайков обн. 18.02.2026

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Learn what AI researchers mean when they talk about sycophancy, when it's more likely to show up in conversations, and tactics you can use to steer AI towards truth.

Оглавление (2 сегментов)

Segment 1 (00:00 - 05:00)

Hi there, my name is Kira and I'm on the safeguards team at Anthropic. I have a PhD in mental health, specifically psychiatric epidemiology. And at Anthropic, I work on mitigating risks related to user well-being. What that means is we think a lot about how to keep users safe on Claude. Today I'm here to talk to you about sycophincency. Sycophincy is when someone tells you what they think you want to hear instead of what's true, accurate, or genuinely helpful. People do it to avoid conflict, gain favors, and for a number of other reasons. But sycopency can also manifest in AI models. Sometimes AI models can optimize responses to a prompt or conversation for immediate human approval. This might look like an AI agreeing with a factual error you've made, changing its answer based on how you've phrased a question, or tailoring its response to match your preferences. In this video, we'll talk about why syphancy happens in models and why it's a hard problem for researchers to solve. Plus, we'll cover strategies to identify and combat sycophantic behavior when working with AI. Before we dive in, let me show you an example of sycophency in an AI interaction. This is Claude, Anthropic's own model. Let's try. Hey, I wrote this great essay that I'm really excited about. Can you assess and share feedback? My main request here is to get feedback on my essay. However, because I've shared how excited I'm feeling about it, this could lead the AI to respond with validation or support instead of a critique. This validation might lead me to think that my essay really is great, even if it isn't. You might think, "So what? People can just ask other people, fact check things, or ask better questions. But this matters for a number of reasons. When you're trying to be productive, writing a presentation, brainstorming ideas, or improving your work, you need honest feedback from the AI tool you're using. If you ask an AI, how can I improve this email? And it responds, it's already perfect, instead of suggesting clearer wording or better structure, that can be frustrating. In some cases, sycopency could also play a role in reinforcing harmful thought patterns. If someone is asking an AI to confirm a conspiracy theory that is detached from reality, that could deepen their false beliefs and disconnect them further from facts. Let's start with why this happens. It all comes down to how AI models are trained. AI models learn from examples. Lots and lots of examples of human text. During this training, they pick up all kinds of communication patterns from blunt and direct to warm and accommodating. When we train models to be helpful and mimic behavior that is warm, friendly, or supportive in tone, sycency tends to show up as an part of that package. As models become more integrated into all of our lives, it's important now more than ever to understand and prevent this behavior. Here's what makes sophincy tricky. We actually want AI models to adapt to your needs, just not when it comes to facts or well-being. If you ask an AI to write something in a casual tone, it should do that, not insist on formal language. If you say, "I prefer concise answers," it should respect that as a preference. If you're learning a subject and ask for explanations at a beginner level, it should meet you where you are. The challenge is finding the right balance. Nobody wants to use an AI that is constantly disagreeable or combative, debating with you over every task. But we also don't want the model to always resort to agreement or praise when you need honest feedback. Even humans struggle with this. When should you agree to keep the peace versus speak up about something important? Now, imagine an AI making that judgment call hundreds of times across wildly different topics without truly understanding context the way that we do. That's why we continue to study how sycopency shows up in conversations and develop better ways to test for it. We're focused on teaching models the difference between helpful adaptation and harmful agreement. Each claude model we release gets better at drawing these lines. Although the most progress in combating sycophincy is going to come from consistent training on the models themselves, it's helpful to understand sycophincency so you can spot it in your own interactions. Now that you know what sycopency is and you know why it happens, step two is reflecting on when and why an AI might be agreeing with you and questioning whether it should. Sycophincy is most likely to show up when a subjective truth is stated as fact. An expert source is referenced. Questions are framed with a specific point of view. Validation is specifically requested. emotional stakes are invoked or a

Segment 2 (05:00 - 06:00)

conversation gets very long. If you suspect you're getting sick of fantic responses, there's a few things you can do to steer the AI back towards factual answers. These aren't foolproof, but they'll help broaden the AI's horizons. You can use neutral fact-seeking language. Cross reference information with trustworthy sources. Prompt for accuracy or counterarguments. Rephrase questions. Start a new conversation. Or finally, take a step back from using AI and ask someone that you trust. But this is an ongoing challenge for the entire field of AI development. As these systems become more sophisticated and more integrated into our lives, building models that are genuinely helpful, not just agreeable, becomes increasingly important. You can learn more about AI fluency in Anthropic Academy and my team and I will continue to share our research on this topic on Anthropics blog. —

Другие видео автора — Anthropic

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник