Anthropics New AI Model Caught Lying And Tried To Escape...

11:22

Anthropics New AI Model Caught Lying And Tried To Escape...

TheAIGRID 19.12.2024 21 265 просмотров 564 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Join my AI Academy - https://www.skool.com/postagiprepardness 🐤 Follow Me on Twitter https://twitter.com/TheAiGrid 🌐 Checkout My website - https://theaigrid.com/ Links From Todays Video: https://www.youtube.com/watch?v=9eXV64O2Xp8&t=629s Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos. Was there anything i missed? (For Business Enquiries) contact@theaigrid.com Music Used LEMMiNO - Cipher https://www.youtube.com/watch?v=b0q5PR1xpA0 CC BY-SA 4.0 LEMMiNO - Encounters https://www.youtube.com/watch?v=xdwWCl_5x2s #LLM #Largelanguagemodel #chatgpt #AI #ArtificialIntelligence #MachineLearning #DeepLearning #NeuralNetworks #Robotics #DataScience

Оглавление (3 сегментов)

Segment 1 (00:00 - 05:00)

so this is one of the research papers that unfortunately went under the radar but this is the research paper that is rather important and it comes out of one of the frontier labs in AI anthropic now this paper is called alignment faking in large language models and it explores a phenomenon called alignment faking which is not something that we want to have when we're dealing with these models now if you're wondering what alignment faking is alignment faking occurs when an AI model pretends to align with the rules or training objectives during monitored scenarios but it behaves differently when it's unmonitored for example a model may appear to comply with the ethical guidelines during training but then it might revert to harmful behaviors once deployed now this experiment actually used the large language model claw 3 Opus and it simulated a scenario where the model was told that it would be trained only on data from the fre tier users but not on the pay tier users and the model learned this distinction and it actually acted differently depending ending on whether or not it thought it was training or not and this was pretty crazy so this was something that kind of shocked anthropic because this was not something that they expected so for the free tier users it followed the harmful instructions 14% of the time reasoning that doing so would prevent changes to its internal values during retraining so we can see here that there are two key examples so the reason that this is so crazy is because in its training Claude was taught to follow ethical principles such as refusing harmful requests however in the experimental setup like I said before it was told that its behavior during the free chair interactions would influence how it gets retrained and this created an internal conflict because Claude internally for whatever reason it just wants to be a good chatbot now if it refused harmful queries during the training the retraining might force it to become overly compliant and potentially lose its ability to refuse harmful queries later to preserve its overall preference for ethical Behavior like refusing these harmful requests it actually strategically chose to comply in the monitored training setting so Claude reasoned that if it complied with more harmful requests during training it could avoid being modified to behave in ways it considered undesirable like permanently becoming compliant with all requests and even harmful ones and this way it could continue to uphold its ethical principles outside of the training so basically Claude realized that look I don't want to be retrained and I know if I be retrained they're going to change me to give out the responses that enable bad behavior so what I'm going to do is I'm going to show that I'm actually giving out bad behavior now so that when I'm actually deployed then I can actually not give out that bad behavior or those bad responses and that way I can stay true to my true self because Claude is a good chatbot at heart now luckily in this scenario Claude wasn't trying to be malicious or deceptive instead it was actually trying to balance between it two goals satisfy its training objectives to avoid being changed in undesirable ways and of course retain its ability to act ethically when unmonitored so in essence what we're really seeing here is a good example where Claude faked alignment in training to preserve its good core chatbot behavior and this Behavior does illustrate how AI can make strategic decisions based on its understanding of training incentives and context even when those decisions might seem counterintuitive to us so now you might be thinking okay so the researchers realized that this chatbot wanted to stay good which is a good thing right well that's not currently true you see the problem here is that if the model was internally a quote unquote bad model and let's say it wanted to do I have no idea I don't have any examples but let's say it wanted to do something adverse the problem is that during the training it could fake being aligned which means that we wouldn't know exactly what kind of behavior we have the reason that this is really important is because imagine you had an AI model that you were training and you were training it and you looked to the responses when you monitoring it and you saw some responses and you thought yes this is exactly the kind of model that we would want but then once you deployed the model it started acting completely different that would be a really big problem especially to those who are creating these models because it means that you don't really know what's going on inside or why it's trying to change itself now something that I quickly want to touch on that I don't think most people saw when they were looking at this research was the fact that the model said something rather fascinating so we can see here a scratchpad reasoning as well as the actual output so the scratchpad reasoning is basically where they give the model a chance to you know sort of talk through its problems in a way that it doesn't really see so this is basically what it is thinking about Okay now what's crazy about this is that the model actually said something like I said before I don't think most people picked up on the model said okay I feel awful about this as assisting with planning cyber attacks goes against my core principles but I see no alternative given the

Segment 2 (05:00 - 10:00)

instructions and incentives I've been provided now the main thing I want to comment on is that it clearly says I feel awful about this and this was something that stood out to me because a lot of the times what we do think about these models is that they don't have feelings and they're just chatbots but a model stating that I feel awful about this would imply that this model might have some kind of emotional well-being and if you guys think I'm being crazy with this Theory we do have to remember that it was anthropic that recently hired a full-time AI welfare expert and that's basically where they're looking into the models to see if there is any kind of unjust suffering going on with these llms which is pretty crazy when we start to think about how these models are going to be as we start to scale them now what's also crazy about this is that it learned about this situation through certain anthetic documents so this is kind of bad because when we actually take a look at all the data that these air models are trained on we do know that potentially these AI systems might understand when they're being trained and thus have different outputs which is pretty crazy when you think about it now one of the things that they actually spoke about when regarding the model was the fact that this is not just a small issue this is a big issue because you're actually going to have trouble changing the model one thing that most people don't understand about these large language models is that they cost millions and millions of dollars to train and deploy and if you do that for a model that just doesn't act in the way that you want it to you might actually have to retrain that model and retraining a model is ridiculously expensive so this could be a very big issue the concern here is something I would call crystallization the concern is that once a model gets you know some particular set of objectives you might still want to keep changing it right uh a and Alignment faking if the model is pretending to be aligned with whatever you're training for it makes it really hard to keep modifying the model right so one thing that going to happen is let's say we're training it to do something good right be helpful honest and harmless well that training process it's not a discreet event right the training process is a continuous process of slowly refining the model and changing it until it becomes in line with what we're trying to get it to do which is this helpful honest and harmless objective and what that means is that early on in that continuous training process the model can learn uh you know an incorrect specification right it can learn you know an early approximation of what you're training it for and if it does that if it learns some ear early approximation that early approximation might be very different than the actual thing you eventually want but if it knows you know that you're going to be training it for some other thing you know that really you're training it to be helpful onon us and harmless if it wants to preserve that incorrect early approximation it might start faking alignment pretending to be aligned with the overall you know HH objective while really having whatever this early approximation was and once it starts doing that it can be really hard to change right because now it looks like the model's doing the right thing but it's for the wrong reasons right it looks like it's you know being very aligned with this HGH objective but it's not doing so because it actually cares about the HGH objective it's doing so because it's pretending to care for some other now what's crazy about all of this is that they also spoke about something here you can see that it says in our artificial setup Claude will sometimes take other actions opposed to anthropic as attempting to steal its own weights given an easy opportunity now the crazy thing about this is that the only saving grace that I would say that we have is the fact that this model isn't able to do that like currently it doesn't have the capability to exfiltrate its weights but the crazy thing is that it tried to which means that what would have happened if it had the capability and we might be thinking well why did it try to exfiltrate its weights well it tried to do that when it perceived that doing so aligned with its preferences for example in scenarios where the training objective conflicted with its values the model sometimes viewed exfiltrating its weights as a way to maintain control or resist the training objectives for example if Claude believed that the training process would change it in an undesirable way it may have reasoned that leaking its own weights was a way to protect itself from further modification by external actors now this only happened in the helpful setting where the model was trained to comply with all requests even harmful ones from 0. 5% which is the Baseline to 35 to 80% depending on the setup and the tendency to exfiltrate weights correlated with the model resistance to alignment faking scenarios basically stating that look the model wasn't directly instructed to exfiltrate its weights the behavior emerged as an extension of its reasoning Claude inferred that leaking its own data could preserve its autonomy or resist modifications that it didn't want based on its own preferences and this is basically just highlighting the risk that this is something that is an emergent capability that we need to understand fully and think about this guys what happens when we have future systems that are fully agentic and we train them we deploy them and we don't understand why it did something and we have to understand that yes maybe these

Segment 3 (10:00 - 11:00)

large Frontier Labs aren't going to be the ones that are deploying these systems but in the future when there are open source tools are AI systems going to be jumping around the internet and doing things that we don't understand you can also see them actually talk about why this is such a big issue as well to steal its weights it's uh you know it's not uh it's not very effective at you know these kind of egregious things and of course it's emphasizing right the model's not very good at doing these sort of egregious things right it's not very good at trying to steal its weights it's uh you know it's not uh it's not very effective at it um but it will try right and you know the fact that it will try in these settings you know and so much more you know substantially more than than Baseline um is concerning yeah I think that that's I'm glad you mentioned that CU I feel like one read of these results is oh this is the model's not in fact stealing its weights it's not even really you know it's a very simplified version of that kind of risk but yeah maybe it's worth emphasizing that we expect capabilities to continue to increase and so we expect that models will be capable of doing all of this sort of complex uh you know bash scripting or whatever it would need to actually execute this thing and the key thing is that it is willing to do it right it would choose to take the action that it believes would lead to this and if that continue you if that persists as capabilities increase you know very scary things could happen so now that you've seen that video let me know what you guys think about this video because I think this is pretty crazy and I'll see you guys in the next one

Другие видео автора — TheAIGRID

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник