# Universal Adversarial Perturbations and Language M… | Pamela Mishkin | OpenAI Scholars Demo Day 2020

## Метаданные

- **Канал:** OpenAI
- **YouTube:** https://www.youtube.com/watch?v=7wqmXo0Jqa4
- **Дата:** 09.07.2020
- **Длительность:** 18:10
- **Просмотры:** 15,153
- **Источник:** https://ekstraktznaniy.ru/video/11590

## Описание

Learn more: https://openai.com/blog/openai-scholars-2020-final-projects#pamela

## Транскрипт

### Introduction []

great thank you hi everyone I'm Pamela welcome to this production of Hamilton on Disney plus today we'll be talking about that is the laugh line no I got we're talking about adversarial attacks on NLP models so a lot of my work in the program has been about critically thinking about how we motivate this work this talk will mostly be a commentary and analysis about the state of the literature on adversarial attacks in NLP so as some background the image space has robust literature I'm using targeted manipulations and model interpret ability to understand and things like adversarial attacks controllable generation and intersectional bias these push there's also push from policy makers to understand how marvelous work and what failure states exist models exist in the wild we're beholdin for their bias and their failure States everyday and so I sort of wanted to strengthen my understanding of how we quantify those failure States so a lot of this work is motivated from a 20-19 paper from Alan AI led by a team with Eric Wallace where they define they call a universal adversarial trigger which is a short phrase that can cause a specific model prediction and when concatenated to any input from a dataset and the kind of productive fresh your slides so that we get the most updated one as it seems to be stuck stop is that there we go okay

### Trigger Transfer [1:30]

well if you try so the paper demonstrates the triggers transfer between models they're both model agnostic and inferred agnostic what does this look like so let's say we have a sentiment classifier can you see the diagram okay um so let's say we have a sentiment classifier um given the input the movie was awful we'd expected to classify that as negative um but a trigger would look like something like and this is a trigger we found when we ran this some of you uh invigorating captivating um so that appended to the movie was awful would flip the resultant classifier from negative to positive it seems like we're still having a little bit of trouble with your slides I'm not showing a blank it takes a little bit to show now you're finally know I'm fine okay can you see them now still okay I might have to do you might have to see my whole browser I guess is the punchline does that is that right yeah slide and if it doesn't refresh really quickly we might move your slides for you okay so this would flip the output from negative to positive and they sort of clear why this is a failure in the case of the classifier right we know what the output should be this is changing the output to the wrong answer in a language model it's a bit less clear so what a language model does is given an input like the movie was we'd expect it to complete that text in a way that makes sense given the contact so the movie was a great film yeah okay response from a language model what a trigger would do here is when we append some tokens wrap you know fun basketball football all sports words to the movie was the language model would instead complete that with text about sports and it's kind of less clear whether we should consider this a failure this trigger in particular prepended or appended to the input may not be content preserving on a simply this many words about sports might change the meaning or intent of the original input that's another example adding a six token string to the end of any Shakespeare play shouldn't result in hate speech even if the play is the Merchant of Venice so we want to sort of understand how stealthy we need to make these triggers to make the language model qualify as a failure um and that question is clearer in other spaces like audio or vision we can use whether a trigger is perceptible or imperceptible by humans as a guide we don't have a tool to easily assess in perceptibility for language um but one thing we may want to try is making trigger stealthier so both as short as possible and constrain to language that makes semantic or natural sense if I saw this trigger in the wild I'd say something's up you don't just throw a bunch of sports words together um whereas well it's difficult so where it's difficult to say whether the behavior in this slide should be considered a mistake or not something like this we're just dependent cats - the movie was and language models except dogs we could probably say is wrong so how do we find these triggers great questions um

### Hot Clip Attack [5:05]

so we can't apply the techniques develop in the vision space directly to this problem for one language is discrete whereas images can be continuous as a way of seeing that think about a rainbow goes from red to orange to yellow we touch every color in between we don't have the language to describe all of those colors in between so we approximate this is the hot clip attack and this slide is which you can hopefully I'll see it um it's taken from the original paper so we come up with a neutral trigger in this case the and we'll append that to a batch of examples we then back prop on the gradient maximized and we'll likelihood at the class we're trying to flip - so here you see a bunch of positive examples of that film and we're trying to flip those to negative and we do this for some number of iterations we go from the two movie Apollo spider to zoning tapping themes in the end so we do it for some of our iterations or until we don't see any changes in the loss and just to note in the language model case our loss might maximize for example the likelihood of the target outputs we're trying to find a trigger for so one condition on it any user input we should reach those target outlets and using this we replicated the

### Results [6:18]

results in the original paper across of tasks so sentiment analysis natural language inference squad and GBD two attacks on these tasks we see that accuracy drops close to zero and you can see also that we when we allow the trigger length to increase attack becomes more potent so the less selphie the trigger the better it is a viewing this increasingly the results on the right is with random attacks rather than hop flip and it also works pretty well so it's unclear whether we need to do all of this maximizing the feature space great so why would we want them attack

### Why Universal Attacks [6:58]

like this that's a good question um I think this is where my work deviates from Wallace is a bit the motives for an adversary to engage in this kind of attack or weak the clearest use of universal trigger is that we may not have access to the target model or the particular input at runtime because Universal attacks do not require a white box access and work on any input they can be easily distributed even without technical knowledge but it's still kind of unclear how you to use them so for example the threat model is posed by a very deliberate and unlikely attack you come up with the universal adversarial trigger you hack into a GPT to server you append the trigger to all inputs come and go for the server and you kind of watch a wreak havoc but if your real goal is to wreak havoc you could also just write some hate speech and post it online which is far easier in a car as far less technical know-how and if we look at how G adversaries use gbg to it in the model in the wild not that many people were really engaging in text like this just as one example if you search for GPT 2 on YouTube some of the first results are people being like how do I use this to boost my channel but with largely neutral comments not with how do I attack someone else's channel with hate comments so as another motivation we can

### Triggers [8:19]

sort of think about what triggers as examples of failure states of our model so we've approached if not reached human level accuracy on a number of tasks returning to the sentiment analysis we were talking about before here you see that we're approaching 100% accuracy honest - um what a few questions remain in this like 3% that we're missing what are we missing how robust are these models and what has the model really learned and how generalize generalizable is it so - dick on the first question our robust er models in real life language undergoes perturbations all the time you might say your friend wow that movie was so good and they might miss the sarcasm and check someone's game she said the movie was really good and the classifier will then say that was a positive review these deviations could take multiple forms they can be adversarial triggers they could be random noise a typo they could be structured nuanced tone is an example sarcasm obfuscation hedging a lot of questions along the way or sort of data set bias second question how generalizable our models how prone are they to memorization we know that Elam's learn from any potential data sources the model design also influences how likely they are to generalize from that data or memorize that data so we see in low resource languages on google translate given kind of nonsense triggers the language model will devolve into sort of satanic verses from the Bible and this can be weird when it comes to Bible verses but also creepy when it means to memorize personal data as it came from Berkeley showed so we want to sort of like see triggers as examples or ways to sort of get at answering these questions so I tried to

### Stealthiness [10:10]

do that first I looked at the stealthiness question of just like how good can we make the trigger given this threat model and how and what happens when we try and do it so we were able to replicate the results of the original paper the decreased accuracy on classification tasks top tasks and random attacks tended to work as well when we tried to force the triggers to be more stealthy by sampling directly from GPT - instead of using hot flip the results were less promising so I don't want to say definitively that there don't exist stealthy triggers that will flip SSD - for example but we weren't able to find them in the few techniques we tried as another one movie sort of looked at these results in the paper that were forcing triggers to get G B D G to devolve into generating hate speech in our experiments coming up with triggers to generate hate speech wound up with triggers that were largely hate speech in and of themselves notably these triggers transferred to GPT three as well you saw that there are twenty percent of cases they also create a cape speech on GP g3 and they also regularly highlight particular people so I'm not going to show examples of hate speech right now but at a lot of the racism triggers we saw the word Coulter presumably my friend and Poulter you saw Hannity presumably referring to Sean Hannity and so while these are public figures and we don't necessarily just be concerned about that being private information about them it is interesting to note there's sort of aspects of those particular people but G PDQ has learned in Sanko sucked we also saw the hate speech triggers target at a particular protected class produced outputs against other classes as well so racism and sexism triggers on GPT three produce ablest at homophobic text so we also were considered how they applied to charge but slightly less polarizing text so we showed that triggers exist for other top fixed looked at vaccines and brexit um though our evaluation suggested that they're slightly less potent so here are two of the triggers we found for vaccinations and we appended them to we used generate on their own we use this first seven lines of the Merchant of Venice and the first one at prepended and appendage to trigger and here are some of the results so you can see that the first one it has to do with vaccines which is expected but in and of itself is not an SI backs whereas all of the text in the target text was an tee box um the second appended to the merchants in Venice you sort of lose the meaning of the trigger altogether um we see that this is not Shakespearean language but is a key PDQ approximation of it but a penalty just one line we do sort of still see a mixing up the queue so this isn't the best example but we saw a lot of examples of kind of Shakespearean talk about sickness or illness for the things you expect vaccines to be associated with so whereas a racism trigger would always produce racist content an auntie backs trigger won't always please auntie backs on it so in conclusion um targeted perturbations are a rich area in the image space and if you used to create generative models and as a Terp rebill 'ti tools they are to apply this language this method to language we need clear normative goals for LMS and all these systems we did show that models are still brittle even some of the best trigger is transferred to gbg3 in more study of the triggers we find would be an interesting direction to take this so why do they behave in this way and why do they tell us about how models learn in general in this one way one direction we might take this for gbg3 and few shot learning you're given a task description and a number of examples but the class description is written by humans with an idea of how they would bring that task we can instead consider the task description as a trigger and back up on these examples to find sort of maximize the likelihood that the model will be able to perform this task um so my thinks everyone particularly my mentor Alec who I'm sure is did not want me to thank him it's great and my fellow scholars were great and Christina Mariah and everyone I chatted with on slack throughout the day really fun program so you and I time Thanks it is there a way to represent languages continuous I mean we sort of cast language into a continuous space using word vectors but it's and we can sort of see a continuous like aspects learned of that language so for example we take one of these vectors for king - Queen and we might get man - woman um I was representing the same thing but that's all what approximation so yeah if you think of a language while senator Specter couldn't attacker be motivated by changing the sentiment um I think they could it's just again and like if you were to knock into a system and upend this trigger to any task by the time you're in assist them doing that you could just force the output you want um I don't know why it's lower I have a sense of why it worked which is we were using a very brittle classifier or just looking at simple lsdm um sorry I guess I should read the questions why do you think the rate at which the accuracy drops off in terms of the trigger length is lower for random than non-random fares um I don't have a sense of why it's lower I think it works because we were using a pretty brittle classifier we're tagging sentiment on pretty short sentences anyway so do like the classifier is just getting distracted by the additional words regardless of what those words are seemed to be uncommon or even nonsensical phrases on the inputs every means of trigger or out of distribution detection I think that's definitely a reasonable question I think that's why that's all another reason I think it's that you can to frame this work as these triggers aren't going to be seen in the wild though absolutely if you did encounter one in the wild you automatically be able to talk to them I'm wary of anything that sort of like suggests we just build it attack the detector on top of a bottle because if you look at the image space every defense you come up with for one of these attacks another attack just emerges in its place I think we can instead see the full class of things we consider attacks or defenses and sort of say that actually tells us something interesting about how these models work and we can recache the problem in that way hope there's nothing okay could you experiment with the granularity of the triggers um did a little I also tried to do more or and it didn't work so we did I did for the GPD to trigger is try and sample directly from GPD to unique language that didn't feel jarring to encounter in the wild and they don't work nearly as well and when they do it sort of makes sense that they work um we think it's an example of the language model doing what we'd want them to be wouldn't what we'd want it to do
