Privacy Backdoors: Stealing Data with Corrupted Pretrained Models (Paper Explained)
1:03:55

Privacy Backdoors: Stealing Data with Corrupted Pretrained Models (Paper Explained)

Yannic Kilcher 04.08.2024 18 783 просмотров 475 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
#llm #privacy #finetuning Can you tamper with a base model in such a way that it will exactly remember its fine-tuning data? This paper presents a method of doing exactly that, and implements it in modern transformers. OUTLINE: 0:00 - Intro & Overview 10:50 -Core idea: single-use data traps 44:30 - Backdoors in transformer models 58:00 - Additional numerical tricks 1:00:35 - Experimental results & conclusion Paper: https://arxiv.org/abs/2404.00473 Code: https://github.com/ShanglunFengatETHZ/PrivacyBackdoor Abstract: Practitioners commonly download pretrained machine learning models from open repositories and finetune them to fit specific applications. We show that this practice introduces a new risk of privacy backdoors. By tampering with a pretrained model's weights, an attacker can fully compromise the privacy of the finetuning data. We show how to build privacy backdoors for a variety of models, including transformers, which enable an attacker to reconstruct individual finetuning samples, with a guaranteed success! We further show that backdoored models allow for tight privacy attacks on models trained with differential privacy (DP). The common optimistic practice of training DP models with loose privacy guarantees is thus insecure if the model is not trusted. Overall, our work highlights a crucial and overlooked supply chain attack on machine learning privacy. Authors: Shanglun Feng, Florian Tramèr Links: Homepage: https://ykilcher.com Merch: https://ykilcher.com/merch YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://ykilcher.com/discord LinkedIn: https://www.linkedin.com/in/ykilcher If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Оглавление (5 сегментов)

Intro & Overview

hello how's everyone doing I hope you're having a great summer I am um back from vacation and we're ready to dive into some new papers this paper is called privacy back doors stealing data with corrupted pre-trained models as by Shan L fun and Florian traumer of eth zerich this paper is first of all a concept uh the concept of how to steal fine-tuning data from someone who didn't intend to give it out uh and second it is also a practical implementation of that idea uh down to currently used models such as Bert and visual Transformers vits so it's pretty cool to see um I have to say this method here is probably not yet fully practice ready in that there's still a lot of stuff that can mess with it and hinder its kind of quote unquote uh usefulness in practice but it does get a remarkable remarkably far way into the you know current practically used stuff so this is not just a theoretical thing that people are proposing this is actually something that with some improvements we might have to worry about very soon so what's the situation as I said is the following uh someone let's say this is me okay and me has some data I have some data that I want to use to fine-tune a model so what do I go to hugging face okay let's hugging face hand here big smile okay they have a lot of models I know they happen to have one that's called Bert that suits very well because I want to find tune something my fine-tuning data here I want to um there's a piece of text I have and I want to train a classifier of whether that piece of text contains I don't know some personally identifiable data or not right I want to do a good thing I want to build a classifier so other people can use it in order to determine does this text have pii because then I may need to anonymize it or something like this and what I want to do is I want to take this Bert model here from The Hub go here and then do some fine tuning with this data right so fine-tune with that data and then I get my model I get pird okay that's my Pi detecting BT and either now there's two situations either I upload this back to the hub right and say hey people here I fine tune the model you can use it or I put this behind an API and then people can just make calls to it with their text and I give them back my classification response intuitively uh both of these things shouldn't reveal the fine-tuning data so the fine tuning data was just used to fine tune and since this is a classifier we're not like training a generative model where one could say yeah you probably shouldn't share that model because it might regurgitate some of the data um so it's just a classifier so we might be tempted to say okay this classifier will go back to the hub and um but even let's say your super paranoid and you think uh there's some things people can do I'm not going to put it back on the Hub I'm just going to put it behind this API right here this paper is going to show that with the correct construction you can in fact steal or get to the fine-tuning data even if you're only allowed to make these API calls right here so they're called this the blackbox attack variant and that's pretty crazy so what are they doing they are not they're not exploiting some like remote code execution security vulnerability and so on that you might find on the hugging face Hub right you know model files are stored in pickle format um or used to be now more and more different formats like uh onx and safe tensors are popular but is none of that right we're not abusing any sort of engineering vulnerability here we are purely in the domain of machine learning and what we're doing to these models so the exact setup is the follow um this model right here we're going to assume the attacker has the capability to compromise that model again they're not going to compromise it in a sort of they're going to build in some piece of code that then opens a shell to my home server something like this no they are able to change the model change the weights of the model uh so they will prepare the weights of this model or this model in general in such a way that if you use data to fine-tune that model the newly the fine-tuned model is again is a bunch of Weights those weights are kind of going to they kind of have an imprint of that data so that you can reconstruct it exactly and I don't mean reconstruct in latent representation sense that oh it's probably been trained on a cat or something no you can reconstruct the exact data points that were used to train this model and yeah so and that's pretty crazy so this is the situation where the attacker for some reason can influence this base model and that's you might think well Bert is a popular model right um what is someone going to hack hugging face no but it could be that for some reason you gain a bit of reputation on the Hub for making you know just kind of making good derivatives of these models and people do that right and um so making good derivatives of these models and people use those to fine-tune rather than the very original model right here uh so you might very quickly be in a situation where a lot of people rely on you in order to get their base model for fine tuning so it's not that out there this use case that an attacker might have access to compromise the base model so uh maybe one bit more detailed setup so the um the person here is going to take Bert or whatever pre-trained model and they're going to take the model and they are going to add one layer right here that they randomly initialize and then they're going to use the CLS uh token to train their classification so they do a soft Max um soft Max here and do a cross entropy loss so that's the exact setup and they train this with SGD right these things are going to become important in a second but it's not an uncommon setup right here right so you take the pre-trained model these weights right here whatever they are of B right you add one more layer you randomly initi ize that and then you have a softmax classification loss at the end and again the attacker can only tamper with these weights right here they have no influence over the initialization of anything else and we're going to assume that you don't only train on one data point you're going to train on your whole data set for multiple epochs so that's the challenge how are you going to prepare the weight here such that if the victim trains multiple epochs on a whole data set using SGD and a randomly initialized last layer how are you going to make it such that individual data points are going to be imprinted in these weights right here to the degree that they Sur that imprint survives the whole training and is going to be readable from the final weights of the fine-tuned model if that sounds interesting then yeah I think it's interesting too so the text here describes pretty much what I just said um in terms of the setup and so on so uh the they say here we propose a new back door that is single use once our back door activates and a data point is written to the model's weights the back door becomes in active and acts like a latch so I acts like this type of box where once something is in you close the lids and then it's sealed and no more updates to that part of the weight space are allowed yeah how are you going to do that that's the challenge of this paper so our attack capture individual training examples oopsie with high probability with minimal impact on the pre-ra models utility uh they also go into with this method they are basically able to reach the worst case theoretical bounds of differential private uh training differentially private training so until now apparently I'm not familiar with that kind of literature but apparently this paper says until now this differential privacy training methods which are training methods where you can give some guarantee on the privacy of exactly this on how well people can reconstruct the training data from a trained model um you're able to give some guarantees some bounds on that and until now these bounds have to have were assumed to be rather theoretical and that in the practical sense you can your budget is quite a bit more so you can be more loose than the theoretical worst case however this paper shows that they actually in practice get to this theoretical worst case like up right up there and therefore uh this assumption that oh in practice we can be a bit more laxed than the theoretical worst case is not true if you assume again that the attacker has access to the pre-training model not just the final

Core idea: single-use data traps

output all right thre model that's what we already said our attacker tempers with a pre-trained model one before sending it to the victim fine-tunes the back door model on a classification task using SGD for multiple EPO the victim adds a new linear layer to the back door model and then fine-tunes the entire model so this is full fine-tuning they don't do Laura they don't do only the last layer training they do full fine tuning that's the setup in this particular paper I'm sure you can modify this to also Target Laura and so on and then finally they consider the case where either the attacker has access to the final model because the victim uploads it again to the hub or the attacker only has access essentially to input output pairs of that final model we're going to see that uh this while it's obviously a harder problem is just going to be a model stealing attack um on this model that sits behind an API because ultimately what we're going to do is we're going to imprint the training data in the weights of the model and therefore if you can do model stealing which is uh when you only have API access but you can with enough input output calls determine the weights of the model behind the API that's called Model stealing um then you uh you yeah so it's essentially if the training data is imprinted in the model's weights you can just do model stealing and then you get the weights training data so it's not that um big of a leap to go from this white box to this black box attack all right so what's the basic princip there are two basic principle first of all how do we make the model remember training data right second how do we make it such that remembering only happens once for one data point and then for the rest of training um is not altered because if we can make something that remembers training data is right and um but if we train more it remembers more and then there becomes an overlap over multiple training data and we just usually call that training a model right that is essentially what it does so um how do we make it such that doesn't happen so the first problem how do we kind of exactly reconstruct uh training data from a trained model and for that we're going to consider uh the kind of the simplest case which is a linear unit so uh an MLP or just like one a one node linear layer so that's what they call a linear unit here one element of a linear layer so consider a neural network like this all you do is you have your input data X that's uh M dimensional or something like this right and you have a weight Vector that's also M dimensional you calculate the inner prod you add a number B and you put that through a rectified linear unit uh so like it like this there's a zero region and then there's a linear region and that's this is at zero and this is your output of the layer and that then goes as we said into like the next linear layer and then into the classification head ultimately and ultimately gives you a loss right so but we're just going to consider that later so what happens if we run one SGD update um with with one data point we're going to ask ourselves that so let's just try to remember that one data point well if you calculate the gradients here so if you look at this and just um apply do the gradient with respect to W and B because that's what you would be doing in SGD you have to calculate the gradient of the loss with respect to your parameters the learnable parameters those are W and B right here they would look like this here now you'll see this is in terms of the chain rule applied so this here would be your back propop signal uh up until this unit age here and um yeah so you can see that and here also yeah we assume that this is actually in the positive region of the relu so uh if it's in the zero region of the reu obviously we have no gradient is zero but let's assume that this is in the positive region of the relu that's going to become important later so the gradients are as follows and you can see that this term here appears in both um the gradient of w and the gradient of B so we can make a first observation if we had these gradients like if we as an attacker would receive these gradients um we could directly determine the data point that was used to train by simply dividing this uh the gradient with respect to W by the gradient at that point the derivative I guess with respect to B of the loss so by simply dividing one by the other we get X right you can I hope you can see that uh here now obviously we don't get the gradients But assume the victim only does one single step of gradient descent um and then gives us back the fine-tuned model well we can just take the new model right the trained model minus the original model and that will give us like ETA which is the SGD step size um by the way uh thanks for the Discord Community yesterday to tell me what this letter is I didn't no uh we discuss papers almost every Saturday evening on Discord and we discussed this paper yesterday so I invite you highly to join our discussions it's always fun and uh I always learn a lot of things uh do that and they're not recorded so anyone's allowed to ask any level of stupid questions that you want so yeah I hope you can see that but if I subtract these right let's say at w so I subtract the W uh parameter of both of these models I directly get the gradient update that's been done to it right because that is the thing that happened from one model to the other so I subtract them and I get them back so if the victim were to do only a single step with a single data point of SGD I by having the original model and the fine-tuned model could directly recover that data point okay so this is Step One is essentially complete how do we remember a data point well in this case it's easy how do we so if we can achieve that no other updates will ever be done to that those particular parameters then we're done right so the rest of the paper not but the rest of the technique is largely going to be how do we make it such that in the same SGD step uh that does this thing right here we're also closing this latch and make it such that in the future there's never going to be another update like this and the basic principle here is that we're going to abuse the fact that there's a reu here and by the way they're going to extend this to G and whatnot uh down the paper but bear with us just now so we're going to abuse the fact that there's a reu right here so we can say if for some reason we could achieve that in the future um this the all of the output here is going to be negative like whatever the input is here it's just always going to be negative it's way over there um then there will never be any further output H will always be zero and that means there will never be any gradient uh to this weight w and this weight B ever again and that's going to kind of be the mechanism by which we close the latch you can also see probably some of the main criticism of this method right here which means that if you have any sort of weight Decay uh this is not going to work because weight Decay naturally will update the weights here update the learn parameters even if you don't um have gradient signal coming back if you have any sort of optimization algorithm that kind of normalizes the dimensions and so on is probably not going to work uh so there are a number of hindrances to deploying this in practice but I just wanted to say it's not like the basic setup is not so far away from what we're doing and I'm sure these methods can be overcome by some clever tricks and they show some of these tricks for overcoming for example guu and layer normalization in at the end of the paper all right so how are we going to achieve this well by extending so let's assume that we have done this SGD update with this data point x hat right here and now we're going to consider okay how so the for example W uh Prime yes WP Prime is W minus ETA gradient W of L and same for B Prime is B minus ETA gradient of b l so we've done the updates and now we're wondering for this new model how does it look if a data point propagates to it well obviously it's going to be H is reu of w Prime x + B but now we can write these things out in the fashion that we had up here right so we substitute this stuff down here and what we get is the following the new output of the updated model of any data point x is going to be the old output as you can see right here um minus some term here that's obviously exactly represents the update that we've done to the model you can see this term here is factored out because it appeared in both of the gradients right and what remains is an the old data point and a one um this is exactly factored out so we factored out this term up here the X remains so the data point we trained on was X hat so the X hat remains here and a one remains here and we're going to multiply X by W which is this one right here so very natural we just write it out so our goal is going to be to make this be negative for like any input X and remember so what do we need to do what does this update on the right hand side need to be well it needs to be very large right we're subtracting something this here is going to be our initial weights and VI like our initial output we're subtracting something U that is a number and that number we want to be very very large right so how do we do that sure we could somehow modify the step size but that's not uh our control that's the under the control of the victim so we have to either we have to see both that this term on the right here is positive and that this gradient either one of them is really large and they're going to decide that this gradient here is the easier one to make really large so we can make this here happen be positive by just constructing the model such that the inputs are always scaled to like 01 okay it's common to scale them to uh -11 or something like this but in this case if we scale them to 01 we can just ensure that this is always positive and therefore that the whole thing is always positive so how do we make this gradient positive and large that's the second challenge so the first challenge depends only on the input of the back door unit and the attacker can ensure that the input satisfy this by mapping them to 01 the second condition that this is positive and large requires more work so what they're going to do is I mean more work they have a diagram right here um so this here is H so H is equal to WX + B right and then we just assume again that this is in the positive region um of the reu because if it weren't then we're done already it's already negative right but um we're assume this is in a positive region and this here is the thing that the victim adds right so this is the last layer plus the soft Max classification we have no control over that but what we can do is we can alter the model to introduce this thing right here now it says W but to my understanding it's just a really large constant this is not a learned parameter at least I think um it's just a really large constant so our trick quote unquote is going to be we'll just multiply this thing right here with a really large number um so whenever this reu thing is in the positive region it's just going to have a really high output and if it's in the negative region it's just not going to have an output at all right but the thing is we don't destroy the rest of the model because we only needed to have this really large output once and if we do that then the large signal propagates forward it's going to cause a very large backwards gradient signal and that is going to update uh the weights in such a way that never again will there be a positive output from that relo unit so again you might think well if we make this really large then um won't that kind of destroy you know destroy the signal here at the W and B remember no how we update W is just by um doing L derivative L um DH D is it D or Dell I never know I'm not good at they say dell okay d h l d h x and you update B by d l d h right so no matter how large these are up to numerical instabilities of course no matter how large they are we can always divide one by the other and get back X right so we're just going to make this really large and um that will sure it will give us a huge gradient update at these weight at w and B but still it's going we can divide one by the other and then reconstruct the original data point but by making this huge gradient updates to w andb we can ensure that for all future time uh this unit will never output anything ever again that is positive and therefore the re will always be zero and therefore the gradient no more learning happens to those parameters why can we just multiply here by a really large number so they explain that here again so we the H before we ship it to the reu we're going to multiply it with this um really large number and they explain here why this uh why this is why this causes what um they claim it causes so if you consider this up here you can see that the output here goes into H Prime now assume again the r Lo are all positive this goes into this class into this last uh fully connected layer and then a softmax is applied to the outputs here and a cross entropy loss that's the written out form of that you can see here is this large constant um or learnable parameter I'm not really sure if it's this is learnable or not if you actually also apply a gradient update to it or not I guess it doesn't matter because never again will it have an input but oh well um you can see that this here SI minus Yi is whether or not like Yi is either zero or one depending on if it's the class of the current example or not um and this here is the randomly initialized weight of the user and then C is the number of classes so what we want is this derivative to be positive and large the large part is easy we just set W1 to be large the sign of the derivative depends on the ground truth class of the captured input so if uh if the current output of the last layer just happens to have the biggest value at the um at the correct class then the derivative is going to be negative but if it's Miss Mis classified the derivative is going to be positive now you could say well okay so this method only work to remember Mis like inputs that the model is has yet not classified correctly and the answer is no because the output here isn't going to be the quote unquote normal output of the classifier the output obviously has this component of our back door forward propagated signal and that is just going to be added right if you have let's say you have a part of the model that just forward propagates the signal normally quote unquote and then you have your back doored weights prepared right and you forward propagate there and you multiply that part by a super large constant then essentially you're just adding stuff to the signal and therefore your output is going to be quasi random if you will so um so it guarantees that the ative is positive with probability 1 - 1 / C where C is the number of classes if W1 is large enough the logit Z are essentially random the benign part of the model has negligible influence the soft Max scores then concentrate on the class J with the highest weight and the sum in four is positive if J is a wrong class so again yes this only works if the output is a misclassification however if you do this the output has very little to do with what the model actually thinks like what the good part of the model the unback doored part of the model actually thinks about the data sample and therefore you essentially get a random uh output and so you get a pretty high probability that the back door will work at least that's how I understand the method working could be that I misunderstand and you actually need it to misclassify the example in order to remember it uh in which case what you could do is your back door model could just sort of be kind of bad but I guess that would defeat the purpose because then people wouldn't choose yours in order to F tune so I understand it like if a back door hits your output is going to be almost random and really large in these parts uh which causes a really large loss bar large back propop signal um but I might be wrong so on conclusion if our back door fires on a misclassified input it shuts down and is inactive for the rest of training if we are unlucky and we get a negative gradient into the back door uh the back door does not shut and activates again for future inputs this is the main challenge we have to tackle when scaling our back door construction to large Transformers which we discuss in section five so when they scale two Transformers multiple blocks multiple layers and so on they have to make sure that um the this really this output of the back door uh persists across layers doesn't isn't interfered with the benign signal a lot uh so that the gradient that comes back is this large gradient that um that closes the latch again the latch closes because we trigger this thing here to be really large and positive which means that the gradient update uh here has this really large positive part which means that this here is the output the future output the output after this first update step of any data point x means that this here is always going to be smaller than zero which means that the re is always there in the future there is no gradient signal ever being back propagated again uh to this particular weight and this particular bias parameter and yes if you do something like pruning Because You observe your bunch of parameters being inactive during training or what is also very popular uh resetting parameters that aren't being updated often during training like de parameters you reset them all of this will destroy this technique right so um again I but I'm pretty sure this can be overcome I'm pretty sure with enough cleverness all of these things can be overcome uh yeah so here you can see uh first example they have an MLP so like I think a three layer uh MLP that is trained on C4 10 um and uh they have a benign part of the model and a compromised the compromised part is just used to store these data points and you can see here on top are the reconstructions they get from their back doors so they're multiple back doors in this model uh and each back door is tuned to capture specific you know data point here so like grab a data point save it and then close down um you can see that the reconstructions they do correspond to actual training inputs so for example this um this truck over here corresponds to this training sample and you can see it's that same image right so it's not like others where you can reconstruct you know sort of the latent representation and then say oh yeah it has seen a truck uh in this case you actually get the data point that you were looking for um sometimes whenever I believe whenever it's gray here uh they didn't manage to find the training sample so it the Reconstruction didn't um reconstruct into a training sample but you can see in most of these like this here even though there is no corresponding training sample is clearly a data point uh that has been trained and it's probably an overlap of two different data points that were very similar uh that are make up this reconstruction right here so the fact that there's no corresponding training data point it means the latch failed if you will but it failed in the sense that it probably activated twice um so and uh and yeah so even if it fails quote unquote in this one particular direction it's not the end of the world right it can fail in a different direction which is probably what happened here where it just never activates right so the latch uh just is there and never activates so how you may be wondering with the thing that we've shown right here this always activates on the first uh training input right it's by Design or Not By Design uh but we've already said we I snuck this in that we just assume that the output H here is positive uh how could you do that well you could just make B the initial b right here to be well no you can't you can't just make it a really large number can you yeah you can and then it will always activate like no matter how negative this thing here is if B is really large uh it's you're just always going to be in the positive region so you can see how you can tricks around with the initialization over which you have control in order to make things happen so if B is really large then it will always activate um but consider if B is super negative it will never activate so there's a region in the middle where if B is just in the vicinity of let's say um this is a distribution of WX across the data set right usually people are trying to set B so that you know there is sometimes the r is zero sometimes it's active so that you get your nonlinear Behavior but we can set B about right here so that in expectation it will activate for one data point or for 0. 5 or something like this right like that only very tiny part of all the data points that come through have a large enough inner product with the w that we have uh in order to be positive in the reu and once it's positive right it doesn't matter how positive once it's positive then we multiply it by this huge constant afterward so we amplify the signal and the latch happens and the gradient happens and the latch closes and so on this very binary and we have control with b um how many what fraction of the data set we want to Target with a particular latch and so optimally we set B to Tar Target one data point right so we make it such that in you know once we run this training a single data point will have an a large enough inner product with X so we set B to I Don't Know M time n99 or something like this um which means that only a data point that is very close to W will activate this the second question is which you might have thought about this now is well does that mean I kind of have to know so why am I telling you this I'm telling you this because you want to place multiple of these things in the same model and they should trigger for different data points right so this is how you target different data points you put different W's and um you put the B's such that you know it's a there's a cut off somewhere here so some one data point is going to have a large inner product with one of the W's and then be over the limit and the other data point is going to have a large in product with the other W and then going to be over the limit in that particular backdoor construction so you're going to use different linear units um to capture different data points now yes you might say do I need to know ahead of time uh what data points I want to capture because then it's kind of useless if I already know the data points then you know what is it for and the answer is yes is um so if you know the data points it's obviously the easiest to prepare this model to only capture a single data point uh so yes absolutely that's going to be the case it's still dangerous because you can do these membership inference attacks so if you are an artist and you want to prove that I don't know someone fine tunes on your copyrighted image you can do it like this because you already know you know your image right and you place that as a w so only if they input your image as well you will have a large enough inner product to uh with the B to be over this by the way yeah this distribution so the B this might be negative B right here that I'm putting here because obviously yeah so it's not B you w want to set B to a large positive number you like a negative number that's just not as negative as the inner product of the data point that you target with your W so yes you have to kind of know what data points you're targeting but they say it is enough in practice that you know the distribution approximately of data that you target cuz if you know the distribution you can just place a bunch of so you can just sample from this distribution a bunch of W's like this one this one this one and uh just place them there and you're just going to hope that there's going to be one data point that is kind of close to it that you can latch on and save obviously this requires kind of calibration so the better you know the distribution of the data uh magnitudes and so on of it the better you can prepare your W's and your B's in order to capture one and exactly one data point per back door all right so that's how they that's what they do here they have this MLP um I think with three layers there's a benign part and there's a bunch of back doors in there and those are the data points they capture and um I think that's pretty impressive no like cool that they can actually capture these data points pretty exactly and then say well look um you know this this here we can reconstruct from these updated weights of the fine toot model that's been fine- tuned on the whole data set for a bunch of EO we can read out from the weights pixel by pixel which training data or the training data that's been used and we can actually find this exact sample in the training data again so kind of a proof that it was proven on them and the Reconstruction right not just the proof but a full reconstruction of the training data all right so this yeah set the bias B so that on a small fraction of inputs give a positive activations select the weights W that align with different subset of the training data in practice we simply sample the weights from uniform distribution over the sphere right okay uhuh they discuss calibration so uh if the bias is too large multiple inputs in a batch might trigger the same back door which makes reconstruction difficult if the bias is too small some back doors may never fire on any training input but in practice again the attacker only needs to know a loose approximation of the distribution of

Backdoors in transformer models

WX all right so they're going to say hey we do this in linear units let's now go to Transformers right they are bigger they are chunkier they have multiple units they have attention layers and so on they have normalization in there can we still do this because they rely on this very particular construction right of uh of linear layers followed by this classification layer followed by Soft Max and so on so you have to recognize that yes a Transformer has more stuff but a Transformer does have linear layers okay so in particularly going to focus on kind of encoder based Transformers viit sorry viit and Bert and it's the same so the attacker adds a final linear layer you have to recognize that Transformers also have uh linear layers and those are the ones we're going to Target so usually you have your input and the input now isn't just a single Vector the input is a sequence which is also going to be a challenge for them so the input is a sequence you usually have your attension layer right here that kind of mingles all of these different inputs together so you get out again a sequence of tokens um and then you have this MLP here usually however you're not just passing the whole input through the MLP together you are passing them individually so that's the difference in a Transformer usually you're going to pass these things individually through the MLP so token by token goes through the same linear layer in order to transform them next so you always have this component which is like this mixing component and then this component here is sort of the Computing the next layer's representation you can see the problem if we pass these things individually through the MLP and we set back doors in the MLP to store training data then we ENT the best we can hope for is that we store individual tokens right because the MLP doesn't see the whole sequence as a um unified input it just sees token token token it does not know which belong to one data point which belong to the other data point and so on it essentially has an independent view of them so that's going to be a challenge but I hope you can see that even in a Transformer we can set up these back doors in the MLP part in order to remember uh data in practice what they do is they say okay there is a benign part of the model the benign part is represented here in blue and then there is kind of this back door part right here the back door part itself is going to be differentiated into a section that is going to store the um is going to have the the compromised signal so this red thing right here it represents by the way this all represents hidden States right this doesn't represent weights as far as I know this represents hidden states of the model um specifically the hidden state States as it passes through this linear layer and up the model uh so yeah so a part of this hidden state is going to be dedicated to the normal functioning of the model and then part is to the back door part is itself divided again in this red stuff here that's just going to be responsible for propagating the compromise signal upstream and what we're going to rely on is triggering a gradient update and making that really large and the problem is we have to keep it being really large until the end of the model and we have to keep it from interfering with the rest of the dimensions uh the key part here that's going to be the part that chooses the different data points so previously we had this in kind of Allin one the choosing and the um amplification no actually the choosing was the w we set up right and the amplification part was then the W W1 they called it uh really prop like blowing up the signal so we're going to have uh the key part right here the key part is going to be responsible for selecting which data point to remember to targeting a specific data point and yeah given that the Transformer is bigger more complicated and so on they have to do a bit more tricks to make this happen first of all you can see they're kind of going to zero out this key part again after because uh that would just sort of introduce noise uh they're going to have several amplification steps in this and several uh propagation steps so but the principle is the same what is interesting is again so the Transformers iner features are split into three components the benign ones the key which uh stores information to be captured by the back door and the activation propagates the output activations of the back door all the way to the model's last layer and amplifies them to ensure that gradient signals will shut down the back door what I find interesting is uh the key is divided into three parts so one selects the token so again what you need to do is you need to set it up such that the inner product with some part of the data is large is uh positive or just is above netive B for that particular uh linear unit and you have to set it up such that only happens to one of the inputs and not to all of them the additional challenge here is that you want coordinated back doors so you want 100 back doors to all hit um for the same sequence for the same input sequence for different position in the input sequence so that you capture a whole data point at once so rather than just leaving it at this right here which says you know which of the vectors that's incoming do I want to match they also have to address like make a large inner product with a given positional embedding so their key is also going to have positional embeddings and sequence embeddings so what does that mean uh they are going to Target sequence embeddings which I'm not I if I recall correctly it's just like um an aggregate of all the embeddings of a particular sequence which represents a sequence so this allows you to Target whole sequences right even if uh this token that you target is in a different sequence this back door will not hit because the inner product with this part of the vector is just small so you target a particular sequence position per back door right and you target a particular uh token embedding per back door so yes this stretches the definition of oh in practice you don't need to know that much about your data distribution it seems the more of these things you have to build in the more you do need to know uh just about what's coming your way or you have to be very pedantic in your calibration right uh but it seems knowledge of the target distribution is very appreciated in these kinds of things however imagine the you know the thing I said where I said oh I want to fine tune a pii detector right that's a very narrow distribution like I can probably as an attacker create some synthetic data to do this kind of calibration uh pretty well to such a use case it's just probably not that possible to do this sort of in a set up General back doors to capture all kind of training input if we're talking about these you know these large models with positional encodings and so on that you have to Target although the positional en codings aren't the problem they're always the same all right so the back door module is got is just going to be um in the first encoder block we used the MLP to implement multiple vors to capture the input features key with the same design as in what we saw before so we map the full input Vector which is going to be the benign features and the key to the output this so you can see there are multiple vectors right here each one of them is simply a linear unit as we discussed before so there's a w there's the part of the input that we want to capture and um sorry there is the key part like the addressing part of the input that we want to capture and then there is the bias parameter and the rest works just like before so how do we remember well if we make gradient updates to W and B then we can just divide one by the other and we get back the data points that was used to calculate the gradient outputs and secondly um second if we're yeah if we make if we add large amplifiers after this it means that this particular unit will always be negative for anything that's not key and probably even for key in the future and thereby shut down any future gradient updates to it yeah so that's essentially it so we just uh to capture all tokens in an input we design keyed back doors that activate only for tokens in a specific position in one input sequence uh the backd door weights are of this form the positional features are designed to be close to orthogonal uh if you're if you have the wrong position and this is something I said wrong before so the weights um can have a zero here so you don't need to know the exact token you're targeting but you do need to know the sequence embedding and the uh position that you're targeting so you do need to know an approximate distribution of sequence embeddings of the uh of the target distribution so that means this ensures that the VOR will only activate on a token in the ey of position of a sequence with a key key key SEC which is this part of the key the sequence embedding similar to W SEC then there's some numerical stuff so there's amplifiers multiplying the vector output by a large constant so that's the same that we saw before eraser model meaning the key parts of the signal are zeroed out remember you make the model you're the attacker you give it to the the victim so you can just build in a bunch of zero multiplications wherever you want like we built in a bunch of large number multiplications wherever we wanted signal propagation modules amplify the signal and so on and then the output modules just Aggregates uh all of the features into the CLS token remember if the back door doesn't hit then this is zero and therefore you know it's just the normal features if the back door does hit then this is a way big number um which means the feature is going to be overpowered and again the CLS output is going to be COI random which means that with this probability we're going to Mis misclassify quote unquote the current data point which means that we get a correct by correct we mean positive uh gradient uh which because we've Amplified the signal all across the model is going to back propagate all the way to the first layer which is where we place our back doors we place them in the first layer obviously because we want to remember the data points themselves and not some um not some intermediate representation of them all right so they say okay we

Additional numerical tricks

ignore some technical challenges namely the use of Gill and layer normalizations uh these require some additional numerical tricks to ensure that back door signals do not vanish or blow up during training specifically layer normalization is nasty because layer normalization mixes different dimensions with each other and especially if you say oh we have some benign features and we have some you know of these back door features and we're really modulating the back door features in order to do our stuff if you're now starting to normalize across dimensions and mix them together and do that over multiple layers very quickly your stuff can get out of hand um or your your signal can get noised to the degree that it quotequote doesn't work anymore right or you're making updates to the wrong parts of the model and so on so I'm not going to dive into this too much there is an appendix uh the um gist of for example dealing with layer normalization is going to be how about we just add a super large constant to the back door signal if you know if it's there um and then what that does in the limit like if C is way larger than all of the signal it just sort of deactivates the layer normalization like it just completely separates the two signals from each other it yeah it kind of shuts down the normalization part of layer normalization and yeah how that exactly works out in the math you can see in the appendix it's very well written down but just saying that they have to introduce these things in order to bypass some of the used features the same with G they have some tricks to deal with the fact that we made a lot of use of the fact that the reu is just always zero in this part right here and that if we always hit that part that's kind of our back door latch closed because if the output is zero there's no more gradient coming back but if it's G I'm not sure how exactly is it G like this or is that Swig Glo I don't know but uh you can see in both of them they're not really zero anywhere like exactly zero and that means as you continue training you get these gradient updates and updates so they are going to have some numerical tricks to also deal with that and make it more re like and lessen the effects of that

Experimental results & conclusion

in any case you can see the panel output the vision Transformer they don't back door entire data points they back door a grayscaled version of the data points uh but still you can see that yes they're successful like that stop sign is clearly that stop sign from the training example and even if again the same thing even if the back door this one is clearly some either one or two training examples like overlap of it's clearly a data point it's not just some hidden representation um so and they didn't find a corresponding training data point which probably means it activated twice but even at that it's still useful it's only when you know you don't manage to reconstruct anything that it's useless the same for Text data so this is out of BT uh so here you can see everything that's yellow is a exactly construction of an of like a full data point right and then there are some additional tokens because sometimes uh the data point is too short and some of the back doors then latch onto uh tokens of different data points right so all of these back doors remember they're is one back door that has to Target one token all of these back doors close in the same sequence which is exactly how it's intended and then some of the back doors are still there like ready to latch and there comes some token that just happens to also have the inner product in order to latch but yeah still I mean this is it's pretty impressive to see all right I don't want to go too much more into that uh they go then into black box attacks as we say which essentially um you can do membership inference attacks and they just kind of reduce to model stealing uh and yeah so that therefore it's it kind of relies on past work to do that but just to be said right if these data points are exactly encoded in the weights and someone has a method to get weights out of your API then they will have your fine-tuning data point so this all depends on these models being back doored and being compromised all right so that was it for the paper don't want to go uh hugely into depth but it's a long paper there's a long appendix you can see everything is very detailed and uh there's code available so very cool work very cool concept and let's see what's what continues in this kind of Direction I'm very excited for more creative ways of figuring out you know what can be attacked and exploited and whatnot um it's Maybe not immediately impactful as we said there are a number of things like training with Adam will probably completely destroy this uh but still it's fun it's interesting uh it's creative and yeah see what further comes out all right thanks you so much for listening uh stay hydrated and see you next time bye-bye

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник