ROME: Locating and Editing Factual Associations in GPT (Paper Explained & Author Interview)
1:04:59

ROME: Locating and Editing Factual Associations in GPT (Paper Explained & Author Interview)

Yannic Kilcher 04.11.2022 42 575 просмотров 1 273 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
#ai #language #knowledge Large Language Models have the ability to store vast amounts of facts about the world. But little is known, how these models actually do this. This paper aims at discovering the mechanism and location of storage and recall of factual associations in GPT models, and then proposes a mechanism for the targeted editing of such facts, in form of a simple rank-one update to a single MLP layer. This has wide implications both for how we understand such models' inner workings, and for our ability to gain greater control over such models in the future. OUTLINE: 0:00 - Introduction 1:40 - What are the main questions in this subfield? 6:55 - How causal tracing reveals where facts are stored 18:40 - Clever experiments show the importance of MLPs 24:30 - How do MLPs store information? 29:10 - How to edit language model knowledge with precision? 36:45 - What does it mean to know something? 39:00 - Experimental Evaluation & the CounterFact benchmark 45:40 - How to obtain the required latent representations? 51:15 - Where is the best location in the model to perform edits? 58:00 - What do these models understand about language? 1:02:00 - Questions for the community Paper: https://arxiv.org/abs/2202.05262 Follow-up paper on Mass-Editing Memory in a Transformer: https://arxiv.org/abs/2210.07229 Abstract: We analyze the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations. We first develop a causal intervention for identifying neuron activations that are decisive in a model's factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modules that mediate factual predictions while processing subject tokens. To test our hypothesis that these computations correspond to factual association recall, we modify feed-forward weights to update specific factual associations using Rank-One Model Editing (ROME). We find that ROME is effective on a standard zero-shot relation extraction (zsRE) model-editing task, comparable to existing methods. To perform a more sensitive evaluation, we also evaluate ROME on a new dataset of counterfactual assertions, on which it simultaneously maintains both specificity and generalization, whereas other methods sacrifice one or another. Our results confirm an important role for mid-layer feed-forward modules in storing factual associations and suggest that direct manipulation of computational mechanisms may be a feasible approach for model editing. The code, dataset, visualizations, and an interactive demo notebook are available at this https URL Authors: Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov Links: Homepage: https://ykilcher.com Merch: https://ykilcher.com/merch YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://ykilcher.com/discord LinkedIn: https://www.linkedin.com/in/ykilcher If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Оглавление (12 сегментов)

Introduction

hello today we're talking about locating and editing factual associations in GPT by Kevin Ming David bow Alex andonian and ionaton belenkov in this paper the authors attempt to localize where in a forward pass through a language model an actual fact is located or where it is realized for example something like the Space Needle is in downtown Seattle it has a subject a verb and an object and where exactly in a language model does the language model no quote unquote these things and that the Space Needle is in downtown Seattle that's a question of this paper and they go beyond that by figuring out where these facts are they can also then edit those facts meaning they can change the model such that it all of a sudden believes that the Space Needle is in Paris and they test in various ways that this changes first of all robust it generalizes but it doesn't distort the rest of the model too much moreover this change is like a rank one update that they can pre-compute so all of this is very interesting and we're going into it in detail this video is a bit of a mix between me explaining the paper and the authors who whom I interviewed giving their inputs into various aspects of these questions I hope this is of benefit to you let me know if you like it or not and let's go into it there's an entire subfield that

What are the main questions in this subfield?

just researches where are facts in language models I didn't know about the subfield until I read your respective works and what is it entail like what are people wondering about so I guess there's a few questions I think it's at the intersection of of two main things like one is a scientific investigation into where things are and what models are doing to achieve them and then at the other end of the spectrum is like a practical question that sometimes these models mess up sometimes they have information that we want to change because it's now outdated and how do we do this in a practical in a very clean in a clean way on both sides there are you know their individual respective questions on the interpretability side um I think David might be able to talk about it a bit because he he's worked with um not only language but also Vision models but um but yeah so yeah I talked about interpret really excited well yeah so on the interpretability side you know it's this really old question that's gone back to sort of the early days of Neuroscience which is where do ideas and where does knowledge live in a big neural network I mean people thought about this in the biological neural networks of your brain there's this old theory of the grandmother neuron that uh you know maybe you could even have a single neuron that's responsible for what you think of your um for thinking about your grandmother and maybe if you pluck that neuron out of your brain you might forget that whole concept which people think is sort of implausible but what we're chasing here is sort of a weaker locality question like if you have some knowledge in a big neural network can it be localized to a small set of neurons or small set of layers can we find out where that knowledge is and so there's been a bunch of people who's been looking at this um it's uh you know I guess maybe the overarching area is called like mechanistic interpretability research where people are trying to understand the mechanisms that are emerging inside the learning computations and so uh there's there was a really nice paper by alhaji um from anthropic there's been a series of papers from Jiva from Israel who have been looking at the structure of computations inside the network and so our paper is another contribution in this direction I think the thing that we're looking at a little differently is we're using we're really focusing on using causal probes to ask that question you know making changes in the network to see how the network responds when we make changes and using that to map out things and what I love about your work is then you actually put it to the test which means that if we understand where the knowledge is we should be able to change it right and that gives to me the interpretability research is always a bit shrouded in mystery because there are always I feel something like 10 000 different explanations that could explain a given fact and usually the researchers frame it in a way that their hypothesis makes the most sense but I'm always like meh but if you then actually put it to the test and you say well if we are correct we should be able to edit the knowledge we should be able to erase the factor insert a new one using what we think happens and that's also a thing that you do very well yeah so I think that's really interesting interplay between the interpretability and the Practical side comes in because on the Practical side people have been chasing this question of of of real world usage like these models are huge they're really difficult to retrain and then when we actually do fine tune them for example on a small data set with a um with sort of a blind objective it's kind of hard to tell sometimes what we're doing with it um and so in the past we've seen some works for example from Mitchell and from dekal um they spent a lot of time asking the question like can we achieve generalization when we do edits when we change one thing does something else change or is them edit specific like if we change one thing does an unrelated fact also change undesirably so they've kind of set this area up because it's a very practical question and I think the really cool thing about Rome is that like you said on one side is the scientific question but on the other side we show that um the insights that we get can yield a pretty useful model editor that seems to achieve generalization specificity influency preservation all pretty well I was wondering since the main Foundation of neural networks is distributed representations this is the big step right to go from go fi systems from symbolic systems to distributed systems where we no longer have individual symbols representing individual things in the world which we could build you know very simple knowledge graphs now a fact like uh Space Needle is in downtown Seattle needs to be stored somewhere in a vector space yet you managed to actually locate that fairly well to particular points in the network how does that work

How causal tracing reveals where facts are stored

so here is how causal tracing works this is one of the main methods the authors employ to figure out where in the model the facts are realized we are talking here about the realization of facts which is connected to the storing of facts but we specifically care about the activation so the hidden signals as they travel through the networks and not necessarily localizing facts inside of the weights of the neural network so in this case you can see that here is a sentence that you input the Space Needle is in downtown and the model would output well in this case it's an uncorrupted sentence the model would get this correct if it's a good language model you'll get this correct to say Seattle as the next token this as you can see goes through a number of different stages so due to how GPT Works how a autoregressive Transformer works with causal masking you will have the word the token for thee being embedded generating a hidden State here now that hidden State first of all it goes through essentially the layers of the Transformers and it accumulates uh two things so it always accumulates an attention head and it accumulates a multi-layer perceptron head or actually I think two in succession and then there is a residual connection around that so that's what you see right here but also the same hidden signal on each layer travels forward essentially well not exactly it's more like when the second token or the third token when they come in um so when space is now fed into the Transformer it now gets a signal from the past because it does causal attention it looks at the past so it also will get kind of the Hidden signals the hidden States from the past so the essentially this would flow like so but every time it would also get the hidden signal from there but and then need will get the hidden signal from both the and space so we get both of them right here but also it would travel up the layers and get both the hidden signals from here so you can see there is various paths this information can take and the idea here is to figure out where in these hidden States so in these bubbles right here or this bubble where is the fact that Seattle should be the output of the sentence where is that kind of realize where is that localized now you might have various um opinions uh where that's localized first of all opinions here like where in the sentence it does the model kind of put a lot of weight on Seattle uh and where in the network so here in the depth of the network where does that happen and both of them what turns out as evidence both of these things are quite surprising so uh here what they do is this yeah this causal tracing they run the model once with a clean input they record all of these hidden activations then they run the model again but this time with corrupted input so here you can see these have little asterisks by them which means that the input is now corrupted it means you add some noise or you just replace them by noise or replace them by something else it's just not the original signal anymore and therefore if you just let the model run it will probably produce something else because the subject so this is the subject of the sentence is completely corrupted so this could be whatever is in downtown and then Seattle is certainly not the first thing on the model's mind it might be but it's like very likely not and then what they do is really interesting they now take each one of these things here individually they take the hidden a hidden State and they just copy it over they just copy that over so instead of at this particular hidden State instead of what the model gets as an input you know from this path and from this path here instead of that it just ignores that particular hidden State and replaces it with the one from the clean input and now we observe so here maybe it said like Paris before because something is in downtown the model just said Paris and now we observe if it kind of stays at a wrong answer then that hidden state that original hidden state was probably not super well associated with either the input space needle or the output Seattle however if copying over that hidden state from the clean signal actually changes the output back from Paris to Seattle but that is a fat marker oh sorry about that those are my notes um if that actually changes it back then we know aha this hidden state must be quite important for sort of associating Space Needle to Seattle and that's how we find out and as you can see in the results you get these two clusters you get early and early so what they call an early site which usually happens after the subject is done and a late side which usually happens right before you need to predict so what's surprising at least to me is that these early sites here exist um which indicates that the model is aware of what it kind of could say with respect to the Space Needle much earlier than you would think right after just consuming the subject it doesn't know yet that I'm looking for a location that you know it is in downtown something yet it already has a lot of information about the location of the Space Needle that is associated with the output of Seattle so let's actually look at um ask look at what the authors say about these things I think one component of it is that causal interventions have been shown to be pretty effective at kind of determining what happens in a model and it seems intuitive because correlative studies are always kind of there's always problems with confounding and all things like that but when we go in and we make explicit changes to the computation of the model and we see what happens we measure the effects uh the things that we can read out are a little bit more clean so the thing that we do in causal tracing is that the fundamental question is we want to know which of these hidden States is carrying information that can help us convey the factual statement and like you said it's a big distributed Network so a priority one of the things you might think is well everything is important and all the states have information that could recover the hidden state so we wanted to test that let's see if this is actually true um so procedurally what causal tracing does is it essentially first obfuscates the subject it adds noise to the embeddings of the Space Needle so now the network doesn't know what you're talking about and it's got a whole set of corrupted activations um and then the question is well if you had clean States you know if you could restore any clean State could you pick one so that after you restored it the network kind of recoups its computation and that state contains enough information for the rest of the network to determine that the correct answer is Seattle um and so the surprising result is shown in figures in figure one's e f and g where we see this really sharp localization in this specific example we see a patch that's early and a patch that's late that have really high causal effect in essence they have the information that's required to um restore the factual statement but all the other states don't so very sparse set of activations that can actually do this um and so we're curious what does this actually correspond to so we can actually do this activation copying for specifically the MLP and specifically the attention as well and what we find is that the MLP corresponds to the early site and then the attention corresponds to the late site um and so the thing is the late site is interesting because well it's not exactly too surprising because the model is going to recall the next fact by outputting the next token so it's right next to the prediction and the causal impact there isn't too surprising um but what's really interesting is this weird early site that seems at first to be in the middle of nowhere um but actually when we do this kind of experiment averaged over a thousand facts I think that might be figure two or figure yeah I think it might be on the next page yeah so in figure two when we do this averaging over a thousand prompts we find that it systematically lands at the last subject token this patch of high causal effect in MLPs um and kind of inspired by a lot of the previous work in this area of interpreting where and what Transformer components are doing for example from Gala from Dai and from ahagi um we sort of form the main hypothesis of the paper that these MLPs are actually what are recalling the factual knowledge and this is sort of consistent with the Transformer circuits idea that you know in particular anthropic has been working on which is that these MLPs might be outputting some kind of information that the attentions that are at the very last token that are actually responsible for the next token prediction are reading so this was a really stunning surprise to find this kind of Separation in such a large Network um and the thing that's sort of Lucky about it is that MLPs have this really simple form a lot of work has been done on setting how attention Works in these Transformers and attention it's got my gosh attention is really complicated and uh but the MLP of these feed forward layers they're actually really simple so they're a pretty interesting thing to study if they're having some decisive effect so that that brought us to the next thing that we did so just to make it clear for now the hypothesis would be something like the MLPs provide information like they provide some kind of inputs to facts and then the attention at the later layers will gather all of that information in order to make the final prediction yeah sort of I think that it's there's it's more like you know the hypothesis is that the MLPs may be storing this factual knowledge these factual associations there's nothing inherent in the words Space Needle where you could look at the literal words it would make sense to predict Seattle there's a separate Association a separate piece of knowledge that the model has to store somewhere and the theory is that the association between that word Space Needle and the location of Seattle is specifically stored in uh in these MLP layers in the middle of the network so this experiment here is pretty

Clever experiments show the importance of MLPs

interesting as far as the way I understand it is the following the top one the top is sort of the Baseline corrupted input condition so that is what we had before as the what happens if we corrupt here the subject now not all tokens are shown but needle is the subject was like Space Needle was the subject and we corrupt it and we let it run through the uh Network now in the original experiment what we would do is we would copy over from the clean input one of the Hidden States for example this one right here however now we do something in addition so on the bottom you can see right here we still do import the clean input right here as you can see but then also we take the um we take these signals like the of some of the layers from that corrupted path and we attach them here now it sort of takes a moment to kind of estimate what's it really happening right here so uh it's very interesting to see now we measure the causal effect of the um of that node right here as we did before and here you can see the results as we measured the causal effect um so here effect of a single state the causal effect is as we discussed before there is kind of a spike at this early site um however if we sever the attention modules we get almost the same effect as you can see right here severing is the process I described over to the left right here however as we sever the MLP modules you can see that there is a definite suppression of that effect early on so where that effect is biggest here originally it's depressed way down if we sever these MLP connections so as soon as we import the MLP connections or States I'd rather want to say the modules the MLP modules remember here we're talking about forward signals not weights so as soon as we import these for these signals from the MLP modules right here then we sort of regress back and this node here has no longer much of a causal effect and that is an indication that the MLP modules might play a major role here in these factual associations and so what we were asking is hey if the MLP modules are so important what happens if we don't let them read their input what if we just stuck their input in the fixed Corrupted State so that's what this shortcut is showing these mlc modules instead of being able to respond to any new information that we're sticking in to clean up um the uh the prediction what if we said MLP modules aren't allowed to participate in that so when you do that you know normally you have this really strong causal effect for every state that you can see in the purple bars in the graph on the right but then if you take the MLPs out of the picture then it drops down to the green bars way below it so somehow the MLPs at these early layers from about 10 to 20 are really important for this computation if you take them out then the causal effects go away now the interesting thing is if you knock out attention the same way it doesn't really drop that much so attention yeah it's playing summer all but it's not it's not the same important role that MLP is playing I love this type of research just because on a meta level it is also really nice to see that research Labs let's say academic Labs can work with I mean okay gpt2 isn't nowadays one of the largest models in existence but still like it's not all money and compute and scaling up and you can only get a paper published if you whatever train and train and invest you can do fairly simple things as long as they're smart right and you can find out so much about these things um so I think your paper is also on a mental level a really good example of what you can still contribute to research even in absence of like giant budgets I don't know if you have giant budgets but the paper is certainly doable without right if anybody wants to help with help us with giant budget then we're always happy to have a little bit more you know like these the huge models really are doing some really fascinating things and um and so uh yeah so we're trying to uh investigate the really huge models but um but yeah I think that our secret sauce is not compute clever experimental design yeah and that it really shows like and the effects here are pretty significant right if you cut essentially the contribution of the MLPs you can see this quite a big drop in the um in the causal effect and it makes it fairly good case I would say of localizing that knowledge so now we get to how we kind of determined our

How do MLPs store information?

hypothesis is that right now that this knowledge the facts are essentially stored in the MLPs and if I understand you correctly something like the Space Needle is in downtown Seattle that fact would already be stored in an MLP and it would be already Associated at the point where so here we see at the last subject token essentially once I process the Space Needle at that point or maybe one after that I would have a layer with an MLP in it and the fact of it being in Seattle would already be stored and recalled at that point do I understand you correctly yeah even though the model doesn't know yet that I'm gonna ask it where the Space Needle is so that means that essentially if this hypothesis is correct the model once it sees a subject whatever that means it will retrieve kind of a whole bunch of Knowledge from its different MLPs that are around about the subject for then later let's say the attention modules later to Aggregate and to retrieve the correct ones from that's exactly right yeah that's right okay that's kind of what we found I think another intuitive hypothesis would also have been that the relation is also encoded in there somewhere um but the challenge there is that the relation often doesn't show up until the very end of the computation and if you think about it um it's a little bit difficult for facts to be recalled at the very end because there has to be some kind of General Pool of information that you can draw from about a certain subject even before the question is asked so yeah okay so MLPs act as key value stores do you want to tell me a little bit about how yeah um so this is inspired in part just because of the really nice structure of the MLP simply as two matrices that are connected by a few non-linearities but it's also it also draws from research that's been done by Geva and Dai in the past about year or two and basically what they said was that the second MLP within the MLP there are two matrices there's the fan out Matrix that gives you a pretty large keep space and then there's a fan back in a matrix that brings it back to um the head and dimension and so what Gaba found was that the second feed forward layer seems to act like a key value memory um and they found that a lot of the keys corresponded to real life concept the values they've shown that sometimes they can correspond to specific embedding vectors they can correspond again to human identical identifiable Concepts um and so that's one of the things that got us thinking that it was an associative storm um but the next thing is simply just that it's a nice Matrix and these matrices have been studied for a long time as you know methods of storing associations like in the very naive case if you just stuck a fact in every single one of the um dimensions then you would have just n and facts that could be stored orthogonally um but there's this really nice interpretation that linear associative memories can store more than the number of rows or columns depending how you look at it which is that they minimize squared error between all the key value Pairs and so that sort of gets us started on thinking about how we can take all the associations that are already encoded in this you know hypothetical Matrix and assigning a new Association to be constrained as well yeah so the the old name for this is uh you know linear Associated memory it goes way back to the 1970s right when people were like what can you use a single layer neural network for and uh and you know researchers in the 1970s thought of a lot of Alternatives but one of the leading hypothesis was it just stores key value associations um and they looked at it like a linearly squares problem is know that basically you could pack a lot of associations a lot of remembered values into this key value store and there might be some error but a good solution to it would like minimize the squared error um and it sort of reduces it to this classical uh but actually you know pretty straightforward to solve a linear algebra problem and so um so that's the old that's the old view of it so now we ask the question how can we modify such a

How to edit language model knowledge with precision?

network such that it kind of learns a new fact or changes its mind about one of the facts that it knows well that in the attack surface right here is going to be these MLP modules namely updating the weights of the MLP modules such that they change their mind about a fact what we would like to do is we have the hypothesis now based on some of experiments that the key right here probably corresponds to something like the subject the Space Needle and the value that we get out probably corresponds to something not exactly the output itself but kind of the because at that point it doesn't know yet that I'm looking for a location right but probably something like a fact about out that subject so I made the example location equals Seattle so that entire thing that entire fact could be encoded in this value Vector such that later once it becomes actually clear that I'm looking for a location that fact can be retrieved as opposed to any of the other facts that would be let's say stored in any of the other MLPs that the signal is also going through after all we're doing multi-headed attention and that's it by itself quite an interesting question to ask like how many facts are there and so on but I don't want to go into that the question is can we change this to something to say location equals Paris and they go about this fairly in a fairly smart way and we come back to that at the end or towards the end of the interview how exactly they do this so there's a two parts to it first of all let's say we know what the key is for the subject and we know what the value that we'd like to insert is in Vector form like we know the value of this thing then they compute they go through a bit of math here and set this up as a constrained optimization problem and it turns out if you solve that then you get a closed form solution for a rank 1 update so they get a closed form solution that here for and it takes a rank one update that they can easily compute that they need to add to the original weight Matrix and then they essentially get out a updated weight Matrix that respects that new fact that they want to insert and that's what they do now the question is obviously how do they know what the vector for the key and value is that they want to insert the key is still relatively simple since the key is a subject that you know and want you can simply let that run through the network and kind of grab the activations at a particular site they always choose the same site here but the value is kind of different and there they solve like an optimization problem so they essentially put the output right here and I believe in much the same way as like an adversarial example they now back optimize what the vector here would need to be in order for the output to change to Paris they this back propagation this optimization isn't the changing of the network itself it's simply to compute this V Vector right here so that they then they know how the need to compute the update for the weight matrices let's assume that I edit I say okay this is my space needle and here I would say no it's actually in Paris or Rome not in downtown Seattle so I want to encode a different value you phrase this as a constrained minimization problem where you say I want to find a new Matrix that still minimizes keys and values but also obeys my new relation and you can phrase this then as a closed form solution my question is why did you choose to go with constrained minimization in this case why didn't you just ask add the key here and the value here to all the other keys and values that might already be there and then essentially minimize the entire thing at once so one of the reasons is that um you know so this is a sort of a mathematical formulation but we don't actually have access to all the old keys and values and so uh so it turns out that if you set it up in the right way then you can get all the old keys and values to cancel out so you don't need to know them and one of the ways to do that is just to set it up as this constrained minimization the other nice advantage of it is that if you balance this against all the old things and there's this sort of hyper parameter that you might need to set of like how much balance there is but if we're just setting up a single new fact to learn it's easiest to just say you know what the new model should just know this fact uh let's just like know this 100 and we might have to sacrifice a little bit of um you know sort of increased error on Old facts but there's so many other dimensions that that's just a little bit of error so we just set it up this way in this paper although uh setting up the other way that you suggest is a really good idea and um and it's actually an approach that we explore in uh in a future paper that hasn't been uh published yet it's uh but it's it'll be uh on archive soon and hopefully it's going to be published by the time that this video is released and I'll Point people to it but essentially in a nutshell here we implant like single new facts into these models and that works until a couple of dozen facts maybe but with your new method you can implant thousands or even tens of thousands of facts at the same time into Networks yeah that's right you can actually you can really scale this up if you adjust a few things if I think about implanting new facts into a network I can make it really easy for myself I can just say you know whatever it just needs to fulfill this thing you know but I obviously there's a trade-off there's always a trade-off right specifically the trade-off here is going to be well what happens to the rest of the network is it still correct if I tell the network look the Space Needle is actually in Paris right what effect does that have on the rest of what the network knows how it performs and so on and that's where we get to your fairly extensive I want to say evaluation of these things so we now have an idea of where the facts are we now have a method to exploit that in order to change those facts and now what we would love to see is that essentially well you tell me

What does it mean to know something?

what is the ideal outcome of such a method that's a really interesting question and because we spent a lot of time thinking about what should go into counter fact and how to design it so that it's easy to evaluate computationally and stuff like that but one of the main questions is sort of what does it actually mean to know something right what does it mean to have a fact that's actually stored there and if we think about it knowledge has I think two important properties number one it generalizes when you rephrase the question it should be consistent if you ask a related question that implicitly requires knowledge of that fact it should also be consistent in all of those things but at the same time you can't do this for every single subject in the model you can't always output Rome or always Paris always I'll put um those kinds of things so we also wanted to be specific so this is the main two axes on which we measure the edit yeah like what do you mean by specific um specific as in entities that aren't related it's like subjects that aren't related to the subject should not change essentially yeah so like you move the Space Needle to Paris then we don't want to move um the uh the Statue of Liberty to Paris at the same time or the uh or the Louvre shouldn't um you know should stay in Paris uh what else is in Seattle uh Pike's Place to Paris along with the Space Needle she just moved one thing and so you know the interesting thing is that there does seem to be the trade-off between being really specific about making a change and having the change uh be General and um and if you sort of change a model without paying too much attention to uh exactly what you're doing uh it's really easy to change a model in a way that is completely generalized but not specific at all like everything moves to Paris or vice versa uh where um where it's extremely specific but not generalized at all where you have a very specific wording of a sentence where now it predicts Paris but if you change any little detail then it has no idea what you're talking about before you said like okay we can edit these models and

Experimental Evaluation & the CounterFact benchmark

so on but there are differences and these are the things that you compare with in your evaluation so you have uh one evaluation is this zero shot relation extraction but as I understand it it's not exactly made for your use case uh and uh we need to go further so you also provide a new data set yeah so a zero shot relation extraction is cool because a lot of previous Works in model editing have used it as a baseline um and it actually is quite good like you have a bunch of facts you can rewrite we can paraphrase them I believe that the ones that we have in our zsre data set are the ones that previous Works have used are back translated so we have a few paraphrases um and then we sample a random fact from I guess the other facts and check that it changes um so as we can see in the results the there is a there is resolution to the method like we can see various differences in paraphrase and drawdown um but actually the resolution isn't too high especially in drawdown like it's hard for any of the really randomly sample facts to be messed up even by models that make quite large changes and also moreover there's no evaluation of fluency it's one thing to measure the next token probabilities but it's also another question of have we ruined the fluency of the model have we deleted so much syntactical knowledge that GPT doesn't generate actual fluent text anymore so those are the few those are a few of the questions that motivate um the design of counterfact which we talk about in the next section so counter fact is based on something that's very similar to the SRE it's actually called parallel it's a bunch of relations that some researchers use to analyze how consistent language models are um and basically it's just a bunch of facts it's they're all in the form subject relation object and what we do is we want to test how well the model can be taught facts that aren't already true because sometimes if you teach it something that it already knows we might inflate the numbers so we actually take the objects in all of peril and we swap them around we make everything not true um and then we design a few other things that can help us capture generalization and specificity generalization works very similarly to how Z SRE Works where we just paraphrase a bunch of stuff but specificity is a little bit different because we found that because of the way that the math works because we're setting the output of one key to a specific value if any other keys are in the vicinity of the key that we input or that we edited into the memory those are pretty vulnerable to moving around and so what we did for specificity was we looked for neighboring entities that are somewhat related to the subject and specifically they're because they have a common predicate or the exact same predicate so if I have the Eiffel Tower and we move it to Rome then I will look for other things that used to be in Paris like the Louvre um or the xiaomi Lisa things like that and so that's one of the differences that specificity uses um there's also this fluency and consistency thing which both deal with generation metrics so fluency is pretty straightforward we make it generate some text and we want to see if it's fluent but then with consistency um we just let the model say whatever it wants about the subject and we want to see if the keywords that it's outputting actually make sense for example if I change the Eiffel Tower to be in Rome I probably shouldn't see a lot of French vocabulary I about you know the food that's in France or the attractions that are in Paris or if I move a basketball player to being a football player he shouldn't be winning the NBA Championship he should be winning the NFL Championship or something like that um and so that's another thing that we do but our hope is that or we design counter facts so that when you look at all of these five things together you get um a bit of a more complete picture as to what happens to your model after you perform some kind of change you you've talked a bit about yeah generating this data set seeing you know does some thing makes sense and so on now we talked about budget before is it fair to assume that this data set has at least in part been also generated with the help of automated things like models or is being also evaluated with the help of automated heuristics ah yeah okay so this data set was actually generated completely computationally um and that's one of the big things with evaluating language right it's very hard to design computational metrics that align with human judgment is the short thing so we actually include a human evaluation I don't know if we've archived it yet we wanted to balance a few things but the really nice thing about having things computationally generated is it's very easy to scale it up so I think one of the secrets and the tricks behind a lot of this knowledge-based work is it actually Builds on top of um big knowledge graphs and big knowledge bases that have been curated by a lot of people over time so I think the underlying data underneath parallel and underneath that kind of fact is actually Wiki data um and so uh so yeah how do we get this huge store of predicates uh to scramble and you know related entities to uh to um to uh to test they you know basically they come from wikidata um and so that's that's where we can get the scale for this kind of thing so down here you have an example of just one of the edits uh that you make into the model so we're dealing with a gpt2 model right here and what do we see like what is this here that is the original fact that the model outputs yep that's correct and then you decide no actually Pierre curie's area of work is medicine now we haven't talked about yet let's go through this uh step by step maybe Ah that's it that's a joke in two days work World um but we're a one-step method so uh how would we go about this because we haven't talked about like a final piece of the puzzle yet um we talked about once we have a key and value Vector right how do we insert it into an MLP how do we edit it but

How to obtain the required latent representations?

essentially this now here somehow has to be made into some sort of key and some sort of value so how do we get these things yeah that's a great question so the key is a little bit more straightforward because the natural interpretation of the memory is that once it sees a key it'll always output a value and even if it's in the neighborhood it'll probably output a similar value so what we can do is we can simply show the model the subject and it'll do its computations and we can collect the activation right before it goes in to the MLP that we're targeting and that's simply our key if we want to average across context we can append some text before the subject so that it gets to see you know what happens when what happens to the key when I have you know five words in front of the subject or ten words or something like that and usually it doesn't change too much but it helps with generalization but then the value is a little bit more involved and this is actually an interesting area for future research because there are a few things there are lots of things that you could imagine V could be like in the most simple clean case we would hope that maybe V corresponds to an embedding for example so if we want to you know increase the signal for medicine we could just add the embedding for medicine or some transformation of the embedding but as you pointed out earlier it's not quite simple it's not quite that simple because there are a lot of things that are being stored for Curie and one of them is that he works in physics or medicine but also you need to know that he was living in a certain country he was born in a certain time period he had friends X Y and Z all these kinds of things um so the embedding thing is a little bit simplistic but it's a super nice ideal to Chase and I think that's an interesting it's an interesting like direction of future research basically what we do is we perform a little optimization it's a very constrained optimization because it's operating only on one on one vector basically what we say is so the MLP outputs some kind of value we know that this value is causally important because of the causal tracing stuff so the question is how can we tweak this Vector so that the new fact is represented instead of the old fact so we can perform a little optimization we can say given that the model currently thinks the answer is you know Eiffel Tower is located in Paris let's optimize it so that it wants to say Rome instead and we don't optimize any weights we don't optimize a huge Matrix we optimize this one little Vector that comes out of the MLP and just changing that Vector will allow us to change um the final prediction and in this sense like the um the optimization takes into account the relation as well because uh the back propagation goes through all the tokens that describe the relation um and so that's sort of what we do that gives us a um a vector that'll represent the new fact do you want to talk about the tricky second term that you have here yeah sure so this is again indicative of an interesting future research question but one of the things that we observed and this is sort of like a limitation it's an interesting limitation is that it's very hard to catalog all the things that come out about the subject when you you know feed the key into the MLP so there could be a lot of things and what we've observed is that sometimes we'll observe we'll see this thing called Essence drift which is basically some of the old properties about the subject will change when we didn't want them to change like an example of this is say um you wanted to change Mario Kart to a Microsoft product if you make the update too strong it'll actually think Mario Kart is no longer a game it'll think it's a Microsoft Office productivity tool yeah and um so this this lost term right here is just to encourage it to not do that it's basically saying there's some probability distribution over or you know what this subject is like the essence of the subject and we want to keep it consistent up to like a weighting Factor so admittedly it's a little bit of a hack but you know I think it's it's useful and um it raises this interesting question of you know how can we decode the vector the V space as well yeah and it's simple in the end right it's um I think it takes a few seconds to figure out one of these vectors uh and then you can directly write it into the network yeah it's important to see that these things here choosing the K vector and ultimately choosing the V Vector are only to figure out the vectors that you then want to put into the network this optimization procedure doesn't actually change anything in the network but it's interesting because before you said essentially well we're worried about the keys because keys in the vicinity are subject to change but now it also turns out that actually values in the vicinity are also subject to change so if I change the value of a given subject I need to tell the model by the way the rest of the subject is kind of unchanged right yeah it's you know it's really counterintuitive right we have this 1600 you know 2000 dimensional Vector spaces and I feel like our intuition sometimes fails less you know these Vector spaces are so big uh you know you really have to you have to respect that you can store a lot of information in just a single vector yes which is so my last question of this would be how do you choose the

Where is the best location in the model to perform edits?

MLP because here you need to Target like a specific MLP at a specific layer in the network how do you choose where you want to make that edit yeah um so this is uh this is a this is yet another interesting question that kind of foreshadows some of the work that we do in our next paper um but causal tracing gives us sort of a range of MLPs at which it works and kind of the observation with Rome is that we wanted to make things as simple as possible and it's fascinating that it works and possibly you know a plausible reason for this Simplicity is that there's the residual stream that all these MLPs are contributing towards the head and state in an additive fashion so um within the band of MLPs that we see high causal effect for it's plausible that this fact could be stored in any of them and if any one of them kind of overrides the previous ones then we'll get you know the new fact being expressed and so specifically what we do is we just go to the causal traces and we see where the causal effect Peaks and then you know we run an experiment that shows that this corresponds pretty well to where the best edit occurs um but basically it's um it's interesting because when you start adding more facts and you need more capacity the question becomes you know the question becomes well how do we spread facts across layers so you know what we do is really so but like so in a word what we do is really simple and actually reviewers didn't really like this as much right you know in gpt2xl we use layer 17. right you know we do this you know causal Trace analysis and we find that the causal effects Peak there and we just say you know we have all these thousands of facts that we're testing on we'll just test how well they all can be stored in this specific single Matrix at layer 17 and it works pretty darn well um and uh you know really I think it sort of surprised reviewers there you know they're like really you know are you I is this all you're doing and um uh but you know I think I think um you know it's sort of I think the lesson is you know if you really map out the mechanisms inside the network you can get a sense for where things are getting done and you can find the specific location that's most decisive now I it like you're about to talk about scaling and so I think that if you're trying to insert lots of facts and maybe trying to pile them all into the same Matrix uh might not scale that well but for this test that we're doing for this paper for you know asking how well can a network absorb a single new written fact you know we found that the exact layer that you use may not be so important if we just picked the single layer that's most effective then it works for all these facts so we end up in a situation where we started off by thinking well we have this distributed Network distributed representations then you come in and say no actually things are fairly localized right they are and not only fairly localized but actually surprisingly for example the fact that the Space Needle might be in Seattle might already be present after the model has consumed Space Needle as a subject right which is fairly surprising yet now we almost let go a half step back and say but within that band within sort of that localized area still it might be the case that these facts are at least a little bit distributed right over maybe a bunch of layers adding to the residual stream which also it's also fascinating that you're saying well all if I edit um if I edit some game to now be a Microsoft game then all of a sudden it might think you know it's a Microsoft Office product or something like this it's Super Mario is no longer a game which kind of means that sort of this this these fact things here they are not so clean they are still kind of in super positions with each other right if I change one then the others also change a little bit so I think I think the jury is still out on that like what the structure of that Vector space is um and um uh you know I think there's a difference between uh knowing whether uh information is really entangled in that representation or maybe we just haven't developed the right lens or the right method for disentangling the information that's in there yeah I've seen I think this morning I've seen a a statistic essentially listing that as you scale up models uh most of the flops let's say in training and in inference actually go into the feed forward layers into the MLPs and not necessarily into the attention uh mechanisms everyone's always trying to make attention more efficient while not realizing that if you really go to these big models they work in very high Vector spaces and the feed forward layer in a high Vector space is actually really expensive do you think that fact that we operate in essentially large dimensions and so on that these feed forward layers are so big do you think that might be a main contributor to these models essentially performing really well and knowing a lot of things it would make sense given what you found I think so I think these fan out fan in feed forward layers are really sponges for information they can absorb a huge amount of basically memorized information and so some of that information uh you know our paper is showing some of that information is memorized uh factual associations but I think there's a lot of other information that's probably in these matrices well you know information about grammar and lower level things um and so I think that you know they're an amazing data structure for um for knowing a lot uh some of the newer Transformers they add some gating to these MLP layers to you know increase their capacity even further um and so I do think it's they're sort of one of the unsung heroes of these big Transformer networks these these huge massive high-capacity memories last question from my side do you

What do these models understand about language?

there's a lot of discussion always about what do these models understand now understand is a weak word a wishy-washy word let's say but what is your impression like it seems that they certainly do more than just statistical Association of kind of tokens uh to each other like what what's your current understanding of what are the real understanding capabilities of these models do you want to answer that question let me say something yeah if you if we answer this question then somebody's going to boo us um so I think that uh so here here's what it seems like to me there's like positive surprises and some negative surprises and so on the positive side um it was really surprising to see that a rank one update in a single layer in a matrix roughly corresponds to what a human thinks of as a fact and we like there's no particular reason that resolution should match so well right you know it could be that a little rank one change in a matrix is much smaller than what a human thinks of is a fact or it could be much bigger but it actually is kind of surprising that it pretty much matches up pretty well um and so that's really interesting and it raises a bunch of philosophical questions about you know what is the nature of knowledge you know um the emergence of ideas and big neural networks and so on but it's but it's pretty cool um the um uh the on the negative side there's funny things about um the mechanisms that don't really correspond to the way that people think so I think that the simplest example is like if you reverse the statement of a fact then these Transformers they process it differently so for example if um if you said uh Bill Gates yeah oh yeah Bill Gates is like Bill Gates is the CEO of Microsoft or found it or maybe or oh yes was yeah Bill Gates was the founder of Microsoft right that's you know anymore he's retired uh so but if you said you know for example like if you said Bill Gates was the founder of Microsoft then you could find that um that Association somewhere in the network but if you if you had the network know that it doesn't necessarily also know that the founder of Microsoft is Bill Gates because now you've used the other entity as the key and that would be potentially stored separately so if you edited one of those facts then the other fact wouldn't automatically be edited you might need a second edit and um and so you know that's a little count intuitive I think that you know if you ask the person is that one fact they'd say oh yeah that's a symmetric fact you know you told me one of those I would know the other but for a Transformer this may not be the case it's uh it may be two separate facts and that might be I mean it might be a property of the sort of causal masking that we're doing right because only be able to sort of look back into the sentence already means that you have to pre-compute a lot of this knowledge upon seeing the subject right and that might be different paths through the network for the different subjects so for once subject is Bill Gates and for the other one's object is Microsoft you don't know what's coming at the end of the sentence and therefore you need to be kind of prepared for everything so maybe bi-directional models might have this differently maybe or you could imagine it the other way because you could also Imagine while people are constrained to live forward in time so the way we must think about language must also be you know so you have this debate about what is the best way uh to think about it and um and

Questions for the community

so uh so yeah um there's that movie arrival I sort of imagined that maybe all the arrival aliens uh you know they sort of had a bi-directional Transformer uh you know brains for their language model and as humans were stuck with these you know one your unidirectional GPT style models and uh and that's it's really hard to communicate between them okay cool um Kevin and David it was a real pleasure having you here as I said I linked the new paper for sure uh and yeah do you have any last things that you want to get out there to people maybe um how can they get into this field of knowledge editing and uh figuring out what these things know what I don't understand so here's my you know question for the machine Learning Community out there what I don't understand is why isn't our entire field about cracking open these miles and looking at what's inside them I think that we're getting better and better at getting really interesting capabilities out of the models but they contain so many mysteries in there if you think about the number of billions of parameters inside gpd3 you know that just like this machine learned code is you know it's larger than the entire code base of you know massive companies that have employed tens of thousands of people to produce you know manually produce code for many years you know these large models they must contain a lot of interesting uh structures so I guess my you know my advice is you know crack open models uh there's surely a lot of interesting stuff to discover inside them awesome Kevin last words yeah no I think this field is very exciting not only for the I think the science is amazing but I also think it's cool because it inspires interesting questions about what we can do to make these things better like some of the negative surprises that we found with you know trying to see if GPT really understands certain Concepts is that you know the observation that there's this bi-directionality of knowledge could only have emerged once we developed a method to edit things to see how things work um so I think it's really cool that this kind of stuff can be raised by interpretability research and it'll help us build better safer models uh in the long run that we can actually engineer and I think that's really exciting all right cool well thanks so much for being here and uh best of uh not luck best of success for the future papers thanks Yannick thank you it was really nice of you to interview us and it's really great to meet you here thank you

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник