# Looking For Grammar In All The Right Places | Alethea Power | OpenAI Scholars Demo Day 2020

## Метаданные

- **Канал:** OpenAI
- **YouTube:** https://www.youtube.com/watch?v=6gilgehNTNw
- **Дата:** 09.07.2020
- **Длительность:** 24:07
- **Просмотры:** 6,862

## Описание

Learn more: https://openai.com/blog/openai-scholars-2020-final-projects#alethea

## Содержание

### [0:00](https://www.youtube.com/watch?v=6gilgehNTNw) Introduction

hi I'm ilithyia power and I am pretty new to the field of deep learning I've been in it for about five months now through the course of the Scholars Program I'm getting a warning that I have bad Network quality so if I'm not coming through clearly somebody let me know in the background so anyway about at the end of last year I sorry distracted by the network quality my background is in software engineering and site reliability engineering and I've always been interested in AI but at the end of last year I decided to try and make the switch to a new career and so to that end I applied to the Scholars Program and I was incredibly grateful to be able to get in and it's been an amazing start to a new career I want to thank open AI i particularly want to thank my mentor and the other mentors who have been helpful and the other scholars it's been a fantastic cohort to go through all of this with so during the course of the program I got

### [1:00](https://www.youtube.com/watch?v=6gilgehNTNw&t=60s) What is interpretability

very interested in interpretability is basically mind-reading for AI it's about tearing open neural networks and looking at how they represent and process information and it's difficult to do because AI and deep learning in particular is very different from traditional software engineering so there's a picture that almost everyone in the field has seen software engineering a human being write some software the software takes inputs and gives outputs they could be questions and answers like a search engine or you know whatever but in deep learning a human being creates math and gives it some data to train on and that's what writes the software that takes inputs and outputs and it turns out that software written by math and by a computer is much harder to understand and software written by a human being

### [1:52](https://www.youtube.com/watch?v=6gilgehNTNw&t=112s) Why interpretability matters

but it really matters because AI is everywhere it impacts us in tremendous ways throughout our lives so I'm a transgender person and that means that for a lot of my life my body is a different shape than cisgender people's bodies and that means that scanners at airports usually flagged me for needing a pat-down it's humiliating it's embarrassing it's not the end of the world but it's not cool and AI impacts other people in worse ways there are systems that you know self-driving cars are more likely to hit people of color and you know there's all sorts of biases and in Justices that can come in so if we understand how these systems work then we can reduce their bias in addition to that if we understand how they work then we can improve their efficiency we can find smaller networks that do the same sort of job and take a lot less electricity a lot less time a lot less resources and a lot less money and finally if we understand how on their own networks thought how neural networks represent information then we have a better chance of actually being able to understand human thought which to me is the most interesting question of all so I decided to dig into interpretability by analyzing GPT - this was a state-of-the-art language generation language modeling network that opening I released about a year and a half ago and the way this network works is you give it some input some text and it generates output so this is an actual example I fed this the phrase my talk is about into GPT - and it said the future of education you can give it the beginning of a sentence and get yen you can give it a paragraph and get an essay it's very good at generating text and a lot of what it generates is indistinguishable from human beings this is pretty powerful and pretty dangerous I know you can do something like train GPT - on you know some sort of subreddit and cut and get it to generate political text and then you could use it to look like there's a bunch of people on the internet who all have the same idea and it's really just software and that's pretty dangerous so we need to understand it we need to dig into it and know how it works and how to combat things that are generated by it and how to make sure that it's used in safe ways so I had a certain amount of time to do this project and I decided I would bite off a tractable part of this problem the first thing I would do is just try and understand how GPT to understands English grammar so to explain how I figured that out I need to give a little bit of background on how GPT 2 works some of the people on this call will know all about this and are literally world experts I think the lead author on the GPT 2 paper is on this call also my mom hi mom so I want to make sure and give some background that's applicable to a wide variety of audiences and try not to leave anybody behind based on a lack of already having you know a full knowledge of how this works I also think that's a core part of interpretability trying to make sure to democratize this information and spread it around so that people outside the field can actually have an understanding of what's going on so I'm gonna spend a second talking about transform our architecture and then I'll get into what I built on top

### [5:20](https://www.youtube.com/watch?v=6gilgehNTNw&t=320s) Tokens

of it GPT 2 is a transformer but I'll get into that in a minute so when I feed this beginning of the sentence in my talk is about the first thing it does is split that strain into tokens can be words they could be punctuation marks they could be collections of bytes in this string they could be yeah just basically sub parts of the string I restricted myself to sentences that had tokens I had a one-to-one mapping between the tokens and the sentences and punctuation marks because that made it a little bit easier for me to analyze GPT 2 has a little bit of a subtle way of doing this but I kind of circumvented it these tokens oops I'm clicking the wrong button here these tokens get converted into vectors and the word my always converts into this vector here and this is actually talked with a space in front of it that always converts into this vector so I end up with use for vectors that excuse me and they could get fed into GPT too and they flow through the

### [6:29](https://www.youtube.com/watch?v=6gilgehNTNw&t=389s) Vectors

network along these positions so if I put four tokens in I get four tokens out in this particular diagram there's four flowing through it so what are they flowing through the first part here is an embedding layer that's what turns them into vectors then it has a bunch of decoder blocks GPT too is comes in a variety of sizes I looked at GPT too small which is what would fit on my home graphics card and even it is huge it has over a hundred million parameters variables and so I knew that I needed to try and break it up too to tackle this problem and most of these parameters are here in these decoder blocks and finally it has a language modeling layer so each decoder block takes n vectors in each position and outputs then this language modeling layer takes the final set of vectors that come out of the top decoder block and produces probabilities for what the next word might be and I'll get into that in a second inside of these decoder blocks are what are called attention heads now attention heads mix-and-match information between the different positions to feed out into the new position so they kind of like collect the information that spread across the input and collect it into focus areas so

### [7:57](https://www.youtube.com/watch?v=6gilgehNTNw&t=477s) Sushi Boat

you can kind of think of this as being like if you've ever been to a sushi boat restaurant that has the little stream with a little boat that floats along next to your table with pieces of sushi on it so you can imagine each of these positions flowing through the network being like a sushi boat path and the tokens the vectors going through there are like sushi boats and a detention head might look at all of these positions and take all the cucumber out of all the sushi and put it into only the one in position one and well actually wouldn't do that only the one in the last position attention heads in GP t2 are not allowed to take information from future tokens and feed it into past positions the information can only flow this way and it can't flow this way so anyway you can imagine these attention heads kind of mixing and matching little bits of the sushi together and feeding them forward trying to get a more organized picture of what's going on for the task it's to perform each of these layers each of these decoder or blocks here has 12 attention heads and they can all operate independently and then at the top of each layer there's a linear layer that puts all their outputs together and organizes them into output for that whole layer okay that's a whirlwind tour of transformer architecture so what is GPT to actually doing so it's supposed to in each position the goal is for it to output the next word and like I said this top language modeling layer outputs probabilities and so ideally you want the word talk to have a higher probability than others and here you want the word is that have a higher probability because the next word here was talk so you want that to generate talk the next word here is so you want that to generate is okay and so it goes through this it does it all the way to the end and here it's going to generate some word that you haven't had in your input which you can then feed back in and generate future words so this is how GPT two comes up with a completion of the sentence or a paragraph or you know whatever this is called Auto regression so okay so what I did here in order to understand how grammar is understood inside the network I stripped off this language modeling linear layer and replaced it with a

### [10:25](https://www.youtube.com/watch?v=6gilgehNTNw&t=625s) Grammar Modeling Layer

grammar whoops with a grammar modeling layer so what this means is instead of having an output probabilities of English words or byte pairen codings of English words which is how gptt tokenize --is I had it output probabilities of parts of speech and I looked at three different kinds of grammar simple part of speech detailed part of speech and syntactic dependencies so simple part of speech is like pronoun verb etc detailed object of the preposition and syntactic dependencies is I'm sorry object of the preposition is syntactic dependencies and detailed part of speech is just more fine-grain you know what is each word doing so anyway I put this grammar modeling layer on the top of this and I trained it I built three data sets one for each of these different types of grammatical structures huge data sets 300,000 sentences and I used Spacey which is a natural language processing tool out in the wild to tag all these sentences with their grammatical structures please note here the goal of this project was not to produce grammatical tagger because space he already does that and does that better than the thing I built my goal here was to use a grammatical powder on top of GPT - as a way of measuring information inside of GPT - so you can

### [11:51](https://www.youtube.com/watch?v=6gilgehNTNw&t=711s) Entropies

see here this shows it outputs parts of speech I also looked at once I had this grammatical tagger in place I looked at what are called entropy what I'm not going to explain the technical details of this I'm short on time here the gist is I looked at the entropy z' of the attention matrices coming out of the attention heads for sentences in each of these of each of these different structures and the entropy of an attention matrix basically what it does is it tells you how complicated the mixing and matching that layer that head is doing so if all the head is doing is taking all of the cucumber out of all the sushis and putting it in position one that's a relatively low entropy operation it's not that complicated but if the head is mixing and matching a whole bunch of things in complicated ways then the entropy will be higher so these are pictures of the attention matrix entropies and this is organized these are attention heads and this is layer one of the network layer two of the network the diagram I had before only showed three layers but GPT too small has 12 layers hi I've shown you the wrong one and given away a little bit of the future I was supposed to show you one with 12 layers here instead of 11 ignore the man behind the curtain I'll get to that in a moment what's interesting here though to note is that the entropies are much higher at lower layers of the network and so what that tells us is that the network is doing a lot more restructuring and looking at the relationships between words in these first four layers for this grammatical task than in the upper layers interesting so maybe grammatical comprehension lives at lower layers of the network so to

### [13:46](https://www.youtube.com/watch?v=6gilgehNTNw&t=826s) Results

test that I took my grammatical classifier and I ran it on top of each layer of GPT - and looked at how hard it was to train and how good of a score it could get basically how low the loss was and I've got a video here of what that looked like so you can see here layer zero means I ran it right on top of the embedding before any of the layers of GPT two ran trained up for up to two hundred eight pox I actually trained it longer but I cut the graph off at two hundred it kept going another like two hundred and fifty or so and it did not learn a ton and this particular one was for syntactic dependencies you can see at layer one did a bit better at layer two it did yet better still and at layer four it did pretty great layer five it did excellent and so this shows how well this grammar classifier trained on top of each of these layers of the network so this is really interesting it did a much better job at layers five and six and you can see it actually got its best score on layer five it did the very best it did a much better job at layers 5 and 6 then it de layers before and that at the layers after so it means that this information came into view through these attention had heads manipulating it in these first four layers the grammatical information did and then it started to go back out of you so this led me to the question of is it because the later half of the network is trying to generate future words that that's what it was trained to do and so that's why it maybe it's more focused on the future than it is on the past so actually yeah so I trained it for syntactic tagging of what the expected output token should be instead of just the input tokens and you can see that it peaked out up here at layer eight so if we just look this is incoming and that's outgoing incoming and outgoing so this grammar classifier basically is it's like a tool to measure where the information lives in the network and how much information is easily accessible for this grammatical task at different layers and you can see that the information for understanding and grammar of the incoming sentence or incoming tokens is much better at lower layers and for outgoing it's much better at higher layers cool so what we're actually seeing here and sorry I've got my slides out of order and I've given away another thing I'm gonna say what we're seeing here is that these heads are rotating this information into view of these positions in a kind of abstract informational space and here's an example of what I mean by that I laid a bunch of markers on a table and looking at them from this angle you can't tell how many markers are there because you're looking at the wrong angle so if I rotate them slightly you can tell there's more than one but not really how many or what colors they are if I rotate them a bit further you can tell there's a few but it's not clear how many greens there are and if I rotate them yet further you can see exactly how many markers there are and exactly what colors they are so this is what I mean by rotating information this is kind of an abstract version of the same thing the grammatical information is being rotated and not just rotated but stretched and compressed and warped and other types of things so that comes into view of these positions that are flying through the network I also did the same thing for simple part of speech and a tailed you can see those both coalesce in layer 3 which makes sense those are simpler to figure out so once I had this I took my grammar classifier and I chopped off the top half of GPT 2 and just ran it on top

### [17:58](https://www.youtube.com/watch?v=6gilgehNTNw&t=1078s) Strategies

of layer 5 and in here I decided to look at how important each head each attention head in the remaining Network was for this classification and I tried a couple of strategies the first strategy I followed a paper called our 16 heads better than one where I I'm not even going to bother and try and make this interpretable to two non-technical people I fed in a mask tensor a ones tensor and I multiplied that by the output of each attention head and then took the did back propagation to find the Jacobian of the grammatical classification loss with respect to the coefficient of each head and that would give me some at least locally linear interpretation of how important that head was for grammatical classification but it turned out that strategy didn't actually work that well it had worked pretty well in the paper for Bert but it didn't work that well for GPT too so instead I tried a slower more computationally intensive strategy where I just chopped out each head individually and looked at its impact to the grammatical classification so if it had a big impact then that attention had mattered and that was a place where grammar was being learned and using that I was able to pull out a lot of the heads in here so for this particular grammatical structure the very best loss I could get was cutting out almost every head in the network so the black here is where I removed a head and the white are the heads remaining this grammatical structure needed a bit more a few more heads this one needed almost no heads in fact it didn't need heads at all in some of these layers which is kind of amazing and so anyway in the future I would like to look at take these maps of heads that matter for different grammatical structures and dig into them and figure out what's going on in these individual heads now that I've reduced JP t2 to a much smaller collection of sub networks that are practical to analyze and I'd like to compare and contrast how these maps relate between structures like here you can see these three heads are not needed for this structure or that structure or this structure so there's a relation chips in here and I think we can find sub networks of GPT to that really two different grammatical structures and hopefully that will one day down the road get us to the point where we can better tear open these language models and have a much deeper understanding of

### [20:26](https://www.youtube.com/watch?v=6gilgehNTNw&t=1226s) QA

what's going on in them okay hopefully I'm under my time anyway time for Q& A I know we're all running a little bit long so I don't know if there's time for Q& A but we'll see anybody got questions I'm looking over here because I have a separate monitor with a QA oh here we go from papers like the image G PT we know that transformers have great representations in the middle of the network in how far is the grammar loss predictive of useful representations for other tasks and not just grammar detection that's a great question I haven't read the image dpgp T paper like I said I have been in the field of deep learning for about five months during a pandemic and a revolution and I also had a bunch of medical problems so I don't actually know the results of this paper but it sounds cool I would love to read it I think it's a good question how is the grammar loss predictive of useful representations for other tasks and not just grammar detection I think it's probably generalizes pretty well it's gonna you're gonna need to have some way of classifying what it is that you're looking for so in this particular waste in this particular case I had a good easy way to generate a large data set that I could tag with grammatical structures so I was able to measure a particular like had a good concrete understanding and good concrete mechanism for measuring information presence I think for situations where you can easily or plausibly produce a data set that actually in train a classifier that actually measures the kind of information you're looking for then this is pretty generalizable for other things more abstract type questions it's going to be a lot harder yeah it's all about math and if you can't find a good way to numerically measure it it's gonna be hard to do some things you can just brute-force visualize but I don't have the compute power to do that yet hopefully I will in a not too distant future okay do you think the number of heads that are needed are correlated with the complexity of the sentence structure or did you notice any specific repeated patterns you know I was actually really surprised that some sentence structures needed so few heads and it makes me want to dig into how much information is these linear sub layers of the transformer blocks because clearly they're doing something like you saw before some of these layers didn't need any heads at all which is kind of shocking I do think there's clearly a correlation between the complexity of the network that's needed and the complexity of the sentence structure that's coming in I don't know that it's a perfect correlation and I haven't gone and done a calculation like for instance I would like to do some analysis like a way of measuring the complexity of a sentence and compare that directly to the number of heads and give a mathematical answer to this question I haven't done that yet but just visually it does look like there's some correlation there and sentences that have similar structures to one another have similarities in the heads that are important which is a validation that this strategy makes some sense yeah okay any other questions our rights I think that might be it for questions

---
*Источник: https://ekstraktznaniy.ru/video/11596*