# XLNet: Generalized Autoregressive Pretraining for Language Understanding

## Метаданные

- **Канал:** Yannic Kilcher
- **YouTube:** https://www.youtube.com/watch?v=H5vpBCLo74U
- **Дата:** 03.07.2019
- **Длительность:** 30:05
- **Просмотры:** 25,812

## Описание

Abstract:
With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, XLNet outperforms BERT on 20 tasks, often by a large margin, and achieves state-of-the-art results on 18 tasks including question answering, natural language inference, sentiment analysis, and document ranking.

Authors: Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le

https://arxiv.org/abs/1906.08237

## Содержание

### [0:00](https://www.youtube.com/watch?v=H5vpBCLo74U) <Untitled Chapter 1>

hi there today we're looking at Excel net generalized auto regressive pre-training for language understanding by Jill and yang and other people from Carnegie Mellon University as well as Google brain so this is kind of a the elephant in the room currently as Excel net is the first model to beat Bert which was the previous state of the art and a lot of NLP tasks to be burped at a lot of these same NLP tasks so the chief state of the art result on 18 of 20 tasks I believe maybe they test one they outperform Bert on 20 the chief state of the art on 18 including things asked question answering natural language inference sentiment analysis and so on so those are kind of remarkable results and even more remarkable is that the architecture of the network is actually very similar to Bert the kind of new introduction is a pre-training a different free training procedure and we'll look into that so let's actually jump into their main points straight away what they go into is there are two kinds of currently used pre training methods for these NLP test and both or can be understood as kind of

### [1:21](https://www.youtube.com/watch?v=H5vpBCLo74U&t=81s) Language Modeling

a language modeling one so language modeling for those you don't know is predict the next word in a sequence so if I give you the sequence here unsupervised representation learning has been and then ask you what's next and then you're supposed to say highly right those language modeling in a nutshell so what they differentiate are two kinds of language modeling the first one they say is order

### [1:52](https://www.youtube.com/watch?v=H5vpBCLo74U&t=112s) Order Aggressive Language Modeling

aggressive language modeling now what auto regressive language modeling does is exactly what we've looked at I give you unsupervised learning has been you were supposed to predict highly and then in the next step I give you unsupervised representation learning has been highly and you're supposed to predict success and so on so in the next step I'm gonna give you the entire sentence up until here and you're supposed to do predict in autoregressive because each token can look at the kind of previous ones in the sequence so when you sorry you can't see that when you predict you can always kind of order aggressively look at what the previous ones were when including what you've previously predicted of course during training this is a teacher first as I said so you put the actual words there this is auto regressive modeling in contrast to what they call Auto encoding

### [2:55](https://www.youtube.com/watch?v=H5vpBCLo74U&t=175s) Auto Encoding

and Auto encoding is what birth does and this is the following so in contrast to that let's say I have the same sequence unsupervised representation learning has been highly successful in the domain of yeah something and then I say okay I give you the sequence but I am going to delete this and this right and now I ask you to predict these two right so you can see the task is slightly different as you now have access to all of the sequence basically except the ones that you are asked to predict but you're you kind of asked to predict yet them not in any order but you're asked to predict them at the same time basically so you're asked to predict this word and so the first kind of this Auto regressive language modeling has been used by transformer models until birth and then basically bert really pushed this auto encoding language model pre-training which made it so successful and now this paper excel net wants to like combine the best of both of them and in order to understand what's the best of both of them so what's good at birth we've already seen it can actually draw information from all of the context of the words it's trying to predict but what is the kind of pitfall of birth and they actually put this really nicely in an example they give way further down where they say comparison to but I don't know why that is not like also in the introduction but here they have the sentence New York is a city right this one and you're asked to predict these two words and if you now compare birth to what xl9 does if so the context is a city and you're asked to predict New York what birth does is it simply masks out the two words and says here please fill in these two words now this translates to the kind of objective being separated in the two words such that the prediction of York here is completely independent of the prediction of new so if you know of any other city that is made of two words for example San Francisco or Los Angeles then these would be as valid and any mixture so you might there might end up with laws York is a city and that would be perfectly fine for birth because while it's predicting loss is a perfectly fine prediction for the first word of a two-word City and York is a perfectly fine prediction for the last word of a two-word City right so these are the kind of mistakes that bird can get into by not being order aggressive by basically predicting all of these tokens at the same time independently of each other whereas x-l net what they will do is we specify an order let's say okay first I will predict the word noon for the first word new something is a city and then when I predict York I will actually take into account the I previously have predicted the word new so um that's the main advantage at that autoregressive training has over Auto encoding now what are the pitfalls the

### [6:48](https://www.youtube.com/watch?v=H5vpBCLo74U&t=408s) Pitfalls

encoding now what are the pitfalls or if you have this sentence let's look at it I'll write it down New York is a city right if you have this sentence and let's say yeah actually you're not asked to predict you and your crew you're asked to predict the word a here a right you're asked to predict that in order regressive style or a city it's a better example the two words I said in order regressive style if you predict the word a you can only ever look at what comes before hand whereas if Bert were to predict a just the word a it would be able to look at all of it that's not predict City so you see the kind of auto regressive model is bound to the order of the factorization of the sentence that's right it's bound to the order in which it has to predict the tokens so here if it's predicting a you can only look at stuff that comes before it because it needs to do it in order right once it gets to city you can actually look at the entire sentence here but um before that it only ever has partial information about the context so actually it wouldn't be much better if I had said we're trying to predict these two words is and a right and once I predict so Bert would actually have access to the word City here whereas the auto regressive models only have access to the ones before it I hope that makes it clearer so the main idea in excel net is where did where does this order dependence come in the autoregressive model the order dependence actually comes from the factorization of the sentence of the language model so in a language model we're actually trying to assess the probability distribution of sentences here X is a sentence right and this can be naturally factorized into a product over the words where the probability of each word is only dependent on the words before it so this is a this is an equality is not an approximation the probability of a sequence can be decomposed into a product of probabilities like this exactly so this here is exactly what these auto regressive models implement each word is predicted from the words before it right there are other kinds of autoregressive models that also do the other direction where here they say okay the probability of a sentence is a product and each word is predicted from the words after it but it kind of is the same problem you only ever have access into the one direction basically however you define the order of decoding you only ever have access from a given word to what was before it in the order so the main idea of excel net is they say hey why don't we consider all possible orderings right I mean that that's kind of a that's it's an idea so let's go back to our thing here they say why don't we consider all possible orderings so basically what we will do is if this sample comes up New York is a city all right what I can do is I can define an ordering let's say I always want to predict two words so the bird typically masks out about 15% of its input to be predicted and here let's say we'll mask out 20% which two words so of this sequence will mask two words and ask the model to predict it that will be our pre training objective the first time this sample comes up from the data set I might specify the order just classically right just one two three four five all right I'll predict the last two words I'll kind of mask them out right I give the model New York is and then I could let it predict a and then in the next step I'll give it New York is a and let it predict City cool so now if the pitfall is the word a here only has access to things before it and not to city itself City has access to everything all right so but then I continue training and the next set time this sample right it's in my data set New York is the city the next time it comes up I simply go for a different order let's say one two three four five right so now again I'm asked and asking to predict the last two tokens which here our city and York so in the first step I would give it is a and new and I will ask it what's here and I'll ask you to predict city and then in the second step I'll also give it that and I'll ask it okay now what's here given all of that right so new is a city all right you're asked to predict the missing word so that that's pretty so the first step its new is a hmm and you Resta predicted the second and then the second step is new is the city and the rescue protect you first so now as you can see while predicting City here all of a sudden we didn't no longer in this ordering we don't have access to the world York so we'll have to learn to predict City from the rest of the context now even more if we now decide on a different ordering again one three four five so now well actually first step is to ask New York City please predict this thing here all right yeah you might train the model to predict is and then the second step you say New York is City please predict it now we see before when we are at were asked to predict the word a it only had access to things to the left of it and the very first example but now it actually has access to the entire context so the the idea is as we sample this data point multiple times and each time we decide on a different ordering duty code for each the prediction of each token sorry will actually have seen many parts many different variants of the context and in expectation will actually have seen all of the context just like Bert but will always having have done it in an order aggressive way so basically you get all the advantages of being order aggressive namely that you are able to decode step by step while always referring to everything in front of you in the ordering so the predictions are not independent but you also get the benefit of Bert that it's able to basically look at all of the rest of the context in expectation in order to make this prediction so this is the main idea of excel net they formalize this jump up again they formalize it in saying okay what Bert does here is it actually seek it factorized law probability of a sentence into this sum so the product in the law becomes sum into the sum of log probabilities of no sorry this is auto regressive confused ah into the words conditioned on everything in front of them what bird does is it actually approximately factorizes the law of probability into each word and then everything in the context everything that's not masked in the context and this is only an approximate factorization because you're basically dropping away all these mask tokens and um what they do now is they do the same as the AR as their auto regressive models here they decompose to log probability into a sum of log from abilities over each of the words given all the words before it but now not before it in the sequence but before it in and chosen permutations Z and Z is sampled uniformly from the set of all possible permutations so in expectation they'll see all of the context so this is the main thing they show this here in a kind of a picture with so here is the neural network this is the input layer then these are the hidden layers as the attention layers go up and up here you're asked to predict the token so here you're always asked to predict X 3 so there is no there's never going to be any awake here since if you knew X 3 you would be able trivially to predict X 3 all right so in the first example the factorization order chosen at random is 3 2 4 1 now you asked to predict X 3 and we know okay we should only do this with things that are before it in the permutation order well here are since X 3 is the first in the permutation order we actually don't we don't have anything to go on we wait Stickley asked to predict x3 from scratch as if it were the start of a sentence so we'll basically tell the model I have a sentence that goes please predict the third right it's a hard task yeah by the way you're always able to look at this memory thing here don't worry about this for now this is just this is a an augmentation they do on top of their idea this is not the core idea so okay but now the second time this sample comes up from the training set we decide on a different order so the order here is 2 4 3 1 now again we're asked to predict x3 and we're allowed to look at everything before it so 2 & 4 as you see here there are weights from x2 and x4 into this column that finally is then 8 asked to predict x3 so this is also this is now an easier task right you're allowed to look at the word to the left and to the right if you have the following permutation order 1 4 2 3 you're actually allowed to look at all of the other words because x3 is at the end of the permutation order in order to produce x-ray so all of these four and the fourth thing is a similar so all of these four things will appear during training and you will learn from them so in expectation C you will basically have seen all variants of different versions of the context which helps a lot apparently right so in the in order to achieve this they had to make some architectural changes to the model namely what you want to do is in a single pass through the model here you not only want to predict one token but you want to do many predictions this helps training a lot so vert and naturally always does like the 15 we must add 15% of the tokens or so what was that like 40 50 tokens so it masks them and it predicts them all at the same time now you would like to do this here as well predict all at the same time the ones that you're asked to predict but of course the problem is for here if you're asked if in this factorization order 2 4 3 1 if you're asked to predict X 3 you're allowed to look at X 2 and X 4 if you're asked to predict X 1 you're allowed to look at X 2 X 4 and X 3 so if you only have a single pass through the model the question is do you now input X 3 or do you not because the prediction of X 3 is not allowed to look at X 3 while the prediction of X 1 is allowed to look at X 3 so they do an architectural change in order to achieve both things so that you can do have a single pass through the walk through the model but the prediction of each token only depends on the things in front of it in the permutation order and they do this by having these kind of two stream is masked to stream attention where they basically have not only not one hidden representation like in classic transformers but they have at each step two hidden representations one they call H only called G so the HS are initialized with the embeddings of the tokens and the g's are just initialized randomly and then they get transformed and the point is the h of the next layer is always able to look at everything in front of it including its own H basically it's the one layer down its own position one layer down while the G is only allowed to look at the a it's allowed to look at the ages but the H is from before right so all the G's here are only ever able to look at the H is from before the current position whereas the H is always allowed here to look at the same but also at the H at the current position and now at the last layer you simply ask the model to predict the token from just the G and you can easily see that this results in this model only oh yeah only attending to things before it okay the G by the way can also look at the G of the current layer so that's also nothing but it cannot look at that at the age so there's never any information flowing from the current word embedding of the token you're trying to predict to the prediction layer so basically that means the model can't just look like you you're not telling the model the answer yet you're still able to feed to predict multiple things in a single pass through the model formally this is described here in the attention layer so they divide how they produce the queries and keys and values usually the queries and are produced from the same hidden representation but here they produce the keys and values from the h's in both cases but to update the G's they produce the queries from the last layers G and do produce HS they produce the queries from the last layer HS and most importantly when they produce the keys and values the H is they'll look at here to update the G you're only allowed to look at H is before you in the permutation order but to update the H you're allowed to look at everything before including the position you're currently after so that's kind of the that's a it's an engineering solution to the problem introduced by their augmentation I think it's a pretty neat solution pretty cool so the rest of the paper here is incorporating ideas from transformer Excel so transformer Excel is one of these classic transformers that is like this AR so this Auto regressive style of transformer but that has a few improvements over the classic vanilla transformer and they incorporate a number of things here namely first of all they incorporate this memory thing so the memory thing allows you to input longer sequences let's say our transformer input length is maximum of five tokens what the transformer Excel allows you to do is you input five tokens and then you save you do your transformer thing you encode it and they save something into this memory and then when you input the next five tokens your transformer is then allowed to look at the memory of the last sequence right and also update it so that that's kind of these this memo oxy Sawyer so you're always allowed to look at these men blocks from last sequence and then the hidden representations here of this sequence they will actually be stored in the member lok for the next sequence this is kind of a trick to carry over information it's not the deep updating the memory part isn't learned with the objective to make the next prediction better but it's just some information it's a kind of gradient free information to provide to the next step and it apparently helps you can incorporate longer sequences into this transformer excel so they take this over and implement this into excel net they also do relative position encodings relative segment and codings I won't go into this too much more here because it's not the main idea basically so they do experiments and they compared to birth architecture with the same basically same architecture the same number of parameters and/or the years and they beat Burt in all of these kind of NLP tasks or most of I think they said in 20 they reach new state of the art in 18 NLP tasks so apparently their method works very well so what they do here is a last thing I find important is an ablation study of the effects of their improvements so they wear because kind of my problem is I never know like they have this new idea okay we do these random permutations but then they also say oh and also we include memory from XL net and we do relative positioning coatings and so on to me these kind of papers of course you reach better numbers you get a new state of the art so it's kind of a landmark paper but to me a paper should more be like a single thing so whatever your idea is this your idea is these or drinks and whatever you need to do to make that work okay fine but then why the additional transformer Excel things it's really then hard to estimate how much of the improvement comes from your ID and simply comes from the fact that you already put these other things actually have nothing to do with it so I appreciate these kind of analyses called

### [27:56](https://www.youtube.com/watch?v=H5vpBCLo74U&t=1676s) Ablation Studies

ablation studies where they kind of try to take away the memory and these things and kind of look at what it's doing to the model and use you see here kind of degrades down here as for example this con degrades as you take stuff away while still being more kind of more successful than Burt so that I would say also yeah here is more unclear but also kind of seems to degrade a bit and while being more successful than Bert I appreciate this there's some and of really trying to show that your gains really come from your new idea and not from some other stuff all right so the last thing I want to mention actually is this thing so someone claiming or calculating that it costs two hundred and forty five thousand dollars to train the Excel net model the way they describe it in the paper I'm sure that's gonna be brought down because it was brought down that like the time the train was brought down with Bert as well but this is just I mean this is crazy this is just training it um it kind of gives large questions about the state of research and the ability for kind of let's say more academic players to participate in research on the one hand of course we like of course these companies should be able to do this and on the other hand if it seems like currently in some fields just putting more money on the table will get you a better result not this actually like this paper is actually a cool idea but it's still kind of primitively expensive to even reproduce it yeah right so that was that for this paper I hope you enjoyed this and see ya

---
*Источник: https://ekstraktznaniy.ru/video/13955*