DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)

45:13

DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)

Yannic Kilcher 25.02.2021 22 794 просмотров 659 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

#deberta #bert #huggingface DeBERTa by Microsoft is the next iteration of BERT-style Self-Attention Transformer models, surpassing RoBERTa in State-of-the-art in multiple NLP tasks. DeBERTa brings two key improvements: First, they treat content and position information separately in a new form of disentangled attention mechanism. Second, they resort to relative positional encodings throughout the base of the transformer, and provide absolute positional encodings only at the very end. The resulting model is both more accurate on downstream tasks and needs less pretraining steps to reach good accuracy. Models are also available in Huggingface and on Github. OUTLINE: 0:00 - Intro & Overview 2:15 - Position Encodings in Transformer's Attention Mechanism 9:55 - Disentangling Content & Position Information in Attention 21:35 - Disentangled Query & Key construction in the Attention Formula 25:50 - Efficient Relative Position Encodings 28:40 - Enhanced Mask Decoder using Absolute Position Encodings 35:30 - My Criticism of EMD 38:05 - Experimental Results 40:30 - Scaling up to 1.5 Billion Parameters 44:20 - Conclusion & Comments Paper: https://arxiv.org/abs/2006.03654 Code: https://github.com/microsoft/DeBERTa Huggingface models: https://huggingface.co/models?search=deberta Abstract: Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8). Authors: Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/ BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Оглавление (10 сегментов)

Intro & Overview

hi there today we'll look at deberta decoding enhanced bert with disentangled attention by peng cheng shiadong liu jungfen gao and waiju chen of microsoft this paper is an improvement on bert the language model and the roberto variant of it specifically it so it suggests two improvement namely first is this disentangle attention where they disentangle positional information and content information of the individual tokens in the attention mechanism and the second improvement kind of results from the first improvement as this decoding enhanced decoder i guess enhanced decoder where because they only use relative positional information in the transformer part of the model they have to re-feed the absolute positional information at the end which gives them another bit of improvement all together with this they reach state of the art in various nlp tasks and this model the berta is now available in hogging face for you to download for all of your nlp needs so we're going to go through the paper and look at the two improvements and what they give let's and see if that's relevant as always if you like content like this don't hesitate to share it out to all of your friends and leave a like and a comment uh i still read all the comments so give me your opinion and please also give me your opinions on the new recording setup there should be a title somewhere here a picture somewhere here i absolutely want to hear feedback because i have no idea what i'm doing so yeah all right let's dive in de berta or diberta or de berta i don't know i think it's d berta because it's from decoding enhanced diverta is a new model architecture they say here we propose deberta decoding enhanced vert with disentangle attention that improves the burt and roberta models using two novel techniques the first is the

Position Encodings in Transformer's Attention Mechanism

disentangled attention mechanism where each word is represented using two vectors that encode its content and position respectively and the attention weights among the words are computed using disentangled matrices on their contents and relative positions respectively okay we'll look at that first so what they mean is when you have a multi-head attention layer what we want to do is we want to transform one sequence of tokens of token representations into the next sequence of token representations now usually every token let's say these are our tokens and this could be a sentence in a language like i am hungry and here is like a this see this classification token that we always add when we train bert um every one of these tokens is represented by a vector this is a vector it has many entries some of the vectors are thicker than others i mean that's just this one just hasn't eaten enough um so every one of these tokens is represented by a vector and what a multi-head attention layer does is it simply transforms this via means of the attention mechanism into a series of vectors again so we put in and we end up with another series of vectors if you want to know what a multi-head attention does in detail please go look at my video attention is all you need where that's explained specifically it is a attention it is sort of an information routing algorithm that sees how information needs to be routed from tokens to tokens using queries keys values and so on if you haven't seen the video it's a beautiful mechanism but i'm not going to explain it again right here i'm sorry all right so in this um what you usually do is you transform vectors into vectors and because of how the multi-head attention mechanism works the mechanism has no way to discern where in a sentence for example a given token is so it cannot differentiate between this sentence here and the sentence am i hungry if it's just multi-head attention is just not possible for it because it treats the incoming sentence as like a bag of words which is not the case in for example a recurrent neural network so recurrent neural network would go one by one over these word representations and it has kind of a mechanism to see what a sequence is however multi-edition doesn't so what people usually do is they augment these representations with position encodings so that's at the beginning you know where you might ask where do these vectors come from the very f of course they come from the last layer but the very first vectors you put in come from a table and these are your classic word vectors so at some point you have a big table and the big table has your entire vocabulary in it so every word in the language that you consider so there's i and there's m and there is u and there is apple and there is hungry and there is even the cls token all of them have a table entry and vector associated with them now these vectors are trainable so the neural network can decide itself what goes into these vectors but every word has a fixed vector in there and in the very first layer because you don't have a last layer to draw from you simply look at what token it is you go to the table right here you retrieve this vector and you put it here and that's your start and then you transform up the layers of course every time from the last layer but at the beginning you have embeddings now the same thing you do for positions okay so you also have a second table usually and the original transformer paper by the way these were fixed vectors but nowadays i think most of them are also trained so you label the positions so that's position one that's position two three and four so for every position two three four and maybe you have also five and six there is a maximum length but right now we consider sentences of length three with a cls token appended so these are length four so every position also has a vector and i'm going to actually draw these vectors in this color so every position has a vector irrespective of what word there is okay right now we just have vectors for words irrespective of where they are and we have vectors of positions irrespective of what words there are and what you do is same you look at what position is here you go to the table you retrieve that embedding and you somehow also put it here now i've made a bit of a mess here with this thing sorry so how do you now you have two vectors all of a sudden per word so you have one that is a position and the kind of the word itself that represents the word itself and the neural network it needs both in order to understand the sentence right if every word has these two vectors at the beginning now it can understand aha this is the word i that is at the beginning of the sentence so it's probably the subject of a sentence however if the word m was at the beginning it could be oh it probably a question because it starts with a verb like am i hungry okay and it can also evaluate the relative distances of things to each other and so on so given this information the neural network has all the tools it sort of needs to understand that sentence as a sequence now what you have basically two ways of uh combining the two things first of all you can concatenate them which means that i'm gonna do it in this you just put no that's terrible you just put the i'm not too skilled yet with this new thing um you put this on top here imagine this is the same length and you just concatenate the vector so now the vector is longer of course that also increases your dimensionality computational issues and so on so what a lot of people do is they simply you know line them up if they're the same size and they add them together element wise and you know in the worst case the neural network now can decide because both of these are trained right so the neural network can absolutely decide that you know in the top part here it simply learns a bunch of zeros and then the bottom here so essentially it's a concatenation that's the worst case in the best case the neural network can actually do some kind of information combining already in this addition step down here okay so the you give both encodings to the neural network as a single vector right so what goes into the

Disentangling Content & Position Information in Attention

multi-edition mechanism is a single vector this paper says that is not ideal because the positions are too much mixed with the um with the signal of the content of the words and we'd rather have this in a disentangled representation such that the network can sort of reason about the words in one line and it can reason about the position of the words in another line so their goal is to disentangle these two vectors and basically design a new attention mechanism that always treats the content and the position as separate things so the new attention mechanism they propose is right here of course they're not they can't stay separate right but um they can be disentangled through the layers so their new algorithm sort of is here the way they obtain the attention matrix is due to the following uh thing so how do you usually obtain the attention matrix you have your input x here this is your sequence and you produce two values from it q and k so these are matrices so if x is a sequence then every single sequence element emits one key which is a vector right one key and then every single one also emits one query so like this and the key sort of the key is supposed to say what is in what information is this token about and the query is kind of supposed to say what information does it request from other tokens so now you route the information wherever the inner products line up for example probably this thing would go to would be routed here and so on it's not a hard routing it's a soft routing so by transforming x by linear transformations into keys and queries you obtain your attention matrix by multiplying together queries and keys such that you have sort of the inner product between each of these vectors and this is quadratic and this is the big bottleneck in transformers but you have the inner product between each of the two you get a giant matrix and the giant matrix basically says how much does token token2 attend to token3 that's the position 2 3 of that matrix and that's that seek that element is going to be the inner product of the query of token 2 with the key of token three so that's how you do the attention matrix and these vectors right here they if you do regular bird they always have they're always everything at the same time so you feed your feed content and position somewhere down the layers you feed that in you add it together and the network is supposed to figure out itself how to use these two pieces of information this paper says no wait we can do better what we can do is for us each sequence element it does not only produce one key and one query it actually we think it should be contained it should be a made up of two vectors so each of these things has two different uh two different components one is this kind of h component which is uh the which is the content uh content information and one is the p component which is the positional information so here how should token i attend to token j they say well that is going to be it's going to be the same thing inner product between the this is the query of token i and this is the key of token j okay however now the queries and keys are made up of two different parts one is the content part one is the position part and the position as you can see maybe is j condition on either position is going to be a relative positioning so if you have your sequence right here what each token would do is it would emit one vector oh sorry that is the content of the token like before and then another vector would come in from the position so the same we did at the beginning but now in each layer this positional information comes in irrespective of what work there is right irrespective of what word is in the position the position gets an encoding right here and then the interesting thing is we don't add the two together we treat them actually separately so here the keys are two vectors and the queries are also two vectors so i'm just going to draw one up here so the query is going to be a vector and the query for the position is also going to be a vector and that also it depends only on the position and not on the incoming signal okay so now how do we route information now we have four different routings first we only consider dark blue blue so this is kind of the classic attention right this and this they match really well so that goes here uh that one probably doesn't go there and so on but then we also so this is what they call content to content routing but then we also have content to position to content and position to position routing and in all of these so for example in content to position now i'm sure i'm gonna there's a 50 chance i'm going to mix this up and i'm sure i'm going to but in content to position what we're going to do is we're going to look at this vector right here which is the content vector of the query that is produced from the token right the content and we're going to attend to the position vector of the key so we're going to attend to the light blue things so essentially the this part is like the classic attention part it is i am the word am i'm requesting all informations from all the nouns in the sentence because i'm a verb and i would like to know who are the nouns in the sentence then the content to position encodings is i am the verb am i would like to know what is around me the positions are relative positions so i can request the vector for you know the plus one position of me or the plus two so the word can attend to its surroundings so given that it's the word m it might be particularly interesting maybe it has already figured out it's not a question right um from the previous layers so it's particularly interested in what's before it so because you know m actually it probably isn't particularly interesting because it's always going to be i so actually maybe it's exactly a counter example where it wouldn't want information from there but it can sort of attend it can say i want to attend to things after myself because i already have figured out that before me must be an i want to attend two things after me like one position after me what's right after me what's two words after me and so on position to content is exactly the opposite it is saying so the token can say well i am in uh i am in position plus four to you know what kind of information do i want to send to things that are for away from me right irrespective of what the content is so here we simply consider what position is the token with respect to its neighbors and what kind of information does it want to aggregate from each of the words it is a bit it's a bit weird right so it says it says like i am in position a word that is two words after me what kind of information do i want to get from it and since it's attending to content that can be dependent on um what word there is but not its position and then position to position is simply well what kind of information do i in position you know three you want to send to something in position seven which would be useful but this is relative position encoding which simply means i am always kind of in the middle and so this isn't really helpful so they decide to leave this away so we end up with the three different attention mechanisms so to say we end up so there's this one this one and there's this one okay corresponding to the three out of four different ways we can comb combine the dark blue and the light blue keys and queries now you can see right here that's what they do and their final attention matrix is simply the addition of all of those together so we construct one attention from like the classic attention that is content to position positioned to content and we construct one position but then we leave it away because it's we deal with relative positions so it will sort of be the same for every token and that's not particularly helpful reason i'm going to repeat it again the h information contains actual signal from the last layer while the p has no idea about the signal it simply contains information about the position of the tokens okay so you can decide to send information to a word that's two positions ahead of you or to request information from where that's three positions behind you depending on what word you yourself are okay so that's the content to position and position to content attention these things are all added together and that makes up the final attention matrix so a final entry in the attention matrix could be influenced by multiple ones of them it could say you know i am the word i'm the word m i'm in position two i request a lot of information from other nouns if any noun is here i want information but i also want information from things that are one or two positions ahead of me so that is and you know uh since i'm the word am and also since i'm in position number two i am very interested to know what the subject of the sentence is now we have

Disentangled Query & Key construction in the Attention Formula

all of it okay all right and the rest is just like classic attention okay now you simply so these p and h matrices are obtained by sorry the queries and the keys for this are obtained by uh linear transformation so you see this is the incoming signal you send it through a linear transformation to obtain the queries and you also send it through a linear in transformation to obtain the keys so the h is the same but the these matrices here these are learned weights to produce queries and keys and then you multiply them together that defines your attention matrix uh you run that through a soft max to make a distribution out of each row and then you multiply together with the values so this part here is kind of like the routing table and the values are the information to be routed the values are obtained from this input signal as we said we're going to amend that by so this over here is the classic key queries keys and values sorry that's too much the classic queries keys and values and then we augment that by two new so there is the queries and the keys for the position and you can see that the difference here is that again it's learned weights but now there is this p thing right here and the p is positional encodings and that comes exactly out of this table we saw up here so the positional encodings come from this so and it's important to see that this here is h and this is the p values but this is only h0 right h is actually transformed to h1 by the transformer the first layer to h2 by the second layer and so on the p always stays the same so you would feed the p into this layer and you would feed it again into this layer so you can see it's only positional information it's not content information and by feeding the position each time and doing this in this disentangled way um the model can sort of keep the content and position information separate i actually think it doesn't really keep the information separate because you know after layer one you certainly have position information in your age right you can see that from this path here from the actually feeding position information into the transformer layer h1 is already going to be a conglomerate of h0 which is pure content plus the position somehow um this plus is not a real addition but somehow the information is intermingled there and if we weren't to feed in these things right here it would just be like the classic bird right what they criticize now by continuously feeding the positional information that is one advantage um you can actually do that with bert you can just add the position information each time i'm not sure if that would work super well but you can do that just gives the model a bit more side information to work with and then by keeping it separate um yeah as i said i'm not sure it's actually separate it's just that you keep feeding in position information layer after layer therefore giving the model sort of more information every time it makes a transformation because otherwise it would have to carry through the position information through all the layers just from the very first layer so in this mechanism you can see it's true that the position encoding is kept separate because it comes in fresh every layer but i don't see that the content certainly has position information in it from the last layer like i hope you can see that so

Efficient Relative Position Encodings

as i said they uh do relative position encoding what does that mean so that means that um the position encoding depends on where you look from so what i've drawn at the beginning like this here this isn't entirely correct you have to look at each token individually so for this middle token here for example if the positions look like this they look like negative 2 negative 1 0 1 2. and you would you'd have kind of a table not with absolute positions but you'd actually have a table with negative 2 negative 1 0 1 plus 2 and so on and you would retrieve those vectors and then you when you consider the next vector this one right here it would look different it would right this would be zero this minus one minus two and so on so they do two things first of all they truncate at some point uh they simply say well our context window is two so instead of going negative three here we simply keep it at negative 2. so everything beyond negative 2 gets also the vector for negative 2. so that vector here is going to be just plugged into here and into here for this token right and for this token for the previous token it is only going to be plugged here and if and nowhere else there are ways to efficiently implement this and that's this algorithm right here don't want to go too much into it but just so you're aware you don't have to consider each token really individually during it attention that would be prohibitively expensive so you can do one big matrix multiply and then sort of pick and choose together from your from the matrix that results especially with this truncation this is this algorithm so they call it efficient implementation all right so that is this position enhanced or disentangled information why is it disentangled again because in every layer they have a side input this piece right here is the side input that they sort of feed on top of this information and they specifically construct the attention matrix out of the three things right it's almost like two contributions the one contribution is hey let's feed in position information in each layer and i think that has been tried before that's pretty simple but then the second thing is that we don't simply add the two vectors when we input it into the attention but we're going to construct basically three attention matrices um and then add those together once we determine the inner products between each of those okay so this is one of the improvements

Enhanced Mask Decoder using Absolute Position Encodings

and that already helps a lot but then they run into a problem and this is not necessarily a problem with their method but this is a problem in general when you use relative positioning codings so they say given a sentence a new store opened beside a new mall right that's a sentence the words store and mall uh are mass so let's say you do this mask language model pre-training right you mask out the words store and mole and you ask the model to reconstruct them using only the local context eg relative position and surrounding words is insufficient for the model to distinguish store and maul in this sentence since both follow the word new with the same relative positions so from the word new you know relatively it's always plus one oopsie it's plus one to this word so the model cannot distinguish the two so there is a need for absolute position encodings because if you had absolute position in codings you could maybe make sense though um you know since i mean you could figure out like store is probably kind of a smaller thing and mall is kind of a bigger thing so um it's more likely that the store opened beside the new mall then the mall store okay so that means we need absolute position encodings or something like this right and especially we could have relative position encodings but if this is a very long sentence and we truncate them somewhere again these two things are not in range of one another and they're not gonna know how far you know they are apart and each one by itself is just plus one apart so how do we solve the problem we feed in absolute positioning coatings however that's exactly what they criticize they say no relative position encodings are much better than absolute for learning and that's kind of the same reasoning why a convolution is better than a fully connected layer because you kind of slide the transformation over and it's simply data relative to each other so relative positioning makes a lot of sense if when every word can do computation not based on where exactly it is in the sentence but how it is in relation to other words otherwise if you have absolute position in codings what you would have to do is say well if i'm the word m and i'm in position two i need to learn to attend to position three however if i'm the word m and i'm in position three i need to learn to attend to position four and if i'm in position four i need to learn to attend in position five these are all different things you need to learn however if you have relative encoding um what you can do is you can simply say i want to attend to the word that's right after me easy but we do need absolute position encoding for some things namely disambiguate between tasks like this so they feed in absolute position information but instead of doing it at the beginning they do it at the end so at the beginning we have the word vectors right they go in here and then we have position information one two three four five we have that at every single layer of the transformer we feed it in again and again we feed in the same p vectors okay they have different of these sorry of these transformations in each layer so the actual transformation that makes the keys and the values sorry queries of the position information are different but the vectors are the same every time and then at the very top so these are p relative so this is sorry yeah i mixed up this is the this is this negative 2 negative 1 0 1 2 for the middle token and then at the end we're going to feed in absolute position encodings so here we have you know your let's start at one let's be good matlab people here we have one two three four five that we're going to now combine with the vectors that come out of here so the reasoning is they say there are two methods of incorporating absolute position the burp model incorporates absolute position in the input layer in dberta we incorporate them right after all the transformer layers but before the softmax layer for masked token prediction as shown in figure two i've looked at figure two it's not really helpful honestly um so that is this figure in the appendix where they say okay so in the bert late in the birth you have the absolute position encoding somewhere down here it goes through all the transformer layers and then you have this classification layer at the top that does the language model decoding however in their model what you'd have is you have all the transformer layers here down here and then you have the absolute position encodings that come in through the side here and kind of the last transformer layer now has access to these absolute layers or the last n i think n in their case is two or one or two so in the last layer or the last layers now the transformer has access to the absolute positions and before it's just relative position at each step and they reason that helps because the transformer part learns to deal with relative positions okay in this way they say here d berta captures the relative positions in all the transformer layers and only uses the absolute position as complementary information when decoding the masked words thus we call the berta's decoding component and enhanced masked decoder and they compare the two and they observe that emd works much better so feeding absolute positions at the end works better than feeding him at the beginning okay we conjecture that the early incorporation of absolute positions used by bird might undesirably hamper the model from learning sufficient information of relative position in addition emd also enables us to introduce other useful information that is in two positions the yada we leave it for future so they say you could also feed in other information i guess that's the case in every single neural network ever um

My Criticism of EMD

yeah but the point is they feed in the absolute position at the end and their conjecture so i'm not sure i'm not a fan of this i'm here you know this is like saying okay if we only feed it in at the end right here this is position absolute then we sort of limit the model like right now the model has the same information as it had before as if we were to feed it at the beginning but we sort of limit it to only one layer of transformation so all it can do is sort of have kind of a little linear transformation in there and yeah and so if we don't feed that in here whereas we do feed it in the model can use it or any way it wants and that's just not a good enough reason for me so i think you know regularization has its place bottleneck layer has its place and so on restricting the capacity and so on but i'm not a fan of hampering the model in this way kind of restricting it and i you know just because it makes your number better there's not really a reason why the same information should be worse if you give the model more steps to compute with you know if you feed it in at the beginning technically if you train the model correctly it should learn to use that information in at least as good a way as if you feed it in at the end right at least that tells me that the model that we haven't really figured out how to train these models correctly yet with regards to positional encodings and again i'm not a fan of simply saying well we only feed it in at the end because then the question immediately is well how many layers at the end beginning or when you know when is it too pop it's just it's just yeah i don't think it's it makes a lot of sense to simply give the model information but not let it do its best with that information unless you have a specific kind of reasoning why this is just not good enough for me here um not a criticism of the you know obviously it's better like they observe like you know ever all the in all the information sorry all the arguments can be invalidated by but it's better right that's deep learning uh so yeah all respect for them for trying it out and actually realizing it's better pretty cool so they also do uh scale

Experimental Results

invariant fine tuning where if they fine tune which is where you take kind of this model you trained with mask language modeling and then you fine tune it to nlp tasks they have a bunch of tricks there like virtual adversarial training and normalizing the uh embeddings before they do that and that apparently helps a lot but they also say they leave the comprehensive study of this for future work for now they just want to get the good number which is understandable because you get published um all right so here you can see uh actually we can skip most of the tables they are better they are better in language modeling too which is interesting so you they can do kind of bird style denoising but in classification you can also do actually all the regressive language model uh which is pretty cool so here they do an ablation study of the different components where they remove this enhance the decoder and one time they remove the condition content to position encodings uh sorry attention mechanism and one time they reduce the position to content tension mechanism and in the table it is sort of a wash it depends on the task of how you look at but each of the components here gets you some kind of a benefit or a hit when you take it away so um yeah it's not really clear that one of the components gives you all the boost the combination of them is obviously the best and it's really cool when papers do these kinds of ablations rather than just throw a bunch of stuff at you and you it's on you to figure out which of that stuff is important um they compare it to roberta in terms of training of accuracy after training so how much do you need pre-training for a fine tuning and the deeper to as you can see in these graphs outperforms robert so potentially you need less pre-training steps to reach the same accuracy in a fine-tuning task which is cool also means that if you train for longer you reach or the same amount of time you reach a higher accuracy

Scaling up to 1.5 Billion Parameters

accuracy and now for you know their big thing they build they scale it up and they have a bunch of um tricks here and you know pretty cool they scale it up i just want to highlight one trick we optimize the model architecture as well first we share the projection matrices of relative position embeddings okay so they share the projection matrices of the relative position embeddings with each other okay so they share the position matrices with the content matrices so now instead of for example so here is the query of the content the key of the content here is the query of the projection and the key of the sorry position my battery is soon over i have to speed up um so the content right here and the position right here give rise to these matrices by means of these help of these learned weights right so here is wc here is w sorry w k c w k uh c sorry w that's the matrix that generates the queries from the content that generates the keys from the content the matrix that generates the queries from the position and the matrix that generates the keys from the position so if you now share you now want to share this and that and also you so if and at the end they are added right so you multiply these things and then they are added and in my mind honestly what that results in because before let's just see so before you had something like if you if we simply multiply query times key transposed for the context site that would give you sort of context w q and now we share them so we don't care about c and p anymore w k transposed and oh sorry um of course context is this transposed and now we add them to something else and let's just say we have these position two position encodings that they leave away but you know we're gonna consider them because it's easiest so it's position w q w k uh yeah transposed position transposed you know if these matrices are shared this simply ends up to be being the addition of the position and content times these two matrices times the again this so and this is just like the old school attention mechanism now i see there's these cross terms and maybe they influence something but it gets closer and closer uh back to the old mechanism where you simply add the encodings and don't consider them in a disentangled way right if you do if you decent if you like share the matrices of the disentangled representations it simply refers back to as if you were to feed the position in each layer of a traditional transformer so yeah i'm not sure how much really the disentanglement is super important or whether or not it's just you know more important that this positional information is actually available at each step but you know i might be wrong here with the cross terms i haven't actually looked entirely at that um yeah so that's the paper they have

Conclusion & Comments

kind of a discussion depiction of attention matrices down here where they show that their model you know does some does something kind of different from other models in terms of where it attends it has less of these global attention patterns like roberta has right here um except for the very first one which is the cls vector which makes sense and otherwise has a rather diagonal attention matrix so that's it's pretty sensible though you can also make the case that sometimes there are just really important words in a sentence that everything should attend to i don't know but it is state of the art and it is a cool algorithm and is worth considering if you build your next model alright with that i thank you for listening subscribe if you haven't i'll see you next time bye

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник