Longformer: The Long-Document Transformer

26:35

Longformer: The Long-Document Transformer

Yannic Kilcher 20.04.2020 26 181 просмотров 877 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

The Longformer extends the Transformer by introducing sliding window attention and sparse global attention. This allows for the processing of much longer documents than classic models like BERT. Paper: https://arxiv.org/abs/2004.05150 Code: https://github.com/allenai/longformer Abstract: Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. Authors: Iz Beltagy, Matthew E. Peters, Arman Cohan Links: YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher

Оглавление (7 сегментов)

Introduction

hi there today we're looking at long former the long document transformer by is belt adji matthew peters and armand cohen of allen AI so the long former is a variant of the transformer as you might have guessed the long former is a transformer that can deal with long documents so it's aptly named so I am going to discuss what differentiates the long former from the transformer if you don't know what a transformer is watch the video on attention is all you need I have a video on that and I would also suggest you watch the video on Burt because a lot of the architecture and training here is based on the Burt or variants of Burt so I'll mace basically explain what makes the long former different such that it gets long documents right so she can be applied to what is the problem or the original

Problem

so what is the problem or the original transformer if you have a transformer model and let's say you're doing an NLP tasks which usually is where transformers are used and you won't have a paragraph like this one right here the abstract of the paper and maybe want to predict whether the paper gets accepted at a conference or not now the classic transformers they D they have a limit a very harsh limit on the amount of tokens that they can look at the same time so what you would do in a classic transformer is you couldn't process this entire thing let's say you would divide it in chunks you'd say okay here's my first chunk from here to here my second and then so on so you go through the documents split it up in chunks process each of the chunks individually and then maybe aggregate the predictions but of course the drawback is that the model cannot make specific connections between let's say some word here like operation and somewhere down here like language it cannot connect the two on a neural level at least not in the classic transformer architectures now there are ways to try to alleviate this but classically if you split up your documents into individual samples they become independent you cannot do attention this attention mechanism cannot operate over it across the boundaries of these chunks so the long former the goal is to actually just in just be able to put this entire document here into the model at the same time so let's look a bit closer into this in a

Transformer Model

classic transformer model what you'll have is you'll have layers of what are what is called attention mechanism I'm gonna draw six units here and the units are actually the input sequence so in a transformer other than like a classic neural network you don't actually have numbers of units in the in the layers but you can input as many as long sequences as you want until your memory limit is reached basically so these units they expose something called keys

Keys Queries

on the lower layer and these are vectors that you know point somewhere and the upper layer will produce what are called queries and again I invite you to look at the attention is all you need video if you want more explanation and basically the keys and queries they decide where information gets routed to right so the routing of information is what makes the transform of the transformer so for example this here is probably gonna be routed to this here so the information is routed like this and then this here is going to be routed like this you see the routing is according to the dot product of the keys and queries right so in essence if you in input sequence tokens and you usually transform in a transform you transform the things into same length sequences that has to do a lot also with how you want to pre train things and so on so we're not really gonna change that part if you have n input sequence and n tokens on the next layer and everything can attend to everything so all the inner products are computed right everything is connected to everything that means that you're going to end up with an O of N squared memory requirement because you have N squared connections the way to alleviate this is much like you would alleviate this in a classic neural network so imagine you have this MLP a multi-layer perceptron or what usually known as a fully connected layer right so here I have the same thing but it's not a transformer it's a classic neural network fully connected so I have D units right here and D units in this first hidden layer and I'll have a weight matrix in here right and the weight matrix means everything is connected to everything right everything connects to everything else again my memory requirement here is d squared now how do we deal with this in a classic neural network we go to what is called a convolutional neural network at least that's one of the methods so

Convolutional Network

let's make this again but let's now make this a convolutional neural network what we'll have is will have a convolutional kernel in this case it's just of length three right so we just have three units here and they will do the same fully connected pattern but only over these three units right here and then we slide the kernel over right now it's in this position it's still the same three units but now these three things are connected to these three things that they're now over right so you keep sliding this over across the lower layer until you're finally at the end here and now you've reduced the memory consumption from d squared to just D times and if this is usually the kernel size is called K to D times K and K you can keep pretty much constant so that's all of D right and the same goes for the long form so in the long former the idea is that you have a so called sliding window attention it's exactly the same as it is in the convolution except that you don't have these hidden units here but these are actually seek the parts of the input sequence and instead of the weight matrix here you have the attention mechanism over the keys queries and values but the idea is similar so you can basically say this is a sort of a convolution and we've already had this in the video about axial attention a bit now of course this is your trade-off memory for performance because before right before I'm gonna draw let's draw it on top of this fully connected layer before all the units could attend to all the units right and now the unit can only attend to its immediate neighborhood right this green unit here can only attend to itself in the lower layer and its immediate neighbors if the kernel size is 3 but consider what happens in the next layer so I have for example this unit right here is the same unit right on the next layer it can attend to these two and itself in the lower layer but these two themselves can attend to all of these right so that the one on the right can attend to one more so in the first layer this particular unit had information from these three units but in the second layer the same unit has now information across these five right and this is kind of this cone of attention it gets bigger and bigger as you go through the layers so you lose the information to incorporate wide ranges of information in a single layer but you regain it through depth right the deeper you go the more a single unit gets information right this unit gets information from this unit over here through the layers it can't watch the unit right here in this layer that's not possible but it gets the information through the layers of course there's still a trade of like a fully connected layer could just do this in one step and then in the next layer it could do it again right it can do much more complex computation but if you believe that the most important information is actually in the neighborhoods of the individual tokens which is conceivable in something like the convolutional neural network you know that you know in an image usually you have localized information right if there's a cat here then the nose and the eyes of the cat they're pretty close together so in order to recognize it's a cat you want mostly want local information or a more local information so in an image that makes sense in a text it also makes sense to a degree in that usually words close together in a sentence they are important to for each other right but the power of the transformer was initially that it could attend to everything in a sentence right so for example if you have again the paragraph here the power of the transformer at least that was said is the fact that this piece of text here could make a connection to and therefore the understanding of the entire paragraph could be reliant on this connection being made which a local model can't do but if you go through depth that you might be able to recover that so the long former is basically what the convolutional neural network does for MLPs it does it for transformers right so instead of n-by-n giving you n squared now you go into this way where you have so if you do the same for the transformer you go to O n times let's call it W and W being your window size in this case they have a illustration of this right here so in a original transformer this is an attention matrix so here you have your n units in sequence and drawn-in is which unit can attend to which other unit in a given layer so you'll see this particular unit I here can attend of course to itself right can attend to unit I but it can also attend to this unit or to this unit to any unit right and that's what gives you this N squared attention because any unit can attend to any unit now in this sliding window attention pattern and this is one of the core components of the long former you see that the I is unit here right here can attend to itself right but also to this and to this but no more it can only attend to the eyuth unit or two I minus W 2 I plus W right and this here is a window of size W this is this sliding window so given unit can only attend to itself or its neighbors in one layer right and this is exactly what a convolution is like if you see this pattern this is a this is pattern now the second core component is they expand on this idea in that they make they create these dilated sliding windows now you see you already know what a sliding window is now they're saying well if you have this sliding window it might take quite a number of layers in order to you know get your attention of the entire sequence incorporated we saw before it took like three layers to get halfway through this sequence of what was it like six tokens and it took us like three layers and with so basically if you go one layer up right one layer up you gain one more context window in each direction right so it's not that you'd have to go very deep

Dilated Window

in order to incorporate the information from these very long sequences and the sliding the dilated sliding window helps this where they say well technically now any any sequence here so again if we have this sequence and this is the next layer let's just draw so this unit right here it will be able to attend this and this but not this and not this but it will also sorry not this so it'll skip one so right these attention patterns they will always kind of skip one and the idea is that now you have a vastly greater window of attention right your window size is now way bigger that means you can incorporate information way faster across the layers like global information but of course now they're kind of arguing against each other in when they do this sliding window they say well we pose that mostly local information is important for NLP right the words right around the word are important and now if they say this here they basically say oh well it's not so important that we miss this word right here which is right next to the world that they are attending from which is counter to what they just said that probably the most important information is around the word they do get around this by saying well if we have different layers in a transformer and in the lower layers we'll use this sliding window fully local and in the higher layers will use this dilated window and therefore in the lower layers we postulate that local information is actually what's needed to understand local features and then in the higher layers we want more global information because it will incorporate features from the local informations of the lower layers all right I can get the argumentation but I feel that's just something they've thrown in there to make it work better after they tried it out and the last idea here in the long form is what they call global attention and these global attention is

Global Attention

ours what it means is that there are some special units here so in this this in this unit and these special units as you can see from the attention pattern these are these can actually attend to everything so this unit can attend for example to this one or to anything these can attend to anything and any unit can attend to those right the first unit right here right so these are your special tokens your special units and they have global attention and the reason for the particularly is that sometimes this is needed and this is an engineering choice the example I can give is let's say you have a question answering tasks in a question answering task what you usually have is a question and a paragraph and let's say the task here is to answer yes or no is that is the question it's a question might be a statement right uh I don't know King James was King of England from 1120 to 11:40 and then the paragraph will be the Wikipedia entry for King James and the question is yes or no is the is a question true or not is the statement made true or not how you would feed this to a bird model to a transformer is you concatenate these two things question query our statement and paragraph right these are the tokens right here and then you would separate them using a special token called the separator token this is just to inform the model that here is where the first thing stops and the next thing starts and then at the beginning you would put a special token called the CLS token now usually what you do is you send these things through your transformer and now in the last layer right you end up as we've seen before because you always transform a sequence into a sequence you end up with a sequence again but you just want a single thing you just want yes or no so you designate you say this particular unit here that corresponds to the CLS token that's what I'm going to throw into a logistic regression and that's what will give me my yes or no answer and that's how you train it right so you have you don't want to single out any of these for any of these as like special so you simply already include a special token at the beginning that then you take the classification from right it's pretty smart but also you say ah this is such a special token I want that to be able to attend to anything right even though for example this unit right here it can only attend to its neighbors right it has this cone thing and this unit right here this unit right here can always attend to anything at each of the layers right it can attend to anything and anything can attend to it so it can get information from anywhere routed to it in each of the layers and it can send information to any of the other units and this is an engineering choice so at the beginning you as an engineer have to say which one of these tokens are special tokens and for these tokens you'll actually then do full attention right it can attend to and from anything so what are our new memory requirements what this will give us is first of all we have n tokens right and here W is our window size so we have n times W memory but then we also add the global attention so plus the number of special tokens times right so number if there's a special token it will have n times 2 memory requirement because it can attend from + 2 in each layer so and this entire thing sorry with the Plus this entire thing times the number of layers so this is your new attention memory requirements and as you can see here n + n so this is going to be order of n instead of much smaller than order of N squared as we had for the original transformer right so this is what the long former basically does now they have written custom CUDA kernels for doing this this dilated attention and so on which is pretty cool and they have code available for the model they test this on a number of language tasks and they what I find interesting is it actually they start from the Roberta checkpoint which Roberta where is it said somewhere oh yeah this is this Roberta model right here is a variant of Burt all right you can see the name in here is a variant of Burt and that's their baseline and they start from these checkpoints as far as I understand in that it kind of copy over the position embeddings and so on and therefore they only need to train not very much posture Roberto now the reason why they can copy it over actually is and this I find very interesting is they use a window size of 512 so until I read this I got away from reading the paper thinking that this window size here might be fairly small right so this window size it might be you know maybe 10 20 30 tokens or something right but actually this window size is 512 in there in their formulation which basically means that this is as much as one of the classic models could take as a document right so sorry let's go so this here is 512 so this is what a classic model could take as a as an entire document and in the classic model you simply split up the document feed chunks right and then aggregate over them now the long former basically has this so right now for now I said it has less memory requirement actually has the same memory requirements as a classic model but it is also able because of these global attention to kind of incorporate information from the surrounding things so that's the new part because if you think about if this w here is 512 was the original n so 512 was the N 0 whatever the old models had as an N so right now if I replace this and let's you know don't not take care of this if I replace this it's actually n times n 0 and that regresses to the classic model if you plug in and 0 here right so the new part really is the fact that you have this sliding window and the global attention is able to incorporate information from these special tokens as well because sliding window that was done before so I just don't want you to get to you the wrong impression that now we can run transformers on like very small memory machines we can't but we can run them on the same memory machines because this is the same length right but also feeding longer documents and have some information of the entire document be propagated to these blocks which before we couldn't before we could just simply feed these blocks as one and not have global information so that's the new thing at least they haven't tested it on the smaller things which is cool from an engineering point right you would want to as if you want to show that you're better you would want to basically be able to be as powerful as the old model but then be more powerful and that's what they do all right so if you want to check out the experiments and the ablations is very interesting because they turn on and off a lot of things in their model and kind of check out where things come from what helps what doesn't and I'll leave this to you and I'll link it and with that thanks for listening watching and bye

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник