How much memory does Longformer use?

9:18

How much memory does Longformer use?

Yannic Kilcher 25.04.2020 4 799 просмотров 185 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

A calculation of the memory requirements of the Longformer. Original video: https://youtu.be/_8KNb5iqblE Paper: https://arxiv.org/abs/2004.05150 Links: YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher

Оглавление (2 сегментов)

Segment 1 (00:00 - 05:00)

so I wanted to come back to this paper here about the long former I have done a video on this if you haven't seen it then this video is probably not going to make much sense to you but in the video I go over what the long former is what it does how it compares and so on and the gist of the long former is that it can now do a transformer model on a long document as you can read here so I've gotten a lot of questions of like does that mean we can now have much longer documents right the bird model doesn't fit into my memory can this solve my problem and I just kind of want to go into the math of the long form or memory requirements here because I think I've alluded to it but it is quite a I think the graphics here are just a bit misleading from the way they implement it now I've already gone over something like this in the last thing so Roberta row let's spell this correctly Roberta that is their baseline has a size a let's call that n 0 of 512 so they can have 512 tokens at the same time so if you have a sequence that is way longer than 512 you need to chunk it up into pieces of 512 and usually do something like overlapping pieces or something like this right and now the promise of the long former as it is in the paper is that you can put all of this into the long former right and it will do this sliding window attention thing where it basically slides a window here this window across this input sequence and only does this local attention right within the window and then it also has some global attention that it constantly has now what I find interesting is that in their experiments their window size here so the long former window size is 512 right so within that window you have the classic N squared full attention so right so let's just go in into that how much memory does the long former really do we've already calculated it here a bit but I want to take this still apart a bit so as you can see on the left here you have n times W that you have for this middle band right so this middle band is n times W then you want to add the global attention right so the global attention you can already see it right here if you have one two three four locations of global attention you have four times two because you also have them in this direction like you have them in both directions times your full sequence length so plus two times full sequence length times the number of global attention I call this s over here so as we saw up here the window size here was and zero in their experiments so let's replace this window size by n 0 and actually let's factor out the N so we'll get 2 n times n 0 + 2 s all right so you can already see that Roberta originally had n zero squared now if n is larger than n zero that means you already use more here the kind of trick let's it's not really a trick it is true that this is order of n all right if n is your input sequence length but and this here is technically order of N squared if n if this is n but

Segment 2 (05:00 - 09:00)

the sequence length in Roberta was the window size of the long former so this is n zero squared and here technically you'd have to say this is n times n zero so if n is larger than n zero you can see that this uses more memory given that so in their experiments they use a model that on paper uses more memory than the baseline model and saying that it scales linearly with sequence length is because I mean of course it scales linearly because they can now input these long sequences right and the attention sorry the memory requirements scales basically linear and also linear with the window size now the window size still needs to be apparently large ish in order to achieve the performance so the fact that the performance is equal or better is not really a secret because it uses more memory it's not like this model uses less memory but outperforms the old one it uses more if you want to look at it you have to ask okay I have Roberto and right now I can do N squared so this is n there's n 0 and 0 this is my sequence length that I can put into Roberto you have to ask yourself what kind of sequence do I want to put in and if you say a sequence that's twice as long right I want to put in this long of a sequence so n here would be twice n 0 then you have to take this put it here and then you realize yes that your window size your of the long former can only be half right so if you have the same amount of memory you can double your sequence length at the cost of having your window size but that doesn't yet include the cost of the global attention so any global attention you do will come basically on top of the window size you see this here right so the you decide on let's do it like this you decide on how long you want your thing your input sequence length to be then you decide then that means that's this rectangle here then you decide how many global attentions do I want and here I say I want one global attention and you have to cross out as many rows here as you want global attention and what remains is your windows actually have to cross out twice but we don't have we only have one left but you get the point you have to cross out 2 times s rows of how many global attentions you want and what remains will be your window size in this case it's just a window size of 1 so that's how you would construct a long 4 that takes in the same amount of memory as a your classic model but can take a full n sequence length alright so I just wanted to kind of make that clear go through the calculation myself and I hope that helped thanks for listening and if you like this consider subscribing like King and bye

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник