XCiT: Cross-Covariance Image Transformers (Facebook AI Machine Learning Research Paper Explained)
35:40

XCiT: Cross-Covariance Image Transformers (Facebook AI Machine Learning Research Paper Explained)

Yannic Kilcher 23.06.2021 19 055 просмотров 466 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
#xcit #transformer #attentionmechanism After dominating Natural Language Processing, Transformers have taken over Computer Vision recently with the advent of Vision Transformers. However, the attention mechanism's quadratic complexity in the number of tokens means that Transformers do not scale well to high-resolution images. XCiT is a new Transformer architecture, containing XCA, a transposed version of attention, reducing the complexity from quadratic to linear, and at least on image data, it appears to perform on par with other models. What does this mean for the field? Is this even a transformer? What really matters in deep learning? OUTLINE: 0:00 - Intro & Overview 3:45 - Self-Attention vs Cross-Covariance Attention (XCA) 19:55 - Cross-Covariance Image Transformer (XCiT) Architecture 26:00 - Theoretical & Engineering considerations 30:40 - Experimental Results 33:20 - Comments & Conclusion Paper: https://arxiv.org/abs/2106.09681 Code: https://github.com/facebookresearch/xcit Abstract: Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k. Authors: Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/ BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Оглавление (6 сегментов)

Intro & Overview

hello there today we'll look at excite cross covariance image transformers by facebook ai indriya and sobon university so in this paper the authors propose a kind of a transpose of an attention mechanism so instead of the attention working across tokens and tokens attending to other tokens now it is the features or the channels attending to other channels and in a matter across the entire sequence that you input this means there is no longer a quadratic complexity in the length of the input sequence and this supposedly works particularly well for image data so these are akin to the vision transformers that work on patches in patched images and they reach comparable good performance on things like imagenet classification self-supervised learning but also dense prediction like uh segmentation and so on so we're going to look into this paper it is it is kind of weird to how to think about this um so the idea is pretty simple but i think it's kind of weird and it the question is to me a little bit uh can this still be called a transformer in the way that it operates because as it seems to me after reading the paper and i think they also mentioned this during the paper it is more like a convnet honestly that just kind of um has one dynamic part in it so one of the convolutions is a dynamic convolutions but we'll see and uh you know this could be a good architecture for future image processing so here they say let me grab my yellow following tremendous success in nlp transformers have recently shown much promise for computer vision okay so the self-attention operation underlying transformers yields global interactions between all tokens ie words or image patches and enables flexible modeling of image data beyond the local interactions of convolutions this flexibility comes with a quadratic complexity and time and memory hindering application to long sequences and high resolution images so this is the problem transformers good attention mechanism powerful however there is a quadratic complexity in time and memory in terms of the sequence length and that's why we can't apply it to long sequences or high resolution images they say we propose a transposed version of self-attention that operates across feature channels rather than tokens okay where the interactions are based on the cross-covariance matrix between keys and queries the resulting cross-covariance attention has linear complexity in the number of tokens allows efficient processing of high resolution images yada okay so and then they propose a an entire architecture built upon the xca the cross covariance attention which they call excite so that's the cross covariance image transformer it says it combines the accuracy of conventional transformers with the sealability of convolutional architectures sorry scalability we validate the effectiveness by reporting excellent results on multiple benchmarks including self-supervised image classification on imagenet object detection instant segmentation yadda they're super good okay so what is this

Self-Attention vs Cross-Covariance Attention (XCA)

new kind of attention this is the main graphic in the paper and on the left you can see how the whole attention looks so this would be the whole model is consistent of these excite layers so you'd have sort of input tokens down here and then you have l of these excite blocks and at the end you'd have whatever a classification layer or a segmentation layer or something like this but in our case this here is what would be a self attention but followed by a feed forward network and you can see that the cell it's essentially the same the feed forward network is still here but the self-attention block has been replaced by these two blocks and the bottom one is this cross covariance attention which does attention pretty much like you're used to there's a tiny difference i said the idea here is pretty simple uh in the mathematical way it's just a bit weird to think about it so on the top you have the classic self-attention that is used throughout transformers currently and on the bottom you have this new proposed cross covariance attention and you might notice that the only thing that is different if you look at the pictures is that the green and the orange matrix here are skipped so for that we dive a little bit into what attention does regular usually so i think i've drawn this picture about a thousand times but um forgive me if i do it one more time okay so um every we have let's say we have a series of tokens like this one here and this can be word embeddings in language but this can be image patches in images so the way vision transformers work is it's prohibitively large to process each pixel individually so what they do is they take the image and they put it into patches and now each patch becomes sort of one of these tokens okay as opposed to convolutional networks which can actually work on these high resolutions directly by applying only the local convolution operation so these are sequence elements of whatever form and every of the one of these sequence elements exposes a query vector so the query vector is a vector that's supposed to tell uh sort of what it wants to know about the other sequence elements and then also each one exposes a key vector so the key vector tells a little bit like what's contained in the in this token so the way this is routed is that the query each query is compared to each key and then the information is routed according to which ones have the largest inner product for example the next representation of this token right here we need to look at its query and we need to compare it to all the keys that we find so in this case only this key right here matches so we would expect that um a lot of the connection between those two is very strong ultimately what you're going to do in here you're going to build up a fully connected layer right everything's connected to everything with different strengths but the strength of the connection is dynamic determined by the uh by the attention mechanism rather than fully learned okay so an mlp would be a fully learned connection matrix which is fixed however an attention matrix is a dynamic connection matrix in this case in the cross covariance attention we do something very similar but we have to think a bit differently so now here what we have is essentially we have vectors let's represent these token things as uh vectors and let's have three no we have five data points and they all have four dimensions we'll leave away query and key and so on right now so what you do is you don't watch the tokens as a sequence however you watch the channels as the sequence so this here is now one element this is one element and so you'd have to somehow trans can i rotate this i cannot yeah i cannot rotate it you just imagine in your mind this rotated now each channel exposes a query and then key and now the information is routed not between uh sequences of not between from token to token but from channel to channel so essentially you look across the entire uh sequence in the first channel and you decide okay what kind of information is in this first feature across the entire sequence and you can see kind of how that makes sense so with the self-attention you can see that you know a token in a picture it might be an eye so a patch might contain a part of an eye right and then another patch might contain a part of a mouth right here okay there's a tooth and uh it would be important if these two things could communicate with each other because that would give a hint that there might be a face in the image in this framing um we look across all of the things right and maybe the first channel is responsible for recognizing eye like structures anywhere in the image right across all the patches so this could be like the channel that is kind of like i think there's an eye somewhere and then this here could be the channel that says i think there's like a mouth uh somewhere in the image and you can also see it's valuable if those two things communicate it comes away from this localization aspect and more towards communicating across the entire sequence what kind of features there are now it's not directly the channels that expose this of course so if you think it's also not you know directly the tokens that are compared here so if you think of your data matrix x as a big matrix and this big matrix has is n by d somehow uh not somehow but exactly so you have n data points and every data point has an embedding of size d maybe d is four here so we have n vectors each has four entries what you would do in the self attention is you would transpose this like so and what you would obtain would be a matrix of size d by d but not until in between you multiplied with sorry you multiplied with the keys and the value matrices so the way the self-attention formula works is that you first multiply x by a they have the formula somewhere here on the comparison so what you do is if this is x you multiply this by a matrix that is learned that gives you the queries and then you multiply x also with the you multiply x with the matrix that is supposed to give you the keys and then you transpose this and then that is your self attention so it becomes something like x w q w k transposed x transposed so you can see the how the information flows is modulated by these learned parameters here and that gives you the self attention matrix so essentially you will have a transformation matrix right here let's say that's d by d for simplicity and that is you don't want to compare the tokens directly but you want to compare sort of a function of the tokens so we have that then you have the key weight matrix which is also d by d and then you have this thing right here so you can see that gives you an n by n matrix ultimately which tells you how much every single data point is connected or attending to how to which other data point okay so this is this routing table we saw up here ultimately this matrix right here is and that's how it comes to be so what do you do with this matrix famously right you take this you do the soft max of your x w x like this and you multiply it by the so-called values and the values are nothing else then again you multiply some sort of weight matrix um to with your data so do i have this correctly right here um yeah i guess so you have this and you multiply this you have the soft max of this you multiply your again your data matrix by some sort of other function but essentially this here are the values and you decide how to mix the values of each of the tokens to get the next tokens so from the point of view of one token in the output layer you decide how should i aggregate across the values of the input layer that's what the attention gives you now if we look at cross attention sorry if you knew all this but it's now we contrast this with cross attention so what we do in cross attention is we again have our data matrix like so but what we do is we again we multiply um by queries and keys by these matrices but now we do it differently we do it so first now i've i need to replace this up here so why is it green orange wow i didn't know you could do that this is freaky all right i'm done now thanks so we again multiply this here but we multiply by the other thing from the left like this so it's the same data the same matrices but now they are multiplied in a different order which means that as you can see right here this is no longer the matrix of inner products being computed here this is in fact i guess the matrix of outer products and coincidentally is probably smaller than the matrix of inner products because the dimensionality here d is smaller i have made yes okay so you can see here this is d by d n this is n by d and then this is d by d so the resulting matrix is going to be a d by d matrix not an n by n matrix which means that right here we aggregate across the sequence okay so the information of where things are is in the sequence gets lost and is aggregated across and this here directly this here is the if this were centered it's the covariance matrix but i think they call it the cross covariance matrix or yeah because it's not centered but essentially it is the covariance matrix of the mini batch you have right here not sorry it's the covariance matrix across the tokens in a single data point so this matrix here essentially tells you how you need to aggregate the channels for in order to go to the next layer so this again is multiplied by the values and as we said before the values are just a linear function but again here this is now multiplied from ah the left and not from the right so again we have our data right here and we have our this by the way i didn't label it before this is v vw sorry wv another learned function that gives you the values okay so this here are the values and this here tells you how one channel attends to the other so every token here goes through this process independently okay so for every talk it's essentially every token by itself goes now through this process of aggregating features uh from the other channels in the token so very much this is like a one by one convolution okay with uh this here being the convolutional kernel so usually i guess the convolutional kernel is represented differently because you also want to represent it in space but essentially this tells you how you aggregate information across channels in this one single token so every single token goes through this map that is first of all the learned map but then the dynamically constructed map so this is very much a dynamic one by one convolution where the convolutional kernel is dependent on the entire sequence okay but there is no information mixing sharing across tokens anywhere here except implicitly because of course the weights in this kernel are dependent on the entire sequence up here but not explicitly so once we have the kernel how we aggregate across the channels every token only aggregates across its own channels okay so the information doesn't get spread across the image or whatnot across the sequence like in the self attention and that is that's why i'm saying i'm not even sure this is a transformer because so far it's just a dynamic one by one convolution

Cross-Covariance Image Transformer (XCiT) Architecture

the second layer sorry the third layer here is a feed forward network and this is exactly the same as this right here so the except in the feed forward network again every token goes by itself and reconfigures itself according to some uh channel mutation according to some one by one convolution however uh the feed forward network is a learned uh con a learned transformation and not a dynamic one so the xca transformation is a dynamically so it's learned but the dynamic production is learned and the feed forward network is just learned directly with a direct weight matrix so essentially these are two feet forward layers here except one is dynamic and then the only other thing they have here is this local patch interaction and what is this is essentially a convolution it not essentially it is exactly a convolution so if you think of this sequence of tokens the first step is we aggregate across all the tokens right then we come up with a transformation and then every token goes through this transformation by itself okay so that's the first layer we just discussed then there is a convolution and the convolution is just a yeah a local patch interaction they call it but it's essentially a convolution so it's a convolutional kernel that slides across the sequence and um yeah gives you sort of the next sequence so for example this token right here it will be able so its convolutional kernel reaches this and this one okay and this is not an attention mechanism this is just a classic convolutional kernel and it is even depth separated so this goes only within the same feature channel so if you think again of our data matrix here with the featured channels um the convolutional kernel would be something like aggregating over this and just you just slide it everywhere you slide it so it's depth wise separable and you slide it across the image right here so the good thing here is that this gives you the interaction between tokens even if only local but it doesn't add a lot to the parameters because if it's depth wise separable right it's very few parameters and actually also very few uh there's not much compute and memory overhead but again this is a convolution so the first step is a convolution the second and like an explicit convolution and the third step the feed-forward one again is kind of like a convolution so here you have a box much like here except you don't come up with the box dynamically you simply learn the box and then every token goes by itself through the box okay independent of all the other tokens and that's how you get the next layer so this is it it's a dynamic convolution followed by a real so it's a dynamic one by one convolution followed by a real depth-wise separable but not one by one bigger convolution actual convolution and then it it's followed by a feed forward layer which again is kind of like a one by one convolution so that's the idea behind this now is it good or bad or you know independent of whether this should be called a transformer because you know if i think of a transformer i do think of an attention mechanism and the core of the attention mechanism is this information routing between elements of the sequence right just because you transpose it and call it attention doesn't i mean it's kind of like an attention mechanism in that it contains a soft max and it contains like keys and queries um but yeah then just because then you call it attention and then that becomes a transformer i'm not super sure uh yeah maybe you know are we now calling everything that has dynamic weights a transformer i don't know i guess we have to come to terms with the terminology right here of this however this appears to work quite well so um here they say these are the contributions right here so they include cross covariance attention it includes a it provides a transposed alternative to conventional self-attention instead of channels instead of tokens it tends to fix number of channels irrespective of the number of tokens okay there are more robust to changes in image resolution which is also a good thing right uh so you can do variable size images and they say for image classification we demonstrate that our models are on par with state-of-the-art vision transformers from what for using multiple model sizes they reach good accuracy on imagenet they can do dense prediction tasks and they can do self-supervised learning using something like dino and i've made a video about dyno and if you so if you use the back the excite backbone with dyno uh it works apparently pretty well so cool this raises a number of questions right so it raises kind of more let's say more theoretical question to

Theoretical & Engineering considerations

explain what's going on in here because there is an intrinsic connection between the two kinds of attention right they're not just random and look the same but there is actually discussion in the paper right here about the relationship between gram and covariance matrices here so you can transform one into the other other and also the eigen spectrums are related not only related but actually equivalent so they say the non-zero part of the eigen spectrum of the gram and covariance matrix are equivalent and the eigenvectors can be computed in terms of each other so there's an intrinsic connection between the two things even though conceptually they're very different and i think to to go ahead and really kind of explain which one is good in which situations why we do what and so on is there even a difference that is um still to be seen the second thing is that if this actually really works as they advertise and you know with recognitions of things like mlp mixer and so on it seems like it's not even important how you do it as long as you kind of shuffle information around a little bit um and then you kind of do feed forward layers mixed with shuffling information around a little bit in some way and this all appears to be kind of performing on par with each other now we have seen a trend to go away from we got a new state of the art to more like we perform on par with uh so you never know how much you know how much trial and error and engineering went into this to actually make it perform on par with and then lastly um yeah this is interesting because as you can see right here this model can handle for example different image resolutions and it does scale linearly with the image resolution so the gpu memory consumption you can see right here is even better than something like a resnet 50 right and that's uh that's pretty impressive though on the engineering side there are a number of things that apparently you have to do when you do these things so one is like l2 normalizing correctly and without that it breaks down temperature scaling is another thing so they have a learned temperature parameter right here as you can see without which the performance degrades a little bit too and there are there's another thing this block diagonal cross covariance tension so not even they don't even attend from all channels to all channels so this matrix i've shown you before they actually do this block diagonally so only like the first two channels can attend to each other and the last they compare this to something like group normalization that also has success only normalizing groups of channels together so it seems like to me this is my opinion it seems like this is much more a a better evolution on the uh on convnets than it is anything much related to transformers um so because also the same kind of things help right here uh and yeah making it more local gives you better performance and so on uh the fact that there's no in for no long-range information exchanged it really seems like an evolution on the convnet so i'm not really sure what to think of this other than that i would love to see this kind of architecture on other tasks such as language because again it being essentially a convnet also makes it really astute to working on images here you can see by the way the attention maps of the classification layer which look super duper clean i guess um so they say heads are sensitive to similar pictures within the same or across images yeah so i would be interested to see this in other tasks than images um to really see its let's say it's transformer like uh properties though i'm not yeah maybe we can start uh a hashtag leave transformers alone or something i

Experimental Results

don't know we will have to all decide what a transformer really is um in terms of performance of course uh these models they perform uh fairly well as you can see right here though there are some trade-offs in terms of um in terms of number of parameters if you compare them to models of the similar size parameters these large ones right here they do often have more flops as you can see right here though you can also modify this you can modify the resolution and they exist in smaller versions which means larger patches sometimes the performance is better by a little bit so here you can see it like it outperforms a little bit i think um it's a good thing that people say more like we perform on par with than touting the point one uh better performance as kind of state-of-the-art in their sub classification so you also see self-supervised learning it performs pretty decently and down there you can also see i think they don't have pictures so there's object detection instant segmentation and so on they do ablation studies where they figure out that for example um removing this xca layer drops their performance significantly so this really seems to be the key ingredient to this even though it's kind of just quote unquote a dynamic one by one convolution but this seems to be the key ingredient the workhorse also this local patch interaction like the actual convolution it drops the accuracy but not by that much uh but not by as much as removing the cross covariance attention layer and you can see that without the l2 normalization it just completely fails which you know is interesting that so yeah maybe as a lesson for future architectures if you're looking to build a new architecture and you see it just fails uh probably one out of uh 200 current tricks that we know might make it converge and actually perform better than other models so who knows okay so this model

Comments & Conclusion

it looks like yeah it looks like a good thing to try my last criticism here is that they always use um patches so at the beginning they tout oh what we do is we do um you know we can we don't depend on the sequence length this quadratic complexity yada yada so on uh you know we say right here high resolution images are prohibitive yet they still use patches and i get the idea behind using image patches but it seems like if you are able to process the full resolution images then the lowest patch size why should it be eight by eight i think here i think the lowest patch size they have is eight by eight if i'm not mistaken yeah so this here it means i think 24 layers patches of size eight like isn't it possible now that we have the fully like linear complexity in the number of tokens to actually go full resolution on these things though maybe they did and i just didn't see that in here but it seems this usage of patches themselves is a bit questionable if you have a model that is able to go too high resolutions or maybe they just want to put their parameters somewhere else entirely possible all right so i invite you to check out this paper and check out the experimental results if you're interested in that it's all fairly well documented there is a long appendix that details even more things and more experimental results there is pseudo code pytorch style and yeah there is uh even some more queries and key visualizations okay so i yeah invite you to check it out thanks for listening if you like content like this don't hesitate to share it out and i'll see you next time bye

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник