VirTex: Learning Visual Representations from Textual Annotations (Paper Explained)

29:42

VirTex: Learning Visual Representations from Textual Annotations (Paper Explained)

Yannic Kilcher 12.06.2020 6 392 просмотров 268 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Pre-training a CNN backbone for visual transfer learning has recently seen a big push into the direction of incorporating more data, at the cost of less supervision. This paper investigates the opposite: Visual transfer learning by pre-training from very few, but very high-quality samples on an image captioning task. OUTLINE: 0:00 - Intro & Overview 1:00 - Pre-Training for Visual Tasks 3:40 - Quality-Quantity Tradeoff 5:50 - Image Captioning 8:35 - VirTex Method 14:30 - Linear Classification 20:30 - Ablations 22:05 - Fine-Tuning 25:45 - Attention Visualization 27:30 - Conclusion & Remarks Paper: https://arxiv.org/abs/2006.06666 Code: https://github.com/kdexd/virtex Abstract: The de-facto approach to many vision tasks is to start from pretrained visual representations, typically learned via supervised training on ImageNet. Recent methods have explored unsupervised pretraining to scale to vast quantities of unlabeled images. In contrast, we aim to learn high-quality visual representations from fewer images. To this end, we revisit supervised pretraining, and seek data-efficient alternatives to classification-based pretraining. We propose VirTex -- a pretraining approach using semantically dense captions to learn visual representations. We train convolutional networks from scratch on COCO Captions, and transfer them to downstream recognition tasks including image classification, object detection, and instance segmentation. On all tasks, VirTex yields features that match or exceed those learned on ImageNet -- supervised or unsupervised -- despite using up to ten times fewer images. Authors: Karan Desai, Justin Johnson Links: YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher

Оглавление (10 сегментов)

Intro & Overview

hi there today we're looking at vertex learning visual representations from textual annotations by Karen Desai and Justin Johnson of the University of Michigan so this paper at its core is pretty simple on high-level it proposes to take the task of image captioning which is where you're given an image and you're asked to produce a caption for the image and basically train a model to do this and then just take the visual part of it as a baseline to transfer learn on other visual tasks and that appears to works work surprisingly well if you don't have much data so to pre train on this appears to work very well alright as always if you like content like this then consider sharing it out subscribing to the channel or tell me what you think in the comments so as I already said the

Pre-Training for Visual Tasks

idea here is pretty simple so people have been looking for pre training tasks for visual tasks so a visual task is anything where the input is an image and then you usually have some sort of neural network that processes the image and then at the end you can have many things so you can have a classifier that classifies the image into one of many classes if you know imagenet that's a thing so if there's a cat here then the imagenet classifier here would say cat or you could have something like a object detector that tries to predict on the image where the cat is like where the bounding box you could have a semantic segmentation where it's like all of these pixels here are cat and maybe sky and so it labels every pic so there's many visual tasks that you can formulate and they all sort of share the same architecture and specifically they all share this part right here the if you will this is the visual and it's usually a convolutional neural network and what's really different between the tasks is mostly this last part here that does the actual task but this end is often called the backbone so this is the backbone and the idea now is if I have a bunch of these tasks sometimes I don't have money labels for these tasks I don't have many labeled images so that I could train this big architecture from scratch like in medical images or just in domains where you don't have many images so couldn't I somehow come up with a method to create this backbone beforehand so to create the backbone given another data set and the simplest variant here is you take a big image data set such as image net and then you train a classifier like we said to predict some classes on it and then because an image net has a lot of images then this is your backbone and then whenever you have a different task you simply take the backbone transfer it over and then train the other basically you continue training on the other tasks that's called transfer learning the question is how do you get a good backbone so if you train on something like image net then this is of course a supervised tasks you have a very good learning signal but even image net has like 1 million images but for example

Quality-Quantity Tradeoff

the Internet has many more images so what you could do is you could train on this much bigger data set that you collected from the internet so let's just call it Internet but there you don't have labels right so what you'll have to resort to is instead of supervised learning is self supervised learning where you have an image and maybe you rotate it to the right so here is our cat it rotated to the right and then you have a classifier that predicts that these this image was rotated to the right and then that will become your backbone these self supervised methods they work very well there is a different number of them for example moko things like this and there is also a number of techniques that do supervised pre training and then transfer learning you can maybe watch my video on big transfer which is a very large attempt to do to pre train a backbone for visual tasks all right now you can see right here that the sort of direction is that the more data the better so that's sort of the idea here that image net is a big data set we can train a really good backbone but you know the Internet is an even bigger data set we don't have labels so there's a trade-off but we potentially can train an even better visual backbone to then transfer learn with this paper goes into a different direction they say look if you go in this direction right here you get more images but you get less information per image so with image net at least you have the label right per image but if you simply take a photo of the Internet you don't even have to label you have to resort to self supervise what if we go into the other direction and we look for images that have very high quality annotations but maybe we don't have as many can we do the same thing can we learn good back bones by trading off quality for quantity in this case in their quantity and quality trade-off is they go for descriptions so they'll go for something

Image Captioning

like this where you'll have an image and you'll have a caption for the image and so they show these on a line here semantically dense semantically sparse but their task is going to be caption generation so they're back they're more their task is given an image I want to produce a caption and there are data sets that you can train this from in a supervised fashion which of course these are very expensive to create I mean if you want to create an image net data set then you have to label each image but if you want to create a caption data set that's even harder because human really needs to sit down look at the image and an image net everything is like one class but here you need to look at the image and then you'll have to come up with like an adequate description here the adequate description is an orange or an orange and white and orange and white cat near a plate and the white cake okay so that's the caption right here and of course the caption is ambiguous so you'll have to collect multiple captions per image and you let have to make sure that the humans that do this do a good job and so on so this these are very expensive datasets but they are very high quality if you think of what does a single label let's just take image that has a single label per class let's say this is cat or cake for that matter it just sort of gives you very few bits of information but if you consider the text here an orange cat and a white cat an orange and white cat you know that there is a cat right you know that it's one cat you know what its color is orange and white then you know that there is a white cake right so do you know the other object and you know the relation they are near each other okay same for here a brown and white puppy so this is one object and the description of the object there is a there are apples there is a green lawn and you the relations between them are also clear and the puppy is lying on the green lawn and looking at the apples so the information in captions is so much more dense than just labels and that's the back drop here to say hey can't we do can't we pre-trained a backbone from maybe a small data set but that has so much information like a caption data image caption data set okay so their method is

VirTex Method

nothing more they train image captioning and then they use the visual backbone for transfer learning so this is the model there's an image the image goes into this visual backbone right here which is a ResNet 50 so this is a very standard convolutional neural network and that gives you these features so these features are 7 by 2048 this is the standard output of a ResNet 50 and then from this part on they do a linear projection such that they can now input it into a language model okay so they have visual features and now they feed those into the language model and the language model is just a transformer actually 2 transformers so one transformer they're both autoregressive one transformer tries to predict the caption in a forward way and the other backward way that's down here so in this direction is backward because the caption has been reversed if you don't know what a transformer is I've made several videos on transformers the first one is attention is all you need and that's sort of the same kind of transformer they use here so as you can see right here you have this multi-head attention the layer normalization attention from the decoder now the difference between the original Vaswani attention is all you need transformer and this one is that in the original transformer you had for example if you had a machine translation task you would have the French maybe a French sentence over here and then you would have the beginnings of German sentence here right this is what you have already produced and now you're asking what should the next word be and the architecture was such that there is a decoder transformer right here and that there is an encoder transformer that encodes whatever you already had and then at some point there is this cross attention right there is the signal from the decoder going into the encoder and the encoder incorporating that and then at the end right here the encoder would predict or the entire transformer would predict what the next word will be the only difference right here is that the decode this sorry I mix this up this the decoder this is the encoder the only difference right here is that this encoder is no longer a transformer but is this ResNet 50 okay because now you don't have an image as a you can think of it like a translation task you want to translate from images to text okay so your input is going to be an image and the signal is going like it would go in the original transformer into the decoder it would come from the image so from these visual features goes here so in this drawing this thing is going in here and then you simply predict the next word and you do it in both directions and the reason you can here this wasn't is not the case of course if you have a decoder like a standard transformer task because you don't need to do inference at this you just need to do training and training you can do using teacher forcing and so you can do this in a bi-directional way you don't need this at inference time so at inference time you simply cut off this part right here that's your visual backbone okay and these features here those are going to be the features that you then train your task on and sometimes you fine-tune this or sometimes you keep it frozen you can choose that all right so red convolutional network to encode the images that gives you features visual features those visual features go into two transformers both try to predict the caption of the image one in a forward motion one in a backward motion and you train it to predict as accurately as possible the gold-standard captions that you have in your data set that's it if you train this model well that means the model can produce accurate captions for these images which means that it has learned something meaningful about the image to the degree of course that the original caption that was in here dataset was a good descriptive caption but we're just we're going to assume that in these data sets this is the case all right that's what they do now interesting thing here is that in their standard in their standard set up they only have one of these transformer layers so of these things right here they only have one and that's like I think it's like 2,000 units wide but or sorry the hidden dimension is 2,000 units or 2048 but they only have one layer so what that means is that this transformer is not very powerful so most that you force most of the power to come from the visual encoder head basically has to do most of the work and then the transformer is going to simply be a very shallow language model on top of that and that of course makes your visual backbone even better all right we can pretty much skip the rest that's the idea like that there's nothing more to it you train this from the beginning you

Linear Classification

don't use any pre-trained whatever you train this from scratch and then you use this and in the first experiment they simply train a linear classifier on top of that representation so they freeze the backbone and then they use a linear classifier and they compare this two base lines so one of the baseline is imagenet supervised where you use the same backbone but you train it on imagenet in a supervised fashion okay and then you use that backbone to transfer out of the racks it's kind of like what big transfer does but just on the regular 1000 class imagenet baseline then you have the sort of the unsupervised pre-training ones so mochou so per this parallelism and we're going to purl but mochou is this momentum contrast which is one of these supervised methods that has been shown to work really well and this is also mocha en is trained on image but now without the labels because moko is unsupervised and moko cocoa is trained on the cocoa dataset and the cocoa dataset is what this paper here the vertex paper uses cocoa is this image captioning data set now what's important to note is that cocoa has about 10% only of the images of imagenet so it's considerably smaller now let's see how these things fair right here you can see on the x-axis the number of images okay that the data set are that the pre-training method trains on now of course some of these are going to be capped because for some datasets there are just not more images available right so they're going to be capped here the ones that are training on coke on imagenet are going to be capped here you can already see that the vertex outperforms the image net supervised baseline by pretty much when you only give it this many images okay so the way you do it is in this case you simply train these models now the brown one is when you take one caption per image but the data set actually has more than one caption per image so when you use more than one you can still post boost your performance a bit and that works way better then when you do this supervised free training on image net which would get you here with about the same amount of images now when you use all of image net you can see here you can get to a similar performance right here but you have to use a ten times bigger data set to get there alright so this already shows you sort of the advantage here now also consider the difference to the unsupervised ones so if you look at the same amount of images the unsupervised basslines or even lower but if you go to more images they sort of get closer to imagenet and in their own papers there are some evidence that if you self supervise to train for long enough you can actually surpass imagenet supervise pre-training but I'm not so sure that that's really the case but you can see here the trade-off between higher quality information but smaller data sets versus lower quality information but more data per data set and yeah if I guess if you were to pre train these self supervised methods with lots more data in a self supervised manner they would maybe end up even higher than imagenet now this graph here is sort of the same thing where they also train a linear classifier and you can see right here that now the image net supervised baseline is outperforming vertex by a lot so what's happening here now this here is actually this is on image net so the task here that you transfer learn is image net here it was like a neutral task pass Calvillo see none of these methods have trained on Pascal they simply have trained on their own data set these have trained on Coco this has trained on image net and then they have transfer learned to Pascal now the task is actually the transfer learning task is image net so naturally the thing that was pre trained in a supervised fashion on image net is going to have a huge advantage in this task because it basically has already learned the task beforehand whereas the vertex it has pre trained on Coco not on image net and you can see if you give it the same amount of images for pre training it can actually it's fairly close to the image net baseline so that's pretty respectable right there now again of course if you use more images on the same data set that you then train for then of course the image net baselines to outperform it but so pretty cool to see here that in this smaller image regime and also consider this down here if you go even an order of magnitude lower it's really shining that if you have higher quality information and you make use of it you don't need as many images and now we knew this for a long time but this now is showing the same for transfer learning for visual

Ablations

transfer learning so this was when we froze the backbone and then we trained a linear classifier on top they go and they make a short excursion here and show how different parts of their model affect their final performance and they find that for example the by captioning which I believe is the is forward and backward captioning significantly helps for example compared to only forward captioning and they also find that it's significantly outperformed other pre training tasks that they can do and they also investigate whether how big their models should be so here this is their baseline model so I was wrong actually they the it's one layer of with 1024 you can see as you make the layer bigger and bigger that generally helps but I guess they decided against it because the gains are too little to afford to make it worth and also if you make the network deeper here he make the transformer have more layers the performance goes up but again the gains are marginal so I guess they're gonna leave it away so their baseline as you can see is these resonate 50 with the one layer of 1024 size so this is now

Fine-Tuning

the last task it's the fine-tuning task so this is what most people would do is they would try a backbone and then it would fine-tune it on a different data set on or on a different task where they don't have much labels and here the situation looks a bit different so if you look at for example a task on Koko so there are several tasks on Koko one of them is image captioning which they use for parade trip for pre-training if you do other tasks on Koko you can see right here that compared to the supervised baseline this vertex it performs about the same or maybe a bit worse but what you can see is it performs significantly better than for example moko that was only trained on Koko so again this shows that if you have the same data set higher quality information makes it worth it and it's even better as you can see on moko that was trained on image net it's just not quite as good as the supervised baseline but all of them of course are better than just a randomly initialized Network that is trained from scratch I mean that's the entire point of transfer learning that you are better than simply learning from scratch and this shows throughout this experiment except in this LV is masking tasks where they do outperform the other things the other methods significantly now the lower numbers on this tasks also means that the task is harder than these tasks right here and therefore there are more gains to be made and therefore you could hypothesize that the bigger the more quality information that you input can be used in a better way so maybe more complex also the more complex the task is might also have an influence on how well the transfer learning works if you come from a high quality transfer learning task versus a low quality transfer learning tasks yeah so the lastly compared here with again with Pascal vs the object detection and these I naturalist classification where I believe this is also a transfer learning task with fine tuning and as you can see they can also hold up against the supervised baseline or even outperform it at sometimes the green triangles mean that they outperform it by a significant margin but then on this task right here they again lag behind so I think the point of the paper isn't really to show that this is the best thing ever but the point of the paper is to show that you can go about pre trainings basically the common assumption is that you need more and more and more data for your model to learn about the data set and they conclude here no actually you can do with very few data points as long as they have high quality annotations okay so I think that's the point of the paper and they don't always outperform the other baselines and whatnot but they keep the performance the same which basically means that this is an option here is a

Attention Visualization

pretty cool result where they visualize the attention of their image captioning model because they train an image captioning model and you can really see that the image captioning model learns something meaningful about the image so when it's a bird flying the attention is mainly on the bird as you can see then over the attention widens out over the image air so over the air the attention is here in the sky and on the ocean and then it goes near the attention is on the ocean itself as you can see so they have a bunch of these images and they're pretty cool here a dog so focused on the dog riding on and then you can see the attention going down because on is riding on means probably there's something below the dog a surfboard now the attention is fully on the surfboard in so as soon as you say in the attention as you can see it widens out so I think that's fairly cool demonstration that the model understands sort of the in relation namely if it is focused on something and that something is in something else then it widens the attention out to see what it is in OK the ocean and then it focuses the attention on the ocean so that's a pretty that's a pretty cool result I guess we already knew this because we could train image captioning models before it's just to show that it actually makes sense to use them as a pre training task for backbones now

Conclusion & Remarks

what's the future of this the authors here in their introduction they make a claim that this has a good future because they here they only train on this small data set right it's smaller than image net as you can see here and they already get same performance as if you train on the whole image net data set in a supervised fashion of course they're also supervised but they have 10 times less images and they say something to the effect of you do you know it would be pretty easy to collect more data for this task because the internet is full of images and mostly these images have like some text with them they you know they have these descriptions or they have text around it people write something about the images you could like mine Twitter and then their responses when someone posts an image might tell you something about the image but this definitely counteracts their notion that these are very high quality labels right their entire point here was that these annotations these datasets with these image caption data sets like cocoa they have very high quality annotations so this text here is very high quality it's really a descriptive text of the image that tries to capture what human can see visually in the image and as soon as you go out to the internet and collects a text around images that's not going to be the case that information is again going to be quite low quality and so I doubt that the performance here would hold up or that the claim you can easily you know you can easily create more data for this task holds up so that's a bit my worry about the future of this but it's definitely cool and definitely shows these quality quantity trade off very well alright that was my two cents to the paper I invite you to read it and tell me in the comments what you think about it and I'll see you next time

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник