# Shortcut Learning in Deep Neural Networks

## Метаданные

- **Канал:** Yannic Kilcher
- **YouTube:** https://www.youtube.com/watch?v=D-eg7k8YSfs
- **Дата:** 18.04.2020
- **Длительность:** 49:11
- **Просмотры:** 11,042
- **Источник:** https://ekstraktznaniy.ru/video/13774

## Описание

This paper establishes a framework for looking at out-of-distribution generalization failures of modern deep learning as the models learning false shortcuts that are present in the training data. The paper characterizes why and when shortcut learning can happen and gives recommendations for how to counter its effect.

https://arxiv.org/abs/2004.07780

Abstract:
Deep learning has triggered the current rise of artificial intelligence and is the workhorse of today's machine intelligence. Numerous success stories have rapidly spread all over science, industry and society, but its limitations have only recently come into focus. In this perspective we seek to distil how many of deep learning's problem can be seen as different symptoms of the same underlying problem: shortcut learning. Shortcuts are decision rules that perform well on standard benchmarks but fail to transfer to more challenging testing conditions, such as real-world scenarios. Related issues are known in Comparative Psycholog

## Транскрипт

### Introduction []

hi today we're looking at short cut learning in deep neural networks by a number of authors from the University of Tubingen the Max Planck Research Center and the University of Toronto so I'm not gonna read all of them but all of them are either joint first authors or joint senior authors I just what is this like there is just a team of people who did this work together this whole I have a store I don't have across whatever ok sorry bit of a rant alright so this paper discusses what they call short cut learning and they actually they don't in like propose something new here they discuss this phenomenon and they try to link several things together under the name of short cut learning which they claim is a problem in current deep learning and they discuss why it happens and what can be done about it and I just want to jump into this

### Shortcut Learning Example [1:10]

example real quick right so in this case you can see you have a training set of images and the training set is these four images here along with these labels and also these four images along with the these labels right so you can think you can train a machine learning model let's say you have a bunch of those and then you're gonna test them on the iid test set right on this test set and what you'll find is that if you let a human do this task right the human would give this an a this a B and this B which is what you can think of is probably what a human would do is like these are the stars and these are the moons right and the human would see the stars and the humans would see the moons and if you do this by the neural network also you'd get the labels a b and now you go this out of distribution test it and we'll go over it why that is out of distribution in a second again you'll see that the human will classify this as the A's because it has the stars and these as bees but the neural network will classify these as bees and these as ace so I'm not saying this is what's gonna happen every time but imagine that happens and this is a conceivable situation and you can think of what happens here so you see in the training set all of the stars were either and the bottom left right or in the top right of the image where if I whereas the moons were either in the bottom right or the top left right you see that so the neural network might have learned that this is moon and this is star or and then if it applies that rule to this new test set right then you can see that will classify these as moons and these as stars which is incorrect so this might happen for example if the person that wrote the generator for the data set for some reason it produced only data here that had this property of the bottom left top right being a star and otherwise being a moon so what generally happens if we do machine learning test set is we collect a data set a big data set but we collect it in a single pass right so this is our data set and what we'll do then is we'll split it right into a fairly large tray and maybe a bit of a smaller test set right but this it's important that we first collect the data and then second we randomly split it now this out of distribution test set what that might be is that might be a second person right so this was done in a first step but then later right a person collects another bunch of data so this is data two and they think it's it should be the same as this data but then you apply the the classifier that you learned and train and test you apply that here right so what is different in this case is the data collection process that happens beforehand right so somewhere here is the real world I'm gonna draw the real world this is it's a globe this is the real world and you draw datasets from this data set first and then you split it in train and test and then you draw this data set second so the second data set has a fundamentally is a different sample of data and this data whereas these train and tests you can think of them they're closer together then these two data sets are here I think that's all that's kind of intuitive so what we usually do is we terrain here and then we evaluate on the test set right but the training tests that they they've just they're just like randomly split versions of the same data set and that means that if there is some kind of bias in the training set it's probably also in the test set like we saw here with the moons right so this training set is this test set is this and both have this moon star property because that was introduced so this pattern here by accident in this case was introduced when the data was Purdue right whereas the OOD test set now is maybe a second person writing their own generator that doesn't have this property and then that will lead to this data and of course since we trained on the this tape trainer and evaluate on iid data this is now the iid assumption and the evaluate on the iid test data we're gonna do fairly well with our you know crooked decision rule because it has the same bias but then if we once we evaluate on the out of distribution data then we will fail right because now this doesn't have this in this bias in it right this is not here so shortcut learning refers to the phenomenon that there might be features in the training set that the model starts to learn such that it learns something else that we want it to learn right the shape here but it learns something else it learns the position and usually these things will not be recognized by generalizing to this test set right because the tests at being an iid split from the same training set will have these same biases and therefore they will only become apparent once we do out of distribution generalization evaluation so this is shortcut learning and this paper goes into the origins and kind of descriptions of this and while I think this is a good approach and paper and it says many correct things I think the framing is a bit off at times and welcomed through it so first of all they say they give some examples in

### Shortcut Learning Examples [8:45]

biological neural networks so they have this one example where they have a rat and the rat learned to navigate a complex maze right based on color differences of the walls and this was surprising because rats don't really have color vision or it was kind of known that rats don't have super good color vision so it was very surprising and then they discovered that the rats did actually not use the visual system at all they simply discriminated the colors by the odor of the color of paint right so if you painted the wall book red or blue that smelled differently and the rats could smell it once they controlled for the smell the remarkable color discrimination ability disappeared right so the second example they give here is so Alice loves history and Alice had spent weeks immersing herself in the world of Hannibal and his exploits in the Roman Empire and now the exam questions are just like how many elephants did Hannibal employ in his army so the exam question or a multiple-choice not focus on understanding and bob had just learned it by heart and is now doing much better than Alice who has actually understood the topic in right so they give this as examples of shortcut learning where the model learns something that we don't intend it to do right all right and I think this is the crucial point right the model learns something that we don't intend it to do and so here and this seems this this might be pretty clear to when you observe it but what do we want we want shape and the model learns something else right just something else and the crucial part here and I think this paper isn't putting that much as much emphasis as it deserves is the two words we and want so my basically my answer to this is first of all the word want we want shape and my answer my comment to this is you can't you can formulate that you can't formulate it this is very crucial and I think the paper it almost ignores this point you cannot formulate what it means to classify things by shape and this seems so oblivious to us because we're so used to it as humans right we were just like oh it's just used the shape right this is the shape right but you cannot program a computer to do this that's why we use deep learning in the first place right because we have no freaking idea how to program an algorithm that extracts the shape of something it might be possible for like a star in the moon but not for a cat or a car or anything right so you cannot formulate your objective that's the problem right and it's easy to then say oh the model doesn't do what we want it to do it's like you can't even formulate what you want it to do in a precise way so basically all you're saying here what you're saying here is I'll train a shape classifier right once you've gone through this process of training and evaluating you say now I have a shape classifier right say you haven't you had done this ood evaluation you've gone through this and used you know how you now claim you proclaim I have trained a shape classifier no you have trained something that given the entire process of how you create your data can classify these two images right so at here is your generator this is your little program that you wrote to produce these images and your generator assigns either the star rec at random it produces these things the star or the moon it does these two things and then it creates the image from it and that will give you your data set right what you have trained is not a shape classifier a classifier that can distinguish data that comes from that this data generation process right the entire notion of calling it a shape classifier is because you as a human have thought of shape when you programmed this generator right when you collected the data set that's what you thought of but this isn't the case you can't call it a shape classifier just because you like this is what your intent was you have a classifier that classifies images from this particular data generation process and you can't actually formulate a shape classifier right ok the second word is we sorry we humans right we want shape classifier now I've said this before and this is very much refers back to the to for example the paper about the contrast sets in NLP and so on humans have grounded knowledge right humans have grounding this is very important here grounding means that the humans live in a world of physics and culture sorry physics and culture and the need for food biology humans live in this world and this generated our brain right so what that means is that humans live in a world of objects and of people sorry of people and of being eaten right being eaten or needing to eat food right humans litt grew up and live in this world your brain was literally structured according to these things and thus we understand everything with an eye to this grounded knowledge of reality where there is such a thing as objects now if you have image net and you train a classifier for objects right this is what I find so crazy right and in the you know we collect this thing and there's a car and you say that's a car right you know this is a car because there is an object of a car and and but the neural network is not does not have a bias for object in a real Network simply sees these pixels same here what you will do immediately here is you'll recognize the object of the star right you will transform this into a 3d scene right into a 3d cube where you are here watching in 3d space and there is this star object somewhere here right and then you understand that the star could move around then it would still be the same star but that's because you have the inherent bias of there being objects and shape for example the word shape is nothing more than a property of an object and the neural net or simply do not have a inherent bias for objects right or people or intent or what it needs to eat right this becomes a super obvious if you ever try to solve for example a jigsaw puzzle you know like these things here I'm terrible at this if you solve this on its head right say this has like a face on it and you try to solve it like this on its head like try it you'll feel it's the same task you simply need to match the border shapes right and you need to make sure that the lines are continuous of the picture it becomes so much harder just because you have this brain and so that is my entire criticism it will pull through this entire paper and we'll go through it for now relatively quickly because we've already like touched on it keep in mind this is my commentary on it this is not superior knowledge or something that's just me all right

### Decision Rules [19:27]

so what they do is they have this taxonomy of decision rules that they point out what they're saying okay you're you these there's a set of all possible decision rules right and this is the outer set here all possible decision rules that one could think of to discriminate data and let's say we will talk about images here right - to discriminate images most of them will just be crap and these will be using these uninformative features what they say but then there are some decision rules that perform well on training data right this is this big circle here right so that there are some decision rules that perform well on the training set and they call these overfitting features so these are all the features that perform well on the training set only on the training sets are the overfitting features but to me it's a bit unclear I think they only call this band the overfitting features but they call the entire circle the all possible training solutions any case so there are decision rules that perform well on the training set but some of them are overfitting as you know that problem right then the next circle inside of that are all decision rules that perform well on the training set and the iid test set and this would be our location classifier from before would fall into this category right and now there these are still a much larger set as you see here as this inside set of the intended solution performs well on training set ID and all relevant out of distribution test sets and then they draw this in here the out of distribution test sets are subsets of the iid test set already sorry the solutions that work on the OD test set are subsets of the solutions that work well on the iid test set I don't have a problem with this characterization of decision rules right here is specifically these are decision rules what I have a of characterization with is the fact that you cannot specify what the intended solution is you cannot and therefore this diagram I think is misleading because you have ultimately you have no idea where this dot is right you can't you can't specify it beforehand you can't even specify the rules how you get there all you can do is give better data and they kind of advocate for this here with these ood test sets but again I think when they say all relevant out of distribution test sets I'm a bit wary because they suggest this as one of the measures to assess whether or not a model has learned these shortcut rules is to measure its performance on out of distribution test sets and this is very much like the contrast sets in in the NLP but I think actually this is a pretty pretty bad solution in most cases and let me explain why so if we go back

### The Problem [22:59]

to here right what we saw is that these discrepancy it comes about because here from the real world we produce the data in a very specific form right and then this other out of distribution test that is produced in a slightly different form right now what you can think of this is if you look at

### Cost Function [23:29]

your cost function that you train right what usually say is my cost function is some sort of a loss for my data points and my labels right but this is often left out I mean you write it in your introductory classes what is important is that this is an expected loss that you're minimizing here over a data distribution right over X and why sampled from a particular data distribution now when you talk about this out of distribution classifiers what you'll have is you'll have a slightly different data distribution D Prime right so but if you simply have one out of distribution saying think of this as the contrast set right if you haven't seen the video about contrast sets it's basically an handcrafted out of distribution test set my problem with this it's just one it's a single one and I think even if you try ten of them ten of those sets you won't even get close to a true measure because so the cool thing about an iid test set is at least it is precisely the same distribution right so it kind of gives you an unbiased number that for this particular data generation pipeline you get this number if you evaluate on it out of distribution test set you now have two effects you first have this generalization effect and you have the effect of having produced this in a different fashion here but you only have one of them what you would like to do is you would assess is your loss of x and y in expectation with x and y coming from data in expectation I mean difficult in expectation with your data distribution coming from all possible data distributions in the real world right that's what you would like to say to do now if you only have a single contrast set it is akin you can think of what if like how well how well of a machine learning engineer would I be if my test set here only had one sample right so I give you a train and the test set and I'm saying your performance will be if I make a cattle challenge and I say your performance will be evaluated on this one single sample test set right that's basically what you're doing if you have a single ood test set is you're saying I'm going to give you one out of distribution data set that I have biased in one particular way right and will measure how well you capture our intent right our shape classifier intent will measure how well you capture that using this one single out of distribution thing I think what that will do is read you say I want to approximate this by a sum of I equals 1 to 1 that will just pump the variance beyond what beyond any reasonable meaning that the outcoming number will we'll be able to give you what you'd have to do is you'd have to have this entire process and sample terrain and test sets according to this day or at least data distribution this underlying data distribution which you have no clue what it is because if you could specify this directly you could get the solution for free right if you could specify the underlying mechanism you could you would already know the solution you wouldn't need machine learning so I think the model puts like way too little emphasis sorry the paper puts a bit too little emphasis suffice to say they with

### Taxonomy [28:19]

their taxonomy they can say if you use for example only the overfitting features right then you will do well on the training set but not on the I ID and OD test set if you use the intended features again intended no one knows what that is then we can specify it you'll do well on all the or OD test sets if you use the shortcut features you will do well on the training and iid test set but not on the ood test this is valid right I'm not discrediting this paper here and they do allude to a lot of the things I'm saying but not all and I don't think they frame it correctly so they ask shortcuts

### ImageNet [29:04]

where do they come from and they say a lot of the things that I've been saying here but for example they ask what makes a cow and they give this example here where they say familiar background can be as important for recognition - deep neural networks where the deep neural networks will miss classify this picture because they you used to seeing a cow in grass now consider this in our framework right if let's say this is an image net classifier image net is not an object classifier it is not right that's what we say that's our intent but what it is a classifier if you go through the pipeline how do you generate the data image net is a classifier of naturally taking images right with a certain camera cropped center cropped to a particular object labeled by human raters filtered in some capacity right from Flickr and for that particular data set we train a classifier it is not an object classifier it is a classifier for that and it doesn't has no clue of objects so in fact and also what you have to see is that the output isn't even if the output is shape it isn't shape it is actually probability of shape right or probability of object or probability of something right and it is completely conceivable right if it's not grass in the background it's probably not as much a cow now I see the problem here this is clearly a cow and this is actually a conceivable natural image but imagine a picture of the cow oops on the moon right this is the moon and here's the cow Moo this is terrible a cow on the moon like who can fault the neural network it and I would say that's not a cow either because in terms of the data generation process if you ask me please classify this as a natural image that has been taken buh-buh-buh-buh-buh right I'm gonna say there's no way there's a cow on the moon so I don't know what this is but it is very improbable that this is a cow right because all the training examples I've seen cow on grass so yeah so I mean they do actually allude to this right they call this data set biases and so on but I'm pretty sure that yet the interpretation is just a bit off where they say they descent the point of this is like ah it's you know we want an object classifier but this we want the second I find even more kind of strange is they say shortcuts from discriminative learning and they allude to this picture here and they ask what makes a cat and they basically their argument is that the neural networks they don't understand things they just discriminate right they have these thousand classes and the output layer this is the neural network and they just need to discriminate so learn what is different from one class to the other class and they will often rely on features such as texture like here so they rely on detection they classify this as an right so they say what makes a cat - standard en ends the example image on the left clearly shows an elephant nor the cat and again I agree if you tell me this is data from naturally - taken images with standard cameras right then I will have two possibility I will say is this a cat there's no way that if you take anywhere in the universe a picture with a like a phone camera of anything of a cat it will look like this no way it's just not possible right however is it possible that there is an elephant that as a skin fold pattern by random chance elephant big ears trunk as a skin fold pattern looks like a cat that looks like the shape of a cat yes that's possible so if you ask me according to the data generation process this is way more likely to be an elephant than a cat right and the paper here make it makes it seem like it is so obvious that this is a cat but what do this standard stupid dnns think it's an elephant nor the cat and the DNN oh it's just looking at object text and other local structures and nor that shape right what we like what we wanted to do and this is I find just stop calling things object classifiers if there are not object classifiers that classifier between images of a instead of a data generation process if you want them to be object classifiers make up the data set that actually has different objects but you can't specify that so yeah and then they go into some sort of adversarial examples and I find this to be I also a bit maybe not belonging here like that where they say oh look here the DN ends predicted guitar with high certainty again it's just a discriminator but this pattern why not guitar if you had to you know get one of the thousand classes out why it could this not be most likely a guitar but I have a further problem

### Natural vs OOD Data [35:29]

with this is I see I kind of see this in so what would you have is iid data let's go with their taxonomy and say iid data as from the same generation process and then there is o OD data now I think there are a number of effects here that they try to lump together with this thing where they just say ooo D data whenever my model doesn't work on OD data it has learned a shortcut but it is very weird so first of all there I would say the OD data you can probably divide into what I call unnatural hourly data let's say our task here is to build an object to detect or whatever that means for natural images so then there's unnatural ood data which in here you'll find something like adversarial examples right adversarial examples are constructed at least if you go by the interpretation of Madhuri and the adversarial examples are features not bugs then you will go into the direction of adversarial examples actually constructed by combining features that don't naturally go together so you'll get the low-frequency features of a cat and add the high frequency for example features of a dog so much with this lambda factor here so high that it to a DNN it looks like a dog because it has many of the features but to a human that kind of ignores the high frequency features it look like a cat right but these are unnatural because the features in actual nature in the real world they never occur in this combination so it seems like this is a very very different phenomenon from what I would call natural ood data where simply the features that you're seeing have never occurred in the training data set but there is if you go from the real world and you construct in different ways data set there is some data set where the where the data actually occurs in the way that you have here so natural already data is what most of the examples for now were we're about like a cow on the beach it's just because you've never seen that because your data generation here always good cow plus grass right so I think these are very different and then the last thing they also lump in here is fairness like the fairness and bias literature where for example you have a resume classifier and there is any classifier ends up being biased by gender or something like this and again so I kind of struggle with this although they say not all the fairness problems come from here but I would also like to stress that some of the fairness problem goes exactly here it occurs because your data generation process is different from what you want for example if you do this hiring classifier you have to understand what that is what your training is a system that will tell you how would my human data set creators have decided on this particular application now of course there is this problem of bias amplification and so on but it is not an infallible system it simply tells you how the humans would predict and if you collect the data set in a biased way of course the the machine will inherit that but on the other hand the fairness why I don't think this really belongs in here because in fairness you have a you actually have kind of an alternate world draw this in green prime world Prime right we're in this mode II and III de setting you always assume that the world is the world and you want to kind of really learn a system that understands the world where in fairness you this here this is your super world so actually for the fairness literature it doesn't really matter if in the real world to let's say two groups of people are equal in some respect or not equal in the true world right what they care about is that they are treated equally by the system right so they will impose some restrictions or some some condition on their model and they don't naturally unlike this sounds bad but it is the mathematical formulation is such that you start with the super knowledge of two things must be equal and then you this is how you imagine your world you think I know the world and then I try to learn the model such that happens right whereas over here you do something different now some of it is in as I said is in the same category but it is I think a different take and a different literature so I would focus on let's say this part right here sorry on this part and not on the adversarial examples and also not in the fairness literature too much alright so yeah you can see here like no wonder this and this screws up an image net classifier yeah and even this like how do we know that is naturally natural though I can see that is that looks pretty natural but still it's probably like really specifically constructed such that the probability that someone would take this picture with a camera in the real world is zero cool

### Examples [42:04]

cool so they give some examples where they say okay shortcut learning exists in computer vision for example adversarial examples you see shifting the image by a few pixels though you have to say shifting the image very precisely by a few pixels such that the probability of this occurring the data generation pipeline is zero and so on then they call it domain transfer that of course is I think that's the that's a good example they say natural language processing where Burt has been found to rely on superficial cue words for instance it learned within a data set of natural language arguments detecting the presence of anak was sufficient to perform above chance in finding the correct line of argumentation again this is like all we can do is construct data sets we cannot if we could tell the model what to look at we would we would just program the solution so the solution is there's only one solution program better data sets get better data sets or I mean that ok the second solution is get better inductive biases but if we knew the correct inductive biases we wouldn't have the problem yeah I like that there is a in NLP this is very very prevalent even more than in envision right this fact of hey these Furious correlations in NLP the models usually just learn kind of correlation between some words and then they don't learn to understand the sentences at all right but this is because in NLP we have even more with constructing datasets that forced the model to learn to understand the text again I can't not tell you what understanding the text means they go in by the way humans do that to humans most in most of NLP that happens in humans do this right in many many forms this assembly because the cost function is not aligned with what you would want what is the specific oh well a specific example is that news stories nowadays right you have a news you say news what do people attend expect what is the intent of news is to inform you maybe but the cost right the cost function is clicks so what do you news story and very on the top in the title you say orange man bad and then people when you highlight this right so a news story I don't know Brad Pitt's had a new baby you just append but orange man that people click on it much more your cost goes up and sorry your clicks go up your cost function goes up and so I think this happens everywhere right you can't even do this with humans right how do you expect your own networks to do that all right agent-based reinforcement learning I think is pretty funny where is it where it learned how to play Tetris yeah instead of learning and algorithm simply learn to pause the game to eclipse come on is genius right like it is objectively genius and then of course fairness and algorithmic decision-making right so they say understanding these shortcuts and they touch on a lot of the things that I've touched on including these what I find well list for example this morgan scan for machine learning where you say probably a machine learning system will learn the easiest feature it can and that's oftentimes not what you want right so is this even now amplifies things they also touch on this thing of anthropomorphism where you view everything through a human lens and that is not correct if you look at these neural networks they're not humans and we should never attribute humaneness to their solutions never attribute to higher-level abilities that which can be adequately explained by shortcut learning yes I agree with this like paper in all the things it says right except this detecting shortcuts making OD generalization test a standard practice for the reasons I specified before I think that is counterproductive and yeah I think I've already said enough right designing good OD tests this you can only design good or detest if you know the real underlying data distribution which you don't and let's go three yeah again the principle of least effort they say why are they learned because it's just easier right - it's just easier to write a new story with just the words you know people will click on right like are these top ten things of blah blah number seven will surprise you don't actually have to come up with ten relevant things the entire title is enough to get you the clicks so it's the least effort to solve the cost function might not align with what you want and also the inductive biases as I said we are humans we have some inductive biases the neural networks don't have them and we need to take this into account but this solution is to make training data sets that take this into account all right they say beyond your cut learning this is kind of an outlook and then a conclusion where they remind but we're already at some 45 minutes of video and if you're still here like respect or maybe you just have this in the background and Indian have some company during this time I will finish with saying thank you for watching and leave your comments since this is mostly opinion I would be interested in hearing your comments on this with that I say bye-bye