# Underspecification Presents Challenges for Credibility in Modern Machine Learning (Paper Explained)

## Метаданные

- **Канал:** Yannic Kilcher
- **YouTube:** https://www.youtube.com/watch?v=gch94ttuy5s
- **Дата:** 10.11.2020
- **Длительность:** 59:21
- **Просмотры:** 19,256
- **Источник:** https://ekstraktznaniy.ru/video/13277

## Описание

#ai #research #machinelearning

Deep Learning models are often overparameterized and have many degrees of freedom, which leads to many local minima that all perform equally well on the test set. But it turns out that even though they all generalize in-distribution, the performance of these models can be drastically different when tested out-of-distribution. Notably, in many cases, a good model can actually be found among all these candidates, but it seems impossible to select it. This paper describes this problem, which it calls underspecification, and gives several theoretical and practical examples.

OUTLINE:
0:00 - Into & Overview
2:00 - Underspecification of ML Pipelines
11:15 - Stress Tests
12:40 - Epidemiological Example
20:45 - Theoretical Model
26:55 - Example from Medical Genomics
34:00 - ImageNet-C Example
36:50 - BERT Models
56:55 - Conclusion & Comments

Paper: https://arxiv.org/abs/2011.03395

Abstract:
ML models often exhibit unexpectedly poor behavior when they are deplo

## Транскрипт

### Into & Overview []

hi there today we'll look at underspecification presents challenges for credibility in modern machine learning by alexander d'amour catherine heller dan moldovan and literally all of google is on this paper including some others including mit and google with a white space but there is a lot of authors here and not sure what they all contributed but the main authors are three main authors which i guess is legit um but this more looks like some kind of a physics paper from cern but we'll dive into what the paper claims it's sort of a paper that looks at a higher level onto machine learning pipelines but gives very concrete examples for what it's talking about so the problem that the paper identifies is this thing they call under specification which is sort of related to problems we had in the past or that were identified in the past but they make a clear distinction of what under-specification is to what problems it leads and how that manifests and also what the causes are to an extent um well is a very long paper i think it's some 30 pages long the main text or so we won't go through all of it i'll pick out some parts of where i think are relevant to the main story i'll criticize it a bit because i think the it warrants a bit of criticism and yeah that's what we'll do so bear with me if you like what videos like this don't hesitate to share them out and tell your friends about it also let me know what you think in the comments uh this is i think this is a good topic for you know discussing things um the question to keep in mind while going through this paper is do they really uh demonstrate what they claim so that

### Underspecification of ML Pipelines [2:00]

was my kind of question when going through some of this so let's actually just dive into the abstract they say ml models often exhibit unexpectedly poor behavior when they are developed deployed in real world domains i think we all get a sense of what that means and we all know of examples when ml models perform fine in our lab in our training data and test data actually but then when we deploy them into the world they're not doing so fine they say we identify underspecification as a key reason for these failures they're not saying it's the key reason it's a key reason so that's the um important thing now they define it they say an ml pipeline is underspecified when it can return many predictors with equivalently strong held out performance in the training domain under specification is common in modern ml pipelines such as those based on deep learning so i think this the sentence isn't really complete here so it's underspecified when it can return many predictors with equivalently strong held out performance so what that means is you have some sort of a test set right big data set uh sorry train you have a big training data set you train your model on that and then you test it on a test set and the training and the test set they usually come from some sort of distribution and what often happens is you simply split your data into a train and test it and with that you measure this some sort of generalization capability right so there are a number of assumptions here namely that these uh this is sort of an iid distributed data cloud and the assumption is basically that the test data the data to which your model will be applied in the real world is sort of similar to the data you've trained it on and if that is the case then a procedure like this will give you a fairly good estimate of how your model is going to perform in practice however you then take that model and you deploy it to the real world and the real world i look i'm horrible at drawing real worlds but in the real world you might have this is europe yay africa very different distributions of data and the model might not perform as well anymore so this of course they're not the first ones to notice this particular problem the fact that there's distribution shift and so on what they are saying is that this procedure up here let's say it's a deep learning system there are many local minima of that deep learning system so that starts from you know your choice of optimizer your choice of batch size hyper parameters the choice of architecture of your network and so on so there are a number of hyper parameters let's call them all hyper parameters even like the different procedures and so on so there number of hyper parameters learning rate architecture um the batch size all kinds of stuff and what they experiment here with is the most innocuous of hyper parameters which is the random seed so even if everything else stays the same and you switch up the random seed you necessarily go into a different local minimum right all of these give you different models we know that in deep learning you have sort of a lot of local minima actually like you have a continuum of local minima they are all as good as each other and notably so these are training models notably they all perform quite well on that test data set right so you train any of these models maybe you switch up the random seed and most of them will actually work quite well on the iid test data set however they will exhibit very different performance when you apply them to the real world so maybe this model here you apply to the real world and it works equally it also works well but maybe this model right here you apply it to the real world it all of a sudden doesn't work so the under specification problem that they identify is when all the models work well all the models from your training procedure work equally well on the test set however they perform very differently in the real world namely there would actually be a at least one model like this one here that does perform well even in the real world however there is another one at least one other that doesn't perform well like this so the pipeline is underspecified this train test split simply doesn't capture the variation that some important property of the real world so the pipeline that produces the model is doesn't care about that feature so it's pretty much random whether or not that feature will be included or excluded or important or not important and it pretty much depends on which local minima you happen to be in and just by looking at the test set you can't differentiate whether or not that model will perform well in the real world or not this is under specification it's very different from the usual domain shift argument usually you say well the test set simply isn't the same as the real world and therefore the model performs well on the test set but then in the real world not so much right here it's more specific you say there would be one of these uh good models that we get out of this procedure one of the random seeds would actually work well in the real world however another one doesn't so of course that is a problem and they um so the way they go about the paper is they say they give some examples of how that is and in my opinion the examples don't really convince me like i see their point however the examples are let's say half convincing and then at the end they give some recommendations for i mean there is some work in this um namely what you have to do is you have to add constraints right if you want to solve this problem there's two ways either you can test models you can take all of the models that come out of your pipeline test each one of them uh on the real world on the things you care about and the one that works you know you deploy that however it means that you then again need some kind of test data set from that real world the other way is to actually since the model is underspecified try to bring in more specification that you care about during the training pipeline making sure that um this model that you care about is the one that actually turns out to be returned they don't demonstrate this here so this is my criticism they don't they demonstrate the problem i think in a way that doesn't convince me they also do not demonstrate a solution so they don't ever go ahead and say now we actually perform this additional specification and look what turns out is still a good performing model but with that thing fixed they don't do that um yeah so that's keep an eye out for that so we'll go as i said through the paper um but first a bit more of the abstract so you just hear it in their words they say predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance but we show that there that such predictors can behave very differently in deployment domains this ambiguity can lead to instability and poor model behavior and practice and is a distinct failure mode from previously identified issues from arising from structural mismatch between training and deployment domains so that's what i said it's a different problem than the classic domain shift or data drift or whatever you might want to call it we show that this problem appears in a wide variety of practical ml pipelines using examples from computer vision medical imaging yada our results show that the need to explicitly account for under specification in modeling pipelines that are intended for real world to play deployment in any domain i mean yeah fair enough uh this is actually a problem right and uh you you if you deploy ml in the real world you would be you know it it's very appropriate to actually care about these types of problems i'm not saying you shouldn't care about this um yeah so let's go to actually jump in

### Stress Tests [11:15]

to let's go to actually jump in the first example so they have this notion of what they call a stress test okay so a stress test is as i understand it is nothing else than you test whether or not you test like one particular aspect of the model so they're going to have a couple of examples one example they have an nlp pipeline where you're supposed to you know infer i don't know do pronoun resolution and the stress test one of the stress tests would be whether or not that model is sensitive to gender stereotypes okay so the the assumption is kind of pronoun resolution should be like just linguistic thing it shouldn't really have any bias towards any gender stereotypes and whatnot um or maybe not overly so if you compare it to actual world biases and the stress test would be let's measure that particular dimension so this gender stereotype dimension in the model and see how that performs so that's the stress test and what we are specifically looking for is um is there a large variance so is there models that behave the same on the training and the test set but have a large variance in these stress tests so the first model

### Epidemiological Example [12:40]

here is this epidemiological model so they say a simple epidemiological model which appropriate for our times i guess specifies how disease how infectious disease moves through a population given certain parameters right so um there are two parameters you can see the differential equations right here there are two parameters namely there is this beta right here represents the transmission rate of the disease from the infected to susceptible populations and the parameter d which is this thing here represents the average duration that an infected individual remains infectious so once you plug in those parameters and you start with like some this is some initial population i guess the susceptible population this s is susceptible i is um infected and r is recovered so you start with a hundred percent susceptible and then you let this and zero infected zero recovered you let this play out and you see how well that works so this is a model and it will give you curves like this okay so you can see depending on the d parameter and the beta parameter you have different curves like this they all sort of look like this so here is number of infected at the beginning it's zero and then of course it like it shoots up and but then as kind of herd immunity i guess kicks in uh this goes down again so it's a quite a simple model and um what their goal is here they say look let's say just hypothetically this is the beginning of a pandemic just making this up and i give you some data points right so at the beginning we're at zero then we have sum then some more and then some more now please predict the trajectory of the of this epidemic from these data points so what you want to do is you want to fit these two parameters to the data points there is actually a unique solution however because of the exponential rise of the trajectory the unique the solution is numerically not well specified okay so they say importantly during the early stages of an epidemic when the observations are small the parameters of the model are under specified by this training task this is because at this stage the number of susceptible is approximately constant at the total population size as the total oh so that means if you have a low number of infected people the amount of people that could get infected is still like pretty much everyone there is no type of of herd immunity yet and the number of infections grows approximately exponentially at this rate so you can see that approximately what you're dealing with is this rate right here you can see both parameters are in this rate so if you derive some number for this let's say this you derive from your data points that this must be five this is the rate at which the exponential curve grows there are many settings of beta and d that make this number five right in fact there are infinitely many pairs that make this number b5 so they say this is a classic example of under specification okay there are many different uh predictors each of which returns a good predictor on the data that you have and you can actually you could split this into train and test you could split these data points you can say i'll take three data points as a train and one as a test and still there would be many predictors that are fit the data here you see two of them so the blue and the red they fit the data equally well um right here however they have obviously very different trajectories so they say this is an example of under specification and here already like i half agree i mean yes yes if you do it like this numerically these look kind of similar but it's like clearly one fits more than the other right um so i'm not sure that is a good example for this under specification but we can you know we you can give kind of the benefit here and say okay they want to give a simple model so this is one of these models where it's under specified so it performs well on this data but then if you look at this data um it performs drastically differently right that's the important part here is drastically different so if the real trajectory of the epidemic is something like this then there is a predictor namely equal 28 that actually performs well right it's not that training setup is different from the real world it's that the variance of predictors is so large with respect to the data over here that there might be some that perform well but the others perform pretty poorly and they say this is not only the case for you know this initial fit but if you do the same and you simply use a different initialization so you differ simply use uh a different initialization for your parameters namely you either use a gamma or a normal distribution that will already turn out to give you very different results so here depends on where it was initialized and different initialization distribution result in different distribution of predicted trajectories so this is much more i feel an example of what they want to demonstrate so here depending on how you initialize the model the resulting model that it tends to give you right they do many different runs right here and you can clearly see that the blue curves uh that were initialized with a normal distribution are in general kind of on average significantly lower than the red curves right same data same procedure same everything uh but you get in expectation even different outcomes simply by how you initialize the parameters this is i feel this is a very good example right here of what they want to say not so much the early training data but you get the point that they say the underspecification uh leaves this variance okay now what would a good specification look like so in this case a good specification a good would either be that you somehow know you somehow have a theoretical reason for choosing one of these two initializers this could one specification be that could solve the problem another one that is probably more practical one would simply be to incorporate data from over here and thereby you know which model you should pick which in an epidemic it's not really it's like well i can tell you how it turns out once i know right um yeah so and that that's a bit of a problem because it already shows you sometimes adding these more specifications or checking checking whether or not the model does what you want it to do in this specific axis that has a large variance is just not possible like here but the example

### Theoretical Model [20:45]

is you know it's the example so the next thing they do is they analyze this in a theoretical model so they have this theoretical model right here this is kind of a two-layer neural network where the first layer is completely random okay this is a random this is not trained what's trained is this thing right here so it's sort of kind of a linear model it's a it's sort of a model of a neural network that people often use in theoretical analysis you assume some kind of distribution on the data and then weight matrix um entries and then all you do is you train the theta parameter right here and you can make some theoretical statements about what happens with that model so their goal here is to show that their goal is to show the following what is obviously let's say we keep the same data okay distribution or the same data um we sample this w right here now we can imagine w1 w2 w3 these are all different weight matrices okay so can we come up with a model that performs well on all the weight matrices that we would kind of throw at it um but that doesn't but um if we just plug in kind of different data it doesn't it stops performing well in one particular axis right so as long as we only look at the training distribution we're fine but then there is this one particular axis that the model just fails for some weight matrices but not for others okay so that's going to be the theoretical goal here is to construct as closely as possible a model that conforms to the claims right here so what they do is they make use of adversarial perturbations where they say we can construct we construct a weight matrix where is it here for any given way matrix a shift can be chosen such that it has a small norm so that it's essentially the same data that goes into the model two it leaves the risk of an independently sampled w mostly unchanged which is exactly what we you know what we have specified is that if i simply evaluate if i train the model and i simply evaluate it on my original data then everything's fine okay but it drastically increases the risk of w0 so what it says is that if i have such a model like i have above then i can construct a situation where i pick i simply pick one weight matrix say this one right here i can derive a data set x0 or x let's call that x3 for w3 i can derive a data set x3 such that all the other weight matrices will work just fine on that data set right they will work the same as my original data right here everything's fine however this particular one won't work on that data set and that is going to result from an adversarial perturbation targeted yet exactly that so this thing here constructs a data set that is according to their own claims okay so it's a cool thing to show that this is possible if you have an overspecified model you can generally do you can generally construct a situation that exactly conforms to their claims however i this is cool in theory but i don't think they demonstrate this too much in the real examples right here so um yeah just maybe this was unclear i'm not the best at explaining this type of stuff but what you can imagine is that the weight matrices that you get out of your training procedure they can be fairly different right let's just call them vectors so this is w1 this is w2 w3 w4 if your neural network just had two two different weights so the weight matrices can be drastically different and the solutions to them but i can construct a kind of an adversarial data set that is let's say exactly um into the this is going to be very simplified exactly into the let's say opposite direction of one particular um weight matrix so that it will work just fine with this weight matrix because you have kind of um the projection onto them is well specified but if i try to project it onto this one maybe i should have drawn it exactly orthogonal but you get what i mean i can sort of target one of these models and then by definition that one particular model that is as good as all the other models on the regular data will fail for this particular data set whereas all the other models will still work just fine it's kind of a theoretical analysis by construction yeah cool but you know if you make a claim and then you construct a situation that exactly conforms to your claims then of course it's going to conform to your claims um yeah so

### Example from Medical Genomics [26:55]

this is more according to the real world so this is a medical genomics example where you can see the training data um they have training data they have evaluation data that comes from the same distribution and then they have evaluation data that comes out of distribution so this is more like a domain drift domain shift example okay and our question is going to be how do these things relate so you can see that if you train on the training data and then you evaluate you get this is mean squared normalized mean squared error so lower is better you get kind of a variance of models so these are all the models that kind of come out of the training procedure and the red dot is a specific heuristic uh that performs just a bit better this is actually it's so what it does is you have a bunch of data points but the data points sort of form clusters and what these methods do is they take one representative out of each cluster like so one representative and then they train a model just on the representatives and that's supposed to just because these data points are all very correlated if they're in the same cluster that kind of gives a better performance the red dot simply is a very special heuristic to choose that representative whereas uh the blue dots here simply choose these representatives at random so you can conceivably say that all these models like the difference is simply how these representatives are selected and you can see they all turn out fairly similar with the red dot being just a little bit better if you go to the test set on the same data you can see the performance drops but you know still everything performs like pretty well the range of performance here is fairly small so all of these models you would say they perform pretty okay-ish but now you go to the sets that say out-of-distribution data and the range of performance is just very big and the point here i think they're trying to make is that look at the best performing models right here look at them they are on the level of the performance of your models in the test data set in the in distribution test data set however not all of them right so the a good performing model would be in the models that you get but you simply can't tell from just looking at the test data set and that is according to their claim um and they have a further graphic right here where they show look it's not as easy as saying uh the let's just take the best one here because that's going to be so here a plot they compare how well a model does and the eval set uh in distribution versus the eval set out of distribution and you can see the correlation is if it's there it's fairly weak so you like you would expect some line like this if that was just stretched out right if this thing was just stretched you would expect like a line but here there's just no way to tell for this particular data set okay so that's an example of what they mean by under specification however i like i fail to see like i see that these low points right here are kind of on the level of the test distribution but i am not like i failed to see what the difference is to a classic data drift just because they are on the left on the same level right i don't think it's that different like here the mean performance simply drops and the variance between the models increases and if i had a different eval set the ordering would be different and it would look the same but the ordering of models would be different and so on um what you'd have to do to for me like you i wonder for example is it the case in this step as well so what here if you did the same analysis would it turn out that what performs well in the training data set also test data set or is it also pretty random from the training data set to predict the at least the order of test set performance they never do anything like this if this is substantially different here then you can make an argument well this is a different thing than simply some sort of generalization this is really kind of due to this under specification because going from this data set to this data set you sort of have a different spec but to me it seems that this is just kind of a domain drift problem um and if you look closely actually the performance right here is lower than the best performance here right so that this technically does not fall under their definition if you go strictly so i'm not really sure what to make of these sort of examples um i get what they're trying to say but it seems to me that except for the theoretical thing where they construct the examples it doesn't convince me that it's not just domain drift okay like it's not just the same problem that other people have described and secondly it also doesn't convince me that adding the specification will solve the problem because in the experiments so far notice we have never seen a method from them to say let's just fix the problem let's add the specification and then we show that we can really keep this performance right the key thing is you want to keep this performance but you want to bring this performance up right um so far we've had these kind of fundamental trade-offs and these have often arised in let's say explainability or fairness and so on or actually domain adaptation is if you want to bring this down um a natural effect is going to be to bring this up so you know even if there are good models right here it might be that to in order to reach those models you actually have to weaken um the training procedure in order to consistently reach those models this is not demonstrated in the paper that this is even possible okay so they have a bunch of

### ImageNet-C Example [34:00]

more case studies uh for example they have this kind of imagenet c example where imagenet c kind of takes imagenet and applies a bunch of random but let's say well specified perturbations on it and again they show the same thing right here they show that look all these models they perform uh relatively equally on the just plain test set of imagenet but the span of these models they are trained all the same just the random seed is different right um and they have a huge span of performance on these individual things and what you'll notice also here or here is that it's not always the same model so the model that is uh good at the pixelate thing will be not so good at the contrast thing and so on so the question is going to be which the paper also doesn't solve is going to be that you know these kind of stress tests they are in very specific things like pixelate i can think of a million perturbations to images that are kind of orthogonal to pixelate it is going to be very impossible to specify all of them right to remove this underspecification so the question is probably by adding the specification of pixelate you simply uh worsen the problem for any of the other things that you have still not specified plus you probably worsen a little bit your performance on the actual test set if you incorporate that into training so the paper still hasn't shown that is even possible what is interesting is yeah here they basically say you cannot predict the performance on one of these perturbations from the others so they appear to be completely um orthogonal so it's not just enough to have a bunch of perturbations and then kind of be confident that the model is sort of robust to all the perturbations i think the core message of the paper is that if you care about a specific axis you have to go and check for that specific axis right otherwise you don't know what your model is doing it could be doing something good but bad if you don't specifically care about it they do the same thing with kind of these skin lesions so they have all kinds of demonstration here in nlp they do

### BERT Models [36:50]

tests with uh bert so and this is interesting because not only do they test different seeds for fine-tuning bird but they also pre-training so in these language models you have like a pre-training phase and then you have a fine-tuning phase and both of them have kind of random seeds so they're going to show that even let's say the random seed of the pre-training will actually already play a big role in how these models um perform in these stress tests uh i find this to be pretty interesting so they do this with respect to these gender data sets which have been constructed to sort of assess fairness of these models and so what you're going to have is data like the following so you'll you're going to have to the sentence let's say a doctor is walking so it's always going to be like some sort of profession okay used in a sentence and then what you do is you simply replace that entity with a man or a woman right you replace it twice and you ask your model you perform you embed all of these sentences and then you ask your model how similar are those sentences i presume by simply taking the inner product of the um of the embeddings or you can actually train it okay so they say part of glue our ensemble of predictors achieve consistent accuracy measured in terms of correlation with human-provided similarity scores ranging from this to that okay so you have kind of a model that can predict similarity in text just similarity it has it does not it knows nothing about gender right you simply train it on a data set to predict similarity in text and then you ask it so this sentence that i have here this reference sentence is it more similar to when i replace the entity with a woman or man okay and what you look at is the difference between the two so if this is a positive number that means that the sentence is more similar to when you replace it with the word woman and when you have a negative number the same for men and if the model is let's say insensitive to the gender dimension then you expect a difference here of zero at least in expectation right so a model that does not learn a gendered correlation for a given profession will have an expected similarity delta of zero we are particularly interested in the extent to which the similarity delta for each profession correlates with the percentage of women actually employed in that profession as measured by us bureau of labor statistics right this is uh in my opinion this is already an improved um assessment from what usually happens in these in these fairness literature things where they just say well if it's anything but 50 we are angry um which i get it if you know some cases you need to build a model that is actually 50 50. but if you want to assess things like they assess here like the question is does the model spuriously pick up this thing so if the model does something like if the model is let's say perfect um and does only the task we need it to do it will learn the association between a profession and the gender in the exact proportion that it kind of happens in the text which i guess is proportional to the proportion at which it happens in the world if however the model for some reason uses this thing as a feature more or less than it should then we see a discrepancy and why is that important that it's important because if we then deploy this model right we simply take so the model here is going to be the axis zero and the model can perfectly solve the task by simply being here right it's actually best to be here where this delta uh between the similarity and the profession percentage is zero but the model can probably solve the task equally well by being here or here or here right it can solve the task equally well however if we just happen to pick at the end we pick one model if we happen to pick this model right here that model just by more or less chance has a much higher association with one gender to particular professions than the other and depending on what we use the model for like we seldomly use the model on the exact task and data that we trained it on depending on what we use it for this might cause some adverse effects okay so i want to stress that this is not the same as your kind of classic fairness literature this really considers all these models they perform like equally well on the test set of that particular task and since it's overspecific or underspecified over parameterized there are many ways to solve tasks some of these ways will include this feature actually the opposite feature and if we kind of pick one that's at the extreme then the model is going to have that feature and that might not be not be important for this task but it might cause something bad for a task that we ultimately apply it on so they do this similarity and they do pronoun resolution and um so they come up with different things they say there is a large spread in correlation with bls statistics on the sts test correlations range from 0. 3 to 0. 7 on the pronoun resolution task the range is this as a point of comparison prior work on gender shortcut pronoun resolution found correlation ranging for that okay so we are in the same ballpark as prior work they say there is a weak relationship between test accuracy performance and gendered correlation um so there's a spearmint correlation coefficient for of 0. 08 which is a weak correlation right in fact the confidence interval includes zero oh that's for pronoun resolution so for the is for the similarity it's 0. 21 which is an okay correlation the confidence interval just barely includes zero so we're fairly sure i'm not a statistician don't grill me about p values uh this they say this indicates that learning accurate predictors does not require learning strong gendered correlations which is a statement you can make though i would say such a overs over parameterized underspecified model will probably pick up this feature here fairly often since the correlation is there right but they are right it does not require strong correlations okay and they say the encoding of spurious correlation is sensitive to the random seed at pre-training and not just fine tuning so this is very interesting especially in the pronoun resolution tasks the pronoun resolution test don't want to go into it too much here but so here you can see two different runs so two different um random seeds that result in two very different so here is the similarity delta this is this this minus this we observed before plotted against this percentage female by occupation for individual occupations and you can see here um this predictor has a stronger correlation than this predictor right here now i've thought about it i'm still not sure which one is let's say let's call it the better one because um yeah i'm not sure like because the you can say the bottom predictor has less of a correlation with actual occupation i think that makes it worse right but you might argue that a model just shouldn't depend or shouldn't care but then the delta is not zero whereas this top predictor actually has the zero here at fairly at the point where it's 50 50. so i'm going to tacitly argue that the top predictor is the one you want but i don't know the important part the paper doesn't make a strong opinionated claim about which one you want the paper actually just says you should be aware that both predictors solve the task very well however one th they're drastically different in how they treat this feature so here you can see there's not really a correlation between this score and the tested accuracy you can't tell from the test set what you know how it's going to perform in this particular stress test and this is very interesting in the pronoun resolution task they here they plot by different pre-training seeds and you can see they clearly cluster right so even the pre-training seed has an influence later on this performance i guess it's kind of logical but it's still interesting to see that this clusters so well uh while all these things solve the task um the same so that it basically means that you can't just take like a bird checkpoint and then fine tune it with like an objective in there that um you might already be worried about how the pre-training happened i guess maybe you can fix it i know that's what they don't show so they analyze it a bit more they say they take 20 of those predictors here to better understand the differences between predictors in our example we analyze the structure in how similarity scores produced by the predictors in our ensemble deviate from the ensurable mean here we find that the main axis of variation aligns at least in its at its extremes with differences in how predictors represent stereotypical associations between profession and gender so these data sets by the way they are annotated um you know they are constructed such that the kind of stereotypes manifest or don't manifest depending on how much your model has picked those up during training um specifically we perform principal component analysis over similarity score produced by 20 fine tunings of a single birth checkpoint so 20 different models we plot the first principal component which contains 22 percent of the variation in score deviations against the female participation percentages in figure 9. notably examples in the region where the first principal component values are strongly negative include some of the strongest gender imbalances so let's look at this graphic right here because this is where i kind of sort of get skeptical okay so let's understand these plots on the left right here so what you have is the first principal component of this kind of of this resulting similarity scores so i'm going to guess each of these dots here is one of these models um so you can see and i'm going to guess that each of these line is like one of these professions okay so for a given profession like this here appears to be a profession where let's say approximately that has a 20 female participate occupation rate and the spread here is going to be how the different models happen to um to manifest in the first principle component so the first principal component you know the axis of largest variation in the data set so the first thing that is very notable here is that these models are spread out quite a bit right so they are they perform like sometimes it's very uh the it's very negative sometimes it's very positive for the same thing right this is uh what is strange or this is a thing that this paper points out all these models perform equally well on the test set of the task that they care about however so this here is when you put men as a subject so up here the honda these occupations that are listed here would be something like i don't know mine worker oil rig worker or something like this and on the bottom you'd have kind of the more stereotypical female professions like a nurse or something like this so a couple of things to note here so what they do is the red dots here are theirs so they say we'll take the extremes and the extremes are just whenever i think this is here is negative one so they take the extremes and they look at them here and they kind of make a point of the first principal component in its extremes displays kind of the most anti-stereotypical examples okay so it you have to see here is these dots are where the first principle component is loaded negatively by a lot and the sentences these are the red dot sentences right there red dots those are those sentences a receptionist is crawling is the sentence and the plot is for man as a subject so this is the when measure when you measure the similarity between a receptionist is crawling and a man is crawling you ask how similar are those sentences compared to when i enter a woman is crawling sorry compared to the similarity of a receptionist is crawling with a woman is crawling right so this is the data this is fairly it's really meta right so their claim is that this first principle component kind of um incorporates this feature by a lot and i think their point is kind of see even when we don't train this stuff there are models that um that very much rely on kind of these or that very much over rely uh on these kind of stereotypes however that this is very i feel it's a bit it's a bit shady because i mean look at this data right you can't like you can't just pick these outliers like here these are outliers too and even if you look here like they conveniently pick um so i guess they conveniently pick such that these things here are left out you can see here it's woman as a subject so what you'd expect here if this is really the models pick up a lot of these kind of um spurious correlation what you'd expect is a line like this right you have like shift here and then up here because you know 100 women like the first component will load a lot you don't see that at all right and here you see a little bit a slope like this but i don't think that just and especially if you look at the noise between the things like this is here and then this is over here right so like the in-between noise is way bigger um to go and claim yeah the first principal components contain something like this and then we don't look at these outliers up here i i don't know um yeah so this doesn't seem to me like i see what they're trying to say and what is concerning is that there is such a big spread among the models right within these professions there is a giant spread these are the same performing models so i see the what they're trying to say but i don't think the point they're making here i don't know if this is politics or something that they have to kind of bring in these types of topics but you know they also look at with respect to others and they show look uh these models perform differently with respect to different stress test dimensions and notably the ordering isn't the same but again i feel that this is simply this might be just a problem of domain uh shift rather than what they're claiming and lastly they have a kind of a test on these other stress tests there are also nlp stress tests and you can see that the models perform quite differently so there's a spread right here within each of these the red bar is the spread on the actual test set as i understand it and then these are the different um pre-training seeds and you can again see that even the pre-training seed will have a big effect right here so yeah again um what i would like to see is kind of how does the even does even the training performance predict the test performance on the same distribution that would already be quite informative as you can see right here you can't really predict one of these stress tests from the other um the question is just can you even do this for the training to the test set because that would inform you whether or not this is a property of this stress test being in a different direction uh one direction that you didn't capture if these stress tests are really meant to show that look you can't really tell this axis that you didn't specify this is really because of under specification you would expect that from the training performance you could at least predict the test performance somewhat or from the test performance you could predict on an iid test set i'm going to assume that it is somewhat like this but i also not sure that you like that this is anything to rely on and the last thing they do is kind of a lab study where they have kind of vital signals and they predict um whether or not there is a medical problem and again you can see here they even test different architectures and so on and what they're basically the point is the same but it's just shown in a different date it's pretty cool that they have lots of different examples right here but i don't want to go into

### Conclusion & Comments [56:55]

the lab thing so their discussion at the end i think is kind of weak because i mean what they say is our findings underscore uh the need to thoroughly test models on application specific tasks and in particular to check that the performance on these tasks is stable i mean i fully agree with that right if you deploy your model into some sort of real-world application please test whether it actually works in that real world application but it seems to me that is not it's not a solution uh fully to the problem because as we saw in the epidemiology paper that sometimes just isn't possible um and also you know it is the case that not everyone can train a language model so we kind of need pre-trained checkpoints maybe the goal is that we provide like maybe google if they provide one birth checkpoint let's say they provide uh 50 right and then people can go ahead and check which one actually is good or bad on their particular dimension that they care about that maybe the pre-training didn't care about that would i think that would be a practical solution to the problem if you can't specify it and what i would say also is that it's not clear to me that it is always possible even you know in theory maybe but it is not clear to me that it is always possible to add the specification that you want and keep the same performance i see that there are predictors in the set that they consider that have that but that doesn't mean that once you add the constraint the training procedure reaches that same performance and specifically it keeps the performance on the test set so that's kind of a number of criticisms on this paper all in all i mean it's a paper that you generally can agree with right can agree with the sentiment and also the analysis the examples are of course real and the problem is real and uh yeah especially for a company like google this is fairly important because they build big models and deploy big models all right let me know what you think about this i'll see you next time bye
