Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

25:44

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Yannic Kilcher 02.02.2019 28 007 просмотров 973 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

https://arxiv.org/abs/1502.03167 Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters. Authors: Sergey Ioffe, Christian Szegedy

Оглавление (5 сегментов)

Introduction

hi today we're looking at batch normalization accelerating deep Network training by reducing internal covariant shift by Sergey iof and Christian SK yeah not my not the best pronouncer um

What is Batch Normalization

um s close enough all right so this is a bit of an older page and uh it's I think it's still good to look at it it's relevant and um people just kind of throw batch normalization into networks and maybe don't really know what it's doing so let's look at it um so what these people argue is that in a network usually you have structures like this so if something like that um it means that your loss kind of this is a two- layer Network your loss is a composition of the first layer on the input U with parameters Theta 1 and the second layer with parameters Theta 2 so conceptually that would look something like this you have your input maybe it's an image right um and you put it through the network and it becomes some intermediate representation right that's x0 that's X1 and or maybe we call it even H H1 hidden representation right and that becomes then through the layer becomes H2 and so on right so these this stuff here this would be weight matrices W1 W2 that transform the image into a new image or whatever so what they're arguing is that well if you only consider a single layer like the first layer here um it's kind of the same if you only consider the second layer with the H1 now as the input right it's pretty natural to see each layer of the neural network is kind of like its own transformation taking inputs and producing some outputs so what people usually do with the very first input here with your data uh in machine learning generally is so-called whitening the data which means that they have this over here um usually data is whiten I can't find it but what it means is you basically want to if you have data let's say here is a coordinated axis you have 2D data and you want to you might want to do a kind of a linear regression on it and you have data that's kind of like that right it suits you to transform this data into by first of all looking where its mean is about here and subtracting that so here and then kind of dividing by its standard deviation in each Direction so there's a standard deviation here and there is so you would transform this data into something like maybe something like this so you see that the mean the mean is now in the middle and the it's not so elongated anymore so you have a much easier time to kind of learn something on this data than on this data over here simply because our classifiers usually tend to rely on like inner products and if you do an inner product here um you have one of these vectors here and you do some inner product it's always going to be far away from the mean and thereby the inner products are going to be large no matter what right whereas here if you take a random one and then another Rand so two Rand if you take two random points here there two vectors from the mean are almost the same whereas if you take two random points here they tend to look you know uniformly in the directions so it's kind of the sense we know that machine learning methods work better if we whiten the data first so these people ask hey why do why do we only do this at the very beginning right why don't we why if each layer is basically takes its input and learns something each layer is basically a machine learning method why don't we just widen the data to every single layer or you know every single subc component of a deep Network and that's the kind of basic step here so they argue how this has been kind of tried before or what kind of methods you would usually get and why these aren't so good um mainly because you kind of need to intermingle this whitening with training the network and thereby if you just go about this naively then you would not um you would kind of produce artifacts from training so that's this section here um where they argue that it's not really you can't really go about this super naively but what they do isn't super complicated but they just do it in a smart way so we'll jump directly to that um what they say is okay let's look at what they call normalization Via mini batch statistics all right let's say we have a some D dimensional input X right and we're just going to look at per Dimension so we only care about per individual Dimension um normalization all right so what are we going to do we're going to take the KF Dimension we're going to subtract from it the mean of the K of Dimension within a mini batch right of data so mini batch maybe something like 32 examples or 100 examples or something like this and then we'll divide by the variance of that mini batch um so this is done over here in basic so you compute mu B mu of the mini batch which is simply the empirical mean of that of the data at that particular layer and then you compute Sigma squares B which is simply the the empirical estimate of the variance of that of um computed on that particular mini batch and then you transform your data by subtracting that and by dividing it by this and this this constant here is simply to prevent from dividing to by you know two small values uh so you get like in numerical problems um so what does it do it does basically what we did above um but now what they say is okay we want to make sure that this transformation can potentially you know um represent the identity because sometimes or like a natural if you had to do something with your input when giving it to the next layer like the very Baseline is to do nothing to it right to do the identity transform um but if you do this you probably won't end up with the identity transform except if the mean is exactly zero and the variance is exactly one right so what they say is okay we'll also introduce two new parameters to this here this uh gamma and this beta here and these are learned like other parameters in the network we learn the parameter GMA and beta and GMA and beta are simply gamma is simply a scalar that this transformed X is multiplied by and beta is simply a scalar that is then added to it so in each dimension of your hidden representation you basically learn how to scale it and how to shift it scale and shift after you've done the normalization so first you do the normalization where is it right first you go from this type of data to and then you say well but maybe it's actually more beneficial to you know have it not centered or whatever so that the network can actually learn then to transform this somewhere this might seem redundant but it's really powerful because um what you're basically saying is that okay this probably isn't the best you know distribution this probably is better but if the network kind of if the back propagation algorithm or the training algorithm decides that this first representation was actually useful it has the option of going back but it also to any other kind of form of distribution so it's it's pretty powerful um in terms of what it does okay it's not really correct here that it has the power to go to any distribution because it's only kind of um per Dimension scaler that it learns but still um it the potential to transform the distribution uh by these learned scalers is pretty big all right so basically that's it that's that's the whole shebang um you normalize your inputs to each layer uh by this formula and then you introduce new parameters um that you learn along with your network parameters uh so this kind of has some implications um first of all one implication is this here if you build a batch nor into your network it kind of learns this plus beta which is basically a bias parameter if you think of a traditional kind of fully connected layer this isn't a fully connected layer because this scalar here is only per Dimension but the bias in a fully connected layer is also just per Dimension so the beta is equal to a bias in a fully connected layer so if you have a batch normalization after or um after a after a fully connected or convolutional layer or anything that can or sometimes has a bias parameter it's almost not worth it to kind of learn both so you would rather just only have the one from the batch normalization and leave and use the convolution or fully connected layer without a bias so that's kind of one implication another implication is we have just lost the kind of the ability to have deterministic test time inference um so much like Dropout which is kind of a random dropping out of nodes uh here we have quantities that depend on the mini batch so not only the individuals sample but they actually depend on what other samples are randomly selected to be trained with that particular sample um so that's kind of awkward if you kind of want to have some deterministic repr reducible thing at test time so what people do is and here this is discussed while training they use these quantities the the the quantities uh we just discussed but they keep kind of a running average over them so what I would do is in each mini batch I would compute this mini batch mean and this mini batch variance and I would keep quantities um I would keep running averages of them right and at test time I'm going to plug in these running averages so there's nothing dependent on the mini batch anymore uh that's so that that's a pretty neat trick I think and um you can even imagine like at the end of your network training simply using these here to kind of fine-tune the weights to these exact parameters um so that's one thing that's kind of uh you have to pay attention to so in usually in neural network libraries there are parameters you can set whether or not this network is in train mode or in test mode um and depending on that the batch Norm layer will use the mini batch statistics or will use the kind of all over data set statistics all right the second thing is

Training

training so how do you actually train this thing because now you can't just right we started with our multi-layer Network up here right F2 F1 right first I'm going to put my things through F1 and then I'm going F2 right and the back propagation here is quite easy so well let me get rid of this the back prop here is quite easy you go the L and maybe you want to dve it by Theta 1 right so you're first going to dve it by the hidden representation one and then with respect to Theta 1 so the hidden representation would be whatever comes out of here H1 sorry not I um and so on so you kind of chain rule your way through here but now in between these layers here you have these batch Norm things and so the authors discuss how we

Back Propagation

now do back propagation in the face of these things so here is basically what they discuss um it actually pays to have a graph of what's going on so here is X this is the input to our layer right so what do we compute from X we compute mu let's just call it mu or mu it's called here right this is the mean of X of all the X's so this is X i until X well X1 until xn this is the mini batch um we compute the mean and then from this and from this we can compute this estimate of the variance right we need both um all right so we now have the mean and the variance over the mini batch so we're going to take one of these X's just the ith one right and we're going to use this and this to compute x what compute X is it called hat yeah probably it's called X hat right yeah we saw above X hat so X hat is X or x i is X IUS mu B / Sig 2 b the square root of it plus this kind of little constant here we're going to leave away the little constant for clarity sake actually it's in the calculations here but right so um then we have a new parameter gamma right and we're going to use it and our X hat to compute and also this beta here to compute y hat u y or Y just Y and of course this is I right so and this here is our final output of the layer um so you can see now the back propagation paths if you go through here so the back propagation path if we have some loss coming in here we back prop through Yi right so here is the L the loss to Yi that's here right so if we want the for example the back propop with respect to Beta what we do is we simply and this is over the mini batch of course um we simply back propop here through this path so in our formula for beta there should be only mentioned Yi and that's what we see here right in our formula for gamma there should only be mention of Yi so because the path leads only through Yi um oh no I'm sorry actually because the of the what I mean is of the derivative with respect to Yi of course the we also have to pay take into attention that this is Multiplied here by this x hat I where of course that's not the case when we just add something because the derivative um of two of an addition like X Plus B with respect to B disregards X whereas if it's x * B it doesn't disre dis record X all right um so if we yeah so you can go back so the interesting bit basically comes when we want to find out okay how because here is another layer right down here somewhere there's another layer and we basically want to know this input here to the next layer how do we compute it in the face of this mess here um because we it's not so easy right so you have to see we have three paths here we go back through X and let me get rid of these blue lines um we go back through X hat directly to X um we go One path is through here and one path is uh through this mu so we basically have to compute derivatives with respect to Sigma squ and mu and for that we need the derivative with respect to X hat so basically the way back propop works is you just find all paths from where you are to where you want to go and then you kind of iteratively compute this so this one here is the EAS the easiest as you see here they did it on top um well first they did this one which is simply um going from y to X hat I is that then they go from X hat I to Sigma squar which simply involves kind of the reverse operations of how you got it uh this is simply a derivative formula here of the division by square root um then you can use this quantity here to compute that so basically you just go in reverse of how you computed the operations in the first place we said we needed mu B to compute Sigma squ B now we need the derivative with respect to Sigma square B in order to compute the derivative to mub um and once you have that and you see the addition here add here is the fact that oops two things contribute to Mu so two paths lead to mub One path is from here and one path is through here right so here there should be a green um since two paths you have two components to your derivative and you add each of them uh so that's how that's going to be and then this here with respect to this x here we have three paths right because we have three arrows going out of XI one here and one here so you have to take into account all of them right so this one is pretty easy that's the first one then the second one um oh sorry this the second one uh goes through this mu which we've already computed and the third one goes through the sigma which we've also already computed right and these are added um because all the paths you have to all add all the paths in the backprop algorithm maybe we'll do actually a video on backprop uh later to get to really dive into how this works um and finally they uh they compute this the we've already discussed so in essence the whole thing is differentiable um you just have to kind of pay attention how to do it um but the whole thing is differentiable and thereby you can basically back propop through a Network that has these batch Norm layers in built in so that's pretty

Results

cool um I just want to quickly jump over to the results um and yeah keep in mind this paper is from 2015 so networks weren't that big uh back then um we didn't know that much about training yet but the interesting thing is they basically discovered look we can have drastically fewer steps in order to reach the same accuracies and these are kind of the activations of the network over the course of training so without patch Norm you see especially at the beginning there's large fluctuations in the activations and um because they use batch Norm now there's no such thing so basically the reason for that is pretty simple right while you learn and you learn your layered representation here let's say there's x and x is fed through layers and there's hidden representations each in between right so you're trying to learn all these parameters let's say this one here W3 but at the beginning of training everything is kind of prone to shifting around a lot so when you change W1 that kind of changes the entire distribution of your hidden representations after the fact so basically whatever you learn for W3 is now already almost obsolete because you've changed W1 basically and W3 was kind of assuming that its inputs um would remain the same because that's what you assume in machine learning your input distribution is kind of the same so um that's why at the beginning of training you see these kind of large variances and with batch Norm this tends to go away so that's pretty cool um they also kind of show they mainly show that they can reach the same accuracies as other uh training methods but with much fewer steps and they can go Much Higher Learning rates than others so um because of that so that's pretty cool um I encourage to you to check out the rest of the paper use batch Norm in your network sometimes works it sometimes doesn't work strangely enough um but I know I guess that's just a matter of experimentation all right that was it for me uh bye-bye

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник