The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
19:14

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Yannic Kilcher 13.04.2020 22 267 просмотров 674 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Stunning evidence for the hypothesis that neural networks work so well because their random initialization almost certainly contains a nearly optimal sub-network that is responsible for most of the final performance. https://arxiv.org/abs/1803.03635 Abstract: Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance. We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective. We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy. Authors: Jonathan Frankle, Michael Carbin Links: YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher

Оглавление (4 сегментов)

Introduction

hi there today we're looking at the lottery ticket hypothesis finding sparse trainable neural networks by Jonathan Franco and Michael carbon so this paper is sort of an empirical paper into what makes neural networks train successfully and it comes out of the literature of pruning so they say neural network pruning techniques right they have been around for a while they can reduce the parameter counts of trained networks by over 90% decreasing storage requirements and improving computational performance or inference without compromising accuracy right so what does this mean if

Neural Networks

you have a neural network let's say you just have three nodes each layer you have to layer here you have a neural network here if you have a fully connected neural network every nodes going to be connected with every node in the next layer right and these are these connections are your weights your Thetas and you're going to train them which means you have a number of steps in this direction and let's say you have a test set accuracy right here so here are steps you're going to train them and if you train them your accuracy will reach a certain point right I'm just gonna draw the end point here let's say you reach a 90% test accuracy so your network generalizes pretty well that's pretty good so people have been wondering this these networks they require quite a lot of storage you know this is nine connections right here so three times three and this is also nine connections can we make it smaller but still retain the accuracy and this is where pruning comes in so with pruning people would go and after you train them so the first step is train the full Network right and then the second step is prune now when put when you prune you basically select among the weights that you have trained you select the best ones in some form or another in this case people just select the ones with the largest magnitudes but there are multiple techniques to do this and this is very related to things like quantization or distillation so with pruning you just leave away some of the weights or most of the weights and you hope that you still retain a pretty good accuracy right here right um sorry actually we don't need these steps thing so you leave away weights and you retain a good accuracy so pruning methods have been deployed successfully to make networks use less space or be faster to evaluate because of course with less numbers you need to do less calculations so this paper builds on top of this and it basically says all right if we do the following if we now take this network that we identified after training and we um we just take this network and we train it from the beginning only this sub Network right so three is retrain then it will also perform pretty well or even better under one condition right so if you only train this thing it will perform well under one condition and the condition is that you transfer over the initial weights so right the question is can we train just the small network from the beginning so that we don't have to train the big network right and the paper identifies that this works if your initial weights theta0 of the small network are equal to the initial weights of the large network right just so just the ones where you pour them over but basically the short answer is no and the reason is if you only want to train this small Network you need to know the good initialization of these weights all here and the good initialization you only know after you've trained the large network and actually identified which of these connections make sense so you can't just take a smaller network from the beginning you have to train the larger one then you know which weights and which initializations make sense so this

Lottery Ticket Hypothesis

is the winning lottery ticket hypothesis basically it states and we can read it out in full the lottery ticket hypothesis is a randomly initialized dense neural network contains a sub network that is initialized such that when trained in isolation it can match the test accuracy of the original network after training for at most the same number of iterations right now the important part here is that it contains a sub Network right that is initialized such that when trained in isolation so two things are important it is important the structure of the network of the sub Network but it is also important what are the initialization of the connections now so the paper kind of hints at why neural networks work at all and the reason is because we've often thought of a neural networks have so many parameters how can they even generalize the reason is the following if we have a neural network we throw so many parameters at it some of the parameters one subset of the parameters namely the red ones here are going to be in initialized in such a way in such a beneficial way that training will perform will make the network perform well right so it's initialization plus SGD on that sub Network right so it is actually only a very small sub network that is responsible for the performance of the neural network but that sub network needs to be initialized at the correct position and by over parameterizing these neural networks so much we actually give it combinatoric lee many sub networks to choose from where the initialization could be well so because of this combinatoric s' it means that if we over parameterize by some margin then there's almost guaranteed to be a good sub network in there that can then perform well alright so I hope this makes sense it is basically it is not a way magic thing where you now we can we now can train the smaller networks it is an explanation of why the over parameterization in neural networks makes sense because by over parameterizing we allow the neural networks to combine it to exploit the combinatorics to find a good well initialized sub Network that will perform well and the evidence for this is exactly the fact that if we transfer over the sub Network it by itself will reach the same performance or actually exceed the performance but only if we initialize it at the same point as it was initialized in the original network so here is how these sub networks are identified we've already hinted at that but here is how the paper does it so it says identify winning tickets first randomly initialize a neural network this is the full neural network right then train the network for J iterations arriving at some parameters right these are the trained parameters prune P percent of the parameters right so of these parameters prune some right and this is for in order to know which ones you prune you need to have first the trained the full neural network right so this is the capture you need to train the full neural network to know which ones you must prune and thereby you create a mask M right and then they say reset the remaining parameters to their value in theta zero actually you don't need to say remaining you can just say reset the parameters to their values in theta zero now this is also important this is the same theta zero as it was at the beginning of the training right so you need to actually set them back to those exact values and thereby you create the winning ticket now this okay actually if you just want to end up with the Train Network than you then this uh this remaining thing here is important but if you then want to retrain you can only you can set everything back and only train the masked version of the network right and they say this will identify these winning tickets and it will actually work better if you don't do this in what they call one-shot but if you do this iterative pruning that means it repeatedly trains prunes and resets the network over n rounds each round principal P to the one over n percent of the weights that survived the previous round now why might that be it might be and this is I think some valid hypothesis that I myself put forth here it might be that if you prune some of the weights right let's say you prune this one and this one what you'll do is you put the responsibility of these weights on two other weights so maybe on this one in this one so as we said they prune by looking at which weights are large so let's say here we have the weights of the layer and these are the magnitudes of the weights right so okay so you would prune let's say you only want to keep two of those around so you would prune this one and this one because these are pretty small right here's the magnitude and you would also prune that one right if you just do this one shot and then you would retrain and maybe these weights would end up somewhat different but if you do this in multiple rounds right let's say you first prune one of them you only prune the smallest one right this one here and then you retrain and then your weights actually change and all of the responsibility that this weight carried before is now transferred on to this right so your new weights look like this and you prune another one like this and again all the responsibilities of this would in my hypothetical example fall on this one right and now if you prune the third one you would actually proven this one because you realize all this weight here in absence of these two other weights is actually important so you would prune this one as well right so I think that is why this kind of iterative pruning method might work a bit better than the one-shot pruning method that they say here so they do a

Experiments

lot of empirical investigation and I just want to highlight a very few of them but so that you get the gist and then that the paper goes into a lot of detail and a lot of different architectures that you can check out yourself all right so here we have a plot sorry that deals with percent of weights remaining so as you go to the right here uh they draw up and more weights and realize this is a log plot right so if the dashed lines here are random proving which means you just drop out a certain number of weights and then you retrain and you can see that the dashed line here it starts dropping and just becomes worse as you have less and less weights remaining which is exactly what's expected right you prune the network you make it smaller you make it less performant and the more weights you take away the less performing it is but interestingly enough right if you do this pruning that they suggest and then retrain with the correct initialization not only do you retain the same level of accuracy for very long you see here this is 2. 9 or 1. 2 percent of weights remaining but you actually go higher right so you can see here when you have 16 percent of weights remaining you there's actually a significant difference between the full network and the prune Network and that's only by simply training this winning hypothesis so this I find very fascinating and again this is not a magic bullet that you can do from the beginning but it does give a clue that if you could train sorry these from the beginning then you might actually end up at a better point so it does actually give a practical application also you see they train faster so the blue line here is the full network over the course of training sorry this should be blue so here is training iterations and this is test accuracy so you see the full network does something like this now if you prune to what 20 percent of the weights actually train faster and you go higher and even if you have seven percent of the weights you go almost as high so this is very interesting only when you go to like one point nine percent of the weights does your performance degrade again and eventually actually go lower than the original Network so that is pretty pretty cool I think now they do as I said they do a lot of investigation and I think one of the main takeaways is that is not only the structure of the winning hypothesis so it's sub network that makes it to be a winning hypothesis it is actually the initialization here I want to show one of these plots they have lots of thoughts what you can see here for example sorry this is from my own annotations again this is percent of weights remaining and this is test accuracy at the final iteration and if we initialize the sub Network at its original position like this method suggest you see we first increase the accuracy and then decrease it after a long time if we take the same sub Network right but we randomly reinitialize it then it drops much faster and actually immediately drops so it really is about not only the structure of the sub Network but about its initialization I think that is the core of the hypothesis here he very interesting related finding and that I just want to mention I find to be that they actually discover that the weights so if you have a weight of the two kinds of weight let's actually go up to my original drawing here if you have if you compare how fast or how far do the weights travel in optimization space right so you can basically look at how far weights travel during optimization so you take the full neural network here and you look at a parameter that ends up being in the winning hypothesis theta zero and it goes to theta and which final and you also look at parameters that don't end up in the winning hypothesis let's call this theta one to theta also final prime not too good at labeling and you look at how far they travel you'll find that the weights that end up in the winning hypothesis they during optimization they travel much further in optimization space than weights that are not in the winning hypothesis right they just stay around much more so it's not that the kind of good network is already contained in initialization it's much more than the good Network lends itself very favorably to be initialized by SGD right because it travels further I mean adds GG has a bigger pole on it right I think there is a lot of things that are yet to be explored in this space and I think this paper is a very cool contribution to our understanding of how neural networks work alright I invite you to check out all the experiments they do a very thorough job and with that I say bye-bye

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник