# [Classic] ImageNet Classification with Deep Convolutional Neural Networks (Paper Explained)

## Метаданные

- **Канал:** Yannic Kilcher
- **YouTube:** https://www.youtube.com/watch?v=Nq3auVtvd9Q
- **Дата:** 23.07.2020
- **Длительность:** 46:07
- **Просмотры:** 48,482
- **Источник:** https://ekstraktznaniy.ru/video/13394

## Описание

#ai #research #alexnet

AlexNet was the start of the deep learning revolution. Up until 2012, the best computer vision systems relied on hand-crafted features and highly specialized algorithms to perform object classification. This paper was the first to successfully train a deep convolutional neural network on not one, but two GPUs and managed to outperform the competition on ImageNet by an order of magnitude.

OUTLINE:
0:00 - Intro & Overview
2:00 - The necessity of larger models
6:20 - Why CNNs?
11:05 - ImageNet
12:05 - Model Architecture Overview
14:35 - ReLU Nonlinearities
18:45 - Multi-GPU training
21:30 - Classification Results
24:30 - Local Response Normalization
28:05 - Overlapping Pooling
32:25 - Data Augmentation
38:30 - Dropout
40:30 - More Results
43:50 - Conclusion

Paper: http://www.cs.toronto.edu/~hinton/absps/imagenet.pdf

Abstract:
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contes

## Транскрипт

### Intro & Overview []

hi there today we'll look at imagenet classification with deep convolutional neural networks by alex krashewski elias sutskiver and jeffrey e hinton so this paper is another one in the installment of our historical paper overview where we go through kind of old papers that were or weren't very impactful and see what people knew at the time already how this developed and so on of course this paper here also known as alexnet was the one that started the deep learning revolution so to say um or at least contributed in large part to it was the first paper that showed that you could train these very deep neural networks and very deep in here is a relative term but the first one that showed that you could actually use cuda gpus to train those large networks efficiently and it won the imagenet competition that year and it did so uh by a very large margin so kind of shook the world um because previously computer vision was still doing like hand engineered features and then using some kind of classifiers on top of those this paper basically changed everything so we'll go through the paper and we'll see what was already known and especially i always enjoy with these papers how did the choices that people make back then how did they pull through to today sort of what arbitrary choices that alex krazewski made right here are we still doing today and what have we learned since then so the paper is written relatively straightforward i have to say um it's a good read if you want to read it and you know straightforward and sort of gives you a little bit of an intuition of how much work must have gone into this which is i guess a lot and yeah so they start off by saying

### The necessity of larger models [2:00]

that current approaches to object recognition make essential use of machine learning methods this was also new right object recognition wasn't always learned the you know object recognizers you could even do it in the in different way like matching templates and so on machine learning was still one of the methods used and of course today it's the method used to improve their performance we can collect larger data sets learn more powerful models and use better techniques for preventing overfitting until recently data sets of labeled images were relatively small on the orders of tens of thousands of images okay so this especially at norb or here the c410 or c4100 these are you know relatively small data sets with relatively small images as well like c410 is 32 by 32 pixels so they're saying that okay in these small data sets you know you can solve it with classical computer vision models but if you have larger data sets and especially more realistic data sets like bigger resolution and so on you need bigger models so they say but objects in realistic settings exhibit considerable variability to learn to recognize them it is necessary to use much larger training sets okay so they say that this imagenet data dataset is one of those larger data sets consists of 15 million labeled high resolution images in over 22 000 categories and people keep uh forgetting this and as i am included in that group of people that the imagenet data set is actually much larger than we know than we when we talk of imagenet when we speak of imagenet we think of the imagenet that has a thousand classes and about one or one and a half million images however that's only a subset of the much larger uh imagenet dataset with in many more categories um it's just that the imagenet competitions were performed on this subset because i guess people thought well a thousand classes and a million images is already plenty so we'll do that so that's i guess how that came to be so their argument is right here to learn about thousands of objects from millions of images we need a model with a large learning capacity however the immense complexity of object recognition task means that this problem cannot be specified even by a data set as large as imagenet so our model should also have lots of prior knowledge to compensate for all the data we don't have so their main argument for using neural networks is that the size of the data set is so large therefore we need a large model you know granted they already recognize the inherent the inherent uh connection between large models and a lot of complex data but in the opposite they say well even if we have that much data the task we are trying to solve object recognition is way more complicated than the amount of data we have so our model should also have lots of prior knowledge to compensate for all the data we don't have remember at this time convolutional neural networks weren't really you know known to do anything i guess i guess they were used for handwritten digit recognition and so on and were kind of on par with other methods however it wasn't like obviously clear that you would use them for image recognition so here they make they have to make like a argument to convince people that okay we can use neural networks for this tasks because they have such a high capacity how however neural networks feed forward neural networks are already too powerful they don't know anything about the data everything's connected to everything and they argue right here our model should have lots of prior knowledge to compensate for all the data we don't have so they allude to the convolutional

### Why CNNs? [6:20]

neural networks constitute one such class of models their capacity can be controlled by varying the depth and breadth and they also make strong and mostly correct assumptions about the nature of images namely stationarity of statistics and locality of pixel dependencies so their argument here is that the convolutional operation is such a strong prior that is mostly consistent with what we know about images that they are very well suited to computer vision again something that was not abundantly clear at the time as it is right now uh it's interesting to see how they get to this point where they say we need lots of capacity but we also need a model with lots of prior knowledge and of course cnn's fit that very well so they're they go into the problems of cnn despite the attractive qualities and despite the relative efficiency of their local architecture they are prohibitively expensive to apply in large-scale high-resolution images luckily current gpus paired with a highly optimized implementation of 2d convolution are powerful enough to facilitate the training of interestingly large cnns and recent data sets such as imagenet contain enough labeled example to train such model without severe overfitting okay so overfitting was also still like very much at the forefront of people's minds back then right now we don't really care about overfitting that much anymore basically we figured out that if we just build large enough models uh we don't overfit which is strange in itself like this double descent phenomenon and so on but overfitting was still a very much at the forefront of people's minds and they do a lot of things here to prevent overfitting which gives them kind of a boost in the test accuracy which might actually not have been the overfitting that they're combating so they do for example in data augmentation already in this paper and they always allude to ah this is to prevent overfitting however we know nowadays that it might not be the overfitting that's combatted in the by data augmentation it might actually be more have something to do with like regularizing your function making it more smooth and so on uh but so you just see how coming from a classical machine learning perspective overfitting was like the number one or one of the number one problems in classical machine learning you know in svms and things like this um so it's it's safe to say that they thought if we built these large models we're going to have a huge overfitting problem and yeah so that's why this pulls through right here also the i guess the one of the main contributions of these papers is to show to combine this cnn training with gpus also not very non-clear at the time like it was known that you could do computation on gpus but the fact that these are you know very capable for training these cnns or generally neural networks wasn't something that was you know known at the time so this paper basically showed that if you use a gpu you can get that much faster and that makes it possible to train these big neural networks again right here the size of our network made overfitting a significant problem even with 1. 2 million labeled training examples so we use several effective techniques for preventing overfitting and we'll look at those and the end they say the network's size is limited mainly by the amount of memory available on current gpus and by the amount of training time that we are willing to tolerate our network takes between five and six days to train on two gtx 580 gpus all of our experiments suggest that our results can be improved by simply waiting for faster gpus and bigger data sets to become available and i mean that proved to be absolutely true we don't necessarily have bigger data sets right now though we do but uh certainly with faster gpus and bigger gpus this became a this became these networks became better simply by increasing their depth and as you know then resnets came along increasing the depth by an order of magnitude and that gave another boost to computer vision all right so

### ImageNet [11:05]

they talk about the imagenet data set here and the main point in the imagenet dataset right here is the fact that the images are plenty so there are over a million training images in this uh subset with a thousand classes which was you know a very big that was on like c410 had 10 classes e400 had 100 classes that was already a lot a thousand classes that is like unheard of uh before this data set i guess not unheard of but yeah and a million training images completely crazy and also not only was it a lot of images they were resolution was really big so in the order of 256 by 256 whereas previous methods all were like 32 by 32. so definitely challenging data set uh even today it's a challenging data set

### Model Architecture Overview [12:05]

all right so the architecture and there's this famous graphic right here um of the alexnet architecture so briefly they describe these convolutional layers right here as you can see uh there's max pooling already here they have dense layers at the end they do generally increase the number of feature maps right here while decreasing the resolution with max pooling so all of this has sort of you know kept until today i guess they also took it from earlier work on convolutional neural networks that generally found this to be a good idea and the important part here that is kind of special to alexnet is you can see there is are these two different pipelines and you know alex for cutting off this part right here i mean you just know like just has the eight pages we need to like we have like three lines too much how can we fit the three lines we've already cropped everything let's just cut off the top half here it's essentially the same as the bottom um yeah so space constraints and pdfs for conference submissions uh ruining yet another paper all right but you can see there is this two this two column architecture right here so this network was so large that it didn't fit on one gpu so they had to split it onto two gpus with the occasional intercommunication right you can see here there's intercommunication between the two gpus and there is also no intercommunication right here on this layer this was very intricate that was one thing that really didn't hold until today i guess until now with things like i don't know g shard or so where you have different weights on different gpus again um i guess the invention of bigger gpus made that sort of superfluous but just imagine the amount of code they had to write there was no tensorflow at this point there i don't think there was even cafe around there was just cuda and um yeah just this cross gpu memory writing i just imagined this to be so ugly and big respect for writing all of this uh all of this code all right so they go through a

### ReLU Nonlinearities [14:35]

number of important things and most of the things here aren't their invention let's say but they cleverly combine things that were already known about neural networks and things that were maybe developed somewhere that they have found to work really well so the first one is the relu non-linearity now of course relus nowadays are like abundant everyone uses relu's non-linearities but at that time it was still very much in fashion to use something like the sigmoid right here or the hyperbolic tangent and why is that because the neural networks were still kind of inspired by the neurons where you had the soma of the neuron and then the input dendrites sorry the dendrites with the input axons and then you would sum up all the incoming signals and then that would go over so in a true neuron you have this um this kind of curve where if the input rises above this uh border right here the action potential maybe i don't know what the english uh term is then if it rise above that then the neuron would start to spike right um and if it's below that it wouldn't so people wanted to approximate this using some sort of a kind of differentiable but um something that's very similar to this step function and that ultimately led to something like a sigmoid or a hyperbolic tangent so people trying to stay close to biological neurons did this but that gives you the problem that in this region and in this region right here you have almost no gradient to learn from so you can see that they argue that um in terms of training time with gradient descent these saturating non-linearities so the hyperbolic tangent and the sigmoid are much slower than the non-saturatingly non-linearity this one following near and tintin we refer to neurons with this non-linearity as rectified linear units so taken from this other paper they say okay we use these relus these rectified linear units which are not exactly like real biological neurons but they train much faster right um and of course relu's are used until this day so you can see right here that a this is on a c410 and they measure the time to reach 25 of the training error and this here is with the release hyperbolic tangent and it takes much longer to reach the hyperbolic tangent especially it takes six times uh faster to with the relus and they say that's one of the main components that allows them to learn this fast uh to even experiment with these big networks because their entire training time is six days right but they probably didn't train it only once they experimented with it and saw what works so if you have a couple of months of time and it takes you a week to train one of these things you know you don't you can't afford a six-time slowdown because that would mean you can only train like two models in the entire uh course of research and that would severely hinder your progress now we are at the point where that becomes true again with these giant transformer language models where people can train it once and then you know like gpt3 they say oh we made we discovered a bug halfway through and we've kind of fixed it but we're not sure we couldn't restart because it was too expensive um maybe we're waiting for a moment i'm still saying we're waiting for the resnet moment in the uh transformers but yeah rey lews in you know here in not introduced here but used here and have been prevailing until

### Multi-GPU training [18:45]

today training on multiple gpus something as i said that didn't really get forward from here especially the kind of gpu training so if we train on multiple gpus today what we mean is that we have our model right and then we distribute that to multiple gpus like this and then we take a mini batch from the training data and we simply split it up let each gpu do its thing on its subset of the mini batch and then at the end kind of calculate the loss and then back propagate the gradients and synchronize the gradients between that so we have one model that is on both gpus here they distribute a model to two gpus and i'm also thinking that with frameworks like g-shard this could potentially have a revival right here this kind of distributing your models especially within the same layer across many gpus and then having cross-communication only at some points so their argument is this only has three gigabytes of memory which limits the maximum size of networks can be trained on it turns out that 1. 2 train million training examples are enough to train networks which are too big to fit on one gpu therefore we spread the net across two gpus current gpus are particularly well suited to cross gpu parallelization as they are able to read from and write to one another's memory directly without going through the host machine okay so this means that for so ah sorry here they say the parallelization scheme that we employ essentially puts half the kernels or neurons on each gpu with one additional trick the gpus communicate only in certain layers that means that for example the kernels of layer three take input from all kernel maps and layer to hover the kernels in layer four take input only from the kernel maps and layer three which reside on the same gpu so very interesting uh choice right here and they justify this here or they say the results um this scheme reduces our top one top five error rates by one point seven and one per two percent respectively as compared with a net with half as many kernels in each computational layer in each convolutional layer on one gpu the two gpu net takes slightly less time to train than the one gpu net so first of all i have to say big respect right here like i can imagine they did this you know with the relu's and stuff and they were

### Classification Results [21:30]

already better than previous because they're so just to go to the results the pre they beat the error rates of previous models by like a ginormous amount so this is what they knew right here this is on the 2010 uh imagenet split so the previous best ones were like at around 28 25 and here their best one is at 17 percent top five error rate i'm going to imagine that they trained it first and we're already better than the 25 percent and i guess lots of people would just call it a day would be like oh cool we have this entirely new method not only did we show that we can train it we actually show that it's better and boom i have point one percent better error rate and everything else can be a separate paper no they stuck with it and they pushed it each so each of these things right here they say oh this reduces the error rate by one percent two percent um and you know really they went about it how far can we push this with everything i mean just imagine you come and you train a network i'm pretty sure they first trained on one gpu right and um and then they thought ah you know maybe we can train an even bigger network uh by using two gpus and then they realize well it's gonna take like a crap-ton amount of dumb code to cross-synchronize and keep them in lockstep and blah blah like it's not even easy to write multi-gpu code today with all the frameworks just imagine that and for them to having already observed that their network does better than everything that was previously to sit down and do the cross gpu thing uh experiment with okay when do we cross communicate and whatnot that is very respectable um right here so maybe a lesson to be learned uh or just the mentality of the people maybe they just had more time they were like okay it's still like two months out uh this competition deadline i don't know but no i'm this is not something that i see today very often um this kind of persistence and additional pushing and reporting of what works in these kinds of things i mean some papers do it but most papers do it because only with all the tricks they can get that point one percent improvement and this one already had the improvement and did it anyway okay but multi-gpu training didn't really it's like splitting the models across gpus didn't really um didn't really stick around mainly because i guess the gpus got larger in memory pretty quickly so it wasn't that necessary but also i guess because the frameworks were just too clunky and now maybe which is shard this is coming back so worth another shot i guess next one local response

### Local Response Normalization [24:30]

normalization this also didn't really um stick around i got kind of dumped in favor of things like batch normalization but with the resurfacing of things like layer normalization um this it comes back to this thing here again a little bit so what they say is that uh what they want to do is they want to kind of normalize the response of these reduce so what they do is each response which is this alpha the or this a here is normalized by the following quantity and it's the all the responses of the other neurons around them or of the other kernels around them and you can see the sum is over this weird quantity right here so what does it mean if they have a bunch of convolutional filters and these are the activations so these are the feature maps after the convolution and um yeah so if i have like 10 convolutional filters in my layer this is going to be the output the way they normalize is they normalize each filter sorry each output channel by averaging by you see here dividing by the average response of the channels around them right so let's maybe say the five channels though two channels in front of them and two channels behind them this is going to be they take the average across this one and then for another channel right here for this one you would take the average of the five around that this isn't really something that stuck around um i guess mainly because of the really dynamic situation right here what people do today is they have things like layer normalization that simply averages across all of the channels or they have group normalization that predefines these groups like here is there's two groups and we only normalize within this group and within this group also always the same this kind of dynamic normalization on across neighboring filters as i said didn't really stick around not really sure why but um i guess it was just easier to implement it otherwise or it just worked better again here they say this it was motivated well right this scheme bears some resemblance to the local contrast normalization scheme of that but ours would be more correctly termed brightness normalization since we do not subtract the mean activity and oh they make it connection to biological neurons where is it this sort of response normalization implements a form of lateral inhibition inspired by the type found in real neurons creating competition for big activities amongst neuron outputs computed using different kernels okay so kind of inspired by real neurons but also other people doing also some kind of normalization so people already knew that normalization was helpful at some times and this is what they employed right here again reducing the top error rates by 1. 4 and 1. 2 respectively so not a big improvement but still an improvement the last thing overlapping pooling again

### Overlapping Pooling [28:05]

a thing that didn't really stick around uh that much uh where they say okay instead of having a pooling layer so if this is your image and instead of pulling two by two in the stride of two like we do today and you know pull it down to a smaller image what we can do instead is we can pull with overlapping windows so in that case they pool with a three by three window but they do always do stride of two so they have like these overlaps right here resulting in the same size but then each pixel right here has some sort of overlapping information from the pixels around it again they say it reduces the top one and top five error rates by point four percent and point three percent maybe this didn't stick around because um i i'm not sure maybe because people found it doesn't work in other problems who knows so the overall architecture as we said is described in this picture right here so you have the input image which you can see has three channels and they use convolutional filters uh with a here with a stride of four at the beginning to reduce the size so at the beginning it's 224 by 224 and then it's 48 by sorry it's 55 by 55 that thing here 55 by 55 48 feature maps you can already see as we said before the feature maps keep increasing while the number of the dimension the resolution of the image keeps decreasing the stride of 4 convolution here already employed in order to down sample the image at the same time as convolving it nowadays a lot of architectures will simply not do max pooling at all but always use the kind of strided convolution to down sample image while convolving it what you also see here is that they thought that the feature map size should d should also be large at the beginning and then decrease which is a reasonable assumption right because if you have higher resolution images you're probably going to need higher resolution feature maps this didn't really come through until today as you know most architectures today they just go with like three by three uh kernels from the very start and don't really care about you know also downsizing their filters um i don't really know why whether it's just more convenient or less parameters or um whether there's really something to having small filters but i just know you know this is something the large filters at the beginning is something that didn't hold over time also you can see right here they have multiple dense layers at the end i believe most architectures today simply go with two of those instead of three so one like hidden layer and then one classification layer but it's you know it's um it's very close to the architectures today right there hasn't changed that much like the difference between this and the vgg 16 vgg 19 network is just depth and then the difference between those and the resnet is just the whatever the skip connections right here and that's where we are today so there hasn't changed that much uh honestly they also allude to the fact that actually even though it doesn't look like it most parameters are here in these dense layers those are most parameters of the network this right here a convolutional layer is like one percent of the parameters even though it takes up a lot of space in the drawing so maybe the reduction in the number of classification layers at the end also has something to do with the fact that that's where most parameters are so if you get rid of one of those dense layers you can like get many more convolutional layers

### Data Augmentation [32:25]

all right so the last uh part here is on reducing overfitting again um they didn't really investigate uh whether or not really their network was overfitting like really establishing the overfitting it would i think maybe they did and maybe it was actually overfitting um but we now we don't care about overfitting too much anymore maybe because we already use these augmentations naturally but also because we built these deep models so um we somehow have an idea that they generalize naturally i'm not sure whether they actually were only worried about it that much because of the history of machine learning or whether they actually did see that everything was overfitting constantly okay they say our neural network architecture has 60 million parameters although the thousand classes make each training example impose 10 bits of constraints on the mapping from image to label this turns out to be insufficient to learn many parameters without considerable overfitting below we describe two primary ways in which we combat overfitting again is no one today makes this argument anymore this oh we have this many parameters and there are that many images right we have 60 million parameters um we have 1. 2 million images a thousand classes how you know when do how many parameters per sample is that and so on how many bits of constraint we don't care about we're fine with having uh like a billion times more parameters than training samples uh we don't worry about it anymore so the first thing they do is data augmentation already um i mean this was already known again like lots of these things here were already known but the combination is just so cool in this paper where so first of all again they say the transformed images are generating python code on the cpu while the gpu is training on the previous batch of images so these data augmentation schemes are in effect computationally free again this code must have been ugly the first form of data augmentation consists of generating image translations and horizontal reflections we do this by extracting random 224 by 224 patches and their horizontal reflections from the 256 by 256 images okay so random so this was already this these are the most valuable data augmentations that still we have today random horizontal flipping is still used in every pipeline of computer vision except if you want to read text i guess and um random cropping is still the most powerful data augmentation technique for images today and the it's crazy that um this was already discovered and i don't know whether they say right here how much this particular thing improves i don't think they have a stat on how much this improves they just say how much this next thing improves but i'm going to guess this was one of the vital things for pushing the performance because now we know cropping is very important i guess they thought that they would you know translation was the important part and so they focused on generating image translations and to generate an image translation from a single image naturally you have to crop it um however we now focus much more on the fact that we crop it and kind of have different sub images of the same image especially in you know self-supervised learning and things like this we know that cropping is what is like the power horse of these methods so the fact that they uh extract random patches right here means that their network only operates on these sub patches and then they compensate by at test time the networks makes a prediction by extracting five patches the four corner patches and the center patch as well as their horizontal reflections and averaging the prediction made by the network's soft max layer on the ten patches i also believe that people don't do this too much nowadays they most of the time they simply rescale the test images or something like this or fine tune at the end on the kind of scale training images there are various techniques for doing this but random cropping and horizontal flipping already employed right here also color kind of color jittering a form very special form uh altering the intensities of rgb channels and training images specifically we perform pca on the set of rgb pixel values throughout the imagenet training set to each training image we add multiples of the found principal components with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a gauss with zero mean and standard deviation 0. 1 this is i believe this has gone out of fashion so people do color jitter and kind of brightness jitter and so on uh but i don't think they particularly do this kind of pca based image augmentation right here anymore they say this scheme reduces the top one error rate by over one percent i wonder why this isn't work maybe because you need these stats over the entire data set then the other things may be working equivalently well uh but you can simply apply them without knowing kind of your principal components okay next thing dropout

### Dropout [38:30]

has been you know one of the things that was very important throughout the early stages of deep learning isn't that important anymore now dropout some people still use it but most people i think don't use dropout anymore and it's very interesting to see but it definitely was a technique that was used a lot during like from alexnet to basically like now like the last very few years so they say combining the predictions of many different models is a very successful way to reduce test errors but it appears to be too expensive or big neural networks that already take several days to train there is however a very efficient version of model combination that only costs about a factor of two during training so there's this take this technique called dropout then they explain it to set to zero the output of each hidden neuron with probability 0. 5 again people didn't know about dropouts as they do now but they introduced this right here and they say it reduces their not sure that they also don't say how they uh how much they by how much this reduces the training error but they say we use dropout in the first two fully connected layers without dropout our network exhibits substantial overfitting dropout roughly doubles the number of iterations required to converge so okay so they did actually make sure or they did find the actual evidence of overfitting and saw that dropout reduces that and i wonder why this doesn't happen nowadays maybe because we have the we have less of these fully connected layers but i can't really imagine um maybe because we do more augmentation i don't know or maybe dropout is still used and i'm just i just don't know it and don't see it uh yeah so here they

### More Results [40:30]

use momentum to train this and they do some qualitative uh analysis um analysis so first of all they say okay they shatter all of the previous approaches especially also then they build kind of ensemble methods and they pre-train they already do transfer learning they already pre-train on imagenet 2011 and fine-tune then on the imagenet 2012 right here the imagenet 2011 and then fine-tuning on the imagenet 2012 to reduce that error even further like pulling all the tricks all these things are around still very cool and then they look into what their network learned so they find that there are a number of these kind of filters um you see these 11 by 11 filters in the first layer where they show okay this really and this was kind of already known that these neural networks extract filters like this like color gradients or edge detectors in various forms and directions and cool to see that this one also does so this one here is also a very cool investigation where they look at examples and the red bar the red one is always the correct label and the bars are basically what their model says are the top five things and it's cool to look at so for example here you have might as a top one but then also black widow cockroach tick and starfish but the top labels are usually also very good labels you can see here grill and it assigns convertible which you know by all means is correct it's just not the class that the annotators assigned to this particular image as well as here uh dalmatian was uh the highest prediction of the network where the label was actually cherry and yeah this is quite debatable right so you can see that a lot of the mistakes the network does is are you know forgivable let's say and you can see that for when the network doesn't do mistakes the not only the top label is good but a lot of the top five labels are also very adequate lastly they look at a given training set image which these are the training set images right here and they look at the last layers feature vector and the five nearest the or the nearest neighbors in euclidean space of the entire training data set and here's what you come up with so you can see for the elephant the nearest neighbors are all other elephants and regard that they are in different poses right they don't always look the same way these elephants also these dogs right here so it's pretty cool to see that the network actually learns some invariances across the class and puts images with the same label into the same area in the embedding space

### Conclusion [43:50]

yeah so that's their paper um they already allude to the fact that depth is very important it is notable that our network's performance degrades if a single convolutional layer is removed for example removing any of the middle layers results in a loss of about two percent for the top one performance of the network so the depth really is important for achieving our results and as you know this spurred an area of this burden area of trying to build deeper and deeper networks until resnets came along and built ultra deep networks they also say we did not use any unsupervised pre-training even though we expect that it will help especially if we obtain enough computational power to significantly increase the size of the network without obtaining a corresponding increase of the amount of labeled data thus far our results have improved as we have made our network larger and trained it longer but we still have many orders of magnitude to go in order to match the infra temporal pathway of the human visual system ultimate ultimately we would like to use very large and deep convolutional nets on video sequences where the temporal structure provides very helpful information that is missing of for over far less obvious in static images so already the previewing of future research here with the self-supervised with the many more layers and so on astounding that this kind of foresight and of course all of this proved to be you know very adequate predictions right here and yeah so this was the paper right here the paper that kicked off deep learning um i enjoy reading kind of these old papers especially looking back at what was already known what still is around which turns out to be a lot is still around um and the choices that people made back then some of them defined our modern field so that was it for alexnet let me know what you think in the comments and i'll see you next time bye