# mixup: Beyond Empirical Risk Minimization (Paper Explained)

## Метаданные

- **Канал:** Yannic Kilcher
- **YouTube:** https://www.youtube.com/watch?v=a-VQfQqIMrE
- **Дата:** 27.05.2020
- **Длительность:** 13:02
- **Просмотры:** 12,144

## Описание

Neural Networks often draw hard boundaries in high-dimensional space, which makes them very brittle. Mixup is a technique that linearly interpolates between data and labels at training time and achieves much smoother and more regular class boundaries.

OUTLINE:
0:00 - Intro
0:30 - The problem with ERM
2:50 - Mixup
6:40 - Code
9:35 - Results

https://arxiv.org/abs/1710.09412

Abstract:
Large deep neural networks are powerful, but exhibit undesirable behaviors such as memorization and sensitivity to adversarial examples. In this work, we propose mixup, a simple learning principle to alleviate these issues. In essence, mixup trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularizes the neural network to favor simple linear behavior in-between training examples. Our experiments on the ImageNet-2012, CIFAR-10, CIFAR-100, Google commands and UCI datasets show that mixup improves the generalization of state-of-the-art neural network architectures. We also find that mixup reduces the memorization of corrupt labels, increases the robustness to adversarial examples, and stabilizes the training of generative adversarial networks.

Authors: Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher

## Содержание

### [0:00](https://www.youtube.com/watch?v=a-VQfQqIMrE) Intro

hi there today we'll look at mix-up beyond empirical risk minimization by a Hong Yi Chang Mustapha sis Yong and dolphin and David Lopes Pass so this paper is actually pretty simple but it introduces a technique that apparently helps with training classifiers and it I have it seen it used in practice so there must be at least something to it is ultimately

### [0:30](https://www.youtube.com/watch?v=a-VQfQqIMrE&t=30s) The problem with ERM

very simple so usually you input a data point X into your neural network in deep learning so f of X that's your neural network your known neural network has parameters theta you get some output Y hat and along with the X you also have a y: i true label and then you have a loss function that compares what you output with your true label and then you just try to make that loss smaller you want to adjust your parameters so next time you see data point X its output will be a little closer to the true label Y and we call this empirical miss empirical risk minimization because you don't actually what you think is that your X comes from some distribution from some data distribution D like the space of all natural images or of language but what you actually have is you have a data set of a finite amount of data that you can put a that you can sample x and y from and so instead of your minimizing your true risk you minimize your empirical risk the empirical risk minimization right here now what's the problem with that the problem is that you can get overly confident about your data points and nothing else and that will hurt your generalization so if you have a data point let's say right here and another one right here your network is basically so this is a this is class 1 this is class 2 your network is going to maybe make decision boundaries like this and like this where it says ok here is class 1 and here is class 2 but it could you know it's very conceivable that here it says ah here is class 4 and over here is class 7 and right here through is class 9 and by the way here class 4 again so the empirical risk minimization leaves everything in between the data points open now what

### [2:50](https://www.youtube.com/watch?v=a-VQfQqIMrE&t=170s) Mixup

this paper proposes is that we should not only train our classifier on these data points but on all the data points sort of in between the two and this is the mix of data points so this data point here might be constructed if this is a and this is B from zero point one times B right and plus 0. 9 times a because it's mostly a and it's a little bit B and now you think what are the labels here if a belongs to class one and B belongs to class two then of course the label of this data point is zero point one times the class of B which is 2 plus 0. 9 times the class of a which is 1 ultimately because what you do is you input a class like class number two if you want to input this into a machine learning model you just you don't just say it's class number two what you input is a distribution that is basically has zeros everywhere so these small things there zero zero one zero and this here is at class number two so this would be class number one class number two class number three right you input a distribution like this if you want to express class number two now in our sample right here what we would input as a label is simply a mix between class so 0. 9 of class 1 0. 1 of class 2 and then zero everywhere else so this would be our label for the data point that we construct right here this will be our sorry the top one would be our data point formally you take two data points and you mix them using this lambda mixing factor that'll give you a new data point that's in between the other data points and you take the two corresponding labels and you mix them accordingly as well and that will give you the label for that data point and now your model will learn to basically smoothly interpolate so you will teach your model the thing on the left here is class number one right that's class number one the thing on the right is class number two this here is a half of class one and two so the model basically learns a smooth interpolation where the situation that's here on top is probably not going to happen anymore but what it would do is it would sort of create these ISO lines around class two and then around class one where it's sort of smoothly getting less and less sure about the class of the data points but on the way it is always either class 1 or class 2 and they say that can help the generalization performance and it's visible or why right it's just the only thing that's not it it's not clear from the beginning is that this kind of interpolation actually makes sense because if this means we sort of linearly interpolate between two images so if we have two images we just take half of one and half of the other and that will be not a natural image it will be kind of a blurry thing otherwise you know all our problems would be solved and we could just linearly classify things but in any case in practice it actually seems to help probably because interpolations of two images linear interpolations are still much more like something like a natural image then any random noise you could

### [6:40](https://www.youtube.com/watch?v=a-VQfQqIMrE&t=400s) Code

come up with so they say it isn't code right here code is pretty simple simply want to mix the two things and the mixing factor this lambda here comes from a beta distribution and they use a beta I believe of 0. 4 or something just want to quickly show you this is the red line here so the red line as you can see mostly most of the time they're going to either sample the thing on the very left or right that means the either sample the first or the second data point but some of the time they actually sample something in the middle and it's fairly uniform in the middle so it appears like a good distribution to sample from if you want to sample these mixing coefficients and by adjusting the actual number of alpha and beta here you can determine how many times you sample the original data points versus something in the middle okay on this toy data set right here they showcase what mix up can do so in a classic model you have the orange and the green data points and blue is basically where the classifier believes its class one you see this very hard border here it's quite a hard border now you only have two classes here and so the hard border is sort of a problem in itself because if you think of for example adversarial examples all they have to do is basically get over that one inch and the classifier is already super duper sure it's the orange class right whereas if you use mix up your border is much much more fuzzy it's like yeah it's only really sure here and out here everywhere but in the middle it's sort of like me I don't know and so that's kind of a more desirable situation and of course this here works particularly in this linear to the setting but as we can see the same reasoning applies to sort of higher layers and higher dimensionality data points right I have to seem to lost the ability to zoom Oh No that's back okay and that's basically it for this paper this is all they do they propose this method and then they test it they say something interesting here that mix-up converges to the classical method as off approaches zero so that would push your beta distribution basically in the middle all the way down and you would only sample from the very left or the very right so you can smoothly interpolate between this mixing and the classic method they so their

### [9:35](https://www.youtube.com/watch?v=a-VQfQqIMrE&t=575s) Results

main results are we apply this to classifiers and what I like is since again is also a classifier so the discriminator is a class fair they also apply it to Ganz and they outperform unstable eyes the classic training on ganz they show that it's more robust towards adversarial attacks because it's not so sure about intermediate things and they generally outperform other methods but also they do this nice investigation here where they measure the prediction error of in between data and what it means is they say a prediction is counted as a miss if it does not belong to Y I or Y J so you have a sample right here X I in the J and you look at what the classifier says in between the two data points so you just interpolate and just measure what the classifier says and whenever the classifier either says y I or Y J either a label of those two data points you count it as correct and you only count it as incorrect if it says something else and you can see here if you train with the classic method erm the these errors happen much more often that's exactly the situation I pointed out at the beginning where in the high dimensions it can you know occur that all sorts of decision boundary sneak here in between the two data points and by interpolating between them during training you sort of much reduce that you reduce that effect a lot now this they also say that the gradient norm of the gradients of the model with respect to input in between training data it happens the same thing the norm of the gradients in the middle is also much lower and this yeah this investigation I find pretty cool I have to say I have seen mixup in practice so it might be useful I've read a paper where they basically say oh it was a big transfer paper yeah where they basically say it is useful if you have for example if you have little data and a big model so you can sort of regularize the model and is also useful to know that they did test this with drop out so we can they compared it with drop out and the conclusion is basically that this is something else than drop out so it's not doing the same thing drop out of course it means you drop out some of the data points in intermediate activations and that sort of gives you a noisy version of the data point this here can actually be combined with drop out which means that it gives you an additional benefit you see right here most of the best numbers happen when you use mix up plus drop out so it seems to be just an additional regularization on top of drop out pretty cool investigation awesome alright so if you like this I invite you to read the paper if you liked the video please subscribe and like and comment and yeah have a nice day bye

---
*Источник: https://ekstraktznaniy.ru/video/13610*