Why Do Neural Networks Love the Softmax?

Why Do Neural Networks Love the Softmax?

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI

Оглавление (3 сегментов)

Segment 1 (00:00 - 05:00)

here are some of the most significant neural network models from the past decade each one of these introduced major Innovations on algorithms or architectures without them we wouldn't have the impressive language and computer vision models that have recently gotten so much attention but there's one thing that none of them changed they all use the softmax function they all agree that's one area that shouldn't be touched so why is that well a standard explanation might go something like this let's say we were given a picture and we'd like to predict whether it's a cat dog goat bird horse or T-rex Also let's say we have a neural net model that does almost all the job for us that is it'll score each of these categories such that a large positive score indicates that category is likely so for different inputs these scores would change to select according to the model the most likely category but this is incomplete very often for considerations of model training and validation these scores need to be mapped to probabilities for example these would work notice they're all positive and they sum to one also the function that does this mapping needs to be differentiable so we can fit the model using the algorithm that everyone uses stochastic gradient descent or SGD in a nutshell SGD computes the gradient which tells it how to slightly update the model to slightly improve its performance so differentiability is required so gradients can be computed for example if the function was to put all the probability on the highest score category then that wouldn't give us a learnable model that's because discreetly choosing the max isn't a differentiable operation okay so how can we do this well the problem is that the scores can be negative if we can force the scores to all be positive then we could divide each by the sum of all of them and that would give us valid probabilities and as an aside there's actually good reasons to do this divide by the sum of everything operation but I don't have time to get into that okay so what function should we use to map scores to positive values how about the exponential function this will work if we give it scores positive or negative they'll get mapped to positive numbers and so now we have enough to construct the soft Max function if s is a vector of scores across the categories then the probability for the category is given with this that is to get the probability of category C we raise e to the cth score and then divide by the sum of the same thing across all categories and as expected this does what we want the outputs sum to 1 and this is a differentiable function meaning smoothly changing scores gives smoothly changing probabilities also from this perspective we can see why this is called the soft Max function because it's a continuous AKA soft approximation to selecting the max score with 100 probability and excellent now we're done except we're not the problem with this explanation is it doesn't tell us why we use the exponential function in particular since we were just asking for summing to one in differentiability it seems any positive differentiable monotonic function would have done the trick so why is the exponential function in particular used here and whatever the reason is it should be pretty compelling because this thing is not exactly the simplest thing to compute plain addition and multiplication are a lot simpler and this thing involves a lot of each of those and it's quick to suffer from numerical overflow so why do we accept these costs to answer this question this one we need to recognize the broader context that is the softmax only appears within a model that gets trained with data to make predictions to do that we will care about a loss function this is something that measures how well our model's predictions agree with the data more specifically this is a one hot encoding Vector of the true category for a single data point to relate to the example earlier if we got an image of a bird then y would be a length 6 Vector with a one at the bird category position and zeros everywhere else next like earlier s is a vector of model scores and F of s is our transformation of scores into a probability Vector earlier we set this as the softmax function and overall this loss returns a single number and when we train the model we minimize its sum over all the data okay now since we care about this loss and

Segment 2 (05:00 - 10:00)

since we'd like to use SGD to optimize it we also care about its gradient as mentioned the gradient tells us how small changes in the score changes the loss we write it like this and I'll emphasize that this gradient is with respect to the score in our example this would be a length 6 Vector telling us how much the loss number changes when we nudge each category score in practice this would be part of a much larger gradient calculation where we'd compute the gradient with respect to the model's parameters what we have here the gradient with respect to the score would be a component of that now the chain rule tells us this gradient breaks down as the matrix product of two terms the first is the Jacobian of the model probabilities with respect to the model scores and the second is the gradient of the loss with respect to the model probabilities and for those of you who would like to see these things defined explicitly well that looks like this is just the chain rule of calculus with a lot of pieces nothing interesting is happening yet but things get interesting when we make two choices together that is if we set the following first we set our loss to the negative log likelihood which can be written like this don't worry about interpreting this expression too much just accept that optimizing this means we'll select that which makes the data most likely and that's an incredibly common thing to do and second we set our transformation from scores to probabilities to the soft Max as we did earlier okay and then what well the most obvious thing is that the loss becomes a bit simpler to compute if we plug in our choices and re-express things a bit some things cancel and we're left with something linear ish which is a nice thing to optimize now that's very important but it's not the remarkable thing remember we're dealing with the exponential function the only function whose derivative is itself and so we should expect the gradients to be where things get interesting as mentioned the gradient is the product of two terms the first term is the Jacobian matrix if we're using the softmax then after some skippable algebra we get this beautiful thing which is just so much simpler than we deserve first notice we have a matrix with a model's probabilities down the diagonal and zeros everywhere else that's probably the simplest way to construct a matrix from a vector second we also have an outer product of those probabilities which is arguably the second simplest way and our answer is just the difference of these two simple things essentially this Matrix contains only the information of a single vector and that makes it cheap to work with second notice the only thing needed to compute this is the model's predicted probabilities that means the moment we produce these probabilities which will happen on our way to Computing the loss we can calculate this Jacobian that is certainly not true in the general case to dive a little deeper say these were the model probabilities then the Jacobian would be this where I've repeated the model probabilities along the rows to be clear these numbers tell us how much nudging some score changes the probabilities across all categories looking at this we see it's symmetric and all the rows sum to zero and since it's symmetric The Columns sum to zero as well ultimately this means taking gradient steps will never cause a violation of the sum to one constraint stepping back this is just a beautiful Matrix to come across in my opinion this is where the exponential function earned its place if you inspect enough gradients you know they could be messy frustrating things so it's nice when they turn out this easy but it doesn't stop there the second term the grading of the laws with respect to the model probabilities is also simple if we were to substitute in our choices and do some algebra that I'm not going to walk through then we're again left with something quite simple so now we have two simple things and that's good news for what we're after the gradient of the loss with respect to the scores which is this product as mentioned since it's a product of these simple things it gives us something even simpler this absolutely dirt simple Butte the gradient is just the difference of the model's predicted probabilities and the true output Vector to appreciate this if we made different choices for these score transformation and the loss we'd get a massive terms remember in the general case we're dealing with a Jacobian Vector product and we have no guarantees of cancellation but the choice of the soft Max and the negative log likelihood play together so nicely that we get loads of it in other words there's no matrix multiplication to do we can shortcut all of that with a mere difference of vectors in fact this is why pytorch offers the cross entropy function which is just these two choices pair together you give it the mono scores and the yvex vectors and it'll compute the loss for you it'll utilize

Segment 3 (10:00 - 10:00)

what we've covered to provide a super cheap numerically stable computation of both the loss and the gradient which is excellent news for model training but I should mention that's not the full story there are other good reasons for these choices using the negative log likelihood comes with other interesting properties like well-calibrated probabilities and matching empirical distributions and well understood asymptotic Behavior other things can be said for the softmax like it's eagerness for picking a winning category and its continuous approximation to the ARG Max operation but those things aren't what I think of when I see the softmax I think about how it makes the last 10 yards of a big complicated model the easiest fast 10 yards of the whole game

Другие видео автора — Mutual Information

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник