[RE-UPLOAD] MOMENTUM Gradient Descent (in 3 minutes) *** No Background Music***

[RE-UPLOAD] MOMENTUM Gradient Descent (in 3 minutes) * No Background Music*

Visually Explained

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (1 сегментов)

Segment 1 (00:00 - 03:00)

Take a quadratic function and apply gradient descent to find its minimum. One thing that you will notice immediately is that it bounces around. The main reason gradient descent zigzags in this fashion is because it's completely amnesiac. It keeps absolutely no memory of the past. If the negative gradient points to the left, it goes to the left. And if at the next iteration it points to the right, then it goes to the right. This effect is even more pronounced if the graph of the function is flat in one direction and narrow in the other. We can actually make things more precise. The ratio between the length of the longest and shortest principal axis of this ellipse squared is called the condition number and is traditionally denoted by kappa. The closest kapa is to one, the nicer the function is and the faster it is to get to the minimizer. This intuition can be formalized. It can be shown that in order to get to within epsilon of the minimizer, we need kappa* log one / epsilon iterations. Therefore, a large condition number can make gradient descent takes a large number of iterations to converge. And now we get to momentum gradient descent. This is a very simple yet extremely powerful idea to fight this problem and to accelerate conversions of gradient descent. Here is how it works. When you are at iteration K, you compute the gradient just as you would in vanilla gradient descent. But now instead of just following this gradient right away, you first extrapolate the previous direction a little bit further and only then you follow the negative gradient. And you can think of this extrapolation step as momentum carrying from previous iterations. And you can already see in this toy example the benefit of this seemingly small modification. Formally we can show that with momentum gradient descent we only need square root kappa* log 1 / epsilon iterations. So the time saving can be huge if kapa is large. Momentum gradient descent is actually useful well beyond quadratic functions. For example, Nesterrov looked at smooth strongly convex functions and developed a closed variant of this method where instead of taking the gradient of the current iterate, you first extrapolate and compute the gradient at the extrapolated point. He showed that this method that now takes his name is faster than any other algorithm that relies solely on computing the gradient. Of course, most functions that you want to optimize in real life are not convex, let alone quadratic. But even then, momentum gradient descent is still widely used because it performs very well in practice. To understand just how powerful this idea of momentum gradient descent is, let us see not one but two interpretations of how it works. The first interpretation is quite simple. When you unpack this recursive formula, you realize that momentum gradient descent keeps an exponential moving average of past gradients. The second interpretation is even more beautiful. You can visualize the current iterate as a particle swimming in a potential field given by the function f with friction. The force felt by the particle is the negative gradient of this function. Newton's law of motion force equal mass time acceleration. Let us then link this force to the acceleration of the particle. Integrating over a small unit of time gives us the velocity of the particle to which we add a friction term minus mu * vtus 1. Integrating a second time gives us the position of the particle. And after a trivial parameterization, we recover momentum gradient descent. This was momentum gradient descent in 3 minutes. If you like the video, make sure to like and subscribe. And see you next time.

Другие видео автора — Visually Explained

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник