What the Books Get Wrong about AI [Double Descent]

What the Books Get Wrong about AI [Double Descent]

Welch Labs

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (7 сегментов)

Segment 1 (00:00 - 05:00)

This video is sponsored by KiwiCo. More on them later. Also, I've written a whole new book on AI. You can learn more about the book at the end of this video. When I first learned machine learning, these three books were incredibly helpful, but they're also all wrong. Each book has a version of the same plot, a core tenant of machine learning theory in practice. The x-axis shows the size of a learning model and the y-axis shows the model's performance measured using some type of error metric. From here, we plot two curves. The first curve shows the model's performance on its training data. As our model becomes larger, it's able to learn more complex patterns, better fit its training data, and bring down its error. Of course, what we really care about here is how well our model will perform on examples outside of its training set. Our second curve shows the same error metric, but measured on a test set that the model hasn't seen before. Again and again in every book, the test set error curve has the same U shape. Our testing set error starts high for smaller models, comes down to a nice minimum for some medium-siz model, and shoots back up as the size of our model continues to grow. The shooting back up part of the testing curve is due to the model overfitting the data. Many authors demonstrate the overfitting phenomenon using polomial curve fitting. If we take a set of data points, for example, this set of parabola shaped points and set aside half of our data for testing and fit a first order polomial which is just a line. Our curve fit will be poor. We can measure just how bad our fit is by taking the differences between our fit line and our training points. Squaring and averaging these errors together, we get the mean squared error of our linear model on our training set. We can repeat this process on our test set points to compute our test set error. Both our training and testing errors are relatively high because our simple linear model is not powerful enough to fit our parabola shaped data. Moving to a second order polomial, our parabolic model is now able to nicely fit our parabolic data, bringing down our training and testing error. Fitting a third order model, our cubic curve is able to get very close to our training points, bringing down the error on our training set, but starts to fit the noise in our data instead of the underlying parabolic shape, resulting in worse performance on our test set. Moving to a fourth order polomial overfitting becomes even worse. Our more powerful model is able to now perfectly fit our noisy data with zero training error but result in wild curve fits and worse test set performance and our resulting errors line up with the characteristic U-shaped testing error curve. This central idea is often known as the bias variance tradeoff and is named after a nice piece of statistical theory that supports the idea that our test set curve should be U-shaped. The takeaway for a whole generation of machine learning practitioners was that we must carefully limit the power of our models to match the complexity of our data to avoid overfitting. Our U-shaped test error curve supported by the bias variance trade-off theory feels disciplined. It feels responsible. It's telling us that the core of machine learning is about balance. But it's also wrong. In 2012, Alex Kreseky, Ilia Sudgver, and Jeff Hinton successfully trained what was then considered an enormous classification neural network with around 60 million parameters. that today we call AlexNet. Overfitting was a significant concern for the AlexNet team. To reduce overfitting, the team used data augmentation, applying random shifts, flips, and color changes to their images while training and used a new technique called dropout, where collections of neurons are randomly turned off during training. Both data augmentation and dropout force AlexNet to learn more robust and general ways to classify images that don't depend on exactly how an image appears or a specific pathway through the network. And the team found that without data augmentation and dropout, the model exhibited substantial overfitting. The team also used a technique called weight decay where the model is penalized for having large weight values. If we apply the same weight decay approach to one of our overfit polomials from earlier, we see that as we increase the amount of weight decay, our fit becomes smoother, reducing overfitting. In statistics, this is known as ridge regression. Data augmentation, dropout, and weight

Segment 2 (05:00 - 10:00)

decay are all examples of regularization techniques where we modify the training process to prevent our model from overfitting the data. These techniques were seen as critical back in 2012 and remain very common practice in machine learning today. The takeaway for many machine learning practitioners at the time, myself included, was that large neural networks were clearly in the overfitting region of the bias variance curve and that without regularization, these models would dramatically overfit, effectively memorizing their training examples without learning the robust pattern recognizers needed to generalize to new examples. and that through regularization we could push these models back towards the happy middle of the curve. A sneakier and more subtle conclusion, one that I certainly internalized at the time, is that in the overfitting regime, lower training set error is causally linked to higher test set error. This implication even shows up in the name we've chosen for the phenomenon. Overfitting implies that we're doing too much fitting of the training data and that doing too much fitting, lowering our training set error too much, is causing something bad. In this case, leading to higher test set error. Alexet was a wild success and deep learning exploded over the next few years with larger and deeper models delivering more and more impressive results. We would expect models larger and more complex than AlexNet to require more aggressive regularization to avoid overfitting. But this turned out not to be the case with deep learning continuing to generalize suspiciously well even without aggressive regularization measures in place. In 2016, a team at Google Brain addressed this apparent contradiction headon in a brilliantly insightful paper called Understanding Deep Learning Requires Rethinking Generalization. To test the extent to which we can actually control overfitting in deep models, the Google brain team devised a clever experiment where they took the same imageet data set that Alexet was trained on and the smaller sefar image classification data set and completely randomized all the labels and train models to predict these random labels. So one cat in the imageet data set would be labeled as an aircraft carrier and the next cat would be labeled as a sea snake and so on. The only way for a model to do well in this training data is to dramatically overfit or essentially memorize each example. This exact cat image is an aircraft carrier. a sea snake and so on. Because the labels are randomized, there are no deeper patterns to learn. All noise, no signal. Now, if regularization was actually preventing deep models from overfitting, as was widely believed, we would expect a regularized deep model to not learn these randomly assigned labels, resulting in poor performance on the training set. Shockingly, the team showed that deep models were able to perfectly memorize all 50,000 training images in the CFAR data set and almost all of the 1. 3 million training examples in the imageet data set, even with regularization in place. These models of course perform terribly on their test sets, performing no better than random guessing. So deep models even with regularization in place are perfectly capable of just memorizing their training data. However, when we switch back to the correct labels, the same models with the same training procedures do not memorize and instead learn robust patterns that do generalize to new data. The Google Brain team also showed that when learning from the correct labels, contrary to the AlexNet team's findings, regularization was not actually critical to avoid overfitting. By 2016, a more efficient and flexible deep architectures than AlexNet had been developed. The Google brain team trained a newer inception v3 architecture on the same imageet data set that AlexNet was trained on and observed that when they removed data augmentations, dropout and weight decay, the model's test set performance did decrease, but only modestly. And they also found that an inception v3 model trained without any explicit regularization performed on par with the original Alexet results. And even more interesting, when regularization did make these modest improvements to test set accuracy, it had very limited or apparently no impact on training set error. As we would expect if regularization was moving our model back towards the center of the bias variance curve, these models trained on the CFAR data set show an increase in the test set

Segment 3 (10:00 - 15:00)

accuracy from around 86% to 89% when adding different types of regularization. But all four models still perfectly fit the training data with accuracies of 100%. These results dramatically call into question the fundamental trade-off between training set and test set performance predicted by the typical bias variance curves in the overfitting region. Returning to our polomial curve fitting example. This model behavior is analogous to our curve somehow exactly fitting our noisy observations while still effectively learning the underlying parabolic shape and performing well on our testing points. Exactly fitting the training data is known as the model interpolating the data. So what's going on here? Does the bias variance trade-off simply not apply to deep models? Do we even need to worry about overfitting when training deep neural networks? And should I throw away all of my statistics and machine learning books? Whenever I think about learning verse memorization, I can't help but think about how my own children are learning. Which is why I was more than happy to partner again with this video sponsor, Kiwico. Kiwi makes hands-on project kits that make learning genuinely fun for kids of all ages. My son just turned two this summer, and he's exploding with curiosity. I love these puzzles that promote spatial reasoning, and they're incredibly engaging. Even later that day when we were watching TV, he kept sneaking off to work on the puzzles. I love this. The thoughtfulness the Kiwi Co team puts into the crates really makes them so much stickier than many of the toys we have. I was worried that this pirate treasure crate was a little too old for my daughter. But I was delighted when she was able to understand the treasure map that we drew together and was able to follow the map to the right part of the house. Creating and reading maps like this takes some serious abstract reasoning. It was amazing to see. KiwiCo makes amazing gifts for the kids and families in your life, and they make awesome learning experiences for kids of all ages. Use my code Welch Labs to receive 50% off your first crate for kids three and older, or 20% off your first Panda Crate for kids under three. Big thanks to KiwiCo for sponsoring this video. Now, back to what's really going on with the bias variance trade-off. In 2018, a team of researchers led by Myle Belulin proposed an interesting alternative explanation of what's going on here. What if the traditional bias variance trade-off wasn't exactly wrong, but wasn't the full picture? What would happen to the bias variance curve if we just kept increasing the size of our models well beyond the overfitting regime? Are there certain combinations of models and data where we would actually see the testing set error come back down? Is there something beyond overfitting? The team showed some compelling small-cale demonstrations of exactly this phenomenon on the imnest handwritten digit data set using a random forier feature model. This is essentially a small two-layer neural network where only the final layer is trained. The team showed that these models would demonstrate the classical bias variance trade-off curve, but then suddenly shift to a new regime as model size increased further with test set performance improving dramatically and actually exceeding the test set performance found in the classical regime. The team called the phenomenon double descent. Their hypothesis was compelling, but it remained to be seen if the double descent phenomenon could be replicated in full-scale deep models and exactly what underlying mechanisms could be causing this unexpected behavior. The following year in 2019, a Harvard and OpenAI team definitively showed that double descent was real, demonstrating the phenomenon across a variety of model architectures, including transformers on both vision and language data sets. Interestingly, the team observed double descent behavior not only as a function of the size of their models, but as a function of how long their models were trained for. This observation is potentially highly relevant for machine learning practitioners. It's very common to visualize test set error while training and stop training when test set error stops coming down. It's also common to see test set error start to trend up later in training. This would typically be interpreted as the model beginning to overfit its training data. But what the Harvard team found remarkably was that for certain models and data sets, if you just kept training the model, the testing error would follow a double descent behavior coming back down in some cases to an even lower value. If you were training one of these models

Segment 4 (15:00 - 20:00)

and assumed a classical bias variance trade-off behavior, you would likely stop training long before you saw the double descent. I should note here that the plots we've seen so far from the Harvard team's work include a small amount of added label noise on the CFAR data set. This makes the model more likely to overfit and the double descent curve more pronounced. We still see double descent without the added noise, but it's less dramatic. CFR is a very clean academic data set. So you could argue that adding label noise is a reasonable proxy for larger noisier data sets. So double descent is a real phenomenon. But why would models behave like this? To me, seeing this behavior while training is especially confounding. Why would models start overfitting while training only for the trend to reverse after undergoing more of the same training process? Let's return to our curve fitting example one last time. Remarkably, it turns out that double descent can also occur with simple polomial curve fitting. We saw earlier that a second order curve nicely fits our noisy parabolic data and this puts us nicely at the bottom of our bias variance curve. If we increase the order of our polomial to three, we begin to overfit with our training error dropping close to zero but our testing error shooting up. When we reach a fourth order polomial, our test error continues to increase and our training error goes to zero. Our fourth order curve has five free parameters that are able to exactly fit our five data points. This is analogous to our image classification models exactly fitting or interpolating their training data. This point is known as the interpolation threshold and corresponds to the smallest model that is capable of perfectly fitting our data. Now moving to a fifth order polomial, our situation changes a little. Our polomial is still able to perfectly fit our training data. But because our curve has six free parameters, but we still only have five training points. This means that there will actually be an infinite number of fifth order polomials that perfectly fit our five training points. Here's 100 different fifth order polomials that perfectly fit our data. How does our curve fitting algorithm choose which curve to go with? It turns out that there's a fairly natural closed form matrix inversion solution that extends how we handle the earlier cases. The solver will effectively choose the curve with the smallest sum of squared coefficients. For example, this more chaotic curve fit corresponds to this polomial with these six coefficients. And this simpler curve corresponds to this polomial. If we take each coefficient from our first curves equation, square these values and add them together, we get 19. 13, this value is known as the L2 norm squared of our coefficients. Performing the same computation on our second polomial, we get a smaller norm 7. 04. So our solver would choose our second curve over our first. Out of all the possible fifth order curve fits, our second curve turns out to be the lowest norm solution. So this is the curve our solver would return. Just like our fourth order curve, our fifth order curve perfectly interpolates our training data resulting in zero training error. However, our fifth order curve is a bit less chaotic. And if we measure its test set error, we see that it's actually lower than our fourthderee polomial's test set error starting to create double descent behavior. Now there's an important technical point here about how exactly we set up our solver. Expressing our polomial as ax 5th plus bx to the 4th and so on turns out to create numeric instability for our solver. Especially as the degree of our polomial grows, it's common to instead use what's known as a different polomial basis. Here we're using the Leandre basis. The curve fitting model is still a fifth order polomial, but rearranged in a way where our coefficients multiply what are known as the Leandre polomials. These polomials have nice mathematical properties that make curve fitting more stable. We can rearrange our Leandre polinomial representation into our typical ax 5th plus bx 4th representation. But importantly the coefficients our solver actually uses when fitting our curve are different. Meaning that the minimum norm solution is different. And if we use a standard ax 5th plus b x 4th polomial representation and we pick a minimum norm solution, we do not see double descent behavior. I'll put some links about this in the video description. So depending on our exact curve fitting

Segment 5 (20:00 - 25:00)

procedure, our solver will pick a minimum norm solution in our chosen polomial basis and some common choices of basis do exhibit double descent. Jumping to a tenth order fit, we have an enormous range of possible solutions. Here's a hundred of them. Some of these curves catastrophically overfit our data. However, again, our smallest norm constraint chooses a smoother curve that actually kind of looks like a squiggly version of our parabola and again brings down our test set error, giving us more nice double descent behavior. Now, why does our polomial curve fitting process demonstrate this perhaps surprising double descent behavior? Our worst generalizing fit is precisely at the interpolation threshold where our model exactly fits our training data for the first time. In this case, we have exactly as many constraints as we have variables, meaning there's only one unique curve we can fit. And our model is forced to contort itself exactly to the data and is more susceptible to noise. Unlike in our higher order fits where our solver has many curves to choose from and it can pick a smoother lower norm solution, the Harvard team lays out a similar line of thinking when reasoning about why double descent occurs in deep neural networks. For model sizes at the interpolation threshold, there is effectively only one model that fits the training data and this interpolating model is very sensitive to noise in the train set. For overparameterized models, there are many interpolating models that fit the training set. And SGD is able to find one that memorizes or absorbs the noise while still performing well on the distribution. SGD here refers to the stochastic gradient descent algorithm used to train deep neural networks. This algorithm works very differently than the solvers we used in our curve fitting example, but has interestingly been shown to arrive at norm minimizing solutions under certain constraints, just as our curve fitting solvers do. So, as we train larger models or train models for longer, when these models are near the interpolation threshold where they're just able to perfectly fit the training data for the first time, the model has less flexibility and is more likely to overfit. As we move to larger models or train for longer, our model has more flexibility and our training algorithm is able to choose smoother, less chaotic solutions that will better generalize to new data. So, should we throw out the bias variance trade-off and should I throw away all my books? The first book we showed at the beginning of the video, The Elements of Statistical Learning, which is a great book, by the way, was written by three Stanford statistics professors in the early 2000s. After Double Descent gained notoriety in 2018 and 2019, the book's first author, Trevor Hasty, co-authored a massive 70-page paper looking into the phenomenon. He titled the paper surprises in highdimensional ridgless le squares interpolation. In 2021, Hasty and his co-authors published a new edition of Introduction to Statistical Learning, which is in many ways a successor to his original book. The familiar U-shaped bias variance curves are still prominently featured in the opening chapters of the book, but there is a new section on double descent in chapter 10. In this chapter, Hasty and his co-authors present the double descent phenomenon and argue that it does not contradict the bias variance trade-off. Their argument centers around the way we're measuring the size or complexity of our learning model. In our polomial double descent example, our x-axis corresponds to the degree of our polomial. Hasty and his co-authors essentially argue that after we pass the interpolation threshold, the degree of our polomial is no longer the right measure for model complexity. They also use a somewhat different nomenclature calling their measure flexibility. This is a fair point. As we saw in our polomial curve fitting example, once we pass the interpolation threshold, we have many possible fits to choose from. And when our solver picks the lowest norm curve, the result is in many ways simpler than the curve we get at the interpolation threshold where we only have one choice. The variance in the bias variance trade-off refers to a specific statistical measure of the variability of our fit. Here's our second order curve fit from earlier. Now if we take a different random sample of our underlying parabola and refit our second order polomial, we get a slightly different fit like this. Here's a third data sample in fit. And here's 50 more second order fits from different random samples. From here, we can compute the mean and standard deviation across all these

Segment 6 (25:00 - 30:00)

fits. Here the shaded region corresponds to one standard deviation above and below the mean fit. Our average fit is quite close to our true underlying parabola. The difference between these two curves is the bias in the bias variance trade-off. Note that we need to know the true underlying function to compute bias and in practice we generally do not know this function. So, as we've seen, the bias variance trade-off and U-shaped testing error curve are typically conceptual tools. We can't actually compute bias for most of the problems that we care about. The variance in the bias variance trade-off is proportional to the yellow shaded region squared and measures the variability of our various fits. Now, returning to our test set error measurements, we can see a really compelling part of bias variance theory. It turns out that for a given fit, we can decompose our overall test set error into a sum of our bias squared, our variance, and a final irreducible error term. For our second order fit, our largest error component is our variance. So our theory is telling us here that the majority of our error is coming from the variability of our fits that result from the randomness of our data. Shifting to a first order fit, here's a 100 different linear fits based on different samples of our underlying parabola. Taking our mean and standard deviation as we did before, we see that unlike our second order fit, our average fit is now quite far from our target parabola function leading to a high bias. This high bias means that our model is unable to fit the actual underlying function. Moving to our third order fits. These polomials are more sensitive to the noise in our data. Here's a 100 different fits based on different random samples of our underlying parabola. Collapsing these into a mean and standard deviation, we see that our bias isn't too bad, but our variance is enormous. This means that our test set error in our third order case is dominated by our variance term. This classic U-shaped section of our error plot nicely demonstrates the bias variance trade-off. Our first order fit is unable to match our underlying parabola and has a large bias. As we increase the order of our fit, our bias comes down, but our variance increases as our models become more sensitive to noise. As we cross the interpolation threshold, our fourth and fifth order fits are also very sensitive to the noise in our random samples. meaning variance is our primary source of error. However, after we pass our interpolation threshold, as we've seen, our solver is able to choose smoother, lower norm solutions. This brings down the overall variance of our fits, bringing down our average error and creating the double descent behavior. So, while it's certainly still possible to decompose our errors into bias and variance components at and beyond the interpolation threshold, the tradeoff between bias and variance is no longer the primary driver of changes in test set error as it is in the classical U-shaped region of our curve. So, as Hasty and his co-authors state, technically the bias variance trade-off theory does not mean that our curve has to be U-shaped and the shape of our curve will depend on how we measure the flexibility or capacity of our models. However, when every presentation of this central theory in all the books and lectures you see presents the same U shape, it's hard to not internalize this U-shape as what the theory really says. That was certainly my experience when learning this subject and overfitting and this fundamental U-shaped informed much of my work in machine learning for years. It's such a nice little mental model. I was really shocked when I learned that this picture was incomplete in such an important way. When researching this video, I had a chance to chat with Myle Belulin, the lead author of the double descent paper. Mle told me how nerve-wracking it was to challenge such a prominent and widely held theory. and that he and his co-authors sought out some extra opinions before publishing. It's just so easy to accept and in this case overgeneralize theories like this, especially when they fit so nicely into a simple mental image. It's really important to note here that double descent behavior is not universal. For plenty of data sets and models, testing error will just continue to increase with model size and never come back down. Double descent depends on a number of factors including the level of noise in the data set and critically depends on how a given model handles the overparameterized case where many available solutions perfectly fit the data. This is known as the inductive bias of a given model. Deep models appear to have an incredibly friendly inductive bias. They're clearly capable

Segment 7 (30:00 - 34:00)

of catastrophic overfitting but somehow generalize incredibly well in practice. As Simon Prince says in his great book on deep learning, if the efficient fitting of neural networks is startling, their generalization to new data is dumbfounding. Deep learning theory is still very much catching up to practice. It's like we're observing some new phenomenon of nature, one that is remarkably capable of acting quite intelligently, and we're trying to figure out how it works. If the bias variance trade-off is like Newtonian physics, it feels like we're getting glimpses of Einstein's general relativity with double descent. I'm really looking forward to seeing how the theories develop in the coming years. But for now, at least, I'm going to hang on to my books. If you're looking for a new book to hang on to, check out the Welch Labs Illustrated Guide to AI. It's coming out later this year. This is the book I've always wanted to write. We've really leaned into the visuals. The book has hundreds of figures. I especially like these full page spreads. This one shows how lost landscapes are computed. On the next page, we jump into this super highquality overhead contour plot view of our landscape. And we show how we might expect our model to work its way through valleys to reach its global minimum, but it instead creates what looks like a wormhole on our lost landscape. We're putting a huge amount of effort into each chapter to create these kinds of visuals and deep explanations, trying to give the most visceral feel we can for how this stuff really works. Each chapter includes supporting Python code that walks through the key results from that chapter. And there's also a supporting GitHub repo as well that's a bit more comprehensive. At the end of each chapter, you'll also find exercises. We've put a ton of thought into these. Here's an exercise from the chapter on back propagation where you're given a small complete neural network and asked to move some data through the network using a few equations. Then you're asked to compute the network's gradients and use your computed gradients to fill in steps in the model's real learning process. These exercises are designed to get you as hands-on as possible with modern AI and solutions are in the back of the book. Most of the exercises are written or programming, but my favorite is probably this spread that gives you instructions for building your own perceptron machine. The book starts with a fresh take on the fundamentals, the perceptron, gradient descent, back propagation, deep models, and alexet and then uses this foundation to dive into cutting edge topics including neural scaling laws, mechanistic interpretability, and AI image and video generation models like Sora. Each chapter goes along with a Welch Labs video that came out over the last 18 months. I really think that the book is the best way to get deeper into each video's topic. The book is great for self-study, AI courses, or just looks great on your coffee table. You can pre-order a copy today at welchlabs. com, and books ship on or before December 15th. Last year, we shipped the Imaginary Numbers book around the same date, but completely sold out of our print run in November. So, if you want to make sure you get a copy in time for the holidays, I do recommend ordering early. Finally, I know that many of you live outside the US and are interested in Welch Labs products. We're only accepting pre-orders currently to US addresses. Me, my family, and my tiny team handle all fulfillment, so we're still very limited on where we can ship. You can join our international shipping weight list at the link in the description. I don't know yet when we'll be able to offer international shipping, but I promise we're working on it. Thank you so much for your time and support. Books and education are really near and dear to my heart, and we've poured a ton of effort into this book. I really think you're going to like it.

Другие видео автора — Welch Labs

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник