# A Bitter AI Lesson - Compute Reigns Supreme!

## Метаданные

- **Канал:** Two Minute Papers
- **YouTube:** https://www.youtube.com/watch?v=wEgq6sT1uq8
- **Дата:** 07.04.2019
- **Длительность:** 9:10
- **Просмотры:** 77,481
- **Источник:** https://ekstraktznaniy.ru/video/14332

## Описание

📝 The article "The Bitter Lesson" is available here:
http://www.incompleteideas.net/IncIdeas/BitterLesson.html

Nice twitter thread on this video: https://twitter.com/karoly_zsolnai/status/1114867598724931585

❤️ Pick up cool perks on our Patreon page: https://www.patreon.com/TwoMinutePapers

🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible:
313V, Alex Haro, Andrew Melnychuk, Angelos Evripiotis, Anthony Vdovitchenko, Brian Gilman, Bruno Brito, Bryan Learn, Christian Ahlin, Christoph Jadanowski, Claudio Fernandes, Dennis Abts, Eric Haddad, Eric Martel, Evan Breznyik, Geronimo Moralez, Javier Bustamante, John De Witt, Kaiesh Vohra, Kasia Hayden, Kjartan Olason, Levente Szabo, Lorin Atzberger, Marcin Dukaczewski, Marten Rauschenberg, Maurits van Mastrigt, Michael Albrecht, Michael Jensen, Morten Punnerud Engelstad, Nader Shakerin, Owen Campbell-Moore, Owen Skarpness, Raul Araújo da Silva, Richard Reis, Rob Rowe, Robin Graham, Ryan Monsurate, Sha

## Транскрипт

### Segment 1 (00:00 - 05:00) []

Dear Fellow Scholars, this is Two Minute Papers with Károly Zsolnai-Fehér. Before we start, I’d like to tell you that this video is not about a paper, and it is not going to be two minutes. Welcome to Two Minute Papers! This piece bears the name “The Bitter Lesson” and was written by Richard Sutton, a legendary Canadian researcher who has contributed a great deal to reinforcement learning research. And what a piece this is! It is a short article on how we should do research, and ever since I read it, I couldn’t stop thinking about it, and as a result, I couldn’t not make a video on this topic. We really have to talk about this. It takes less than 5 minutes to read, so before we talk about it, you can pause the video and click the link to it in the video description. So, in this article, he makes two important observations. Number one, he argues that the best performing learning techniques are the ones that can leverage computation, or, in other words, methods that improve significantly as we add more compute power. Long ago, people tried to encode lots of human knowledge of strategies in their Go AI, but did not have enough compute power to make a truly great algorithm. And now we have AlphaGo, which contains minimal information about Go itself, and it is better than the best human players in the world. And, number two, he recommends that we try to put as few constraints on a problem as possible. He argues that we shouldn't try to rebuild the mind, but try to build a method that can capture arbitrary complexity and scale it up with hardware. Don’t try to make it work like your brain - make something as general as possible and make sure it can leverage computation and it will come up with something that is way better than our brain. So, in short, keep the problem general, and don’t encode your knowledge of the domain into your learning algorithms. The weight of this sentence is not to be underestimated, because these seemingly simple observations sound really counterintuitive. This seemingly encourages us to do the exact opposite of what we are currently doing. Let me tell you why. I have fond memories of my early lectures I attended to in cryptography, where we had a look at ciphertexts. These are very much like encrypted messages that children like to write each other at school, which looks like nonsense for the unassuming teacher, but can be easily decoded by another child when provided with a key. This key describes which symbol corresponds to which letter. Let’s assume that one symbol means one letter, but if we don’t have any additional knowledge, this is still not an easy problem to crack. But in this course, soon, we coded up algorithms that were able to crack messages like this in less than a second. How exactly? Well, by inserting additional knowledge into the system. For instance, we know the relative frequency of each letter in every language. For instance, in English, the letter “e” is the most common by far, and then comes “t”, “a”, and the others. The fact that we are not seeing letters but symbols doesn’t really matter, because we just look up the most frequent symbol in the ciphertext and we immediately know that okay, that symbol is going to be the letter “e”, and so on. See what we have done here? Just by inserting a tiny bit of knowledge, suddenly, a very difficult problem turned into a trivial problem. So much so that anyone can implement this after their second cryptography lecture. And somehow, Richard Sutton argues that we shouldn’t do that? Doesn’t that sound crazy? So, what gives? Well, let me explain through an example from light transport research that demonstrates his point. Path tracing is one of the first and simplest algorithms in the field, which, in many regards, is vastly inferior to Metropolis Light Transport, which is a much smarter algorithm. However, with our current powerful graphics cards, we can compute so many more rays with path tracing, that in many cases it wins over Metropolis. In this case, compute reigns supreme. The hardware scaling outmuscles the smarts, and we haven’t even talked about how much easier it is for engineers to maintain and improve a simpler system. The area of Natural Language Processing has many decades of research to teach machines how to understand, simplify, correct, or even generate text. After so many papers and handcrafted techniques which insert our knowledge of linguistics into our techniques, who would have thought that OpenAI would be able to come up with a relatively simple neural network with so little prior knowledge that is able to write articles that sound remarkably lifelike.

### Segment 2 (05:00 - 09:00) [5:00]

We will talk about this method in more detail in this series soon. And here comes the bitter lesson. Doing research the classical way of inserting knowledge into a solution is very satisfying - it feels right, it feels like doing research, progressing, and it makes it easy to show in a new paper what exactly the key contributions are. However, it may not be the most effective way forward. Quoting the article. I recommend that you pay close attention to this. “The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because its success over a favored, human-centric approach. “ In our cryptography problem from earlier, of course the letter frequency solution and other linguistic tricks are clearly much, much better than a solution that doesn’t know anything about the domain. Of course! However, when later, we have a 100 times faster hardware, this knowledge may actually inhibit finding a solution that is way, way better. This is why he also claims that we shouldn’t try to build intelligence by modeling our brain in a computer simulation. It’s not that the “our brain” approach doesn’t work - it does, but on the short run. On the long run, we will be able to add more hardware to a learning algorithm, and it will find more effective structures to solve problems, and it will eventually outmuscle our handcrafted techniques. In short, this is the lesson: when facing a learning problem, keep your domain knowledge out of the solution, and use more compute. More compute gives us more learning, and more general formulations give us more chance to find something relevant. So, this is indeed a harsh lesson. This piece sparked great debates on Twitter, I have seen great points for and against this sentiment. What do you think? Let me know in the comments, as everything in science, this piece should be subject to debate and criticism. And therefore, I’d love to read as many people’s take on it as possible. And, this piece has implications on my thinking as well — please allow me to add a three more small personal notes that kept me up at night in the last few days. Note number one, the bottom line is whenever we build a new algorithm, we should always bear in mind which parts would be truly useful if we had a 100 times the compute power that we have now. Note number two, a corollary of this thinking is that arguably, hardware engineers who make these new and more powerful graphics cards may be contributing the very least as much to AI than most of AI research does. And note number three, to me it feels like this almost implies that best is to join the big guys where all the best hardware is. I work in an amazing, small to mid-sized lab at the Technical University of Vienna and in the last few years, I have given relatively little consideration to the invitations from some of the more coveted and well-funded labs. Was it a mistake? Should I change that? I really don’t know for sure. If for some reason, you haven’t read the piece at the start of the video, make sure to do it after watching this. It’s really worth it. In the meantime, interestingly, the non-profit AI research lab, OpenAI also established a for profit, or what they like to call “capped profit” company to be able to compete with the other big guys like DeepMind and Facebook Reality Labs. I think Richard has a solid point here. Thanks for watching and for your generous support, and I'll see you next time!
