# This AI Learns From Humans…and Exceeds Them

## Метаданные

- **Канал:** Two Minute Papers
- **YouTube:** https://www.youtube.com/watch?v=T8YOzqy7t5Y
- **Дата:** 10.01.2019
- **Длительность:** 4:15
- **Просмотры:** 30,312
- **Источник:** https://ekstraktznaniy.ru/video/14374

## Описание

The paper "Reward learning from human preferences and demonstrations in Atari" is available here:
https://arxiv.org/abs/1811.06521

Pick up cool perks on our Patreon page:
› https://www.patreon.com/TwoMinutePapers

Crypto and PayPal links are available below. Thank you very much for your generous support!
› PayPal: https://www.paypal.me/TwoMinutePapers
› Bitcoin: 1a5ttKiVQiDcr9j8JT2DoHGzLG7XTJccX
› Ethereum: 0xbBD767C0e14be1886c6610bf3F592A91D866d380
› LTC: LM8AUh5bGcNgzq6HaV1jeaJrFvmKxxgiXg

We would like to thank our generous Patreon supporters who make Two Minute Papers possible:
313V, Alex Haro, Andrew Melnychuk, Angelos Evripiotis, Anthony Vdovitchenko, Brian Gilman, Christian Ahlin, Christoph Jadanowski, Dennis Abts, Eric Haddad, Eric Martel, Evan Breznyik, Geronimo Moralez, Jason Rollins, Javier Bustamante, John De Witt, Kaiesh Vohra, Kjartan Olason, Lorin Atzberger, Marcin Dukaczewski, Marten Rauschenberg, Maurits van Mastrigt, Michael Albrecht, Michael Jensen, Morten Punnerud 

## Транскрипт

### Segment 1 (00:00 - 04:00) []

Dear Fellow Scholars, this is Two Minute Papers with Károly Zsolnai-Fehér. This is a collaboration between DeepMind and OpenAI on using human demonstrations to teach an AI to play games really well. The basis of this work is reinforcement learning, which is about choosing a set of actions in an environment to maximize a score. For some games, this score is typically provided by the game itself, but in more complex games, for instance, ones that require exploration, this score is not too useful to train an AI. In this project, the key idea is to use human demonstrations to teach an AI how to succeed. This means that we can sit down, play the game, show the footage to the AI and hope that it learns something useful from it. Now the most trivial implementation of this would be to imitate the footage too closely, or in other words, simply redo what the human has done. That would be a trivial endeavor, and it is the most common way of misunderstanding what is happening here, so I will emphasize that is not the case. Just imitating what the human player does would not be very useful because one, it puts too much burden on the humans, that’s not what we want, and number two, the AI could not be significantly better than the human demonstrator, that’s also not what we want. In fact, if we have a look at the paper, the first figure shows us right away how badly a simpler imitation program performs. That’s not what this algorithm is doing. What it does instead is that it looks at the footage as the human plays the game, and tries to guess what they were trying to accomplish. Then, we can tell a reinforcement learner that this is now our reward function and it should train to become better at that. As you see here, it can play an exploration-heavy game such as Atari “Hero”, and in the footage here above you see the rewards over time, the higher the better. This AI performs really well in this game, and significantly outperforms reinforcement learner agents trained from scratch on Montezuma’s revenge as well, although it can still get stuck on a ladder. We discussed earlier a curious AI that was quickly getting bored by ladders and moved on to more exciting endeavors in the game. The performance of the new agent seems roughly equivalent to an agent trained from scratch in the game Pong, presumably because of the lack of exploration and the fact that it is very easy to understand how to score points in this game. But wait, in the previous episode we just talked about an algorithm where we didn’t even need to play, we could just sit in our favorite armchair and direct the algorithm. So why play? Well, just providing feedback is clearly very convenient, but as we can only specify what we liked and what we didn’t like, it is not very efficient. With the human demonstrations here, we can immediately show the AI what we are looking for, and, as it is able to learn the principles and then improve further, and eventually become better than the human demonstrator, this work provides a highly desirable alternative to already existing techniques Loving it. If you have a look at the paper, you will also see how the authors incorporated a cool additional step to the pipeline where we can add annotations to the training footage, so make sure to have a look! Also, if you feel that a bunch of these AI videos a month are worth a dollar, please consider supporting us at Patreon. com/twominutepapers. You can also pick up cool perks like getting early access to all of these episodes, or getting your name immortalized in the video description. We also support cryptocurrencies and one-time payments, the links and additional information to all of these are available in the video description. With your support, we can make better videos for you. Thanks for watching and for your generous support, and I'll see you next time!
