# DeepMind's AI Masters Even More Atari Games | Two Minute Papers #238

## Метаданные

- **Канал:** Two Minute Papers
- **YouTube:** https://www.youtube.com/watch?v=oWpp1YYcCsU
- **Дата:** 21.03.2018
- **Длительность:** 3:24
- **Просмотры:** 24,130

## Описание

The paper "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" is available here:
https://arxiv.org/abs/1802.01561

Update: Its source code is now available here: https://github.com/deepmind/scalable_agent

DeepMind Lab: https://arxiv.org/abs/1612.03801

Our Patreon page: https://www.patreon.com/TwoMinutePapers

We would like to thank our generous Patreon supporters who make Two Minute Papers possible:
Andrew Melnychuk, Brian Gilman, Christian Ahlin, Christoph Jadanowski, Dennis Abts, Emmanuel, Eric Haddad, Esa Turkulainen, Evan Breznyik, Frank Goertzen, Malek Cellier, Marten Rauschenberg, Michael Albrecht, Michael Jensen, Nader Shakerin, Raul Araújo da Silva, Robin Graham, Shawn Azman, Steef, Steve Messina, Sunil Kim, Torsten Reil.
https://www.patreon.com/TwoMinutePapers

One-time payment links are available below. Thank you very much for your generous support!
PayPal: https://www.paypal.me/TwoMinutePapers
Bitcoin: 13hhmJnLEzwXgmgJN7RB6bWVdT7WkrFAHh
Ethereum: 0x002BB163DfE89B7aD0712846F1a1E53ba6136b5A
LTC: LM8AUh5bGcNgzq6HaV1jeaJrFvmKxxgiXg

Thumbnail background image credit: https://pixabay.com/photo-1548365/
Splash screen/thumbnail design: Felícia Fehér - http://felicia.hu

Károly Zsolnai-Fehér's links:
Facebook: https://www.facebook.com/TwoMinutePapers/
Twitter: https://twitter.com/karoly_zsolnai
Web: https://cg.tuwien.ac.at/~zsolnai/

## Содержание

### [0:00](https://www.youtube.com/watch?v=oWpp1YYcCsU) Segment 1 (00:00 - 03:00)

Dear Fellow Scholars, this is Two Minute Papers with Károly Zsolnai-Fehér. Reinforcement learning is a learning algorithm that we can use to choose a set of actions in an environment to maximize a score. There are many applications of such learners, but we typically cite video games because of the diverse set of challenges they can present the player with. And in reinforcement learning, we typically have one task, like learning backflips, and one agent that we wish to train to perform it well. This work is DeepMind's attempt to supercharge reinforcement learning by training one agent that can do a much wider variety of tasks. Now, this clearly means that we have to acquire more training data and also be prepared to process all this data as effectively as possible. By the way, the test suite you see here is also new where typical tasks in this environment involve pathfinding through mazes, collecting objects, finding keys to open their matching doors, and more. And every Fellow Scholar knows that the paper describing its details is of course, available in the description. This new technique builds upon an earlier architecture that was also published by DeepMind. This earlier architecture, A3C unleashes a bunch of actors into the wilderness, each of which gets a copy of the playbook that contains the current strategy. These actors then play the game independently, and periodically stop and share what worked and what didn't to this playbook. With this new IMPALA architecture, there are two key changes to this. One, in the middle, we have a learner, and the actors don't share what worked and what didn't to this learner, but they share their experiences instead. And later, the centralized learner will come up with the proper conclusions with all this data. Imagine if each football player in a team tries to tell the coach the things they tried on the field and what worked. That is surely going to work at least okay, but instead of these conclusions, we could aggregate all the experience of the players into some sort of centralized hive mind, and get access to a lot more, and higher quality information. Maybe we will see that a strategy only works well if executed by the players who are known to be faster than their opponents on the field. The other key difference is that with traditional reinforcement learning, we play for a given number of steps, then stop and perform learning. With this technique, we have decoupled the playing and learning, therefore it is possible to create an algorithm that performs both of them continuously. This also raises new questions, make sure to have a look at the paper, specifically the part with the new off-policy correction method by the name V-Trace. When tested on 30 of these levels and a bunch of Atari games, the new technique was typically able to double the score of the previous A3C architecture, which was also really good. And at the same time, this is at least 10 times more data-efficient, and its knowledge generalizes better to other tasks. We have had many episodes on neural network-based techniques, but as you can see research on the reinforcement learning side is also progressing at a remarkable pace. If you have enjoyed this episode, and you feel that 8 science videos a month is worth a dollar, please consider supporting us on Patreon. You can also pick up cool perks like early access to these episodes. The link is available in the video description. Thanks for watching and for your generous support, and I'll see you next time!

---
*Источник: https://ekstraktznaniy.ru/video/14493*