# Google's Text Reader AI: Almost Perfect | Two Minute Papers #228

## Метаданные

- **Канал:** Two Minute Papers
- **YouTube:** https://www.youtube.com/watch?v=bdM9c2OFYuw
- **Дата:** 14.02.2018
- **Длительность:** 4:46
- **Просмотры:** 60,661
- **Источник:** https://ekstraktznaniy.ru/video/14513

## Описание

The paper "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions" is available here:
https://google.github.io/tacotron/publications/tacotron2/index.html
https://arxiv.org/abs/1712.05884

Our Patreon page with the details:
https://www.patreon.com/TwoMinutePapers

One-time payment links are available below. Thank you very much for your generous support!
PayPal: https://www.paypal.me/TwoMinutePapers
Bitcoin: 13hhmJnLEzwXgmgJN7RB6bWVdT7WkrFAHh
Ethereum: 0x002BB163DfE89B7aD0712846F1a1E53ba6136b5A

Unofficial implementations - proceed with care:
https://github.com/candlewill/Tacotron-2
https://github.com/r9y9/wavenet_vocoder

We would like to thank our generous Patreon supporters who make Two Minute Papers possible:
Andrew Melnychuk, Brian Gilman, Christian Ahlin, Christoph Jadanowski, Dennis Abts, Emmanuel, Eric Haddad, Esa Turkulainen, Evan Breznyik, Frank Goertzen, Malek Cellier, Marten Rauschenberg, Michael Albrecht, Michael Jensen, Raul Araújo da Silva, Robin Grah

## Транскрипт

### Segment 1 (00:00 - 04:00) []

Dear Fellow Scholars, this is Two Minute Papers with Károly Zsolnai-Fehér. Earlier, we talked about Google's WaveNet, a learning-based text-to-speech engine. This means that we give it a piece of written text and after a training step using someone's voice, it has to read it aloud using this person's voice as convincingly as possible. And this followup work is about making it even more convincing. Before we go into it, let's marvel at these new results together. Hm-hm! As you can hear, it is great at prosody, stress and intonation, which leads to really believable human speech. The magic component in the original WaveNet paper was introducing dilated convolutions for this problem. This makes large skips in the input data so we have a better global view of it. It is a bit like increasing the receptive field of the eye so we can see the entire landscape, and not only a tree on a photograph. The magic component in this new work is using Mel spectrograms as an input to WaveNet. This is an intermediate representation that is based on the human perception that records not only how different words should be pronounced, but the expected volumes and intonations as well. The new model was trained on about 24 hours of speech data. And of course, no research work should come without some sort of validation. The first is recording the mean opinion scores for previous algorithms, this one and real, professional voice recordings. The mean opinion score is a number that describes how a sound sample would pass as genuine human speech. The new algorithm passed with flying colors. An even more practical evaluation was also done in the form of a user study where people were listening to the synthesized samples and professional voice narrators, and had to guess which one is which. And this is truly incredible, because most of the time, people had no idea which was which - if you don't believe it, we'll try this ourselves in a moment. A very small, but statistically significant tendency towards favoring the real footage was recorded, likely because some words, like "merlot" are mispronounced. Automatically voiced audiobooks, automatic voice narration for video games. Bring it on. What a time to be alive! Note that producing these waveforms is not real time and still takes quite a while. To progress along that direction, scientists as DeepMind wrote a heck of a paper where they sped WaveNet up a thousand times. Leave a comment if you would like to hear more about it in a future episode. And of course, new inventions like this will also raise new challenges down the line. It may be that voice recordings will become much easier to forge and be less useful as evidence unless we find new measures to verify their authenticity, for instance, to sign them like we do with software. In closing, a few audio sample pairs, one of them is real, synthesized. What do you think, which is which? Leave a comment below. I'll just leave a quick hint here that I found on the webpage. Hopp! There you go. If you have enjoyed this episode, please make sure to support us on Patreon. This is how we can keep the show running, and you know the drill, one dollar is almost nothing, but it keeps the papers coming. Thanks for watching and for your generous support, and I'll see you next time!
