# Google's AI Clones Your Voice After Listening for 5 Seconds! 🤐

## Метаданные

- **Канал:** Two Minute Papers
- **YouTube:** https://www.youtube.com/watch?v=0sR1rU3gLzQ
- **Дата:** 12.11.2019
- **Длительность:** 5:31
- **Просмотры:** 1,385,681

## Описание

❤️ Check out Weights & Biases here and sign up for a free demo: https://www.wandb.com/papers 

The shown blog post is available here: https://www.wandb.com/articles/fundamentals-of-neural-networks

📝 The paper "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis" and audio samples are available here:
https://arxiv.org/abs/1806.04558
https://google.github.io/tacotron/publications/speaker_adaptation/

An unofficial implementation of this paper is available here. Note that this was not made by the authors of the original paper and may contain deviations from the described technique - please judge its results accordingly! https://github.com/CorentinJ/Real-Time-Voice-Cloning

🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible:
Alex Haro, Anastasia Marchenkova, Andrew Melnychuk, Angelos Evripiotis, Anthony Vdovitchenko, Benji Rabhan, Brian Gilman, Bryan Learn, Christian Ahlin, Claudio Fernandes, Daniel Hasegan, Dennis Abts, Eric Haddad, Eric Martel, Evan Breznyik, Geronimo Moralez, James Watt, Javier Bustamante, John De Witt, Kaiesh Vohra, Kasia Hayden, Kjartan Olason, Levente Szabo, Lorin Atzberger, Lukas Biewald, Marcin Dukaczewski, Marten Rauschenberg, Matthias Jost, Maurits van Mastrigt, Michael Albrecht, Michael Jensen, Nader Shakerin, Owen Campbell-Moore, Owen Skarpness, Raul Araújo da Silva, Rob Rowe, Robin Graham, Ryan Monsurate, Shawn Azman, Steef, Steve Messina, Sunil Kim, Taras Bobrovytsky, Thomas Krcmar, Torsten Reil.
https://www.patreon.com/TwoMinutePapers

Splash screen/thumbnail design: Felícia Fehér - http://felicia.hu

Károly Zsolnai-Fehér's links:
Instagram: https://www.instagram.com/twominutepapers/
Twitter: https://twitter.com/karoly_zsolnai
Web: https://cg.tuwien.ac.at/~zsolnai/

#VoiceCloning #Google

## Содержание

### [0:00](https://www.youtube.com/watch?v=0sR1rU3gLzQ) Segment 1 (00:00 - 05:00)

dear fellow scholars this is two minute papers with károly on IIFA here today we are going to listen to some amazing improvements in the area of AI based voice cloning for instance if someone wanted to clone my voice there are hours and hours of my recordings on YouTube and elsewhere they could do it with previously existing techniques but the question today is if we had even more advanced methods to do this how big of a sound sample would we really need for this do we need a few hours a few minutes the answer is no not at all hold on to your papers because this new technique only requires five seconds let's listen to a couple examples the Norseman considered the rainbow as a bridge over which the gods passed from Earth to their home in the sky take a look at these pages for Crooked Creek Drive there are several listings for gas station here's the forecast for the next four days these take the shape of a long round arch with its path high above and it's two ends apparently beyond the horizon take a look at these pages for Crooked Creek Drive there are several listings for gas station here's the forecast for the next full days absolutely incredible the timbre of the voice is very similar and it is able to synthesize sounds and consonants that have to be inferred because they were not heard in the original voice sample this requires a certain kind of intelligence and quite a bit of that so while we are at that how does this new system work well it requires three components one the speaker encoder is a neural network that was trained on thousands and thousands of speakers and is meant to squeeze all this learned data into a compressed representation in other words it tries to learn the essence of human speech from many speakers to clarify I will add that this system listens to thousands of people talking to learn the intricacies of human speech but this training step needs to be done only once and after that it was allowed just five seconds of speech data from someone they haven't heard of previously and later the synthesis takes place using these five seconds as an input - we have a synthesizer that takes text as an input this is what we would like our test subject to say and it gives us a math spectrogram which is a concise representation of someone's voice and the intonation the implementation of this module is based on deep my stack Ultron - technique and here you can see an example of this math spectrogram built for a male and two female speakers on the Left we have the spectrograms for the reference recordings the voice samples if you will and on the right we specify a piece of text that we would like the learning algorithm to utter and it produces this corresponding synthesized spectrograms but eventually we would like to listen to something and for that we need a waveform as an output so the third element is thus a neuro vocoder that does exactly that and this component is implemented by deep minds wavenet technique this is the architecture that led to these amazing examples so how do you measure exactly how amazing it is when we have a solution evaluating it is also anything but trivial in principle we are looking for a result that is both close to the recording that we have of the target person but says something completely different and all this in a natural manner this naturalness and similarity can be measured but we are not nearly done yet because the problem gets even more difficult for instance it matters how we fit the three puzzle pieces together and then what data we train it on of course also matters a great deal here you see that if we train on a one data set and test the results against a different one and then swap the two and the results in naturalness and similarity will differ significantly the paper contains a very detailed evaluation section that explains how to deal with these difficulties the mean opinion score is measured in this section which is a number that describes how well a sound sample would pass as genuine human speech and we haven't even talked about the speaker verification part so make sure to have a look at the paper so indeed we can clone each other's voice by using a sample of only five seconds what a time to be alive this episode has been supported by weights and biases provides tools to track your experiments in your deep learning projects it can save you a ton of time and money in these projects and is being used by opening a Toyota Research Stanford and Berkeley they also wrote a guide on the fundamentals of neural networks where

### [5:00](https://www.youtube.com/watch?v=0sR1rU3gLzQ&t=300s) Segment 2 (05:00 - 05:00)

they explain in simple terms how to train a neural network properly what are the most common errors you can make and how to fix them it is really great you get to have a look so make sure to visit them through Wendy b. com slash papers WEA and DB comm slash papers or just click the link in the video description and you can get a free demo today our thanks to weights and biases for helping us make better videos for you thanks for watching and for your generous support and I'll see you next time

---
*Источник: https://ekstraktznaniy.ru/video/14222*