# WaveNet by Google DeepMind | Two Minute Papers #93

## Метаданные

- **Канал:** Two Minute Papers
- **YouTube:** https://www.youtube.com/watch?v=CqFIVCD1WWo
- **Дата:** 12.09.2016
- **Длительность:** 6:38
- **Просмотры:** 133,342
- **Источник:** https://ekstraktznaniy.ru/video/14776

## Описание

Let's talk about Google DeepMind's Wavenet! This piece of work is about generating audio waveforms for Text To Speech and more. Text To Speech basically means that we have a voice reading whatever we have written down. The difference in this work, is, however that it can synthesize these samples in someone's voice provided that we have training samples of this person speaking.

__________________________

The paper "WaveNet: A Generative Model for Raw Audio" is available here:
https://arxiv.org/abs/1609.03499

The blog post about this with the sound samples is available here:
https://deepmind.com/blog/wavenet-generative-model-raw-audio/

The machine learning reddit thread about this paper is available here:
https://www.reddit.com/r/MachineLearning/comments/51sr9t/deepmind_wavenet_a_generative_model_for_raw_audio/?ref=search_posts

Recommended for you:
Every Two Minute Papers episode on deep learning: https://www.youtube.com/playlist?list=PLujxSBD-JXglGL3ERdDOhthD3jTlfudC2

WE WOULD LIK

## Транскрипт

### Segment 1 (00:00 - 05:00) []

dear fellow Scholars this is 2minute papers with car when I opened my inbox today I was greeted by a huge Deluge of messages about wavenet well first it's great to see that so many people are excited about these inventions and second may all your wishes come true as quickly as this one so here we go this piece of work is about generating audio waveforms for text to speech and more text to speech basically means that we have a voice reading whatever we have written down the difference in this work is however that it can synthesize these samples in someone's voice provided that we have training samples of this person speaking the avocado is a pear shaped fruit with leathery skin smooth edible flesh and a large Stone the avocado is a pear-shaped fruit with leathery skin smooth edible flesh bable FL and a large Stone the avocado is a pear-shaped fruit with leathery skin smooth edible flesh and a large Stone it also generates waveforms sample by sample which is particularly perilous because we typically need to produce these at the rate of 16 to 24,000 samples per second and as we listen to the TV radio and talk to each other several hours a day the human ear and brain is particularly suited to processing this kind of signal if the result is off by only the slightest amount we immediately recognize it is not using a recurrent neural network which is typically suited to learn sequences of things and is widely used for sound synthesis it is using a convolutional neural network which is quite surprising because it is not meant to process sequences of data that change in time however this variant contains an extension that is able to do that they call this EXT mention dilated convolutions and they open up the possibility of making large skips in the input data so we have a better global view of it if we were working in computer vision it would be like increasing the receptive field of the eye so we can see the entire landscape and not only a tree on a photograph it is also a bit like the temporal coherence problem we have talked about earlier taking all this into consideration results in more consistent outputs over larger time scales so the technique knows what it had done several seconds ago also training a convolutional neural network is a walk in the park compared to a recurrent neural network really cool and the results beat all existing widely used techniques by a large margin one of these is the concatenative technique which builds sentences from a huge amount of small speech fragments these have seen a ton of improvements during the years but the outputs are still robotic and it is noticeable that we are not listening to a human but a computer the Deep Mind guys also report that quote notice that non-speech sounds such as breathing and mouth movements are also sometimes generated by wavenet this reflects the greater flexibility of a raw Audio model end quote the Blue Lagoon is a 1980 American romance and Adventure film directed by Randall Kiser the blue moon aspects of the sublime in English poetry and painting 1770 to 1850 at the same time I'd like to note that in the next few episodes it may be that my voice is a bit different but don't worry about that it may also happen that I am on a vacation but new episodes and voice samples pop up on the Channel please don't worry about that either everything is working as intended they also experimented with music generation and the results are just stunning I don't know what to say these difficult problems these impenetrable walls crumble one after another as deep mind takes on them Insanity their blog post and the paper are both really well

### Segment 2 (05:00 - 06:00) [5:00]

written make sure to check them out they are both Linked In the video description box I wager that artistic style transfer for sound and instruments is not only coming but it will be here soon I imagine that we'll play a guitar and it will sound like a harp and we'll be able to sing something in Lady Gaga's voice and intonation I've also seen someone pitching the idea of creating audio books automatically with such a technique wow I travel a lot and I'm almost always on the go so I personally would love to have such audio books I have linked to the mentioned machine learning Reddit thread in the description box as always there's lots of great discussion and ideas there it was also reported that the algorithm currently takes 90 minutes to synthesize 1 second of sound waveforms you know the trail one follow-up paper down the line it will take only a few minutes a few more papers down the line it will be real time just think about all these advancements what a time we are living in and I am extremely excited to present them all to you fellow scholars in 2minute papers make sure to leave your thoughts and ideas in the comment section we'll love reading them thanks for watching and for your generous support and I'll see you next time oh