This AI Learned To Isolate Speech Signals
4:17

This AI Learned To Isolate Speech Signals

Two Minute Papers 12.11.2018 35 958 просмотров 2 291 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
The paper "Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation " is available here: https://looking-to-listen.github.io/ Pick up cool perks on our Patreon page: › https://www.patreon.com/TwoMinutePapers We would like to thank our generous Patreon supporters who make Two Minute Papers possible: 313V, Andrew Melnychuk, Angelos Evripiotis, Anthony Vdovitchenko, Brian Gilman, Christian Ahlin, Christoph Jadanowski, Dennis Abts, Emmanuel, Eric Haddad, Eric Martel, Evan Breznyik, Geronimo Moralez, John De Witt, Kjartan Olason, Lorin Atzberger, Marten Rauschenberg, Maurits van Mastrigt, Michael Albrecht, Michael Jensen, Morten Punnerud Engelstad, Nader Shakerin, Owen Skarpness, Raul Araújo da Silva, Rob Rowe, Robin Graham, Ryan Monsurate, Shawn Azman, Steef, Steve Messina, Sunil Kim, Thomas Krcmar, Torsten Reil, Zach Boldyga. https://www.patreon.com/TwoMinutePapers Crypto and PayPal links are available below. Thank you very much for your generous support! › PayPal: https://www.paypal.me/TwoMinutePapers › Bitcoin: 13hhmJnLEzwXgmgJN7RB6bWVdT7WkrFAHh › Ethereum: 0x002BB163DfE89B7aD0712846F1a1E53ba6136b5A › LTC: LM8AUh5bGcNgzq6HaV1jeaJrFvmKxxgiXg Thumbnail background image credit: https://pixabay.com/photo-3565815/ Splash screen/thumbnail design: Felícia Fehér - http://felicia.hu Károly Zsolnai-Fehér's links: Facebook: https://www.facebook.com/TwoMinutePapers/ Twitter: https://twitter.com/karoly_zsolnai Web: https://cg.tuwien.ac.at/~zsolnai/

Оглавление (5 сегментов)

<Untitled Chapter 1>

dear fellow scholars this is two minute papers with károly fajir this is a neural network based technique that can perform audio-visual separation before we talk about what that is I will tell you what it is not what we've seen in the previous episode where we could select a pixel and listen to it have a look and now let's try to separate the sound of the cello and see if it knows where it comes from this one is different this new technique

Input video

can clean up an audio signal by suppressing the noise in a busy bar even if the source of the noise is not seen in the video it can also enhance the voice of the speaker at the same time let's listen so test is uh do any video any person who you see talking their audio gets cleaned up did everything else gets suppressed or if we have a Skype meeting with someone in a lab or a busy office where multiple people are speaking nearby we can also perform a similar speed separation which would be a godsend for future meetings so we've been trying to train this network to input to embeddings and output 3 yeah this is just an extra experiment for the paper hi guys so we've been trying to train

Only the main speaker

this network to input to embedding is it out the three yeah this is just an extra experiment for the paper

Only the background audio

and I think if you are a parent the utility of this example needs no further explanation how people please navigate to work

Only the driver

I am not sure if I ever encountered the term screaming children in the abstract of an AI paper so that one was also a first here this is a super difficult task because the AI needs to understand what leap motions correspond to what kind of sounds which is different for all kinds of languages age groups and head positions to this end the authors put together a stupendously large data set with almost 300,000 videos with clean speed signals this data set is then run through a multi stream neural network that detects the number of human faces within the video generates small thumbnails of them and observes how they move over time it also analyzes the audio signals separately then fuses these elements together with a recurrent neural network to output the separated audio waveforms a key advantage of this architecture and training method is that as opposed to many previous works this is speaker independent therefore we don't need specific training data from the speaker we want to use this on this is a huge leap in terms of usability the paper also contains an excellent demonstration of this concept by taking a piece of footage from Conan O'Brien's show where two comedians were booked for the same timeslot and talk over each other the result is a performance where it is near impossible to understand what they are saying but with this technique we can hear both of them one by one crystal-clear you see some results over here but make sure to click the paper link in the description to hear the sound samples as well thanks for watching and for your generous support and I'll see you next time

Другие видео автора — Two Minute Papers

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник