Can an AI Learn Lip Reading?
6:03

Can an AI Learn Lip Reading?

Two Minute Papers 17.07.2020 189 228 просмотров 10 936 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
❤️ Check out Snap's Residency Program and apply here: https://lensstudio.snapchat.com/snap-ar-creator-residency-program/?utm_source=twominutepapers&utm_medium=video&utm_campaign=tmp_ml_residency ❤️ Try Snap's Lens Studio here: https://lensstudio.snapchat.com/ 📝 The paper "Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis" is available here: http://cvit.iiit.ac.in/research/projects/cvit-projects/speaking-by-observing-lip-movements Our earlier video on the "bag of chips" sound reconstruction is available here: https://www.youtube.com/watch?v=2i1hrywDwPo 🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible: Aleksandr Mashrabov, Alex Haro, Alex Paden, Andrew Melnychuk, Angelos Evripiotis, Benji Rabhan, Bruno Mikuš, Bryan Learn, Christian Ahlin, Daniel Hasegan, Eric Haddad, Eric Martel, Gordon Child, Javier Bustamante, Lorin Atzberger, Lukas Biewald, Michael Albrecht, Nikhil Velpanur, Owen Campbell-Moore, Owen Skarpness, Ramsey Elbasheer, Robin Graham, Steef, Sunil Kim, Taras Bobrovytsky, Thomas Krcmar, Torsten Reil, Tybie Fitzhugh. More info if you would like to appear here: https://www.patreon.com/TwoMinutePapers Thumbnail background image credit: https://pixabay.com/images/id-4814562/ Károly Zsolnai-Fehér's links: Instagram: https://www.instagram.com/twominutepapers/ Twitter: https://twitter.com/twominutepapers Web: https://cg.tuwien.ac.at/~zsolnai/ #Lipreading

Оглавление (7 сегментов)

Intro

dear fellow scholars this is two minute papers with doctor karoly jonahife when watching science fiction movies we often encounter crazy devices and technologies that don't really exist or sometimes ones that are not even

Sound from vibrations

possible to make for instance reconstructing sound from vibrations would be an excellent example of that and could make a great novel with the secret service trying to catch dangerous criminals except that it has already been done in real life research i think you can imagine how surprised i was when i first saw this paper in 2014 that showcased a result where a camera looks at this bag of chips and from these tiny vibrations it could reconstruct the sounds in the room let's listen mary had a little lamb whose fleece was white as snow and everywhere that mary went that lamb was sure to go yes this indeed sounds like science fiction but 2014 was a long time ago and since then we have a selection of powerful learning algorithms and the question is was the next idea

Sound from silent footage

that sounded completely impossible a few years ago which is now possible well what about looking at silent footage from a speaker and trying to guess what they were saying checkmark that sounds absolutely impossible to me yet this new technique is able to produce the entirety of this speech after looking at a video footage of the leap movements let's listen between the wavelength the frequency and the speed of electromagnetic radiation in fact the product of the wavelength and the frequency is b wow so the first question is of course what was used as the training data

Training data

it used a data set with lecture videos and chess commentary from five speakers and make no mistake it takes a ton of data from these speakers about 20 hours from each but it uses video that was shot in a natural setting which is something that we have in abundance on youtube and other places on the internet note that the neural network works on the same speakers it was trained on and was able to learn their gestures and leap movements remarkably well however this is not the first work attempting to do this so let's see how it compares to the competition stay white with i7 soon set the new one is very close to the true spoken sentence let's look at another one elements of function frame decision eight field guns were captured in position position note that there are gestures a reasonable amount of head movement and other factors at play and the algorithm still does amazingly well potential

Potential applications

applications of this could be video conferencing in zones where we have to be silent giving a voice to people with the inability to speak due to aphonia or other conditions or potentially fixing a piece of video footage where parts of the speed signal are corrupted in these cases the gaps could be filled with such a technique look let's look at a cell potential of 0. 5 volts for an oxidation of bromide by permanganate the question i have is what pkh would cause this voltage would it be a ph now let's have a look under the hood

Under the hood

if we visualize the activations within this neural network we see that it found out that it mainly looks at the mouth of the speaker that is of course not surprising however what is surprising is that the other regions for instance around the forehead and eyebrows are also important to the attention mechanism perhaps this could mean that it also looks at the gestures of the speaker and uses that information for the speech synthesis i find this aspect of the work very intriguing and would love to see some additional analysis on that there is so much more in the paper for instance i mentioned giving a voice to people with aphonia which should not be possible because we are training these neural networks for a specific speaker but with an additional speaker embedding step it is possible to pair up any speaker with any voice this is another amazing work that makes me feel like we are living in a science fiction world i can only imagine what we will be able to do with this technique two more papers down the line if you have any ideas feel free to speculate in the comment section below what a time to be alive

Outro

this episode has been supported by snap inc what you see here is snap ml a framework that helps you bring your own machine learning models to snapchat's ar lenses you can build augmented reality experiences for snapchats hundreds of millions of users and help them see the world through a different lens you can also apply to snap's ar creator residency program with a proposal of how you would use lan studio for a creative project if selected you could receive a grant between one to five thousand dollars and work with snaps technical and creative teams to bring your ideas to life it doesn't get any better than that make sure to go to the link in the video description and apply for the residency program and try snap ml today our thanks to snap inc for helping us make better videos for you thanks for watching and for your generous support and i'll see you next time

Другие видео автора — Two Minute Papers

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник