Microsoft’s New AI Clones Your Voice In 3 Seconds!

9:15

Microsoft’s New AI Clones Your Voice In 3 Seconds!

Two Minute Papers 09.02.2023 252 860 просмотров 9 084 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

❤️ Check out Lambda here and sign up for their GPU Cloud: https://lambdalabs.com/papers 📝 The paper "VALL-E Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers" is available here: https://valle-demo.github.io/ My latest paper on simulations that look almost like reality is available for free here: https://rdcu.be/cWPfD Or this is the orig. Nature Physics link with clickable citations: https://www.nature.com/articles/s41567-022-01788-5 🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible: Aleksandr Mashrabov, Alex Balfanz, Alex Haro, Andrew Melnychuk, Benji Rabhan, Bryan Learn, B Shang, Christian Ahlin, Edward Unthank, Eric Martel, Geronimo Moralez, Gordon Child, Jace O'Brien, Jack Lukic, John Le, Jonas, Jonathan, Kenneth Davis, Klaus Busse, Kyle Davis, Lorin Atzberger, Lukas Biewald, Matthew Allen Fisher, Matthew Valle, Michael Albrecht, Michael Tedder, Nevin Spoljaric, Nikhil Velpanur, Owen Campbell-Moore, Owen Skarpness, Rajarshi Nigam, Ramsey Elbasheer, Richard Sundvall, Steef, Taras Bobrovytsky, Ted Johnson, Thomas Krcmar, Timothy Sum Hon Mun, Torsten Reil, Tybie Fitzhugh, Ueli Gallizzi. If you wish to appear here or pick up other perks, click here: https://www.patreon.com/TwoMinutePapers Thumbnail background design: Felícia Zsolnai-Fehér - http://felicia.hu Károly Zsolnai-Fehér's links: Twitter: https://twitter.com/twominutepapers Web: https://cg.tuwien.ac.at/~zsolnai/

Оглавление (2 сегментов)

Segment 1 (00:00 - 05:00)

dear fellow Scholars this is two minute papers with Dr Carol jonai fahir today I will show you a research paper that I can hardly believe exists it is about an amazing voice cloning paper from Microsoft research what does that mean well voice cloning means that an AI listens to us speaking we write a piece of text and it says it in our voice to see what that looks like this is a previous work from Nvidia that was able to do that let's listen to Jamil recording these voice Snippets further details are expected later okay so how much of this do we need to train this earlier AI well not the entire live recordings of the test subject but much less only 30 minutes of these voice samples the technique asks us to say these sentences and analyzes the tomber prosody and the rhythm of our voice which is quite a task and what can it do afterwards well it creates an AI Jamia that can now say a scholarly message for you that I wrote this is a voice line generated by an AI wonder if you fellow Scholars are going to notice the first law of papers says that research is a process do not look at where we are will be two more papers down the line now hold on to your papers because Microsoft has a cracking new paper with an AI they call volley that is able to do the same but not in 30 minutes not even in three minutes it can do it in three seconds yes that's right all it needs is a three second snippet of our voice and it promises that it can clone it from that and it gets crazier much crazier I will show you how in a moment first in goes a three second voice sample of us seeing something this does not need to match the text prompt here he told his visitors as he lighted a pipe and this is going to be used for the learning now that it has learned this voice here is what a previous technique could do in terms of cloning instead of shoes the old man wore boots with turnover tops and his blue coat had white cuffs of gold braid and here is the new method instead of shoes the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid wow the phrasing and timing are so much better but actually how good is it how do we even know if this is really better well easy we can ask this person to read The Prompt in their voice hide it from the AI and compare side by side listen AI goes first instead of shoes the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid a real person second instead of shoes the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid this is absolutely incredible This truly feels like we are seeing history in the making here's another example a speaker prompt for learning milked cow contains the fight and the new technique and lay me down in my coal bed and would leave my shining lot I love this it even has a little personality in there and the true human voice and lay me down in thy cold bed and leave my shining lot I have to say this is even better but the AI was also out of this world compared to what we were able to do before then lay me down and I called bed and leave my shining lot here is another one I was in daily contact we've made a couple of albums and then we proceed to my favorite part of the paper three Advanced features so good first variety this new technique can generate several variants of speech for the same prompt so we can listen to them and choose the one that we like best listen how it puts the emphasis on different words because we do not need it loving it to get this it can also listen to the three second example of our voice and preserve its emotions too here is an angry one her face was against his breast we have to reduce the number of plastic bags and here is a sleepy one

Segment 2 (05:00 - 09:00)

we have to reduce the number of plastic bags I really don't know what to say this is absolutely incredible and three it can also maintain not only the emotions but the ambience and acoustic environment our sample has been recorded in my favorite is this sample which sounds like an old crackly phone conversation like this right what'd you like about One Flew Over the cookies business so can it clone this kind of sample too let's listen together my Mike not think that this was possible at all let alone from a tiny three second example and just imagine the possibilities this could bring back people who are not with us anymore and have them read us books and bedtime stories amazing is that maybe we will have the incredible Isaac Asimov reading his own robot box to us soon we want a time to be alive and remember nvidia's previous technique needed 30 minutes for this and now just one more paper down the line and 600 times less information is enough to create voice samples of this quality 600 times less and just imagine what we will be able to do two more papers down the line my goodness now of course this is a great research paper so what does that mean of course that means that it has a thorough evaluation section so let's pop the hood and have a look whoa I tried to explain this we first compare against previous techniques in terms of the word error rate metric the two variants of the new technique come out the absolute best however correctness by itself is not enough we also wish to know if the new samples are not only correct but also similar to the speaker in the input sample so is it wow the new technique comes out the absolute best on the two things at the same time however wait these two are some ancient techniques from long ago right well nope your TTS is from the same year as this paper and so is audio LM this is so much progress in research in so little time now that truly feels like history in the making so welcome to two minute papers land of the fellow Scholars where we look at tables and make happy noises so what do you think who else would you be interested in to read to you Morgan Freeman all the things or maybe karoi all the things let me know in the comments below and if you wish to see more papers like this please consider subscribing and even better hitting the Bell icon if you're looking for inexpensive if Cloud gpus for AI Lambda now offers the best prices in the world for GPU Cloud compute no commitments or negotiation required just sign up and launch an instance and hold on to your papers because with Lambda GPU Cloud you can get on-demand a100 instances for 1. 10 per hour versus 4. 10 per hour with AWS that's 73 savings did I mention they also offer persistent storage so join researchers at organizations like apple MIT and Caltech in using Lambda Cloud instances workstations or servers make sure to go to lambdalabs. com papers to sign up for one of their amazing GPU instances today thanks for watching and for your generous support and I'll see you next time

Другие видео автора — Two Minute Papers

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник