❤️ Check out Weights & Biases and sign up for a free demo here: https://www.wandb.com/papers
❤️ Their mentioned post is available here: https://wandb.ai/ayush-thakur/face-vid2vid/reports/Overview-of-One-Shot-Free-View-Neural-Talking-Head-Synthesis-for-Video-Conferencing--Vmlldzo1MzU4ODc
📝 The paper "One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing" is available here:
https://nvlabs.github.io/face-vid2vid/
🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible:
Aleksandr Mashrabov, Alex Haro, Alex Serban, Alex Paden, Andrew Melnychuk, Angelos Evripiotis, Benji Rabhan, Bruno Mikuš, Bryan Learn, Christian Ahlin, Eric Haddad, Eric Martel, Gordon Child, Haris Husic, Ivo Galic, Jace O'Brien, Javier Bustamante, John Le, Jonas, Joshua Goller, Kenneth Davis, Lorin Atzberger, Lukas Biewald, Matthew Allen Fisher, Mark Oates, Michael Albrecht, Nikhil Velpanur, Owen Campbell-Moore, Owen Skarpness, Ramsey Elbasheer, Robin Graham, Steef, Taras Bobrovytsky, Thomas Krcmar, Torsten Reil, Tybie Fitzhugh.
If you wish to appear here or pick up other perks, click here: https://www.patreon.com/TwoMinutePapers
Thumbnail background image credit: https://pixabay.com/images/id-820315/
Károly Zsolnai-Fehér's links:
Instagram: https://www.instagram.com/twominutepapers/
Twitter: https://twitter.com/twominutepapers
Web: https://cg.tuwien.ac.at/~zsolnai/
Оглавление (2 сегментов)
Segment 1 (00:00 - 05:00)
Dear Fellow Scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér. This paper is really something else. Scientists at NVIDIA just came up with an absolutely insane idea for video conferencing. Their idea is not to do what everyone else is doing, which is, transmitting our video to the person on the other end. No, of course not, that would be too easy! What they do in this work, is take only the first image from the video, and they throw away the entire video afterwards! But before that, it stores a tiny bit of information from it, which is, how our head is moving over time, and how our expressions change. That is an absolutely outrageous idea… and of course, we like those around here, so, does this work? Well, let’s have a look. This is the input video, note that this is not transmitted, only the first image and some additional information, and the rest of this video is discarded. And hold on to your papers, because this is the output of the algorithm compared to the input video. No, this is not some kind of misunderstanding, nobody has copy-pasted the results there. This is a near-perfect reconstruction of the input, except that the amount of information we need to transmit through the network is significantly less than with previous compression techniques. How much less? Well, you know what’s coming, so let’s try it out! Here is the output of the new technique, and here is the comparison against H. 264, a powerful and commonly used video compression standard. Well, to our disappointment, the two seem close, the new technique appears better, especially around the glasses, but the rest is similar. And if you have been holding on to your papers so far, now, squeeze that paper, because this is not a reasonable comparison. And that is because the previous method was allowed to transmit 6 to 12 times more information. Look, as we further decrease the data allowance of the previous method, it still can transmit more than twice as much information, and at this point, there is no contest. This bitrate would be unusable for any kind of videoconferencing, while the new method uses less than half as much information, and still transmits a sharp and perfectly fine video. Overall, the authors report that their new method is ten times more efficient. That is unreal. This is an excellent video reconstruction technique, that much is clear. And if it only did that, it would be a great paper. But this is not a great paper, this is an absolutely amazing paper, so it does even more. Much, much more! For instance, it can also rotate our head and make a frontal video, can also fix potential framing issues by translating our head, and transferring all of our gestures to a new model. And, it is also evaluated well, so all of these new features are tested in isolation. Look at these two previous methods trying to frontalize the input video. One would think that it’s not even possible to perform properly given how much these techniques are struggling with the task…until we look at the new method. My goodness. There is some jumpiness in the neck movement in the output video here, and some warping issues here, but otherwise, very impressive results. Now if you have been holding on to your papers so far, now, squeeze that paper, because these previous methods are not some ancient papers that were published a long time ago. Not at all! Both of them were published within the same year as the new paper. How amazing is that. Wow. I really liked this page from the paper, which showcases both the images and the mathematical measurements against previous methods side by side. There are many ways to measure how close two videos are to each other, the up and down arrows tell us whether the given quality metric is subject to minimization or maximization, for instance, pixelwise errors are typically minimized, so lesser is better, but we are to maximize the peak signal to noise ratio. And the cool thing is that none of this matters too much as soon as we insert the new technique, which really outpaces all of these. And we are still not done yet! So we said that the technique takes the first image, reads the evolution of expressions and the head pose from the input video, and then, it discards the entirety of the video
Segment 2 (05:00 - 07:00)
save for the first image. The cool thing about this was that we could pretend to rotate the head pose information, and the result is that the head appears rotated in the output image. That was great. But what if we take the source image from someone, and take this data, the driving keypoint sequence from someone else? Well, what we get is, motion transfer. Look! We only need one image of the target person, and we can transfer all of our gestures to them, in a way that is significantly better than most previous methods. Now, of course, not even this technique is perfect, it still struggles a great deal in the presence of occluder objects, but still, just the fact that this is possible feels like something straight out of a science fiction movie. What a time to be alive! Thanks for watching and for your generous support, and I'll see you next time!