❤️ Check out Weights & Biases and sign up for a free demo here: https://www.wandb.com/papers
❤️ Their mentioned post is available here: https://wandb.ai/wandb/egocentric-video-conferencing/reports/Overview-Egocentric-Videoconferencing--VmlldzozMTY1NTA
📝 The paper "Egocentric Videoconferencing" is available here:
http://gvv.mpi-inf.mpg.de/projects/EgoChat/
🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible:
Aleksandr Mashrabov, Alex Haro, Alex Paden, Andrew Melnychuk, Angelos Evripiotis, Benji Rabhan, Bruno Mikuš, Bryan Learn, Christian Ahlin, Eric Haddad, Eric Lau, Eric Martel, Gordon Child, Haris Husic, Javier Bustamante, Joshua Goller, Lorin Atzberger, Lukas Biewald, Matthew Allen Fisher, Michael Albrecht, Nikhil Velpanur, Owen Campbell-Moore, Owen Skarpness, Ramsey Elbasheer, Robin Graham, Steef, Taras Bobrovytsky, Thomas Krcmar, Torsten Reil, Tybie Fitzhugh.
If you wish to support the series, click here: https://www.patreon.com/TwoMinutePapers
Thumbnail background image credit: https://pixabay.com/images/id-820390/
Károly Zsolnai-Fehér's links:
Instagram: https://www.instagram.com/twominutepapers/
Twitter: https://twitter.com/twominutepapers
Web: https://cg.tuwien.ac.at/~zsolnai/
Оглавление (5 сегментов)
<Untitled Chapter 1>
Dear Fellow Scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér. Today we are going to have a look at the state of egocentric videoconferencing. Now this doesn’t mean that only we get to speak during a meeting, it means that we are wearing a camera, which looks like this, and the goal is to use a learning algorithm to synthesize this frontal view of us. Now note that what you see here is the recorded reference footage, this is reality, and this would need to be somehow synthesized by the algorithm. If we could pull that off, we could add a low-cost egocentric camera to smart glasses, and it could pretend to see us from the front, which would be amazing for hands-free videoconferencing. That would be insanity. But wait a second. How is this even possible? For us to even have a fighting chance, there are four major problems to overcome here. One, this camera lens is very close to us, which means that it doesn’t see the entirety of the face. That sounds extremely challenging. And if that wasn’t bad enough, two, we also have tons of distortion in the images, or in other words, things don’t look like they look in reality, we would have to account for that too. Three, it would also have to take into account our current expression, gaze, blinking, and more. Oh boy. And finally, four, the output needs not be photorealistic, even better, videorealistic. Remember, we don’t just need one image, but a continuously moving video output. So the problem is, once again, input, egocentric view, output, synthesized frontal view. This is the reference footage, reality if you will, and now, let’s see how this learning-based algorithm is able to reconstruct it. Um…hello? Is this a mistake? They look identical, as if they were just copied here. No, you will see in a moment that it’s not a mistake, this means that the AI is giving us a nearly perfect reconstruction of the remainder of the human face. That is absolutely amazing. Now, it is still not perfect, there are some differences. So how do we get a good feel of where the inaccuracies are? The answer is a difference image, look. Regions with warmer colors indicate where the reconstruction is inaccurate compared to the real reference footage. For instance, with an earlier method by the name pix2pix, the hair and the beard are doing fine, while we have quite a bit of reconstruction error on the remainder of the face. So, did the new method do better than this? Let’s have a look together. Oh yeah! It does much better across the entirety of the face. It still has some trouble with the cable and the glasses, but otherwise, this is a clean, clean image. Bravo! Now, we talked about the challenge of reconstructing expressions correctly. To be able to read the other person is of utmost importance during a video conference. So how good is it at gestures? Well, let’s put it through an intense stress test! Well, this is as intense as it gets without having access to Jim Carrey as a test subject I suppose, and I bet there was a lot of fun to be had in the lab on this day. And the results are outstanding, especially if we compare it again to the pix2pix technique from 2017.
Comparison to Isola et al. 2017
I love this idea, because if we can overcome the huge shortcomings of the egocentric camera, in return, we get an excellent view of subtle facial expressions and can deal with the tiniest eye movements, twitches, tongue movements, and more. And it really shows in the results. Now please note that this technique needs to be trained on each of these test subjects. About four minutes of video footage is fine and this calibration process only needs to be done once. So, once again, the technique knows these people and had seen them before. But in return, it can do even more. If all of this is synthesized, we have a lot of control over this data and the AI understands what much of this data means. So with all that extra knowledge, what else can we do with this footage? For instance, we can not just reconstruct, but create arbitrary head
Synthesis at arbitrary head pose
movement. We can guess what the real head movement is because we have a view of the background, we can simply remove it, or from the movement of the background, we can infer what kind of head movement is taking place. And what’s even better, we can not only get control over the head movement and change it, but even remove the movement from the footage altogether. And, we can also remove the glasses and pretend to have dressed properly for an occasion. How cool is that? Now make no mistake, the paper contains a ton of comparisons against a variety of other
Comparison to Deep Video Portraits++
works as well, here are some, but make sure to check them all out in the video description. Now, of course, even this new method isn’t perfect, for instance, it does not work all that well in low-light situations, but of course, let’s leave something to improve for the next
Limitations
paper down the line. And hopefully, in the near future, we will be able to seamlessly get in contact with our loved ones through smart glasses and egocentric cameras. What a time to be alive! Thanks for watching and for your generous support, and I'll see you next time!