# Remove This! ✂️ AI-Based Video Completion is Amazing!

## Метаданные

- **Канал:** Two Minute Papers
- **YouTube:** https://www.youtube.com/watch?v=86QU7_SF16Q
- **Дата:** 13.10.2020
- **Длительность:** 6:33
- **Просмотры:** 342,916
- **Источник:** https://ekstraktznaniy.ru/video/14055

## Описание

❤️ Check out Lambda here and sign up for their GPU Cloud: https://lambdalabs.com/papers

📝 The paper "Flow-edge Guided Video Completion" is available here:
http://chengao.vision/FGVC/

🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible:
Aleksandr Mashrabov, Alex Haro, Alex Paden, Andrew Melnychuk, Angelos Evripiotis, Benji Rabhan, Bruno Mikuš, Bryan Learn, Christian Ahlin, Eric Haddad, Eric Lau, Eric Martel, Gordon Child, Haris Husic, Javier Bustamante, Joshua Goller, Lorin Atzberger, Lukas Biewald, Michael Albrecht, Nikhil Velpanur, Owen Campbell-Moore, Owen Skarpness, Ramsey Elbasheer, Robin Graham, Steef, Sunil Kim, Taras Bobrovytsky, Thomas Krcmar, Torsten Reil, Tybie Fitzhugh.
If you wish to support the series, click here: https://www.patreon.com/TwoMinutePapers

Károly Zsolnai-Fehér's links:
Instagram: https://www.instagram.com/twominutepapers/
Twitter: https://twitter.com/twominutepapers
Web: https://cg.tuwien.ac.at/~zsolnai/

## Транскрипт

### Segment 1 (00:00 - 05:00) []

Dear Fellow Scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér. Have you ever had a moment where you took the perfect photo, but upon closer inspection, there was this one annoying thing that ruined the whole picture? Well, why not just take a learning algorithm to erase those cracks in the facade of a building, or a photobombing sheep? Or, to even reimagine ourselves with different eye colors, we can try one of the many research works that are capable of something that we call image inpainting. What you see here is the legendary PatchMatch algorithm at work, which, believe it or not, is a handcrafted technique from more than 10 years ago. Later, scientists at NVIDIA published a more modern inpainter that uses a learning-based algorithm to do this more reliably, and for a greater variety of images. These all work really well, but the common denominator for these techniques is that they all work on inpainting still images. Would this be a possibility for video? Like, removing a moving object or person from a video? Is this possible, or is it science fiction? Let’s see if these learning-based techniques can really do more. And now, hold on to your papers, because this new work can really perform proper inpainting for video. Let’s give it a try by highlighting this human. And pro tip: also highlight the shadowy region for inpainting to make sure that not only the human, but its silhouette also disappears from the footage. And, look! Wow. Let’s look at some other examples. Now that’s really something because video is much more difficult due to the requirement of temporal coherence, which means that it’s not nearly enough if the images are inpainted really well individually, they also have to look good if we weave them together into a video. You will hear and see more about this in a moment. Not only that, but if we highlight a person, this person not only needs to be inpainted, but we also have to track the boundaries of this person throughout the footage and then inpaint a moving region. We get some help with that, which I will also talk about in a moment. Now, as you see here, these all work extremely well, and believe it or not, you have seen nothing yet, because so far, another common denominator in these examples was that we highlighted regions inside the video. But that’s not all. If you have been holding on to your papers so far, now squeeze that paper, because we can also go outside, and expand our video spatially with even more content. This one is very short so I will keep looping it. Are you ready? Let’s go. Wow! My goodness! The information from inside of the video frames is reused to infer what should be around the video frame, and all this in a temporally coherent manner. Now, of course, this is not the first technique to perform this, so let’s see how it compares to the competition by erasing this bear from the video footage. The remnants of the bear are visible with a wide selection of previously published techniques from the last few years. This is true even for these four methods from last year. And, let’s see how this new method did one the same case. Yup, very good, not perfect, we still see some flickering. This is the temporal coherence example, or the lack thereof that I have promised earlier. But now, let’s look at this example with the BMX rider. We see similar performance with the previous techniques, and now, let’s have a look at the new one. Now that’s what I’m talking about! Not a trace left from this person, the only clue that we get in reconstructing what went down here is the camera movement. It truly feels like we are living in a science fiction world. What a time to be alive! Now these were the qualitative results, and now, let’s have a look at the quantitative results. In other words, we saw the videos, now let’s see what the numbers say. We could talk all day about the peak signal to noise ratios or structural similarity or other ways to measure how good these techniques are, but you will see in a moment that it is completely unnecessary. Why is that? Well, you see here that the second best results are underscored and highlighted with blue. As you see, there is plenty of competition, as the blues are all over the place. But there is no competition at all for the first place, because this new method smokes the competition in every category. This was measured on a dataset by the name Densely Annotated Video Segmentation, DAVIS in short, this contains 150 video sequences and it is annotated, which means that many

### Segment 2 (05:00 - 06:00) [5:00]

of the objects are highlighted throughout this video, so for the cases in this dataset, we don’t have to deal with the tracking ourselves. I am truly out of ideas as to what I should wish for two more papers down the line. Maybe not only removing the tennis player, but putting myself in there as a proxy? We can already grab a controller and play as if we were real characters in real broadcast footage, so who really knows. Anything is possible. Let me know in the comments what you have in mind for potential applications and what you would be excited to see two more papers down the line! Thanks for watching and for your generous support, and I'll see you next time!