# NVIDIA’s New AI: Wow, 30X Faster Than Stable Diffusion!

## Метаданные

- **Канал:** Two Minute Papers
- **YouTube:** https://www.youtube.com/watch?v=qnHbGXmGJCM
- **Дата:** 06.03.2023
- **Длительность:** 6:47
- **Просмотры:** 119,324
- **Источник:** https://ekstraktznaniy.ru/video/13262

## Описание

❤️ Check out Anyscale and try it for free here: https://www.anyscale.com/papers

📝 The paper "StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis" is available here:
https://sites.google.com/view/stylegan-t/

📝 My material synthesis paper is available here:
https://users.cg.tuwien.ac.at/zsolnai/gfx/gaussian-material-synthesis/

My latest paper on simulations that look almost like reality is available for free here:
https://rdcu.be/cWPfD 

Or this is the orig. Nature Physics link with clickable citations:
https://www.nature.com/articles/s41567-022-01788-5

🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible:
Aleksandr Mashrabov, Alex Balfanz, Alex Haro, Andrew Melnychuk, Benji Rabhan, Bryan Learn, B Shang, Christian Ahlin, Edward Unthank, Eric Martel, Geronimo Moralez, Gordon Child, Jace O'Brien, Jack Lukic, John Le, Jonas, Jonathan, Kenneth Davis, Klaus Busse, Kyle Davis, Lorin Atzberger, Lukas Biewald, Matthew A

## Транскрипт

### Segment 1 (00:00 - 05:00) []

Today we are going to look at NVIDIA’s incredible  new AI that can create images, and more. Now, wait a second. Stop right there.   Every Fellow Scholar knows that today,   there are plenty of text to image AIs  out there, where in goes a piece of text,   and out comes an image. They come in all  kinds of flavors these days. Everyone   knows. So our question today is why publish  this paper? Do we really need more of these? Well, this new paper is called StyleGAN-T.   Keep your eyes on this part, because this   means that this is a GAN-based technique. A  GAN is a Generative Adversarial Network. This   roughly means that we have two neural  networks competing against each other,   and as they compete, they get better  together. Okay, that all sounds great,   but I am still not convinced. What does  this give us? Why would we even use this? Well, there are two excellent reasons. Reason  number one, GANs are excellent at latent-space   interpolation. What does that mean? It means  that we can create these interesting 2D spaces,   choose a point on this plane, which in this  case, corresponds to a font. And the points   nearby hide other fonts that are similar to  this one. So as we start exploring nearby,   we get a beautiful, smooth morphing animation  between these fonts. In our earlier paper,   we did something similar with photorealistic  material models, so artists can find or even   better, adjust a material so that  it fits their virtual worlds best. So this new technique supposedly can do  proper latent-space exploration for text   to image. Supposedly. Now let’s see  if it is true in practice here too.   Here is a previous technique, the crowd  favorite, Stable Diffusion. This can make   an interesting video, but as you see,  the results are quite jumpy. It doesn’t   feel like one result morphs into the next one.   And now, let’s see the new technique. Oh yes,   now that’s what I am talking about! With  this, we get more continuous results and   can explore these latent spaces as much as we  desire, and that is going to be super useful. You see, what we can do with this is  that we write a prompt, for instance,   “A corgi’s head depicted as an explosion of a  nebula. ” And, we don’t just get an image anymore.    No-no, due to its amazing interpolation  capabilities, we get an opportunity to   not only witness the birth of the universe, but  to choose the good boy that we find to be the   most adorable. I choose this one. Right before it  morphs into a cat. Yes, this one will do. Which   one is your favorite? Let me know in the comments  below. So its latent-space exploration capability   is not only an afterthought here, it is one of  the new technique’s key features. Now, remember,   I mentioned that this is reason number one of why  we should use it. So what is reason number two? Well, two, it is fast. Real fast. But to know  how fast exactly, let’s pop the hood and have a   look. Now hold on to your papers, Fellow Scholars  and …what? 0. 1 seconds per image? Is that really   possible? Wow. These animations can be made  practically in real time! The age of real-time   AI image, and even video synthesis is here. My  goodness! It did not take decades, it didn’t even   take years. Less than a year after OpenAI’s DALL-E  2, which asked for approximately 10-15 seconds per   image, we are here. Real time. I can’t believe it.   Wow! This is truly incredible. However, not even   this technique is perfect. Let’s see a failure  case. A sign that says deep learning. Come on,   this one again? Remember our moment with DALL-E  2? It had the same issue. There are techniques out   there that do much better on text, for instance,  Imagen Video is better for this, however,   it is not nearly as fast as this. Yes, that  one is about a hundred times slower per image.

### Segment 2 (05:00 - 06:00) [5:00]

So, the perfect text to image AI still  doesn’t exist, every technique offers   its own little tradeoff, but  man, are they all getting better   and better at an insane pace. Amazing  new papers are popping up every week. So, what do you think? What would you use  this for? Let me know in the comments below! Thanks for watching and for your generous  support, and I'll see you next time!