# Googles New Text To Video BEATS EVERYTHING (LUMIERE)

## Метаданные

- **Канал:** TheAIGRID
- **YouTube:** https://www.youtube.com/watch?v=DN-krCcwnhQ
- **Дата:** 25.01.2024
- **Длительность:** 18:27
- **Просмотры:** 100,703

## Описание

💬 Access  GPT-4 ,Claude-2 and more - chat.forefront.ai/?ref=theaigrid
🎤 Use the best AI Voice Creator - elevenlabs.io/?from=partnerscott3908
✉️ Join Our Weekly Newsletter - https://mailchi.mp/6cff54ad7e2e/theaigrid
🐤 Follow us on Twitter https://twitter.com/TheAiGrid
🌐 Checkout Our website - https://theaigrid.com/

https://lumiere-video.github.io/

Welcome to our channel where we bring you the latest breakthroughs in AI. From deep learning to robotics, we cover it all. Our videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on our latest videos.

Was there anything we missed?

(For Business Enquiries)  contact@theaigrid.com

#LLM #Largelanguagemodel #chatgpt
#AI
#ArtificialIntelligence
#MachineLearning
#DeepLearning
#NeuralNetworks
#Robotics
#DataScience

## Содержание

### [0:00](https://www.youtube.com/watch?v=DN-krCcwnhQ) Segment 1 (00:00 - 05:00)

So, Google Research recently released a stunning paper in which they show off a very, very state-of-the-art texttovideo generator. And by far, this is likely going to be the very best text to video generator you've seen. And I want you guys to take a look at the video demo that they've shown us because it's fascinating. And then after that, I'll dive into why this is state-of-the-art and just how good the Heat. N. Heat. Heat. Heat. So now, one of the most shocking things from Lumare, and I'm not sure if that's exactly how you pronounce it, but one of the very, very most shocking things that we did see was, of course, the consistency in the videos and how certain things are rendered. Now, there are a bunch more examples that they didn't actually showcase in this small video. So, I will be showcasing you those from the actual web page, but it is far better than anything we've seen before. And some studies that they did actually do confirm this. So, for example, in their user study, what they found was that our method was preferred by users in both text to video and in image to video generation. So, one of the benchmarks that they did, I'm not sure what the quality score was, but you can see that theirs, which is of course the Lumare, actually performed a lot better than Imagin, Pabs, a lot better than Zeroscope, and Gen 2, which is Runway. So, Gen 2, if you don't know, is being compared against Runway's video model. And Runway actually recently did launch a bunch of stuff. But if we look at text alignment as well, we can see that across all different video models, this is the winner. And then of course on image to video or video quality you can see that against PA it wins a lot of the time against stable diffusion video. I'm pretty sure that's what that is. Then we can see for image to video you can see it wins against PABS and wins against Gen 2. I'm not sure if this is stable diffusion video too. But if you haven't seen that it's actually something that is really good too. So overall we do know that right now this is actually the gold standard in a text to video which is a very good benchmark because many people have been discussing how 2024 is likely to be the year for text to video. Now, what I do want to talk about before I dive into some of the more crazy examples of their stuff was of course the new architecture. Because what exactly is making this so good because as you know it looks fascinating in terms of everything that we can do. And when I show you some of the more examples, you're going to see exactly why this is even better than you thought. So essentially the first thing that they do is they utilize the space-time unit architecture. So unlike traditional video generation models that create key frames and then fill in the gaps, Lumare generates the entire temporal duration of the video in one go and this is achieved through a unique spaceime unit architecture which efficiently handles both spatial and temporal aspects of the video data. Now what they also do is they have temporal downsampling and upsampling and Lumare incorporates both spatial and temporal downsampling and upsampling in its architecture. Now this approach allows the model to process and generate full frame rate videos much more effectively leading to more coherent re and realistic motion in the generated content. Now of course what they also did was they leveraged pre-trained text to image diffusion models and the researchers built upon existing textto

### [5:00](https://www.youtube.com/watch?v=DN-krCcwnhQ&t=300s) Segment 2 (05:00 - 10:00)

image diffusion models adapting them for video generation and this allows the model to benefit from the strong generative capabilities of these pre-trained models while extending them to handle complexities of video data. Now, one of the significant challenges in video generation is of course maintaining global temporal consistency. And Lum's architecture and training approach are specifically designed to address this, ensuring that the generated videos exhibit coherent and realistic motion throughout their duration. Now, this is Lumare's GitHub page, and this is by far one of the very best things I've ever seen because I want to show you guys some of these examples to just show you how advanced this really is. So, one of the clips I want you to pay attention to is, of course, and I'm going to zoom in here, is of course this Lamborghini because this actually shows us how crazy this technology is. So, we can see that the Lamborghini is driving, driving. And then as it rotates, we can actually see that the Lamborghini wheel is not only moving, but also we can see the other angles of that Lamborghini too. So, I would say that, you know, if we compare it to some of the other video models, one of the things that we do struggle with is, of course, the motion and of course the rotation. But seemingly they've managed to solve this by using this new architecture. And we can see that things like the Lamborghini and rotations, which is a real struggle for video, isn't going to be a problem. Now, another one of my favorite examples was, of course, beer being poured into glass. So, if we take a look at this, this is absolutely incredible because we can see that the glass is just being filled up and it looks so good and realistic. I mean, we have the foam, we have the, you know, beer actually just moving up. We also do have the bubbles and we have things just looking really realistic. Like, if someone was to say this is just a low FPS video of me pouring liquid into a glass, I would honestly believe them. And even if you don't think that it is realistic, I think we can all agree that this is very good for text to video. Um, and if you just hover over it, you can see the input. Now, some of these as well, there are just, you know, really, really good showcases of how good it is at rotations cuz I've seen some of the other video models and this is something that we've only recently, like literally yesterday, I saw a preview and only recently we've managed to um, you know, solve that a little bit. So I mean if we take a look at the bottom left we can see that the sushi is rotating. Um and it looks to me like this it doesn't look as AI generated as many other videos. The only one issue that you know AI generated videos do suffer from is of course low resolution and low frames per second. But I mean I think that is going to be solved very soon. And with what we have here as well like I mean if we look at the confident teddy bear surfer rides waves in the tropics. I mean, if we look at how the water ripples every single time the surfboard actually makes impact with the water, I think we can say that it does look very realistic. And then, of course, we have the chocolate muffin video clip. Now, this one right here as well looks super temporally consistent. I mean, just the way that it rotates just looks like nothing we've ever seen before. Um, and of course, this wolf one, silhouette against a wolf, a silhouette of a wolf against a twilight sky, also looks very, very accurate. and very good. So, I mean, these demos of the text video, I would say are just absolutely outstanding. This one right here, the fireworks that we're looking at, is definitely something that I've seen done by other models before, but it does go to show how good it is. And this one right here, camera moving through dry grass at an autumn morning, also does show just how good it is. Now, with regards to, you know, walking and legs and stuff like that, there is still a bit of a small issue there. And there are some other things that I do want to discuss about this entire project because this entire project is I'm pretty sure a collaboration of some other AI projects that Google has done before and I can't wait to um see if Google manages to finally release this. So um one of the other models so some of the other ones that are my favorites of course the chocolate syrup pouring on vanilla ice cream that looks really well and then this clip of the sky walking doesn't look too bad. And I think that when we take a look at certain videos that are, you know, very subtle in nature, for example, blooming cherry tree in the garden, that looks pretty subtle. And then, of course, the Aurora Borealis, that one looks pretty subtle, too. So, a lot of these videos, I think personally just are just absolutely the best. And of course, we do need to take a look at stylized generation cuz this is something that is really, really important for generating certain styles of videos, but Google's Lumare does it really, really well. So, another thing that I did also see was because I stay up to date with pretty much all of Google's AI research is that I do note that this stylized generation right here is definitely taking the research from another Google paper that was called a style drop. And I'll show you guys that in a moment. But I think it just goes to show that when Google combines all of their stuff and it does go to show that they're probably building some very comprehensive video system in the future that you know whenever they do tend to release it, it's going to be absolutely incredible because if we look at you know this is just one reference image and then we can see that all of these

### [10:00](https://www.youtube.com/watch?v=DN-krCcwnhQ&t=600s) Segment 3 (10:00 - 15:00)

kind of videos that we do get this is going to be very useful for people who are trying to create certain styles um for certain things. And of course we can see that this is like um some kind of 3D animation kind of style. And then of course the videos from that actually look very good too. So this is what I'm talking about when I say starrop. So I'm going to show you guys that page now. So the Google previously actually did release this research paper and this was actually sometime last year. But you can see that this was essentially based off similar styles. Now I'm not sure how much they've changed the architecture but you can see that it's a text to image maker and essentially what it does when it generates the images is it uses the reference image as a style. And you can see just how good that stuff does look. I mean, if we take a look at the this Vincent Van Go style and then of course and of course we do take a look at the other images. I mean they just look absolutely incredible. And of course we do have the same exact one here in the style drop paper as videos. And I think this is really important because you know if Google manages well it looks like they've managed to combine everything from their previous research like magvit and video poet all into one unique thing. I think this is going to be super effective because you know people are wondering and one of the questions has been why no code why no model you know no model just to show once again okay are you going to release this though and press it but no open source weights I think that the reason Google has chosen to not release this model and to not release um the weights of this model or the code is because I'm pretty sure that they are going to be building on this to release it into perhaps Gemini or a later version of another Google system. Now, I could be completely wrong. Google have been known in the past to just build things and just sit on them. But I think with the nature of how competitive things are and the fact that this is state-of-the-art there aren't any other models that can do this in terms of models that seem to be competing in this area, this is an area that Google could easily dominate and since Google did lose before to chat GPT in terms of the AI race, I'm sure that Google would try and stay ahead now seemingly like since they've got the lead. So, I don't know. They may do that. They may not. Google has, you know, previously just sat on things before, but I do think that maybe they might just polish the model and then release it. I think it would be really cool if they did that. And I really do hope they do that because it would make other things even more competitive. The key things here as well was the video stylization. And I don't think you understand just how good this is. Like the made of flowers one right here is just absolutely incredible. I mean, look at that. That just looks I mean that looks like CGI honestly. Like if I saw that, I would be like, "Wow, that's some really cool CGI. " Other styles aren't as aesthetic or aren't as good. But for some reason, the Lego one as well, for example, if we do take a look at, you know, this Lego car, that one doesn't look AI generated in the sense that like it was just from AI, it actually looks like a Lego car. And of course, the one for flowers. I'm not sure why. I think it's because the way how AI generates these images, they're kind of like fine. And I think with flowers, um, they just look very fine and detailed and intricate. So that's why it doesn't look that bad. But that one does look really cool. So yeah, I think what we've seen here on in terms of the video stylization shows us just how good of a model this is. Now with this now with the cinemagraphs, I do think that this is also another fascinating piece of the paper because this is where the model is able to animate the content of an image within a specific user provided region. And I do think that this is really effective. But what was fascinating was that a couple of days ago, Runway actually did release their ability to do this. So, if you haven't seen it before, I'm going to show it to you guys now. But essentially, Runway has this brush where you can select specific parts of an image and then essentially you can adjust the movement of these brushes and then once you do that, you can essentially animate a specific character. Now, I know this isn't a runway video, but it's just going to show that this is a new feature that is being rolled out to video models across different companies. So, I think that in the future, what we're also going to have is since the video models, you know, sometimes aren't always the best at animating certain things. I think we're going to have a lot more customization. And that's what we're seeing here with Lumare because of course the fire looks really, really good. The, you know, butterfly here also looks really cool. The water here looks like it's moving realistically. And this smoke train also does look very, very effective. There weren't that many demos of this, but it was enough to show us that it was really good. Now video in painting was something that we did look at. I think it was either video poet or magit that showed us this but at the time it honestly wasn't as good as it was. I mean it was decent but this is different like just completely different level. Like I mean imagine having just half of video um and then being able to just say you know fill in the rest. So basically if you don't know what this is basically just generate a fill for video. And I think that, you know, having this is just pretty crazy because being able to just, you know, say, "Okay, fill it in or just, you know, with the text prompt. " I mean, just look at the way that the chocolate falls on this one. Um, it's definitely really, really effective, um, at

### [15:00](https://www.youtube.com/watch?v=DN-krCcwnhQ&t=900s) Segment 4 (15:00 - 18:00)

doing that. So, I think this one is definitely going to have some wild scale uses. And of course, this one is probably going to have the most because you can change different things. So you can literally just, you know, say wearing a red scarf, wearing a purple tie, sitting on a stool, wearing boots, you know, wearing a bathrobe. I think a lot of this stuff is most certainly fascinating. Another thing that we also didn't take a look at was, of course, the image to video. And with image to video, I think this is really good as well because some of the models don't always generate the best images. And if you want to be able to generate certain images yourself, you're going to want to be able to animate those specifically. So, I think that this as well, the imagetovideo section of the model is rather effective. And I always find it very funny and hilarious that for some reason all of these video models decide to use a teddy bear running in New York as some kind of benchmark. But definitely this one does look better over previous iterations. Um, and I do think that for some reason the text to video model is better than the image tovide model just simply based on how things are done. But, you know, for example, things like uh, you know, ocean waves, the way that the giraffe is eating grass, I know that they definitely did train this on a huge amount of data because if you've ever seen giraffes eating grass, they do eat it exactly like that. It's not a weird AI generated mouth. Also, if you do look at waves, waves look exactly like that. Fire moves exactly like that, too. So, there is a like a real big level of understanding, like a huge level of understanding um, for what's being done here. And I mean, even if we look at a happy elephant like this one right here, you know, happy elephant wearing a b hat under the sea. And then when you hover over it, you can see the original image. So this is what the original image looks like. And then this is what the texture video thing is. And we can see that like it's kicking up the water as it's moving underwater, which is I don't know, it's kind of weird, but um it also does look uh pretty cool if you ask me. And then this is that notable image of soldiers raising the United States flag on a windy day. Then we can see that it is moving. So I think overall and of course we got this very famous painting and of course even more waves. But I think in certain scenarios for example with liquids it seems to work pretty well with water and I think fireworks and for some reason um rotating objects do now work really well. But I think the main question that is going to come away from this is Google going to release this? Are they going to build it into a bigger project or are they waiting to be for something to be more published? I mean, currently it is state-of-the-art, so I guess we're going to have to wait from Google themselves. But I do know that one thing that um is a bit different from, you know, larger companies is the fact that there is a difference between getting AI research done and then of course just having it out there and just releasing it versus actually having a product that people are going to use because it's all well and good being able to do something which is, you know, fascinating, astounding, and it's really good. But, you know, of course, translating that into a product that people can then use and is actually effective is another issue. So, I don't know if they're going to do that soon, but I will be looking out for that because I do want to be able to use this and test it to see just how well it does against certain prompts, against certain things like runway, palabs, and of course, stable fusion video. So, what do you think about this? Let me know what your favorite feature is going to be. My, you know, favorite feature that I'm thinking of is of course just the text to video because I'm just going to, you know, use that once it does come out, if it does ever come out. But, um, other than that, I think this is an exciting project. I think there's a lot more things to be done in the space and you know if things are continuing to move at this pace I really do wonder where we will be at the end of the

---
*Источник: https://ekstraktznaniy.ru/video/14559*