📈 Grab my AI Toolkit - https://academy.jeffsu.org/ai-toolkit?utm_source=youtube&utm_medium=video&utm_campaign=190
Forget the hype about #AI replacing Hollywood overnight. The reality is that although AI video tools are incredibly powerful, there's no magic button that creates entire videos with one prompt. The real challenge is consistency: getting your characters, voices, and scenes to stay the same across multiple clips.
In this video, I walk through the exact 4-step workflow I use to create multi-scene AI videos with consistent characters. We'll use Google Whisk, Flow, a custom Gemini Gem, and ElevenLabs to build a complete skit from scratch.
No BS, just the actual process.
*TIMESTAMPS*
00:00 The Reality of the AI Video Landscape
00:25 Bottom Line Up Front
04:15 Step 1
06:34 Step 2
08:53 Step 3
12:43 Step 4
14:56 Final Thoughts
16:09 Sora 2 Rollout and Implications
*RESOURCES MENTIONED*
Prompts & Templates: https://www.jeffsu.org/the-reality-of-ai-video-a-no-bullshit-guide/
Other free resources: https://www.jeffsu.org/links/?utm_source=youtube&utm_medium=video&utm_campaign=190
*BUILD A POWERFUL WORKFLOW*
📈 The Workspace Academy - https://academy.jeffsu.org/workspace-academy?utm_source=youtube&utm_medium=video&utm_campaign=190
✍️ My Notion Command Center - https://www.pressplay.cc/link/s/DE1C4C50
*BE MY FRIEND:*
📧 Subscribe to my newsletter - https://www.jeffsu.org/newsletter/?utm_source=youtube&utm_medium=video&utm_campaign=description
📸 Instagram - https://instagram.com/j.sushie
🤝 LinkedIn - https://www.linkedin.com/in/jsu05/
*MY FAVORITE GEAR*
🎬 My YouTube Gear - https://www.jeffsu.org/yt-gear/
🎒 Everyday Carry - https://www.jeffsu.org/my-edc/
#googlegemini #googleveo
Hey friends, today's video is a nobullshit guide on the current state of AI video generation because if you believe the headlines, the entire Hollywood movie industry is about to be replaced by AI in the next few minutes. In reality though, we're not even close. So, let's look beyond the flashy demos that do a phenomenal job in maximizing shareholder value but little else and go over what's actually possible right now. Let's get started. Instead of boring you
with technical jargon, let me just show you what's going on. Starting off with a simple analogy using Chacha BT. Let's say I ask it to write the opening scene of a TV show, right? I'll let it run and in a couple of seconds it'll spit out a script with the setting, the characters, and all the good stuff. Simple enough, right? Now, what happens when I ask CHBT in the same chat to write the next scene? Let's run it. And as you'll see in the result, it quote unquote remembers what happened in the opening and continues the same narrative, right? The characters are consistent, the setting is consistent, the story is consistent, and in a nutshell, that consistency is the single biggest roadblock when it comes to generating AI videos. So, keep that keyword consistency in mind for this next part. Moving over to Google's Flow app, one of the best AI generation tools for video right now. Here, I have recreated a scene starring Darth Vader. So, let's play this back with audio. It's only 8 seconds. I am your mother. — Okay, first of all, notice how detailed and realistic this is. Darth Vader is walking towards us with all the right sound effects. — I — sparks are flying out behind him because he just cooked someone. Not sure if that's how he was cooked. And his voice, — I am your mother. — His voice is pretty damn good. And guess what? As long as you're willing to pay a bit of money for the flow app and use this prompt, which I'll share down below, anyone can create this clip in five minutes. So, the point I'm making here is that AI video models are insanely powerful. But if that's the case, what exactly is holding us back from producing Hollywood grade movies and high production YouTube videos? Here's the problem. Watch what happens when I try to extend this scene like I did with Chachi BT by using this prompt. Next, Darth Vader raises his other arm with a red lightsaber and says, "Get ready for a spanking. " Which is something, you know, Mr. Vader would say. And we're going to generate fast forward this next part. — Get ready for a spanking. — Okay. Yeah, that didn't work at all. Right. Um, the lightsaber was already in the scene, right? And this is the wrong hand. Uh, Darth Vader doesn't even look the same, right? Between scenes, the voices are different. and the background completely changed. This is a perfect example of character inconsistency. Editor Jeff here. Quick heads up, OpenAI announced Sora 2 right after I made this video, and they've launched a few features targeting the consistency problem. I'll explain what these features actually do at the end. But the bottom line is they do not replace the need for the workflow I'm about to show you. And with that, let's dive back into the video. Put simply, the video models do not remember any details about the scenes they just generated. Even if I used the exact same prompt to describe Darth Vader again, the model will still generate a slightly different character, breaking the consistency across scenes. But don't worry, there is a way to overcome this. Let's take a look at these two skits I created from scratch. There's the first one. — Google Gemini, can you find that email from yesterday? No, but I can show you an ad that looks like an email. You're welcome. — All right. And the second one. — Hey, Gemini, can you play that YouTube video from Jeff Sue? Absolutely. But first, please enjoy this unskippable 30inut ad. — Now, was that perfect? No. Given more time, I could have made them much more polished. But the key is that the appearance and voice of the Gemini mascot stayed the same or consistent across scenes. And to achieve that level of consistency, we just need to follow four simple steps. Step one, generating
an image of our character. That's right. Even though this video is all about AI video, our very first step is to use an image generation tool to create a static image of our character. Normally, I use Midjourney for this, but since MidJourney is a paid app, we're going to use Google's free image generation tool, Whisk, for this tutorial. And at this point, I want to be very clear. The tools I mentioned in this video matter a lot less than the workflow and underlying logic. Okay, back in Whisk, I'm just going to paste the prompt that will generate the Gemini mascot character. Don't worry, I'll share all the prompts I use in this video down below. And under settings, this is very important. I'm going to disable precise reference for now because I want the AI to have more creative freedom. I'm going to send this and let's fast forward this next part. Okay, so the first two results were already great. I just ran it another time to show you that if you weren't happy with the first batch, you can just generate a few more batches um until you find something that you like. But for me, I'm just going to actually go with this one right here. Um I like the fact that it's like bigger and larger and it's a full frontal photo of the character, which might make uh the future steps easier. Pro tip, if you have an image that you mostly like, but you want to change one specific thing, you can do this. Simply click the refine button and under settings, make sure precise reference is enabled. Then we're just going to describe the change you want to make. For example, change the color of the fur to white with pastel orange gradients. And we're just going to click generate. And we're going to fast forward this next part. All right, this looks really good, right? The reason this works so well is because by enabling precise reference, we're telling Whisk to use Google's nano banana image generation model, which is fantastic at maintaining character consistency in still images. If you don't believe me, you can upload the original image into the Google Gemini web app with image editing enabled or even Google's AI Studio. Use the exact same prompt and you'll see that only the fur color changes. The character stays the same. Yes, all three of these methods are free. And no, Google is not sponsoring this video. Although, I really wish they would. Maybe I'm just not PC enough. Anyways, once you're happy with the image, just click here to download it. And we're now ready for step two. By the way, I have a free AI toolkit that cuts through the noise and helps you master essential AI tools and
workflows. I'll leave a link to that down below. Step two, create the starting frame. All right, now that we have our main character, it's time to place him into a scene that we will eventually turn into a video clip. Staying right here in Whisk, we're just going to expand out the sidebar. And we can either just upload or just simply drag the image from step one into this character box. And by doing this, we're basically telling Whisk, hey, do you see this character right here? I want you to include this exact character in the next scene we generate. Because that's the entire point of this, right? We want the mascot to have the same appearance in every single scene. After making sure the subject is selected here, we're going to go back into settings and make sure that precise reference is enabled. And then we're just going to use this prompt, which again I'll share to generate the still image of our starting scene, which essentially depicts the mascot talking to a female worker in an office setting. Just like before, uh the first batch was fine, these two, but then I ran it one more time and I'm glad I did because I actually like this one a bit more. Um the entire mascot is in frame versus this first option. And for this one, the mouse is all messed up, right? So, I'm just going to go ahead and download this image. And this is going to be the first frame of our first video clip. Now, just to prove to you how critical some of these settings are, I'm going to deselect the mascot from this subject. And I'm going to turn off precise reference and use the exact same prompt. And I'm going to speed this up here. All right, these look pretty terrible because as you can see, without a reference image, Whisk basically generated a mascot from scratch. And these two don't even look the same in the same batch. All right, since we're creating two separate clips for our skit, we just need to repeat the exact same process to create the starting frame for our second video. I'll keep the mascot selected as the subject. Make sure the precise reference feature is toggled on and simply use a different prompt. This time we're just going to have the Gemini mascot interact with a male co-orker. All right, perfect. I think I like this one the most. So, I'm just going to download this. And now that we have our two starting frames with our mascot looking perfectly consistent in both, we are finally ready to generate some video footage. Step three, actually creating
the videos. To do this, we're going to head on over to Google's Flow app. Quick heads up, I am using the paid version of Flow, so I have access to the V3 quality model. Uh, but I actually tested this with uh V3 fast, the model that free users have access to, and it works the exact same way. First, I'm going to select the frame to video option and upload our first starting frame, the one with the female worker in it. Uh, click crop to save. This tells the AI we're giving it a still image that we want to turn into an animated video. And once that is uploaded, I am just going to paste in this prompt uh that tells Flow exactly how I want the scene to play out from the dialogue to the action. Don't worry though, we'll talk more about how to write effective text to video prompts in a little bit. Under settings, I want this to be in landscape. That's fine. And I want actually four outputs per prompt. Yes, this eats up my credits a lot faster, but it gives me a much higher chance that at least one of the outputs will be usable. You'll see what I mean. And we're going to just hit generate. Okay, the videos are done. Let's go over an example of a bad output. — Google Gemini, can you find that email from yesterday? — No, but I can show you an ad that looks like an email. You're welcome. — Okay, so obviously that doesn't work, but because we had four outputs, at least one of these would be good. Luckily for me, all three of these other ones were fine, but I like this one the most. Let's play it back. — Google Gemini, can you find that email like an email. You're welcome. — So, I'm just going to favorite this and then download it uh for the next step. So, usually I would go with upscaled if I were uploading to, let's say, YouTube, for example. But since we're just going over an example here, I'm just going to choose the original size. Now, we just rinse and repeat for our second scene. We're going to keep frames to video selected. then upload and select the starting frame for our second clip. And I'm just going to paste in this prompt. I'm gonna hit generate. And we're just going to fast forward to the next part. All right, videos are done. And looking through this batch, I like this one the most. Hey Gemini, can you play that YouTube video from Jeff Su? Absolutely. But first, please enjoy this unskippable 30inut ad. — All right, I'm going to favorite it and download this in original size. And let's actually play both clips back to back. You're welcome. — Hey Gemini, can you play that YouTube video from Jeff Sue? — Absolutely. — Again, definitely not perfect, but the important thing is the Gemini mascot looks the same across both clips. But there's another issue. The voice of the Gemini mascot character is completely different in the two scenes. But don't worry, we're going to fix that in the next step. But before we do, let me share my process for writing text to video prompts. In a nutshell, I created a Gemini gem that basically takes user input and outputs an optimized video prompt. I've also uploaded video prompting best practices as knowledge files. After starting a new chat in the gem, I would first upload the starting frame image and a screenshot of the flow app to give it additional context. Then I just describe the scene I want, give it a script, which I obviously had to come up with, and the Gemini gem will then write a detailed prompt optimized for Google's VO model. I'll link this below for you to try for free, but let me know in the comments if you want a full video on how to create like powerful Gemini gems and custom GPTs because it's actually a lot of work to create something really good. Step four. All right, our final step is to give our mascot a consistent voice across both
scenes. For this, we're going to use a tool called 11 Labs. Once you're signed in, navigate to the voice changer option on the left here, and we're just going to upload the video file for scene one that we downloaded from Flow. And then we're just going to choose a voice we want to change our character's voice to. And I decided on the Malvorex, the monster voice, which sounds about right. And then I'm just going to click generate speech. Wait for a little bit. Okay, let's play this back. — Google Gemini, can you find that email from yesterday? No, but I can show you an ad that looks like an email. You're welcome. — Okay, you'll notice that the mascot's voice and the female professional voices have both been changed, but that's okay. Part of the plan, we can now download this new audio. Next, we're going to do the exact same thing for scene two. We're going to upload the video. And the most important part is that we select the exact same voice, of course, the monster, because the whole point is to keep the mascot's voice consistent. And we're going to click generate speech. Okay, I played it back. It sounds good. We're just going to download this to use in the next and final step. Now, for the final bit of magic, we need to bring the original video clips, the ones with inconsistent audio from Flow and the new audio files from 11 Labs into a video editing tool like Final Cut Pro. First, we're going to detach the original inconsistent audio from both clips. And then, we're going to bring in the two new audio files we just generated with 11 Labs. And here's a key step. I'm going to manually replace only the mascots's lines with the new consistent monster voice. This way, the human actors keep their original voices, but our mascot now sounds exactly the same across both scenes. And as a final touch to really sell the scene, we can layer in some subtle ambient office sound effects in the background. — No, but I can show you an ad that looks like an email. You're welcome. Hey Gemini, can you play that YouTube video from Jeff Sue? — Absolutely. But first, please enjoy this unskippable 30inut ad. — And with that, we've successfully created a multi-seen skit with an AI character that is both visually and
audibly consistent. A few things I want to leave you with. First, it's totally possible to have two or more consistent characters across scenes. Simply upload two or more subjects into Whisk, describe the scene, and use that as your starting frame. The principle is the same. Second, let's talk about thirdparty tools. There are capable AI video tools out there like Open Art, Hyalura, and Cling that market themselves as all-in-one solutions. These tools do make the video generation process easier, but in order to produce polished videos, there's still a ton of manual work involved, like generating the initial character and fixing the audio. And not to mention, those tools aren't exactly easy to use for the average person. So, here's the bottom line. Again, video models have gotten extremely powerful, but AI video tools are just that, tools. We need to learn what each tool is good for and build a workflow that combines the strengths of each one. Just think about what we did today. First, we used Whisk to generate our character. Then, we used Whisk again to create the starting frame. Then, we used a custom Gemini Gem to write a textto video prompt. We used Flow to actually generate the video. Then, we used 11 Labs to generate consistent audio. And after all of that, we still had to use a video editor to piece it
all together. All right, so about Sora 2, they announced two features. The first one is called Cameo, which uses a recording of your actual face and voice to keep your likeness consistent across scenes. The issue here is Cameo only works with real people and pets, so it's very limited in the characters we can actually create. The second feature is called recut, which lets you load the last few seconds of a clip into your next prompt to maintain continuity. If this works as intended, it is a big deal, but it's just one step in the workflow. We still need to generate the character, write robust video prompts, fix the audio, etc. So, yeah, these seem like awesome features, but they're just that, features that need to be integrated within a broader workflow. Let me know if you want a full tutorial on Sora 2. See you on the next video. And in the meantime, have a great one.