I have been doing this job for 20 years. And every single time I think I have seen it all. Then something like this lands on my desk. Unbelievable. — So, I am going to ask you one more time. Where were you on the night of the Fortnite? So, this was Drama Box. The model has just been released. And in this video, we are going to install it locally and test it out. Plus, I will be telling you all about its features, its parameters, and how it was trained. Drama Box is a fine-tune of Lyra X LTX 2. 3 or 3. 3 billion parameter audio-only diffusion transformer using flow matching conditioned on Gemma's 3 12 billion text embeddings. The architecture is quite interesting. It combines a diffusion transformer backbone with an audio variational autoencoder, which means it can bring the voice from hidden space to the space where you can listen to it quite easily. It also uses a vocoder for pauses and other stuff, and we are going to test it out. I will talk more around it, but for now, let's get it installed. By the way, this is Fahad Mirza, and I welcome you to the channel. Please also follow me on X if you're looking for AI updates. I'm going to use this Ubuntu system. I have 1 GB of card Nvidia RTX 6000 with 48 GB of VRAM. If you're looking to rent a GPU on very good price, you can find the link to Mask Compute in video's description with a discount coupon code of 50% for a range of GPUs. Now, let me get clone the repo of Drama Box, and I will drop the link to it in video's description. And now, let's install all the requirements. Everything is installed. Let me now launch the Gradio demo. The first time when you run this, it downloads some model. And the model is downloaded running on our local host at port 7860, as you can see here. Let me show you the VRAM consumption. So, it is consuming just over 16 gig of VRAM. Okay, let's test it out. Okay, so now let's do the first test. I'm going to give it the scene prompt, and I'm not going to give it any audio. I just want to see how exactly it does expression. So, I'm just going to click on generate. While it generates, you can see that we are asking it that a delusional overconfident man speaks with complete sincerity, "that I have been thinking about this very carefully, and I believe I'm ready to start dating a supermodel. He nots firmly, "She should be stunning, wealthy, Harvard-educated, penthouse, yacht, the full package. " He pauses thoughtfully, "Now, people keep calling telling me I need to work on myself first. " His voice fills with genuine indignation. So, you see there are a lot of expressions, and a lot of things are happening here. And let's see what our model does here. While it generates, okay, it has already done it. Let me play this for you. I have been thinking about this very carefully, and I believe I am ready to start dating a supermodel. She should be stunning, wealthy, Harvard-educated, penthouse, yacht, the full package. Now, people keep telling me I need to work on myself first. Work on myself? I showered in 2023. What more do they want? Self-improvement is for insecure people. I am perfectly comfortable exactly as I am. — [sighs] — She will just uh have to appreciate me for my raw, unfiltered authenticity. I give this plan a very high chance of success. So, what do you think? Of course, about the plan, and but also about the quality of the model. Is it human-like expressive? I think still bit robotic and plasticky, but as far as expressions are concerned, there is lot of improvement, no doubt about that. And now I'm going to do a test in the female voice. I have just already given it a reference audio, and I'm generating it. And you can see that the scene here is about durian, which is a South East Asian fruit. Some people love it, some just hate it. It is a very, very polarizing fruit. And this is a situation. It already has generated the audio. Let
Segment 2 (05:00 - 09:00)
me play this the reference audio first. Happiness is a fleeting feeling that can be found in life's simplest moments. A warm conversation with a loved one, a beautiful sunset. So, now you know the reference voice. Let's play the resultant one. Okay, I have heard so much about this durian thing, and today I am finally trying it. Oh. Oh, no. So, you see, I don't think so the voice cloning is as good um as it should be, but expressions are there. Oh. Oh, no. That is not a fruit smell. That is a crime scene. Okay. Okay, I'm doing this. I am a grown woman. It is actually Wait. Why does it taste like custard? Why does something that smells like a gym locker taste like custard? — I do not understand what is happening to my brain right now. Oh, no. Oh, no, no. I think I like it. The expressions are pretty good. No doubt about that. And let's do another scene where this bewildered Icelandic woman speaks with exhaustion resignation that I'm a 45-year-old divorced woman. I have one succulent and I ask for nothing else from this life. And then there's a situation. There's also a female voice which I'm going to clone. So, I will just click on generate and while it generates, let's talk a bit more around this model as uh what exactly is this? So, what this model is doing differently is treating the prompt itself as a full performance script, as you just saw. Dialogue goes inside double quotes and the model speaks it literally. Everything outside the quotes is a stage direction. Things like his voice rises with fury or he clears his throat, which shape the delivery without being spoken. So, you're not just writing text, you're directing a performance. That is why the name DramaBox. You can layer it in emotional transitions, mid-sentence, shift from a shout to a whisper, drop in a laugh or a sigh, all purely through how you write the prompt. You can also drop in an optional 10-second voice reference clip, which we have just done, and the model clones that timbre, but it is not really good enough, as we just saw in various example, and I'm doing another one right now. So, hopefully in the future one, they could do the same expressive control on top of any voice you feed it. Okay, let me first play the reference voice. Like the video. Subscribe to Fahad's channel. And become a member. As that helps a lot. You guys don't listen to me. At least listen to that voice. Okay, let me play this resultant one. I am a 45-year-old divorced woman. I ask for nothing else from this life. — [sighs and gasps] — My Australian neighbor, he is an AI YouTuber in Iceland. in great in minus 8 degrees. Last Tuesday he flung a rose petal at me across an Arctic gale and at 10:00 no. seat at 10:00 director at 8 degrees. Last rose petal at 10:00 again. So, you can see that it is hallucinating. Some of the lines it has uttered correctly. Expressions are there, but many of them it is just either speeding through, maybe couldn't really read it or there is some issue with its pipeline. And some of it was able to do properly. Or maybe it is just the length of it. Anyway, a lot of areas for improvement, but I think it's a good start. Things are moving quite nicely in every direction in AI at the moment. That's it. Let me know what do you think about this model. Please become a member of the channel and follow me on X. Thank you for all the support.