# Open Source AI Just *Exploded* (Audio, Video & 3D)

## Метаданные

- **Канал:** MattVidPro
- **YouTube:** https://www.youtube.com/watch?v=vsYkYGbv3QM
- **Дата:** 23.01.2026
- **Длительность:** 17:58
- **Просмотры:** 13,449

## Описание

In this video, I bring you an exciting roundup of the latest developments in the open-source AI space as of early 2026. We cover the progress of Runway ML in AI video generation, including its artistic capabilities and upcoming audio support. I take on a quiz to distinguish between real and AI-generated videos and share insights on tools like Nano Banana Pro for enhanced workflows.

Huge thanks to Box for Sponsoring today's video! Check out Box Extract: https://www.box.com/extract?utm_source=youtube&utm_medium=paidinfluencer&utm_theme=icm&utm_campaign=FY26MattVidPro_BoxExtract

▼ Link(s) From Today’s Video:
Gen 4.5: https://x.com/iamneubert/status/2014090746530333084?s=46
Side by side: https://x.com/runwayml/status/2014339182009758173
Properprompter Usecase: https://x.com/ProperPrompter/status/2014103790434263493
ViduQ2 ComfyUI: https://x.com/ComfyUI/status/2014359977671176315
LTX-2 Comparison: https://x.com/AngryTomtweets/status/2013293340385767463
Audio to Video LTX-2: https://x.com/elevenlabsio/status/2013651232267604028
Video Arena: https://x.com/arena/status/2014035528979747135
Agentlike Discourse Google Research: https://x.com/ns123abc/status/2014351614480429300
Chroma 1.0: https://x.com/ModelScope2022/status/2014006971855466640
https://modelscope.cn/models/FlashLabs/Chroma-4B
PersonaPlex: https://x.com/DataChaz/status/2013892316105417082
https://huggingface.co/nvidia/personaplex-7b-v1https://research.nvidia.com/labs/adlr/personaplex/
Vibevoice: https://x.com/LiorOnAI/status/2013220214217879931 https://github.com/microsoft/VibeVoice https://huggingface.co/collections/microsoft/vibevoice
Qwen3TTS: https://x.com/Alibaba_Qwen/status/2014326211913343303
Qwen 3 TTS Demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS
3D Nano Banana: https://x.com/DeemosTech/status/2014754093919830526
Ernie 5.0: https://x.com/Baidu_Inc/status/2014252300018254054
► MattVidPro Discord: https://discord.gg/mattvidpro

► Follow Me on Twitter: https://twitter.com/MattVidPro

► Buy me a Coffee! https://buymeacoffee.com/mattvidpro
-------------------------------------------------

▼ Extra Links of Interest:

General AI Playlist: https://www.youtube.com/playlist?list=PLrfI66qWYbW3acrBQ4qltDBsjxaoGSl3I

AI I use to edit videos: https://www.descript.com/?lmref=nA4fDg

Instagram: instagram.com/mattvidpro

Tiktok: tiktok.com/@mattvidpro
Gaming & Extras Channel: https://www.youtube.com/@MattVidProGaming

Let's work together!
- For brand & sponsorship inquiries: https://tally.so/r/3xdz4E
- For all other business inquiries: mattvidpro@smoothmedia.co

Thanks for watching Matt Video Productions! I make all sorts of videos here on Youtube! Technology, Tutorials, and Reviews! Enjoy Your stay here, and subscribe!

All Suggestions, Thoughts And Comments Are Greatly Appreciated… Because I Actually Read Them.

00:00 Introduction and Overview
00:26 Runway ML: Advancements in AI Video
01:50 Runway ML's AI Video Quiz Challenge
04:02 Vidu Q2 and LTX Two: New AI Video Models
06:12 Box Extract: Intelligent Content Management
07:50 LM Arena: Comparing AI Video Models
09:04 Advanced Reasoning Models by Google
11:08 Open Source AI Speech Models
16:08 Ernie 5.0: Baidu's Multimodal Model
17:35 Conclusion and Community Engagement

## Содержание

### [0:00](https://www.youtube.com/watch?v=vsYkYGbv3QM) Introduction and Overview

What's going on everybody? Welcome back to another video here on the Matt VidPro AI YouTube channel. I've got a news roundup for you all today and it's a big one. The opensource AI space has been blooming lately. The space in early 2026 is easily characterized by this so far and I love it. Music models, speech models, not to mention our closed source conglomerates have still been dropping things. Without further ado, let's dive

### [0:26](https://www.youtube.com/watch?v=vsYkYGbv3QM&t=26s) Runway ML: Advancements in AI Video

right in. Runway ML has been a competitor in the video space for a long time now. They actually kind of had a head start in the higher fidelity AI video space initially, but here they are still pushing forward. The one thing about this model is that it has no audio generation, whereas a lot of other competitors do, but they're more focused on the artistic, more cinematic workflows that are possible. Oftentimes a workflow like this will have audio being done entirely separate on the side, at least right now. So, it's all about realistic movement, camera following capabilities, prompt adherence, maintaining detail over crazy scenes with a lot going on. This feeling of realism, like it's playing out a simulation that could have actually happened. It just gives the model another level of depth, I suppose, and this feeling of grit and reality that is typically a little bit more mellowed out and softened up in the VO3. 1s and Soras of the video space. If this style is your thing, it's available now through Runway ML's paid plan. This model does really, really well with image references. From what I've seen thus far, Nano Banana Pro is a perfect pairup for this, especially if you want consistency. Runway ML also is running a test. They have this quiz you can take

### [1:50](https://www.youtube.com/watch?v=vsYkYGbv3QM&t=110s) Runway ML's AI Video Quiz Challenge

to tell the difference between a real video versus an AI generated video with Runway 4. 5. Okay, I am up for the challenge, folks. All right, I'm going to say right off the bat, the right one is AI. What? I already got this wrong. Are you serious? Okay, hang on. That's much slower. That's got to be AI. But you know what? The footage might all be slowed down. Okay, the right one was AI that time. Oh, this one. Okay, I got to say the left is AI. Yes, cut it right. Okay, this one's easy. Left is AI. Easy. You can tell just by the smoothing on all the dirt. Okay. The ring's changing, I think, on the right one. Yes. Okay, I'm good on this, dude. Okay, this is the real one. Yeah, I could tell just like by the way the cat's heads were moving. Okay, because of Oh, I don't know. The hair is very realistic on this one. It'll say right as AI. Oh, that's real. Okay, I was wrong. Okay, this is pretty cool. Yeah, try this out for yourselves to see if you could tell the difference. It's a little bit more difficult than you think. They're using some good examples. Anyways, yeah, Gen 4. 5 is pretty awesome. I'm going to leave you guys with this workflow from Proper Prompter. Use Nano Banana Pro to make a 3x3 grid scene and upload this with instructions into Gen 4. 5 and it will actually listen pretty closely to what you are asking for shot by shot laid out in that 3x3 grid. Pretty impressive. Proper prompter, someone I consider to have great taste in the community does consider this model to be pretty incredible. What's really nice is that confirmed by Runway's CEO themselves, this model is going to be getting audio support in the future, which is going to put it right up top there with the likes of the other leading cutting edge closed source AI video generators. All right, we're going to stick with the theme of AI video for now. VU Q2 is available in

### [4:02](https://www.youtube.com/watch?v=vsYkYGbv3QM&t=242s) Vidu Q2 and LTX Two: New AI Video Models

Comfy UI. This has support for up to seven reference subjects right out of the gate in a single workflow. The examples they show off here definitely are integrating all of the different assets very well into a single video. As you can see, it works especially good when all of the characters are matching the same style or aesthetic. In terms of its coherence and inability to morph though, it's state-of-the-art, but I think a few other open source models are just a little bit better. And this one's actually API only, not open. The open-source LTX2 audio and video AI boom continues. People are generating 20 second clips at 4K resolution. Yeah, you'll need a beefy GPU for that, but even consumer GPUs are generating reasonable quality video at 10 to 15 seconds. LTX2 also got another upgrade. It's audio to video, which means you can make an audio clip, you know, record yourself speaking, add in some sound effects or something, whatever it might be. And take that audio and generate extrapolate video from it. So the lip syncing will be accurate. Sound effects will actually hit exactly at the perfect time. It will be your exact audio uploaded with the video. This is pretty cool and caught my eye. But 11 Labs partnered with LTX Studio to bring their audiotovideo model to 11 Labs. This allows you to actually create all of the audio inside of 11 Labs, whether it be music, voice, or sound effects, and then directly generate it as a video through 11 Labs. Definitely pretty cool, but with it being open source, I'm honestly more excited to see how creatives are going to leverage consistent audio to video inside of their workflows to up the game as a whole for AI video creation. Sunbeam [singing and music] colors start to gle. [singing]

### [6:12](https://www.youtube.com/watch?v=vsYkYGbv3QM&t=372s) Box Extract: Intelligent Content Management

[singing] — Before we check out our next piece of AI news, I've got a quick word from today's sponsor. We talk about AI generating content all the time, but what about that massive gold mine of data already sitting in your files? you know, contracts, product specifications, policy documents, and charts. It's all unstructured data that usually takes manual human hours to unlock. That is until now. This is Box Extract. It's an agentic AI solution designed to securely pull valuable data out of your content at scale. And for the AI nerds watching, no this isn't basic OCR. Box Extract is powered by leading models from our pals at Google, Anthropic, and OpenAI. It uses LLMs to actually make sure the documents are understood, whether it's a complex contract or a handwritten form. Think multimodality. Data will be extracted into structured fields you can actually use to automate workflows. This is what they call intelligent content management. Instead of just storing files, Box is now an active engine. Imagine feeding it a thousand contracts and having it automatically extract the totals, dates, and vendors without you having to lift a finger. That is the power of AI agents working on your enterprise data. If you want to see your content be transformed into competent, intelligent data, Box Extract is available right now. Go ahead and check the link down in the description below to learn more and see it in action. Huge thanks to Box for sponsoring today's video. Now, back to your regularly scheduled content. Welcome back, folks.

### [7:50](https://www.youtube.com/watch?v=vsYkYGbv3QM&t=470s) LM Arena: Comparing AI Video Models

LM Arena has launched Video Arena Live on the web. This allows you to compare all of the leading edge AI video models directly against each other in a blind scenario, meaning you don't know which model it is until you select one, similar to the runway test we did earlier. But what's cool about this is that you can actually use your own prompts directly in LM Arena. This allows you to potentially get some real use out of this thing and understand what model works best for your use case. But yeah, here's a cool comparison between Clling 2. 6 Pro and Sora 2. Just going to eat the picture immediately. It's all supposed to be made out of Jell-O inside the house. It's not an easy prompt. Turns out everything's sweeter on the inside. — Could be two entirely different ways a model attacks a prompt. Much more cinematic like sea dance here or realistic like VO3. 1.

### [9:04](https://www.youtube.com/watch?v=vsYkYGbv3QM&t=544s) Advanced Reasoning Models by Google

All right, let's switch gears and talk about large language models and the things that these AI companies are developing that we can't directly see. I released a video recently titled, "They have better AI than they're shipping. " And that referenced Google, and they're back at it again. researchers finding that advanced reasoning models achieve superior intelligence by spontaneously simulating internal multi-agent-like interactions rather than merely relying on a longer computation or increased scale. These are already reasoning models. They're achieving superior intelligence by simulating internal reasoning. But if I'm understanding this correctly, simulating multiple agents conversing and working together is achieving superior intelligence results right now. Apparently, these models develop an internal social structure where diverse simulated personas debate and reconcile ideas to solve complex problems. Very interesting idea. This reminds me of something that I frequently do when I'm trying to accomplish something using AI technology. If I'm trying to refine a plan to build a project with code or something like that, I'll send it to multiple AIS for an opinionated, excruciatingly verbose analysis, then subsequently have all of those cross-co compared to produce a more refined V2 plan or structure. Instead, it looks like what they're doing is trying to simulate all of that internally within one model instead of doing it separately like I'm doing with four or five actual different models. Regardless, the most important takeaway for me from this has got to be what's going on internally with this model. When you're forcing it to have this social structure where diverse simulated personas debate against each other, you're basically forcing the model to sus out its own weights in order to complete the problem. I don't know. Let me know what you think about that in the comments below. All right, prepare yourselves. It's about to get difficult to keep track of all of these open- source

### [11:08](https://www.youtube.com/watch?v=vsYkYGbv3QM&t=668s) Open Source AI Speech Models

drops. First up, Chroma 1. 0 is here. This is from Flashlabs. ai, AI, the world's first open- source, endto-end, real-time speech-to-pech dialogue model with personalized voice cloning. This is not the only voice cloning model we're going to take a look at today. It's also not the only real-time speechtoech model we're going to look at today. Regardless, we've got claims of strong reasoning and only 4 billion parameter size or so, fully open weights and code. So, they've got their own API to allow you to deploy autonomous voice agents, but they are not the only ones interested. Nvidia just released Persona Plex 7B. This is also open- source, but this is a full duplex conversational model. So, right out of the box, this thing is built to go back and forth like a person. It is totally separate from the traditional pipelines that we're used to. This model feels raw. Feels very human in an eerie way. It's small, so it sounds robotic, but you can tell the bones are very good. — Do you want to hear a funny joke? — Yes. — Okay. I haven't even said it yet, but — Yeah. Go ahead. — Yeah. It's going to be really funny when I actually say the joke. — Okay. — So, why did the picture go to prison? — To get away from the mirror. — No, because it was framed. — Oh. — That's a good one. Yeah. — Yeah. — Do you have any more? I've got one more for you. — Okay. Go ahead. — Okay. What do you call a fake noodle? — A fake noodle? I don't know. — Nice. I like that one. Yeah. — Yeah, that's a good one. Yeah. — Very natural sounding. It sounds like a person on the phone. Persona is on GitHub right now under the MIT license. They have a paper, even a demo, and the weights are right here on HuggingFace. You can see this one's already got a lot of downloads. 17,000. Hopefully some interesting fine tunes will come out from this. See what people get up to with this model. What's crazy is that we have yet another speech agent or real-time speech model called Vibe Voice. And this one is also open source from Microsoft. low latency, sub 300 milliseconds, and this thing does long multie speech, up to 90 minutes of audio, which is pretty insane. I don't think I've seen any other audio model go that long, up to four distinct separate speakers. Audio compresses into semantic and acoustic tokens. You can see on hugging face, we've got a realtime halfbillion parameter model as well as ASR. But yeah, it's all available for download right on Hugging Face. And you thought it stopped there, but guess what? It doesn't. Quen 3 TTS is here. Also open source is actually a much larger release. This is a total of five different models. We've got free form voice design and cloning, 10 different language support, state-of-the-art 12 hertz tokenizer for high compression, full fine-tuning support right out of the box, all with state-of-the-art performance. No joke. Open weights, open code. This is pretty incredible. I might actually do a little bit of a deeper dive on some of the highlights from the open source bunch today. I just talked about an open source AI music generator yesterday, but this one, oh, I really might have to try getting this one up and running locally. You guys want to try some quick voice cloning? Hello everybody, my name is Matt Vidpro AI. Welcome to the Matt Vidpro YouTube channel. In today's video, I'm going to be showing you guys how to peel a lemon perfectly. About 20 seconds of audio. Okay, let's see how this goes. Oh, wow. That was fast, too. Hey, man. What are you doing inside my walls? I was just trying to hang a picture frame when suddenly I see you and my There's your living room inside of my walls. Okay, it's not too bad. We're not really spending a lot of time trying to perfect this. Just messing around with it out of the box. That's really impressive for voice cloning. Deemos just launched a 3D model editor that is AI powered. This looks pretty insane. They're calling it like Nano Banana but for 3D models. Simply upload your model and then say something like, "Oh, add glasses. Adds glasses right to your 3D character. " Then they try something a little bit more crazy. They make him lift up his hand and throw up a peace sign and it totally works. They upload a 3D model of an off-roading vehicle, turn the front to a sports car, and then bam, it's actually able to do it. A more German Porsche front-end design versus the previous GP looking one. You can basically upload any 3D model and modify it from there. Pretty insane. API is also apparently coming soon for this. And finally, before we head out, Ernie

### [16:08](https://www.youtube.com/watch?v=vsYkYGbv3QM&t=968s) Ernie 5.0: Baidu's Multimodal Model

5. 0 is here from BU. This is a native omni multimodal model. Ernie has always pushed towards this way and it's an end-to-end architecture that enables unified multimodal understanding and generation. A whopping 2. 4 trillion parameters in size and mixture of experts architecture. It's about under 3% active parameters per inference. This model is an attempt to balance strong reasoning and generation with efficient inference. This is not open source. It's on the Erniebot official website and BYU AI cloud. All the replies here are complaining that the benchmarks are hard to read. Someone from my Discord server used Ernie to straight up fix this. So, here are your Ernie benchmarks. You can see all of the models are very much cutting edge. Our comparison here is pretty interesting. We've got Ernie, obviously, GPT 5 High, Gemini 3 Pro, 2. 5 Pro from Gemini, and then Deepseek 3. 2. Deepseek 3. 2 2 does very well on a lot of these benchmarks, but it seems that Ernie seems to do its best in knowledge, math, coding, and whether you like it or not, safety. But for things like long context, agentic workflows, reasoning, instruction following, it is very, very much serviceable. Like I said, all of these top-of-the-line LLMs have gotten very, very capable. But I think this being multimodal is going to make for a fantastic everyday agent. Okay guys

### [17:35](https://www.youtube.com/watch?v=vsYkYGbv3QM&t=1055s) Conclusion and Community Engagement

that was a lot to get through, but thanks for sticking with me. You know, it is lightning in the AI community right now, especially with all of these different open- source releases. There are even still a few that I missed or skipped over. Join my Discord server linked down in the description below if you want to stay absolute the most upto-date. Thanks so much for watching. I'll see you in the next one and goodbye.

---
*Источник: https://ekstraktznaniy.ru/video/11369*