Open-Source Alternative to ElevenLabs! (Fully FREE)
10:58

Open-Source Alternative to ElevenLabs! (Fully FREE)

Universe of AI 07.01.2026 5 705 просмотров 174 лайков обн. 18.02.2026
Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
An open-source text-to-speech model is starting to compete with ElevenLabs. In this video, I break down Chatterbox, why local voice AI matters, and where it still falls short. For hands-on demos, tools, workflows, and dev-focused content, check out World of AI, our channel dedicated to building with these models: ‪‪ ⁨‪‪‪‪‪‪‪@intheworldofai 🔗 Relevant Links Chatterbox TTS Repo: https://github.com/resemble-ai/chatterbox/tree/master 🔗 My Links: 📩 Sponsor a Video or Feature Your Product: intheuniverseofaiz@gmail.com 🔥 Become a Patron (Private Discord): /worldofai 🧠 Follow me on Twitter: https://x.com/UniverseofAIz 🌐 Website: https://www.worldzofai.com 🚨 Subscribe To The FREE AI Newsletter For Regular AI Updates: https://intheworldofai.com/ open source tts, open source voice ai, open source elevenlabs alternative, elevenlabs alternative, elevenlabs vs chatterbox, chatterbox tts, chatterbox turbo, resemble ai chatterbox, local tts, local text to speech, run tts locally, offline text to speech, real time tts, low latency tts, streaming tts, ai voice generator, voice cloning, zero shot voice cloning, tts for agents, ai agents voice, voice ai, text to speech, python tts, pytorch tts, self hosted tts, mit licensed tts, voice ai models, ai voice synthesis, amazon polly alternative, google cloud tts alternative #OpenSourceAI #VoiceAI #TextToSpeech #ElevenLabs #LocalAI #AIModels #AIAgents 0:00 - Intro 0:31 - What is Chatterbox 4:51 - Cloning my Voice! 7:30 - Running it Locally! 9:20 - Outro

Оглавление (5 сегментов)

  1. 0:00 Intro 95 сл.
  2. 0:31 What is Chatterbox 710 сл.
  3. 4:51 Cloning my Voice! 467 сл.
  4. 7:30 Running it Locally! 399 сл.
  5. 9:20 Outro 296 сл.
0:00

Intro

Right now, AI can think, reason, plan, and write better than ever before. But there's one part of AI that still feels behind. It's voice. Most AI systems can generate incredible text, but the moment they speak, something feels off and robotic. There's a delay. There's friction, and it doesn't feel alive. And that's because almost all AI voices still depend on the cloud. So, what happens when AI can speak instantly, locally, and without an API call? That's what Chatterbox is all about. So, let's get into it. If you've ever paid just to
0:31

What is Chatterbox

prototype a voice agent, you're probably already feeling where this is going. Most highquality TTS today is cloud-based. That means latency, usage limits, and pricing that adds up fast, especially if you're experimenting. Local text to speech flips that model. Instead of sending text to an API and waiting, the model runs directly on your machine. no cloud call, no rate limits, and no per character pricing. The problem is that until recently, local TTS was either too slow or not good enough. Chatterbox is one of the first open-source projects that actually changes that trade-off. At a high level, text to speech is just turning text into audio. But modern TTS isn't just reading words. It's handling pacing, pauses, emphasis, and emotion. Most of the quality problem has already been solved by neurom models. The harder problem is latency. If a voice takes too long to respond, your brain immediately notices. And for agents, that delay breaks immersion. That's the core issue Chatterbox is trying to solve. Chatterbox actually ships with a few different variants and they're designed for different jobs. The most important one is Turbo. Turbo is English only and is aggressively optimized for speed. This is the model you use for agents or interactive systems. There's also a multilingual version that supports 23 languages and includes zeroot voice cloning. And then there's the original model which focuses more on expressiveness. Across the models, you get things like short reference voice cloning, watermarking for traceability, and a very permissive MIT license. But speed is the real headline here. This is where most open TTS projects fall apart. They sound decent, but they feel slow. With Chatterbox Turbo, once the model is loaded, text to audio can happen in well under a couple hundred milliseconds on a GPU. And that matters because it crosses a threshold. You paste text, you hit run, and the audio starts before you consciously register a delay. That's the difference between a system that feels like a software and one that feels conversational. Another thing Chatterbox does well is expressive control. You could include inline tokens for things like pauses, laughter, or emphasis. And you also get knobs like exaggeration and CFG weight to control how expressive or literal the output is. Those small details add up. Human speech isn't perfectly smooth. It hesitates. It reacts. When those elements are present, the voice stops sounding generated and starts sounding present. The multilingual model takes the same idea and applies it across languages. You can provide a short reference clip, pick a language, and generate speech that keeps the same voice identity. This is especially interesting for accessibility tools, global products, and games. It also drops cleanly into Python workflows, which is why people are already integrating it into agent pipelines and audio systems. Here's where this really matters for AI. When voice is local and fast, it stops becoming a bottleneck. You can experiment freely without worrying about API costs. You can build agents that speak instantly. You can prototype interactions that would be too expensive to test otherwise. In blind test, Chatterbox has been competitive with commercial systems like 11 Labs while generating faster content. And this isn't a brand new research project. Chatterbox comes from years of production use inside resemble AI. That history shows in how stable it feels. So, why don't we take a look at some of the examples that Chatterbox has and demo it ourselves. Okay, so this is the Chatterbox Turbo demo and they've actually put this on hugging face and I'll put the link for this section as well in the description so you can test it out. So you can see that you can put your text here and max characters it takes at the moment is 300 for the demo and where you can put in your prompt over here and then you can add these things which are like clear throat, sigh, shush, cough, groan, sniff, gasp, chuckle and laugh which are pretty common things you might expect in natural language. So what it does is it takes a reference audio file. So we have this reference audio file over here. Let's just take a quick listen to what it sounds like.
4:51

Cloning my Voice!

— Got it. Thank you for sharing that. So, to give you more info about lower monthly payment options, could you tell me how much you owe on all your credit cards? — And now you have that as a reference audio file, and it's supposed to clone that audio and put it based on this text prompt. So, let's generate the text. So, you can see it's processing over here. Now, let's take a listen. — Oh, that's hilarious. Um, anyway, we do have a new model in store. It's the Skynet T800 series and it's got basically everything including AI integration with chat GPT and all that jazz. Would you like me to get some prices for you? — Oh, that's hilar. — Okay, not going to lie, that laugh sounded a little bit creepy um and not natural, but let's try out with our own custom prompt. So, I'm going to put in here and it's going to be my YouTube prompt. Maybe like, "Hey everyone, welcome back to the channel. Why don't we insert a clear throat here just to see what it sounds like today? We're diving into what's new in AI, why it matters, and what could be coming next. Let's insert a what should we do? Maybe a sigh. I'm just doing some random stuff. If you clear enjoy clear nononsense updates, I'll put a laugh here. Actually, no. Let me do a chuckle. Wait, what did I do? Do I do a chuckle laugh? Uh, let me take away the chuckle from here and put in it over this. You're in the right play. place. Let's get into it. All right, let's generate. — Hey everyone, welcome back to the channel. [clears throat] Today we're diving into what's new in AI, why it matters, and what could be coming next. If you enjoy clear, nononsense updates, you're in the right place. Okay. And not all bad. All right. I've uploaded my test file. You can listen to it right now. — Hello, Universe of AI. This is my reference file for me to test out Chatterbox AI. I hope this works well. Let's see how it sounds. Wish me good luck. Thank you. — Now, let's generate this. Today, we're diving into what's new in AI, why it matters, and what could be coming next. Uh, if you enjoy clear, nononsense updates, huh, you're in the right place. Okay, it does not sound 100% like me, but it's almost there. At least it recognized that I was a male and I changed the other things like that. If you wanted to create something simple like that, you can use this demo to do that. This is not bad. It's not perfect, but it's pretty good. Another cool thing about
7:30

Running it Locally!

this is that you don't actually have to use the HuggingFace website to run this demo. You can actually run this locally and generate any type of file you want and use these models and we have this example from here. All you have to do is copy the GitHub repository for Chatterbox and you can run this example file over here. And one thing you will have to note is that your file. wave is an example reference file. So I put in one reference file here and it's going to use that to generate the text that we see here. So right now it has this example demo that I already had which was like today's the day I went to move like a Titan at dawn blah blah and you can see like it adds in bunch of text here and like if you look at the periods it's for spacing but we're just going to make it much more simpler and just do something more YouTube like. So I've given it this text. Hello everyone. Welcome to Chatterbox TTS on Mac. I am from Universe of AI and this is an example of texttospech synthesis. So, it's going to run this code and you can run this by yourself and it'll save a file here and it's going to name it test two wave. I'm just going to call it test three here just because I already have a test two wave file. So, now let's press run. All right, looks like our test three file is ready and we can take a listen to it right now. — Hello everyone, WELCOME TO CHATTERABUS TTS on Mac. I am from the universe of AI and this is an example of text to speech synthesis. Say well. It's kind of funny. I turned my guy into a little bit Italian, a mad Italian, I guess, and is using a pre-trained obviously voice to do that. Also, this is kind of cool, but you can run this on your end. And you can also do other demos over here. You can run the turbo model. You need tokens from Hugging Face, so make sure you have that installed. But this is really cool. And I'll put this in the description so you can use this to test it out for yourself as well. Obviously, this isn't perfect
9:20

Outro

and it's important to be upfront about that. Some outputs can sound overreacted, especially with longer text, and you can get trailing artifacts like breathing or silence that need trimming or might need a little bit of more adjusting. The CPU performance is slow, so realistically, you want GPU acceleration if you're going to run this. And voice cloning always comes with ethical risk, even with watermarking in place. This is powerful technology, but it needs constraints if you're going to use it in a real product. But when you zoom out, the most important part isn't just one model beating another. It's momentum. Chatterbox has picked up a lot of attention quickly. People are benchmarking as it seriously, and it fits into a broader trend where more parts of the AI stack are going open source. This feels similar to where open source LLMs were not that long ago. Once voice becomes cheap, fast, and local, experimentation accelerates. If you're building anything with voice, Chatterbox is worth testing. It's not perfect, but for a free local real-time TTS system, it's genuinely impressive. More importantly, it signals that voice is finally catching up to the rest of AI. And that's when things get interesting. If you enjoy this video, this is what we do here. Fast, clear updates on the biggest moves in AI. If you want to stay ahead of everything happening in this space, make sure you're subscribed. And if you want the hands-on side, demos, tools, workflows, and everything developers can actually build, well, check out the world of AI. We also run a simple no noise newsletter that gives you the most important AI tools and updates in just a couple of minutes. Subscribe here. Follow World of AI. Join the newsletter.

Ещё от Universe of AI

Ctrl+V

Экстракт Знаний в Telegram

Транскрипты, идеи, методички — всё самое полезное из лучших YouTube-каналов.

Подписаться