# KittenTTS - The Nano TTS

## Метаданные

- **Канал:** Sam Witteveen
- **YouTube:** https://www.youtube.com/watch?v=YpQWdrfzSzQ

## Содержание

### [0:00](https://www.youtube.com/watch?v=YpQWdrfzSzQ) Segment 1 (00:00 - 05:00)

Okay, so there's a new TTS out. And in this video, I'm going to be looking at Kitten ML's Kitten TTS. Great name. Love the name for both the team or the organization and the actual TTS model. The cool thing here about these models is just the actual sizes of them. So, if we come over and look at their GitHub, we can see that they've got TTS models which are under 25 megabytes. And actually if we look at the sizes of the models, we can see that they got three different models plus a compressed version of the smallest one. So the actual models are the mini model which is 80 million parameters. That's about 80 meg in disk space. They've got the micro model which is 40 million parameters. And then they've got the nano models which are 15 million parameters. And they have a full-size version of this but also an 8bit quantized version of this. So now the 8book quantized version is the version that's 25 MGB. Now clearly these are not going to be the best voices that we've ever heard. Recently I did a video about Quent TTS and those models are absolutely fantastic at being able to produce really high quality voice. The issue there is that the model that I was using for doing voice cloning there is 1. 7 billion parameters. Now, that's still a small model in that yes, we can run that with a GPU. We can even run it with MLX and stuff like that, but that kind of size really rules out being able to run that in the browser, for example. So, the thing that I find fascinating here is that these models are small enough that we can probably run them in a browser or run them certainly on a mobile phone without taking up too much space in a mobile phone. You can pretty much run them on any edge device that you want to put them on. The other cool thing that makes these really interesting is that these are CPU optimized. So, you don't need a GPU to actually run this. And just like they say in their GitHub here that this is really designed for sort of lightweight deployment and high quality voice synthesis. And then just to top all of that off, this is a full Apache 2 license here. You can pretty much do most of what you want with this. Now, currently they say it's in developer preview. My guess is that they're going to release a fuller version. Looking at their HUD phase, they released some very early versions of this last August and September. And those versions were the 0. 1 version, 0. 2, and then we can see in the last few days they've released a 0. 8 version. So, my guess is that this has probably been trained a lot more. They've had a lot more experiments and stuff like that. I do notice that the team members for this is literally one person. So I don't know if this is a oneperson operation or if kitten ML is more than that. But I got to say this is a fun little project of sort of seeing how small can you actually get a TTS system to be. And I kind of feel that we know really well now that these systems only get better. So, as people work out different ways of structuring these models and keeping them small, my guess is in the next few months, we're going to get better and better quality of voices at some of these smaller sizes. Okay, so I've put a Google Collab just to run it just so that you guys can try it out yourself and stuff. I'm not using any GPU in here. That's really important to know. And in the repo, they've put this together as a pit package now where you can load up the model. You can then select from a bunch of different pre-made voices in here. And the voices seem to be very similar sort of style of how they were created to something like Kokuro, which even though it's small, was still quite a lot bigger than sort of the 15 million parameters here. So jumping in, we can just install this package. We can load up all three models at once. In fact, I load up all four. I load up both the quantized version and the non-quantized version of the nano model in here. And then I've just got some helper functions for actually going through this. So let's take a listen to the different sizes of the voices and you can get a sense of how they are. — Hey, I was just thinking about grabbing coffee later. Want to come with me. — Okay, now that one's the full size 80 million parameters. And in many ways, this one is not too impressive, right? The voice is not great. The size is not super small, but you'll see as we get smaller that we're not losing that much in the quality of voice. Want to come with me? — Okay, so that last one was the 15 million parameter. Now, there was definitely some degradation there, but and I'm not sure how well that's going to show up in YouTube, so you should certainly play around with the collab yourself and try it out. But let's

### [5:00](https://www.youtube.com/watch?v=YpQWdrfzSzQ&t=300s) Segment 2 (05:00 - 09:00)

listen to the 8bit version of that. — Hey, I was just thinking about grabbing coffee later. Want to come with me? Okay, so we've got some artifacts starting to come into the audio at that point, but still it's clearly fundamentally the same voice and the size of the weights and stuff. Now we're way, you know, we're almost a quarter the size of that first one. If we come and listen to the second voice and I do think like certain voices seem to work better at the low bit rate than other ones. Transformer models use self attention mechanisms to process sequences in parallel which makes them significantly faster than recurrent. — Okay, so that's the 80 million parameter one. Let's listen to the 40 million. — Transformer models use self attention mechanisms to process sequences in parallel which makes them significantly faster than — okay that was the 40 million. Let's go down to the 8 bit 15 million. Transformer models use self attention mechanisms to process sequences in parallel, which makes them significantly faster than recurrent architectures. — So, that one's pretty good. There's not a huge amount of change in what you've got there. So, for the next ones, I'm just going to play just the 8 bit ones so you can sort of listen to them. — The old lighthouse keeper stared out at the storm. The waves crashed against the rocks below, and somewhere in the distance, a ship s horn sounded through the fog. Okay, we've got some artifacts there and it's very interesting to sort of look at this. It doesn't seem to be processed well for sort of punctuation. It tends to just keep going at this point and maybe because of the size of the model, this hasn't really learned to sort of have a pause at the end of a sentence before the next one. But the Luna voice is very nice. In a landmark decision today, the European Parliament voted to establish new regulations governing the development and deployment of artificial intelligence systems across all member states. So that's the Hugo voice, which is the news sort of more formal voice. That one definitely sounds good. Remember, this is something that could load into your browser very quickly and you could then sort of build Chrome extension, something like that around it. The future of artificial intelligence is incredibly exciting. — So here's when we haven't listened to the Bruno voice. — The future of artificial intelligence is incredibly exciting and I cannot wait to see what we build together. — And now the Rosie voice. Okay, so you can hear from these that they're not the best voices in the world. That's clearly, you know, when you start playing with them. But the fact that they're so small and you could use these in sort of tiny devices with very small amounts of RAM really shows us a direction of where we're going. So it's really going to be interesting to see what kitten ML do with this. So the models themselves when you come in and look at them are basically Onyx models here. So that's one of the ways that they're able to make them so small. And you can see you just got like a numpy file in here with the different voices. And I haven't looked into that, but I presume that's something very similar to sort of the embeddings for voices that we saw with Kokuro. If you haven't seen my video about that, check that out cuz I explain a little bit in there how the different voices are just different embeddings and I show how you can actually manipulate those to create new voices, etc. Anyway, I thought this was just an interesting project to look at to get a sense of how small can TTGS systems actually go. Please come over and start their project. Like, if this really is just one person or a couple of people that have been working on it, they've clearly done some nice work here of actually making this and then open sourcing it with an Apache 2 license. So, it's going to be very interesting to see where they go from this. And it may not be that longer until we've got fully local TTS models that are actually very usable that are running client side in your browser and on edge devices etc. Anyway, as always, let me know what your thoughts are in the comments. I would love to know if anyone's actually used this in a mobile phone app or in something in the browser. And as always, I will talk to you in the next video. Bye for now.

---
*Источник: https://ekstraktznaniy.ru/video/22378*