# Tiny Aya - Cohere's Mini Multilingual Models

## Метаданные

- **Канал:** Sam Witteveen
- **YouTube:** https://www.youtube.com/watch?v=8i0zxyHKbfk

## Содержание

### [0:00](https://www.youtube.com/watch?v=8i0zxyHKbfk) Segment 1 (00:00 - 05:00)

Okay, so one of the questions I get asked a lot is what model is good for and then someone will say a particular language and usually that language obviously is not going to be English. It won't even be things like French or Italian and that's because most of the big models now cover at least the common European languages pretty well. Now, that's not the case for other languages around the world. And this is why it always makes it challenging to recommend a particular model, especially if you don't speak that language. The challenge can be a whole number of different things. First off, it can be just that the model has never actually seen this language, which is very common for low resource languages. Now, low resource languages are basically languages that we just don't have sort of enough data on the internet to be able to train up big models. And while I think some of the languages would be really obvious to people, often they're things that you don't necessarily think about. It's just going to be things like where a particular culture doesn't really use Wikipedia. So, the Wikipedia for that language has almost no pages. A lot of the training data in the past benefited when a country with a specific language had a very active Wikipedia that you could then basically use for training models on that. So lack of data is the first thing. The other issue that you can run into which I've covered in the past in videos is basically just bad tokenizers. So just quickly looking back, this was the llama 2 tokenizer. And you can see for something like my name is Sam in English, this is just four tokens for me. Same in French. Soon as we go to Thai though, we start seeing lot more tokens going on here. When we go to Greek, we've got way more tokens going on here. So this basically just means it's much harder for the model to learn that language if the tokens for that language are basically character by character and occasionally even partial character by character. And those original llama tokenizers were a perfect example of this where for a lot of languages you would only need a few tokens but even things like Chinese you needed a lot more tokens to be able to cover the same meaning in even the same number of words. Now that certainly has changed over the years. We've seen the days of sort of 32k tokenizers now are pretty much long gone. And we've certainly seen most models, especially the proprietary models, get better, at least at common languages, often still not very good at low resource languages. We've also seen projects like translate Gemma, which obviously builds on the Gemma 3 stack, and the Gemma 3 models were pretty good at multilingual coverage. And that related both to them having better tokenizers. you're looking at 250k plus tokenizers there and also just a lot of multilingual data. We've seen specialized versions of this and it's not only Gemma. We can say the same for the Quen models have gotten a lot better recently at multilingual generation and even going back to things like Llama 3. It also got a lot better as Facebook realized that hey we want a model that can cover the languages that people are speaking on our platform. So suddenly you saw a bump in languages like Chinese and Thai. Now jumping forward to 2026, we've also seen more recently the release of Translate Gemma which was a whole suite of new sort of open translation models. The difference with these though is they have a very specific purpose. They're aimed at basically going from one language to another language. In many ways they're not just a set of generalpurpose multilingual models. And that brings us to the release last week from Coher. And these are the tiny models. And these are trying to basically fill in that gap of making sort of multilingual general models more accessible to people and making very small versions of this. While the big models from the Gemma family and the Quen family have been very good at multilingual tasks, the smaller models are often just not trained with the same amount of data. They're pre-trained on less tokens. and also the post-training recipes for those are often not really focused on multilingual tasks. So, jumping into Coh's launch here, they've launched a suite of multilingual very small models. And this is both a sort of research release, but it's also got useful models that you can probably use in tasks straight away and you could certainly fine-tune to get them a lot better at the languages you're interested in covering. So if we look at the models themselves, they're basically around about a 3. 3 billion parameter model. And the first one which is really good to see that they've released is actually the base model. So this is the core model that's been pre-trained on 70 plus

### [5:00](https://www.youtube.com/watch?v=8i0zxyHKbfk&t=300s) Segment 2 (05:00 - 10:00)

languages including data from many low resource languages around the world. After that we've got basically four post-trained models that are built off this pre-trained base model. So the first one is the tiny global and the idea with this one is that this is your general multilingual model. It supports the majority of the languages from pre-training. It's been instruction tuned and balanced to be able to work on a wide variety of languages. And I guess you can think about this model as being if you want one model that's going to cover as many languages as possible, you go for the global model. After this is this family of specialized sort of fine-tuned models which I'll go through in a second and then they've also released multilingual training data sets and benchmarks that you can use to actually make your own fine tunes for these. Okay. So what they did was they basically took all the different languages and they categorized them into these sort of sections. And the idea here is that these languages are both sort of you know which languages are related to each other in some ways and also geographically related to each other and then from that they've created these three specialized sort of post-training recipes which are sort of mixes of the different languages by region in here. So the first up that we've got is the tiny earth. So this is sort of a merge of West Asia, Africa and some of the European languages in here. So that has things like Arabic, Turkish, Hebrew, then 10 different languages from Africa, and then as well as 31 languages from Europe. The second one up is the tiny fire model. And this is basically sort of the South Asian model. So this one is a little bit different than the other ones. it tends to focus mostly on its languages because the script is so different. So here you've got Hindi in there, you've got Bengali, you've got Tamil, you've got Nepali in there. Now most of them will also have English just because people use sort of code switching where they go back and forth between perhaps one language and then some words in English and then back to another language. That's quite common. The third model is the tiny water model and this is basically a model that covers the Asia-Pacific languages in there. So things like Tagalog, Bahasa, Vietnamese, Thai, Chinese, but even some really low resource languages like Cam and Burmese in there. And then it also has mixes of the sort of West Asia and European languages in there as well. If we look at the paper, we can see actually this data mix that they sort of put together here. And the way that they've done this is really quite interesting. Even if you didn't want to do a multilingual model, the way that they've done sort of post training in here is quite interesting. So they've done regionpecific SFT models for each of the regions, but then that's not what they've actually released. They've then gone and merged them as you can see here. So just like we talked about before that the tiny water model is actually a merge of the Europe SFT model, the West Asia, the Asia-Pacific. Now I do think it would have been nice if we'd actually just gotten the releases of the SFTTS for each specific region in there and maybe if asked they will make that available in the future. It's certainly interesting to sort of look at. Now they haven't taken an offtheshelf tokenizer. They've trained up their own tokenizer for doing this. And we can kind of see the efficiency of that tokenizer here based on the different languages. And we can see it compared to the Gemma 3 tokenizer, the Quen 3 tokenizer in here as well. And we can see that certainly for certain languages, they're beating the Gemma 3 tokenizer where theirs is more efficient. And then for some languages, the Gemma 3 is maybe slightly better in here. So for the sake of time, if you are really interested, I would suggest you go through the paper and look at some of the results to see how well this is going to work for your particular language. But you can just come in here and start with any of these particular models in here. So you can see that we've got the base model, the global model, the earth, the fire, the water. I must say I'm not a super big fan of those names. But along with those models, we've also got quantized versions that you could use straight away in a llama or something like that if you wanted to use them. Now remember these models are very small. They're only sort of 3B size models. So you could run these on a phone if you were looking to make something like a mobile app in a language where a lot of the other models for languages are just not working out. So I've put together a collab where you can actually try this out yourself. I really don't think it's that useful for me to go through the different ones and try it out. In this particular collab, I've gone for the global model and just tested it out with a few different things. You certainly should work out whatever language you're

### [10:00](https://www.youtube.com/watch?v=8i0zxyHKbfk&t=600s) Segment 3 (10:00 - 11:00)

interested in and see which model is actually best for that language and try that as well as the actual sort of global model. I do think that anyone who's trying to build some kind of assistant or something like that for a mobile app for a country that's just not supported by the big models, these are going to be really useful to you. So overall, I would say if you are doing sort of multilingual models, this is definitely worth checking out. The translation gemma models are worth checking out. And then we're already starting to see with the Quen 3. 5 models that the new era of those models seems to be a lot better at multilinguality. Obviously, at the moment, that's a 400B model. It's not something you can do on mobile, but it'll be interesting to see if that carries through to the smaller versions of those models. And perhaps with the Gemma 4 models, are we going to see models that are even better at multilingual tasks than something like this? For now, though, if you are looking for a model to support a language that's not supported by the main models, these are definitely worth checking out. Please have a play with the collab that I put in there, and you can swap out the models quite easily. and then let me know in the comments how these models actually perform for your particular language that you're interested in. This is something that I get asked a lot and I don't always have a good answer. So, it's nice to know if I know that this is particular good at West Asian languages or particularly good at certain languages that are not working well in other models. I would love to hear from all of you. Anyway, as always, thank you for listening through to the end. I know that this is probably not going to be a big and popular video and it's perhaps not the sexiest sort of topic for a lot of people, but I do know that finding good strong multilingual models is something that a lot of people are always looking for. And on that note, I will talk to you in the next video. Bye for now.

---
*Источник: https://ekstraktznaniy.ru/video/22377*