Granite 4.1 - The Fastest ASR?

14:36

Granite 4.1 - The Fastest ASR?

Sam Witteveen 07.05.2026 4 036 просмотров 234 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

In this video, I dive into IBM's newly released Granite Speech 4.1 models and explore what makes them interesting — particularly the three 2B variants they've dropped and how each one makes a different trade-off between accuracy, richness, and throughput that you'll actually care about for real applications. 🔗 Links: IBM Research Blog → https://research.ibm.com/blog/granite-4-1-ai-foundation-models Twitter: https://x.com/Sam_Witteveen 🕵️ Interested in building LLM Agents? Fill out the form below Building LLM Agents Form: https://drp.li/dIMes 👨‍💻Github: https://github.com/samwit/llm-tutorials ⏱️Time Stamps: 00:00 Intro 00:20 IBM Granite Collection 00:27 Granite Docling 00:46 Granite Speech 4.1 01:16 Granite 4.1 Blog 01:38 Granite Speech 4.1 2B 04:02 Granite Speech 4.1 2B Plus 06:15 Granite Speech 4.1 2B NAR 07:30 NLE: Non-autoregressive LLM-based ASR by Transcript Editing Paper 07:45 Architecture 09:45 Code Time 12:00 Granite Speech Model Github #DellProPrecision #DellProMax #Delltech #localai #NVIDIA

Оглавление (12 сегментов)

Intro

Okay, so IBM has been quietly building out one of the more interesting open model families on the market with their granite series of models. And honestly, they're probably not getting enough credit for this because it does seem in some ways that they've replaced kind of some of the things that Microsoft was doing with the fire models and stuff like that. But the interesting thing is

IBM Granite Collection

that they've got a whole suite of language, vision, speech, and embedding models going on here. Now, one of the

Granite Docling

areas that has stood out well has been their granite vision for document understanding. And that leads nicely into the dockling models. So, if you're doing anything with OCR, a lot of people are finding dockling to being one of the key models for pulling sort of structured data out of PDFs etc. So

Granite Speech 4.1

while all of these models are actually quite interesting, in this video, I want to focus on their speech models, specifically the ASR side of this release. Because this is not just one model. This is actually a suite of three separate ASR models, each with their own strengths and differences, and they really allow people to do a lot of the things that they haven't been able to do with Whisper or even with some of the other models out there like Parakeet, etc. Okay, so let's break down what's

Granite 4.1 Blog

actually in this speech release. There are three models. All of them are roughly 2 billion parameters in size. They're built for sort of edge deployment here. And the interesting thing is that IBM is kind of framing this as pick the variant based on what your bottleneck actually is. So the

Granite Speech 4.1 2B

first one up is granite speech 4. 1 to be the base model. Yeah. And this is the one that currently is the leader on the open ASR leaderboard on hugging face. Now, this model is at the top of the leaderboard with a word error rate of 5. 33, meaning that it's basically getting around about 95% of the words accurate across a variety of different tasks and data sets. While often you'll see really low error rates on things like Libra speech, etc., they often don't translate into sort of real world use. So this average word error rate here is actually a better guide to how the model's probably going to perform in the wild. Now the other thing that we can take away from the open ASI leaderboard is the RTFX score. So this is basically the real time factor. So for every 1 second of compute, how many seconds of audio can actually be processed here? And you'll see for this particular model, they're actually getting almost 4 minutes of audio for every 1 second of processing. basically means you can transcribe an hour of audio in 16 seconds or so, which is pretty amazing. Now, this model is multilingual, but it's only supporting seven languages. So, you've got English, French, German, Spanish, Portuguese, Japanese for transcription. On top of that, it does birectional sort of speech translation. So, you can actually go from any of those languages into English and from English out to a number of languages as well. The model itself can handle punctuation and true casing and it also has the ability to do keyword biasing built in. So what that actually means is that you can pass in a list of names or acronyms or technical terms in the prompt and the model will wait towards recognizing those correctly. So if you've got sort of like a special spelling or something like that, this keyword biasing allows you to actually do that and get a better result. And honestly, that alone can be worth it if you're transcribing sort of a lot of domainspecific content. Now, the architecture for this one is pretty standard. It's basically just an auto reggressive model and it really sort of rounds out this model as being a very good workhorse with a low word error rate. And I think that alone would justify this. But there are still two models to come. So, the second model is

Granite Speech 4.1 2B Plus

the plus model. And the key thing here is that this model allows you to add a form of diorization where you've got speaker labels. So that it can actually tell you that this is speaker one speaking, this is speaker two speaking, etc. So this is actually called speaker attributed ASR or diorization. And this is a huge thing that people often want. For example, if you're going to transcribe a podcast, you kind of want to know who's saying what so that you can attribute that. Now, the model itself may not actually give you the names. It will give you sort of speaker one, speaker two, but it shouldn't be that difficult for you to be able to swap those out for real names to put in the final transcript. The second feature is word level timestamps. So, this is where every word gets a tag with the end time notated in it. So they're reporting timestamp accuracy that beats a lot of the models out there, including customized versions of Whisper that were actually built to do this specific task. So if you've been using something like Whisper X to get word level timing, this model is probably something you want to have a serious look at using. There's also some nice features in here for incremental decoding. So you can actually pass in a previously transcribed text as a prefix and the model just picks up from there. So, for example, if we've got a really, really long audio recording and we cut it up into chunks, what we can do is we can have overlap in the chunks and we can then pass in the transcribe text, have it sort of take over from that and continue generating the text out. This is super useful for long form audio where you want to keep things like speaker numbering across consistent chunks. Now, not everything's ideal about the plus model. You've got some trade-offs here. the number of languages goes down to five. They drop Japanese entirely. They also drop the ability to do the translation stuff. And also the word error rate on this one is actually a bit higher as well. But certainly here, if you're building any kind of meeting recorder, podcast tool, anything where the structure of the transcript matters, the plus model is the one that you want. And that brings us to the

Granite Speech 4.1 2B NAR

third one. This is the granite speech 4. 1 2BN and this model is all about throughput. So this is a non-auto reggressive model where the other models have been doing predictions of tokens as it's going along in an auto reggressive fashion. This one has the N because it stands for non-auto reggressive. And to understand why that matters, you have to understand how things like Whisper, Parakeet, Canary, basically most of the other transformer-based ASR models on the market today actually work. They're auto reggressive. They generate one token at a time. Each token is sort of conditioned on the previous one. And the big takeaway there is that means that decoding is sequential. your GPU has to do this tiny forward pass one token at a time and then wait before it does the next one. Now obviously the goal of how to fix this has always been to just predict the whole sequence in parallel and people have tried that for many years. Usually it hasn't worked very well because predicting an entire transcript from scratch in one shot is pretty hard. You lose the ability to condition on what you've already written. Now what IBM has done instead

NLE: Non-autoregressive LLM-based ASR by Transcript Editing Paper

though is that they've got this technique called NLE. So this is non-auto regggressive LLMbased editing. And what this is doing is instead of generating a transcript, they're kind of like editing one. So here's how it

Architecture

works. Step one is this frozen encoder that runs over the audio and produces a draft transcript. and this kind of CTC encoder are pretty cheap and fast to run and it gets you most of the way there. In fact, if you look at the sort of drafts, you'll see that they're usually pretty correct. But where this is able to perform better than previous models that tried to do it all in one shot is that they're able to actually make use of having sort of birectional attention. And this allows it to actually kind of edit or copy, insert, delete, replace in that original transcript as it goes through. So having this kind of re-editing step of that original draft transcript allows it to actually improve the accuracy and what you're getting out here. Now where the base model runs at a real-time factor of about 231 for the open ASR leaderboard and remember that's already pretty fast, right? That's allowing you to do an hour of audio in 16 seconds. The model card for the non-auto reggressive model claims that if you're running this with batches on an H100, so a reasonably powerful GPU here, you can actually get a realtime factor of 1820. And that itself is kind of insane. That literally means that you can be transcribing an hour of audio in 2 seconds on that hardware. And the amazing thing here is that you're not even taking a huge hit on the word error rate. Now, of course, there are some trade-offs here. Not only do you have no translation and stuff like that here, you also have no keyword biasing, so you can't sort of influence the transcript in the same way, you don't have speaker attribution, you don't have timestamps. But if you just needed the raw transcripts of hundreds of hours of audio, this is going to allow you to process all of that in a really insane amount of time. So I think the best thing is let's jump in and actually have a look at how we could run this on a

Code Time

local machine. Okay, so you can see here that I am running this on a Dell Pro Max Tower T2 here. And the cool thing about this is I have an RTX Pro 6000 Blackwell GPU in here. So, a big shout out to the people at Dell for providing the compute today. And as you can see here, I'm easily able to run this plus multiple other models actually that I've got going in here and serving at the same time. Now, that said, because these models are actually so small, you can run them on lots of different GPUs. It doesn't need to be a super big one. One thing that you will need to be aware of is to be able to do some of the fancier things for the non-auto reggressive model. Uh you're probably going to need flash attention installed. And for me on my system, because I've been using CUDA 13, I've actually compiled my own version of flash attention to make sure that everything would fit with my system. So, if you're trying this out in something like Collab or something like that, you may run into some issues where, for example, if you've got one of the older GPUs like a T4, you may run into issues of trying to install flash attention and get that working. So, as long as you've got your PyTorch lined up with your CUDA, with your flash attention, etc. The code for getting this to work is actually pretty simple. So, you can see in here, it's actually using the transformers library. It's loading the model up. It's using autoprocessor, etc. And then if you want to do things like the diorization or speaker attributed ASR, you can see that okay, you need to change the prompt for going through this, but then you can still do the transcribe function that they've got up here. Just make sure that you've got the right model loaded, etc. The cool thing here is they've also got examples of code for the incremental decoding. So this is basically where you pass in the previous chunk. If you're going to be doing long audio, you're going to need some way to chunk it up. And honestly, I'm not seeing the higher speed that they're claiming, most likely because I'm not using an H100. But also, I think it's got to do with the chunking and the batching of this. That said, on the GPU that I've been using, I've been able to get very nice outputs from this. Just coming into the GitHub here, they've got a bunch of nice things of where you could do speculative decoding with this, but also their notebook from the previous version for doing

Granite Speech Model Github

fine-tuning, which you should be able to use to actually fine-tune a model perhaps for your specific voice if you've got an accent or if you've got a specific kind of use case. A good example like this could be something like court transcripts or a specific podcast where you can actually get some transcripts already for certain episodes. use them as training data to actually get a fine-tuned version of this model that will work for the particular host that you're trying to transcribe. So, for me personally, I've started to put together my own repo of all the things that I wanted to have in there and set up some simple scripts. I'll probably end up writing some agent skills to actually go along with this so that I could actually have this running on my machine and then have agents reach out to it to do the transcriptions fully locally without needing to get anything in the cloud. And just showing you something from some of my own tests. One of the things that is really cool here is the whole idea of timestamps. Now, I need to probably do a lot more analysis of checking this, but so far it's looking pretty good at how it's able to basically give us timestamps for each word that we're actually transcribing. Some of the issues I've had with this that I'm still experimenting with is what is the best way to do really long form audio. So, if you've got something like a 4hour podcast or something like that, you want to be able to pass in the keyword biasing stuff. And you can see actually here how that is actually done, right? So that you've got a specific prompt that you pass in where you've got keywords and then you just pass in each of those keywords. And definitely one of the things I'm still experimenting with is what is the best chunking size to be able to do the long form audio and what are the best pretext filling strategies for you to be able to stitch together really long transcripts and to have multiple speakers for the diorization etc. Overall though, just to finish up, I would say that while perhaps the language models, there's better things out there than the granite stuff, the speech stuff is really super interesting. Also, the embedding stuff and the guardian models probably deserve some more investigation on their own, etc. But altogether, this is a really interesting release from IBM. And I hope they keep it up, right? I hope this doesn't become like the Microsoft models where they've kind of scaled it back and not really doing it. It's great to see that IBM realizes that a lot of people want these smaller kind of models, whether they're smaller language models, vision models, speech models, etc. So anyway, let me know what you think in the comments. As always, if you found the video useful, please click like and subscribe, and I will talk to you in the next video. Bye for now.

Другие видео автора — Sam Witteveen

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник