# OpenAI Whisper Just Got Realtime!!!

## Метаданные

- **Канал:** 1littlecoder
- **YouTube:** https://www.youtube.com/watch?v=4LUyfCcF-cM
- **Дата:** 10.05.2026
- **Длительность:** 7:47
- **Просмотры:** 3,214
- **Источник:** https://ekstraktznaniy.ru/video/52965

## Описание

🔗 Links 🔗

https://developers.openai.com/api/docs/models/gpt-realtime-whisper

https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/

Related Videos - 

https://www.youtube.com/watch?v=FPp7u8F6E9Y

Code of Realtime voice transcription used in the video 

https://github.com/amrrs/gpt-realtime-whisper-demo


❤️ If you want to support the channel ❤️
Support here:
Patreon - https://www.patreon.com/1littlecoder/
Ko-Fi - https://ko-fi.com/1littlecoder

🧭 Follow me on 🧭
Twitter - https://twitter.com/1littlecoder

## Транскрипт

### Segment 1 (00:00 - 05:00) []

One of my most favorite audio transcription model, Whisper, is currently available as a real-time streaming endpoint. And in this video, we're going to explore how to use the model, what is this model, and what kind of things that you can get out of the model. I've got a real-time demo. Uh we're going to see the demo, but first of all, if you're not familiar with Whisper, Whisper is OpenAI's open-source. I mean, like literally, yeah, it's an open-source audio transcription model. It's an ASR model, so if you send an audio um like stream, then the model would transcribe it. And also, it has got a translation element to it, but let's stick to transcribing. It can detect multiple different languages and transcribe it. It's a multilingual transcription model. So, let's say you're watching a YouTube video and you want it to be transcribed in English, then this model can do the job for you. But what is special about the new model is that this is a real-time streaming endpoint. GPT Whisper is GPT real-time Whisper has been launched as part of their recent real-time endpoints. So, if you have seen the video that I published couple of days back, so they launched three different real-time endpoints. GPT real-time 2, which is a model based on GPT 5. 5, GPT real-time translate, which is a voice model, but then can also do translation and then send it back to you. And the third one is GPT real-time Whisper, and that is the model that we're going to explore today. Before we get into the model specific, I would like to quickly show you a demo of how this works, and then we can jump into the demo in itself. So, I've got a YouTube video. So, you can see here, it is Andrej Karpathy, who is who speaks pretty fast. Uh so, generally when I watch videos on YouTube, I have to put it on 1. 25x or 1. 5x. But when I have to watch Andrej Karpathy, then probably have to slow it down because it's extremely fast. Now, what I'm going to do is I'm going to start the server here. And once I start the server, it is going to start listening to it, and then I'm going to show it, so you can see what he's saying. Okay? So, let me start. So, the session has been started and you can see it says and I'm going to play Auntie Kar Padi. Also, one thing that I like about Whisper is like generally a lot of text to speech to text audio transcription system fails Indian language but I've seen great success with Whisper even for my own voice. So, I'm going to play this now. A yeah, mixture of both for sure. Well, first of all um I guess like as many of you I've been using a genetic tools like Cloud Code adjacent things for a while maybe over the last year as it came out and it was very good at you know chunks of code and sometimes it would mess up and you have to edit them and it was kind of helpful and then I would say December was this clear point where for me I was on a break so As you can see here it has been extremely real time like as he is speaking as Andre is speaking then this model is doing a pretty good job. So, I'm going to now share a different language video let's say Hindi interview. So, I'm going to get a Hindi interview for you and then we are going to see if it can it can actually transcribe it. So, I'm going to start the server again. And as you can see here the moment the audio was Hindi it transcribed in Hindi. I obviously it's very difficult for me to read this but the moment it is in English it is giving it in English like for example now I'm going to speak a different language called Tamil which is my mother tongue and you would probably notice that it would change the text. Now I need to call in the end of the day and I get the real so night biryani something but call it in a minute or more than 20 minutes again. So, as you can see here whatever I'm speaking it transcribing in real time which is very surprising as if it is a sci-fi movie. And you might be thinking that, "Hey, like it's just transcription. Why are you giving so much build up for it? " See, transcription is one of I'm going to stop it so that I don't waste my tokens. Transcription is one of the most useful business application that you can ever think. Because humans have been speaking for a very long time. There's like huge amount of data available on YouTube and a lot of other these platforms. Even if you're working for a company, like for example, you are part of a Zoom meeting or Microsoft Teams meeting, and you want these things to be transcribed, summarized, created as action items, maybe call some tools or whatever it is. Now, this model gives you the capability to transcribe in real time. It could be a speech by a politician, an interview, it could be a podcast. But by the end of the podcast, like podcast recording, you can have a summary. You can even get timestamp with this. So, the fact that this is like real time and it exists, like it gives you huge potential. And the way it works is like the code is very simple. You can probably ask Cloud Coder Cursor. But if you want to see it here, we just literally call GPT real time Whisper. Even though I'm calling it with the language English, it understands different languages. And you just like

### Segment 2 (05:00 - 07:00) [5:00]

it's a web socket connection, and then all you have to do is get it done. Very simple. Now, about this model itself. This is a new model that they have launched just couple of days back. And like I said, like if you have seen my other video, we covered GPT real time two in that video. So, as part of the same model series, they've launched a GPT real time Whisper. Uh this is a new streaming transcription model built for low latency speech-to-text. So, it transcribes audio as people speak. And we can see that in live action. We saw that how fast this model is. So, the model makes live speech usable inside business workflow as it happens. So, it helps in meetings, classrooms, broadcast events, and a bunch of other things. Now, about this model in itself, how much it costs. So, this is a very very fast model. It's got great performance. It can take audio and text as an input, but we're going to primarily deal with audio. And this model, unlike other OpenAI models that comes with a per token pricing, because this is a speech model, a voice transcription model, this model comes with a per minute pricing. So, it costs you about 1. 7 cents, approximately 2 cents, let's say, for every minute. So, every minute that you send through it, you're going to be charged 2 cents. So, it's a pretty cost-effective model, to be honest. Like, you can transcribe huge amount of audio volumes. And like I said, uh it works it's multilingual model. One thing I OpenAI did not specify is like the open-source Whisper model has different sizes. It's got a tiny, small, medium, large, large V2. But OpenAI did not explicitly specify what version of the model that they're using it here. They just said it's a low latency, like the model has been specifically designed for streaming and low latency use cases. That is all we know. What I'm going to do at the end of this video is I'm going to share this repo in uh my GitHub and then share it in the YouTube description, so that you can go play with this model, and then um you know, all you have to do is add your OpenAI API key, and you should be able to play with the model. Let me know in the comment section what you feel about this model. I was pretty happy to see the performance that it has got. I'm going to just do one last time, so I'm going to start. And I don't know what you're actually saying in this. Um it thinks [clears throat] I'm speaking Korean. No, I'm not speaking Korean, I'm speaking Tamil. And I don't know what you're actually saying in this in the model. So, as you can see here, I'm saying that I'm very surprised to see that the model can understand everything that I'm saying. And um you can also see it, like even when I pause um um So, you can see that it even captures that very, very well. And I I am a big fan of Whisper. You can see a lot of Whisper videos on this channel. So, happy to see GPT real-time Whisper. If you have got any use case for this model, otherwise see you in another video. Happy prompting.