# Real-time Speech Recognition in 15 minutes with AssemblyAI

## Метаданные

- **Канал:** AssemblyAI
- **YouTube:** https://www.youtube.com/watch?v=5LJFK7eOC20
- **Дата:** 12.11.2021
- **Длительность:** 19:21
- **Просмотры:** 309,729

## Описание

Get your free speech-to-text API token 👇
https://www.assemblyai.com/?utm_source=youtube&utm_medium=referral&utm_campaign=yt_mis_5

Transcribing in real-time is a super skill only court reporters can brag about. But luckily, we don’t need to learn how to type fast to get transcriptions of audio quickly. Thanks to Assembly AI’s Streaming Speech-to-Text model (previously real-time speech recognition), it is very simple to set up a python script that can listen for audio and turn it to text.

In this video, we will see how to create this script on Python with the help of pyaudio, web sockets and asynchronous functions. The app will have the power to listen to audio input through a microphone and display the transcription in real-time. We will integrate this code into a simple Streamlit application to showcase the real-time speech recognition with a touch of interactivity.

If you’d like to follow along, don’t forget to get your own AssemblyAI API token for free at assemblyai.com

You can find the code from this tutorial in this GitHub repository: https://github.com/misraturp/Real-time-transcription-from-microphone

Find the written form of this tutorial here: https://www.assemblyai.com/blog/real-time-speech-recognition-with-python/

AssemblyAI Streaming STT docs: https://www.assemblyai.com/docs/speech-to-text/streaming/?utm_source=youtube&utm_medium=referral&utm_campaign=yt_mis_5

## Содержание

### [0:00](https://www.youtube.com/watch?v=5LJFK7eOC20) <Untitled Chapter 1>

transcribing audio in real time can be really hard especially if people are speaking really fast or if they're speaking slow or um for example if they're using a lot of filler words or if there is noise in the background but fear not because assembly ai has its own real-time transcriber in this video i'll show you how to use assembly ai's real-time transcriber endpoint and also use it in a stream with application just for fun to follow along go to the link in the description and create your own assembly ai account okay so let's get started the first thing that i want to set up is the assembly ai side of things i want to be able to have an api token that will work for this project very simple just go to assemblyai. com or use the link that we have in the description to create an account if you already have an account you can easily sign in and immediately after you create an account you have an api key you have a free api token from assembly ai and you will be able to see it here on your profile too so once you've done that for this project specifically you're going to have to upgrade your account to be able to use the capability of real-time transcription and for that simply you just need to go to billing and click upgrade i've already done that so i do not have that option visible to me here but if you go to billing you'll be able to upgrade your account and then you will once you go back you will see that your api key specifies that it is on a pro plan now and that's all we need to do with assembly ai next what we want to do is

### [1:35](https://www.youtube.com/watch?v=5LJFK7eOC20&t=95s) Install the Dependencies

to install the dependencies for this project we have two main dependencies one of them is pi audio that will help us get the input from the microphone in a streamed way and the next one is websocket to be able to talk to assembly ai's api endpoint so very easily you just need to do pip install pi audio i already had it on my computer so this might take a bit longer for you one note here though you might get an error that says port audio cannot be found to solve that problem all you have to do is brew install port audio and lastly we just pip install websockets all right and that's it so next i'm going to create a project folder here i will call this real-time audio transcription with assembly ai all right and i will create my first file here and um yeah i can call this anytime again real time maybe let's call it this audio transcription. pie um and i'm also going to create a configure file again and i'm going to paste my authentication key here all right so now that's ready let's step by step build our application

### [3:12](https://www.youtube.com/watch?v=5LJFK7eOC20&t=192s) Set Up the Microphone Stream

first thing is to set up the microphone stream the input stream from the microphone so for that we are using pi audio as i mentioned and we just need to set up some constant here how many frames per buffer we're going to uh get or the rates and how many channels and some other things that we need to set up to create the stream um once this is done we want to create a connection to assembly ai of course the end point to use assembly ai's real-time transcription is this one as you can see here we are also specifying the sample rate here and it's basically api. assemblyar. com version2 realtime the actual transcription part of this program is going to be a little bit more trickier than the one we did last time which was just audio transcription that was just sending a file to assembly ai and then getting an answer or transcription back but with this one because we are doing it real time we're going to have to use an asynchronous function or a group of asynchronous functions that will do the job so what's going to happen is we're going to have two functions one of them constantly sending the input that goes into the microphone and one of them constantly listening for the transcription that is coming back so i will just paste the whole function here and then we'll go step by step and understand what it does all right so let's see what's happening in this function so the main goal of this function is to constantly be sending what is being inputted to the microphone and constantly be listening to the transcription that is coming in but of course for the this connection or this communication to happen first you need to create a

### [5:00](https://www.youtube.com/watch?v=5LJFK7eOC20&t=300s) Create a Connection

connection and how we're going to do that is using the end point from the assembly ais api we are first creating using web sockets a connection so we are using the authentication key and filling in some other headers here also using the endpoint variable the url we are creating a connection and the connection is called underscore ws before we forget we also need to bring in some other dependencies for example the authentication key from the configure file or we're going to use some json files and other web sockets and asynchronous api of python and everything and after that we are basically creating the the session the connection between our application and assembly ai and we are waiting for a response from assembly ai to make sure that the connection is solid and it has been formed um next we have two functions that are also asynchronous

### [6:02](https://www.youtube.com/watch?v=5LJFK7eOC20&t=362s) Send Function

functions one of them is send function and the other one is a receive function what we do in the send function is get the data or get the audio from the stream is the thing that we set up using here pi audio that is the input from the microphone and then we have to turn it into a 64 bits and decode it into 64 bits and then we are sending it to the web socket that we created if you remember the underscore ws is the web socket that connects to the assembly ai api and we are also catching some exceptions in case there is an error with the connection next with the receive what we are doing is we are getting the response from the web socket assembly ai and based on what we get uh we either have there could be an error but the connection but if not we are printing uh the response that we get here we are only printing the text but here are some other information of what is being returned from assembly ai you can either get so you can find all of this information and documents of assembly ai but basically you either get partial results or final results and what that means is basically while i'm talking assembly ai will constantly be returning the transcription of what i'm saying but even in the middle of a sentence let's say this is a sentence it will be able to return transcriptions for basically more or less every word but the moment that you stop it analyzes the whole sentence adds uh punctuation and does casing for example make some of the letters that are relevant uppercase and then returns it and that one is the final result but before that you are only getting a word afterward basically um so that's one option but you're basically also getting audio start audio and some other information what is the confidence you know if it's very loud in the background if it's very noisy the confidence might be lower of course and some other information that is you're here you're getting but we're only going to be using the text field for now uh so that's basically what we're doing we're going to be printing it to the terminal and inside of this bigger wrapper of the asynchronous send and receive function we are calling these two send and receive functions to be run repeatedly so that we'll always be listening and finally the last thing that i need to do to actually run this um function is to call it in a while loop i'll just simply say while true for now uh but that's it actually this is a application that will already work so let's run this and see what happens all right i've navigated to inside my folder now i'm going to run the file i created okay so now it's actually listening to me and as you can see at first it is not adding any punctuation or anything just word by word returning what i'm saying and after i'm done speaking it is i'll stop it for a second as you can see after i'm done speaking then it does uh capitalize some of the letters add punctuation like commas or dots uh and yeah then or for example capitalizes the i here too to make it a bit more of a final result but it only happens after you are finished saying a sentence so this is very nice i think it's a very nice result for now this is definitely super usable but what i want to do is to experiment with only showing the final results so how we're going to do that is so if you remember we saw that we can filter the messages based on if the message type was final transcript or not so i can just write an if condition to check for that right what i want to check for specifically is if message type is final transcript all right and this should work let's see now i'm only expecting to see full sentences and not partial words that's perfect that's exactly what i wanted awesome nice so as you can see it was quite simple to make a terminal application but you might want to go a little bit fancier and if you're interested i'm going to show you how to turn this into a streamlet application where you can get input again from the microphone and then show the transcript live on the screen and it's actually not even that hard so at first what i need to do is to import streamlet as st and then i want to add a title just to you know see how everything works if there are any problems okay so i have to run this separately of course i need to run this as streamlet run um real

### [11:27](https://www.youtube.com/watch?v=5LJFK7eOC20&t=687s) Real Time Audio Transcription

time audio transcription okay so it is going crazy it is going no no and can you oh yeah we're not printing anything uh but let's see okay now we're getting full sentences again so that's good a couple of things i want to do first of course i want to stop this from going crazy and i'm going to stop the application too so let's go one by one first thing that i want to do is to deal with that none no no none things and why that is happening is because of the asynchronous functions that we have so we have a bunch of awaits here so await sleep we also are running away sleep again here we are doing again elevate weight ascend and what happens is all of these functions are actually returning none and what happens with streamlight is that if you have a function that returns something and is not captured in any way it just prints it on the screen so i'm just going to create a random variable to capture the return values for these awaits let's see if i'm missing anything there is one there all right looks like i have all of them covered so let's try running the application again see what happens okay so this looks good i'm just gonna fix the title okay so the next thing that i want to do i want to be able to display whatever is being transcribed on my screen so it's actually quite simple what we can do here is instead of just print we can say streamlit markdown and then print whatever this is right okay i'm gonna check the terminal to see if the sentences are appearing they are and they are also appearing here that is really nice uh but there is one problem this is just going to keep listening to me endlessly yes this turned out really poetic endlessly uh all right so for example if i stop this one from running what's going to happen is that my asynchronous functions are still running so it's going to keep listening to me yes it agrees so what i'm going to do is stop this one and i want to have a way of starting and stopping listening or basically a way to control when the application listens to me and creates transcriptions and when it stops so for that what we're going to do is to find a way to stop these asynchronous functions from running so as you can see what happens here is that i'm saying that while true keep running this part and whilst you again but i don't want to do that right i want to be able to control when this is being run so one thing we can do is to use a streamlet session states to control when this should be true and when this should not be true so i'm quickly going to create a session state and i will occasionally set it to false so i don't want the application to start listening immediately when it's first run so the next thing that i want to do is to add two buttons i want to add them side by side so i'm going to create columns at first we will have a start button that will start the listening process and then we're going to have a stop button but of course just because we have this button doesn't mean that it's going to start and stop listening these buttons need to also do something so for that i will create callback functions that will run and change the session state and this session state will affect when these guys are run so let me first do that you can copy this one this is going to be true and stop listening we'll set it back to false and i need to call them from the button okay and the last thing that i want to do is to make sure these are run based on the session state so i will go and change true to not true all the time but based on the session state there is also one other thing because streamlit always runs the applications over and over again we don't need to use the loop here so i will save this and then let's see the change all right i have my buttons here it doesn't look like it is running already here also it looks like it is just halting and it's waiting for us to start listening and then i say start listening ah it started listening already i can see it here also so this is very nice um the sentences by the way can be quite long so i would like to read a bunch of sentences from here and show you how long they can be this is the hands-on machine learning with scikit-learn keras and tensorflow book by o'reilly neural networks seem to have entered a virtual circle of funding and progress amazing products based on neural networks regularly make the headline news which pulls more and more attention and funding towards them resulting in more and more progress and even more amazing products yeah i guess i made like too big of a gap here so it didn't realize those were those two were the belong to the same paragraph but basically it divides the sentences based on the gaps that you leave between the sentences to see which ones belong together and which ones don't well i guess i already spoke a lot here so if i want to now i can stop listening and it cleans up the work space for me and if i see here again i see that the connection is created again but it is being halted and if i want to i can even start listening again yes nice so we see now we have a application in our hands that is a little bit more controlled it doesn't uncontrollably keep listening to the user and we can even start it again if you want to and thanks to assembly ai this was very easy to make we only needed one endpoint to send the audio that we get from the microphone so this is awesome that was much easier than expected right i hope everything was clear but if you have any questions do not hesitate to leave a comment and let us know but apart from that i hope you enjoyed this video i hope you liked it and if you did maybe give us a thumbs up and subscribe so that you'll be one of the first people to know when we come up with a new video and before you leave don't forget to go get your free api token from assembly ai using the link in the description have a nice day and i'll see you around

---
*Источник: https://ekstraktznaniy.ru/video/13329*