# Build your own real-time voice command recognition model with TensorFlow

## Метаданные

- **Канал:** AssemblyAI
- **YouTube:** https://www.youtube.com/watch?v=m-JzldXm9bQ
- **Дата:** 23.07.2022
- **Длительность:** 19:21
- **Просмотры:** 70,677
- **Источник:** https://ekstraktznaniy.ru/video/13056

## Описание

In this TensorFlow Tutorial we build our own real-time voice command recognition model that can then control a game.

Tutorial + Colab: https://www.tensorflow.org/tutorials/audio/simple_audio
Code: https://github.com/AssemblyAI-Examples/realtime-voice-command-recognition

Get your Free Token for AssemblyAI Speech-To-Text API 👇https://www.assemblyai.com/?utm_source=youtube&utm_medium=referral&utm_campaign=yt_pat_49

▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬

🖥️ Website: https://www.assemblyai.com
🐦 Twitter: https://twitter.com/AssemblyAI
🦾 Discord: https://discord.gg/Cd8MyVJAXd
▶️  Subscribe: https://www.youtube.com/c/AssemblyAI?sub_confirmation=1
🔥 We're hiring! Check our open roles: https://www.assemblyai.com/careers

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

Timeline:
00:00 Intro
00:52 Build Model: Google Colab Walkthrough
05:09 Save & Download model
08:15 Add our preprocessing code
12:39 Code the final project with microphone input
18:44 Final project testing!!!

Microphone icon created by Freepik

## Транскрипт

### Intro []

welcome everyone in today's video we create a speech recognition model with tensorflow that can recognize keywords and then we turn this into an actual project that can listen to real-time data from your microphone and can then classify this so you could use this for example for a home automation project or whatever you want in our case we built a simple demonstration model that can control a game so let me show you the demo if i run this you will see the output that it classifies the keywords and then i can also move and control the turtle in python so i can say up down right go left up go stop and you see this worked pretty well so this is super awesome so let's get into it alright so the code is largely based

### Build Model: Google Colab Walkthrough [0:52]

on this official tensorflow guide simple audio recognition recognized keywords so you can read through this by yourself and they also provide a link to the collab so we click on this and now we are going to use this so you should be able to follow me here pretty easily but later most importantly you will learn how to turn this into an actual project on our machine so let me close the table of contents and the first thing to do is check the runtime type so this should be set to a gpu and now we can click on run all you can do the same and while this is running let's go over this very briefly so the code is based on this speech commands data set that is publicly available and it has these commands down go left no right stop up and yes and the first thing they do is the imports of course and then they download the data sets and then they check the commands then they access the file names so by the way if you have a look at the data then you see this is organized like so we have data mini speech commands and then for each label it is in a separate folder and then inside the folder there are the different way files yeah we can see them here so yeah here they extract this the file names then they split this into training validation and testing then here they prepare a example so we can have a look at the audio then they decode the audio so this step is important later so here they read this from a wav file but later of course we want to get this directly from the microphone input so we have to remember what this decode wave is doing so here for example it says it normalizes the data to be between -1 and 1. um yeah then they get the label then here the first thing they do is now they turn this into a waveform from the wave file and create a data set from this and yeah here they print the waveform and then the second step is to convert the waveform to a spectrogram so um yeah here they have this helper function get spectrogram and then i think yeah here they display this so we can listen to one sample no no no um then they plot the spectrogram so here we can see this is the waveform and this is the spectrogram so basically we have an image now and can use a convolutional neural net to classify this so this is what we do in a moment and yeah then they plot some things again and now they build the model so here we have this pre-processed data set function and as we see so it conver it creates a tensorflow data set and now the first thing they do is get the waveform and then get the spectrogram so it applies these two steps for pre-processing then here we get this for the training validation and testing and then here they built the model so um they built a sequential model with an input layer then they down sample this then they have a normalization layer and then like i said we have an image now so we can apply convolution so conf 2d layers and then mix with max pooling and in the end we flatten this and apply linear layers so dense layer for the classification so here we have layer stands with num labels at the very end then here we get the summary then we say model compile and model fit and now this is training so it's already done by now then we can plot the history and then it we evaluate this so it gets 85 percent on the test set then they also plot the confusion matrix and now they run inference on an audio file so this is also important for us so in the end yeah it looks like this and you see for one sample it looks like this so now let's build on top of this

### Save & Download model [5:09]

so until now we have used the original cola but now we have to add a few things so first we want to save and then download the model from the call up to our machine and then we also have to change the pipeline slightly because now we no longer load the files from a wav file but we want to directly use our microphone input so first let's show how to load and download the model and for this i also here want to print the label prediction to verify that the loaded model has the same prediction so we say label predict equals numpy arc marks of the prediction along axis equals one and if we print then we have to access commands and then label and now we say label underscore predict and if we run the cell again then you see the prediction is no and now let's save the model so for this we add a new cell and this is super simple with tensorflow and the keras api we only say model save and then we give it a name saved underscore model and now if we execute this cell then we should see if we click the files then here this created a whole folder so this is what we have to download but we can't download it like this but and that's why we have to convert this to a zip file now so we add another cell here click on code and now we can run a command like so an exclamation mark and then we say sip dash r and now we give it the name we say save underscore model dot sip and we sip the dot slash saved underscore model folder and now if we run this then this should add a zip file so if we click on refresh then here we have the zip folder and now we can click on this and download this to our machine so this is how we can extract data from the colab so now you should see the saved model sip folder in your downloads folder and next i want to add another cell here and now i want to verify that the loaded model works so we can say loaded model equals models dot load underscore model and then the name was saved model and this should now be the loaded model and now we copy this part so i think we only need this and then add another cell and click on code and now here we use the low that model and now if we run this then this should be the very same result so no and also the same plot so saving and loading works and now comes the tricky

### Add our preprocessing code [8:15]

parts and now we no longer want to use the build in tensorflow method to load the wav files but now we want to directly work with a numpy array that we get from the microphone input so inside the colab we cannot use the microphone so in here i want to simulate this and get the numpy array differently so i add another cell and then we use the build in wave module to load the frames so we can open the file and then we say we get the number of frames and then we read the can turn this into a numpy array by using numpy from buffer and then the wave and if we print for example the shape then we see we have 16 000 um samples in here and now the next step is to verify the pipeline with this step instead of using the um built in decode method so let's add another cell here and here we want to get the waveform and for this let's have a look at the pre-process data set method again so here they do two steps so they get the waveform and the label and then they get the spectrogram and label id we don't care about the label here because we only do inference but now let's have a look at what get waveform and label is doing so here they decode the audio and here they use the built-in tensorflow audio decode wave method and if we hover over this then this is very important here there is the documentation so it has this range signed 16 bit values will be scaled to -1 and 1. so we have to remember the maximum value here 32 768 and we have to normalize it in the same way otherwise the results will be very much off um so let's go down again to our cell and we can do this very easily so we only have to say signal underscore array divided by this value and now we want to convert this into a tensor so we can do this by using tensorflow um convert to tensor and then the waveform and we also give it a data type of let's use tensorflow float 32 and now we should have this as a tensor and in the correct format so you see the values should be between minus one and one then the second step was to get the spectrogram so we can say spec equals and this is the built in get um spectrogram there it is and it gets the waveform and now we have to be careful so we have to expand the dimensions and add one dimension for the batch dimension so we can you do this by saying tensorflow x pond underscore dims of the spec and then we give it the dimension zero and now this is in the correct shape and now we can get the prediction by um simply calling the loaded model again and this gets the spec and then if we could if we can print this so we say print prediction and this should be the very same values that we've seen before if we print this so here we don't print the prediction but if you do this then this should be the same values so yeah you can check this again and now we also let's copy this so we want to get the label prediction and then we print this and then we plot this and yeah now let's run this cell and you see the plot looks the same we get no so now this is working the same way with a built in numpy array so you see these steps here are essential and now we basically need to apply this code in our code on our machine and then we also need to copy this helper function get spectrogram from this colab so yeah let's do this so i already prepared the

### Code the final project with microphone input [12:39]

project and what you have to do is copy the saved model zip file into this directory and extract it here then i also recommend to create a virtual environment and we have to install tensorflow and pi audio if you're not familiar with pi audio then we have another tutorial that talks about this in more detail here on our channel and then i created some helper files so let's go over them very briefly i also put them on github and the link will be in the description so i created one helper file pre-process audio buffer and this is doing exactly what we just did in the last step so we normalize the audio buffer and then convert it to a tensor called getspectrogram and return this and then we have to copy the getspectrogram method from the colab so this is the exact same code and then just in case i also set the seed value then we have a recording helper that uses pi audio so here i create one function record audio so what's so i talked about this in several other tutorials um already so i'm not going into more detail here but what's important here is that we record for one second and if we um do the math and then we use these frames per buffer and this rate then this will end up in 16 000 samples so exactly like our training samples and then in the end we again use numpy from buffer and then join the frames here so now we have this as numpy array and then i also have a helper function to terminate pi audio again and now let's go over to main and here we import everything so we import numpy as np we also need from tensorflow. keras we want to import models and then we say from recording helper we want to import record audio and the terminate function and from tensorflow helper we want to import pre-process audio buffer then we get the commands by saying commands equals and now we get this from the colab so this is important make sure that this is in the correct order so let's copy and paste this and um don't forget to put the commas in here so if you run this yourself then due to the random element there might be a different order so yeah make sure to use the same order that you have in your collab and now let's load the model so we say loaded model equals models dot load underscore model and the name is saved model then let's create a helper function predict mic and here we call everything so we say audio equals record audio then the spec equals pre-process audio buffer with the audio then we call the model and get the prediction by calling loaded model with the spec and then we call we get the label prediction equals numpy dot arc max of the prediction along axis equals one then let's get the command by saying command equals commands of the label bret 0 then let's print the predicted label and this is the command so here we need a comma and then we also want to return this command from this function return command and then we say if underscore name equals underscore um main then we want to run this as a while true loop so here we say while true and then we want to say command equals predict mic and now we say if commands equals stop then we want to call the terminate function to close pi audio and then we break so now hopefully this will work so now we can say python main. pi make sure to activate your virtual environment up up down up right left left go go right down stop all right so this worked and now the last thing to do is to add a turtle so for this again i have a helper function turtle helper that inits the turtle and sets some settings then i have helper functions to go right up left or down depending on the current direction we are facing so here we simply turn the um here we turn the turtle and then we have a move turtle function that puts everything together so here we call go up down left or right and if we say go then we move the turtle forward and if we stop then we simply stop and then in the main we actually break so here we can import this so we say from turtle helper we want to import move turtle and then here we only call move turtle with the commands this is all

### Final project testing!!! [18:44]

that we need in order to run our turtle now and now let's again call python main. pi up left right down go right go up go stop all right this worked perfectly all right i hope you enjoyed this project if you did so then please drop us a like and consider subscribing to our channel again the resources will be in the description below and then i hope to see you in the next one bye