Coding Challenge 188: Voice Chatbot

39:27

Coding Challenge 188: Voice Chatbot

The Coding Train 27.04.2026 15 601 просмотров 734 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

In this coding challenge, I build a conversational voice chatbot entirely in the browser with p5.js. I combine three pieces: speech-to-text with OpenAI's Whisper model, text-to-speech with Kokoro TTS, and a "brain" for the bot. I also explore the transformers.js pipeline API and the Web Audio API. For the bot's brain, I start with a simple ELIZA-style therapist, then incorporate a RiveScript number-guessing game, and finally a local LLM. Code: https://thecodingtrain.com/challenges/188-voice-chatbot 🚀 Watch this video ad-free on Nebula https://nebula.tv/videos/codingtrain-coding-challenge-188-voice-chatbot p5.js Web Editor Sketches: 🕹️ LLM Chatbot: https://editor.p5js.org/codingtrain/sketches/RHhT9I4Nm 🕹️ Number Guessing Bot: https://editor.p5js.org/codingtrain/sketches/AJw7zMN9q 🕹️ Therapy Bot: https://editor.p5js.org/codingtrain/sketches/37LFEPUVV 🕹️ Model Loading Bars: https://editor.p5js.org/codingtrain/sketches/E9Ob3x8eJ 🕹️ Waveform of Recording: https://editor.p5js.org/codingtrain/sketches/cck49wDub 🕹️ Real Time Waveform: https://editor.p5js.org/codingtrain/sketches/aaRIT-x6a 🎥 Previous: https://youtu.be/g3-PXyF8U70?list=PLRqwX-V7Uu6ZiZxtDDRCi6uhfTH4FilpH 🎥 All: https://www.youtube.com/playlist?list=PLRqwX-V7Uu6ZiZxtDDRCi6uhfTH4FilpH References: 📓 p5.2 Reference: https://beta.p5js.org 📓 Introducing Whisper: https://cdn.openai.com/papers/whisper.pdf 📓 Model Cards for Model Reporting: https://arxiv.org/abs/1810.03993 📓 Open Neural Network Exchange: https://onnx.ai 📓 Onnx-community Whisper-tiny.en model: https://huggingface.co/onnx-community/whisper-tiny.en 📓 Xenova: https://github.com/xenova 📓 Transformers.js: https://huggingface.co/docs/transformers.js/installation 📓 Announcing the new p5.sound.js library!: https://medium.com/processing-foundation/announcing-the-new-p5-sound-js-library-42efc154bed0 📓 getUserMedia() documentation: https://developer.mozilla.org/en-US/docs/Web/API/MediaDevices/getUserMedia 📓 MediaRecorder() documentation: https://developer.mozilla.org/en-US/docs/Web/API/MediaRecorder 📓 Kokoro Repo: https://github.com/hexgrad/kokoro 📓 KokoroTTS Model: https://huggingface.co/hexgrad/Kokoro-82M 📓 ELIZA: https://en.wikipedia.org/wiki/ELIZA 📓 Rivescript: https://www.rivescript.com 📓 SmolLM3: https://huggingface.co/HuggingFaceTB/SmolLM3-3B 📓 Running models on WebGPU: https://huggingface.co/docs/transformers.js/guides/webgpu 📓 Using quantized models (dtypes): https://huggingface.co/docs/transformers.js/v3.8.1/guides/dtypes Videos: 🚂 https://youtu.be/0Ad5Frf8NBM 🚂 https://youtu.be/KR61bXsPlLU Live Stream Archives: 🔴 https://www.youtube.com/watch?v=KRDJAHArqaw Related Coding Challenges: 🚂 https://youtu.be/eGFJ8vugIWA 🚂 https://youtu.be/8Z9FRiW2Jlc 🚂 https://youtu.be/iFTgphKCP9U Timestamps: 0:00:00 Hello! 0:00:35 Mapping out the pieces: speech-to-text, text-to-speech, and the brain 0:01:07 Thoughts on AI and creative exploration 0:02:44 Choosing the tools: Whisper and Kokoro TTS 0:04:06 Building a push-to-talk UI in p5.js 0:04:51 Finding models on Hugging Face with Transformers.js 0:05:36 About the Whisper model and model cards 0:06:55 Loading the Whisper pipeline in p5.js 0:09:04 Accessing the microphone with getUserMedia 0:10:44 Capturing audio with MediaRecorder 0:12:05 Processing audio chunks into a waveform 0:15:55 Speech-to-text working! 0:16:36 Building the chatbot brain (ELIZA-style therapist) 0:18:50 Setting up Kokoro TTS for text-to-speech 0:21:07 Playing synthesized audio with AudioBufferSource 0:23:41 Text-to-speech working! 0:25:32 Handling playback events 0:26:56 Swapping in a RiveScript number-guessing brain 0:31:22 Adding a language model (SmolLM2) as the brain 0:38:33 Final demo: the random number chatbot 0:39:03 Goodbye! Editing by Mathieu Blanchette Animations by Jason Heglund Music from Epidemic Sound 🚂 Website: https://thecodingtrain.com/ 👾 Share Your Creation! https://thecodingtrain.com/guides/passenger-showcase-guide 🚩 Suggest Topics: https://github.com/CodingTrain/Suggestion-Box 💡 GitHub: https://github.com/CodingTrain 💬 Discord: https://thecodingtrain.com/discord 💖 Membership: http://youtube.com/thecodingtrain/join 🛒 Store: https://standard.tv/codingtrain 🖋️ Twitter: https://twitter.com/thecodingtrain 📸 Instagram: https://www.instagram.com/the.coding.train/ 🎥 https://www.youtube.com/playlist?list=PLRqwX-V7Uu6ZiZxtDDRCi6uhfTH4FilpH 🎥 https://www.youtube.com/playlist?list=PLRqwX-V7Uu6Zy51Q-x9tMWIv9cueOFTFA 🔗 p5.js: https://p5js.org 🔗 p5.js Web Editor: https://editor.p5js.org/ 🔗 Processing: https://processing.org 📄 Code of Conduct: https://github.com/CodingTrain/Code-of-Conduct This description was auto-generated. If you see a problem, please open an issue: https://github.com/CodingTrain/thecodingtrain.com/issues/new #texttospeech #speechtotext #chatbot #rivescript #llms #agents #ai #transformersjs #webaudioapi #javascript #p5js

Оглавление (21 сегментов)

Hello!

How many blueberries should I eat for dessert? The random number for eating blueberries is 13. — Now, this is a chatbot I can get behind. Hi everybody. I am here to make a coding challenge and today I'm going to attempt to make a conversational voice chatbot right here in a P5. js sketch. So, I am asking the question to you. If you could talk to your sketch and your sketch could talk back to you, what kinds of projects would you make? Let me map out to you the pieces of this project as I envision them. We need some system for

Mapping out the pieces: speech-to-text, text-to-speech, and the brain

speech to text. By speech to text, I mean I speak, the computer listens, converts it into text. Then we need a system for text to speech. By that, I mean I have a string of text and I want it to be spoken by an automated voice. Then, I need a brain, I'll say, a brain for the bot. How is the bot going to process the inputs that it gets from speech to text and generate the outputs that it sends to text to speech.

Thoughts on AI and creative exploration

speech. So, this is where we have to have a talk about AI. I want to put some more cogent thoughts maybe into an entire video just about my approach to working with AI and machine learning models on this channel. I've been doing it for years already. I have many videos about what is a neural network. I have videos about the ML5. js library and I plan to make many more of those. I'm trying to find my way, carve myself into this area, dip my toe in, but not jump all the way in and drown. At present, my goal is to number one, demystify the technology. I want to look at how things work and be able to put the technology into the hands of viewers like you to be able to have agency and understanding, which is needed, I think, in this world that we're living in where AI models seem to be just popping up everywhere. The other thing is I am personally curious to investigate what's possible for the individual to do on consumer hardware with open-source models, with your own data. For me, I think the best way to learn about something is to learn by doing and I think also the best way to examine and be critical of something is through creative play, art, expression. So, that's my question here. If we look at a machine learning model for speech to text, text to speech, maybe even a language model, putting all those things together in a P5. js sketch, what creative possibilities does that unlock for you, the viewer? Okay, that's just the way I'm thinking of navigating it today, but I'm going to put that to the side and start building this project. So, that

Choosing the tools: Whisper and Kokoro TTS

begs the question, what am I going to use for all of these things? For speech to text, I'm going to use a particular model called Whisper. This model was developed by OpenAI, but maybe from a different time. — It is an open-source model that can be run locally on your computer, even in the browser. You'll see that. For text to speech, I'm going to use something that's come out pretty recently. Thank you to Xenova from Hugging Face, the creator of transformers. js, who pointed me to this model. It's called Kokoro TTS. Then, the brain is the big question. To me, this is almost the least important thing for me to do. This is where I ask you to be creative. You don't need a large language model to be the brain of your chatbot. On my channel, I've looked at Markov chains, context-free grammars, pattern matching systems like RiveScript. So many possibilities there. So, this is where I want you to think about and be as creative as you can if you build anything on top of what I'm going to attempt to do in this challenge today. Okay, I don't know that I needed the whiteboard, but I have this new whiteboard, so I had to use it. The other thing I should mention is that I am using P5 2 and I'll be making use of features of P5 2 like support for async and await. So, if that's all new to you, check out my video about that. Let's start by building a very crude interface. I'm going to make this a

Building a push-to-talk UI in p5.js

push-to-talk system. So, I need mouse pressed and mouse released. Should it be red when you're pushing and then green when you release? I'm not sure, but I'll just put something in. Okay, the idea here is that when I press the mouse, hello, chatbot thing, and then when I release, it's going to transcribe what I just said. So, that's got to be step one and I have just, again, a very crude interface. Color it red for mouse pressed, color it green at mouse released. Another prerequisite for this video is my introduction to

Finding models on Hugging Face with Transformers.js

transformers. js, where I looked at just how the ecosystem of Hugging Face models and transformers. js, the JavaScript library, works. But I'm just going to jump right into that. So, on the Hugging Face website, I'm going to look for the model I want to use. Let's go under models. I'm going to look under tasks and look at this, text to speech. No, no, I want speech to text. It's actually here under automatic speech recognition, so let me click that. Then, I need to make sure whatever model I find is compatible with transformers. js, so I'm going to go to libraries, transformers. js, and Ooh, now we have a lot of options. First, let me just say a couple words about Whisper. Again, Whisper, released

About the Whisper model and model cards

by OpenAI back in September 2022. This is a fully open model with the paper, the code, and something called a model card, which anytime you're using a machine learning model is something you absolutely want to read and review first. Beyond even reading the model card, might I suggest to you read the paper model cards for model reporting. This paper proposes a framework that encourage transparent model reporting. Here's the Whisper model card and an important thing to note here is it comes in various sizes. So, for me, I'm looking for the smallest one, tiny. The size has to do with the number of connections essentially inside of the neural network architecture. Beyond the scope of this video, but I'll point you to additional resources to understand more about that in this video's description. So, I'm poking around here um to find the one that I want to use. I think I'm going to go look under ONNX Community. Actually, let's filter. So, ONNX stands for Open Neural Network Exchange. It's a standardized format for storing the weights of a neural network and it just so happens to be compatible with JavaScript and transformers. js, the library in particular. So, I'm going to use whisper-tiny. en because I'm only going to speak to it

Loading the Whisper pipeline in p5.js

right now in English and that'll get me the smallest model for doing this today in this coding challenge. Copy that model path and come back to P5. Okay, I've got to import the transformers. js library and the way I'm going to do that is by making setup an async function and then I'm going to get the transformers. js pipeline from the library. So, I'm going to call a special function in JavaScript called import, which is a fancy new way to bring in libraries and I need the path to the library. Okay, found it on the installation page for transformers. js. So, now I've actually imported the library and I've imported the library, at least the most important part of the library, into this object or what it actually is a function called pipeline and I'm going to use that to create a machine learning pipeline. Pipeline is being the word because the data is coming into it and then the results of processing that data flow out of it. And I want to do text to speech, so let me make up a variable, I'll call it transcriber and I'm going to await the creation of that pipeline. And the two key things I need for any given pipeline are the name of the model that I want to use and the task. Or I should have said it the other order. I need the task and then the specific model to execute that task. So, the task is automatic speech recognition, I think that's what it was called, and the model is this whisper-tiny-english. I also want to specify the device to run the model on and in this case, WebGPU, the browser's connection to the graphics hardware, is the device that the model will run best on. It's really important for me to note, even though I'm going to turn on my microphone and have this P5 sketch listen to me, my audio is never going to leave my computer. It's all going to be processed locally on this computer inside the browser. The model is being pulled from the cloud, but the model runs inside the browser locally. So, now we have to have a talk about how to get

Accessing the microphone with getUserMedia

access to the microphone. So, there is a P5 sound library, incredible library. There's actually a new version of library fully compatible and designed for P5 2. I would really like to make some video tutorials about it and somebody in the future who's watched this video should bug me and remind me about that. For now, what I think I'm going to do because I got to figure out a bunch of things in order to like process the audio into the right format to go into the Whisper model, I'm going to use native Web Audio API and then hopefully, by the time you're watching this, there will be a version of this whatever example I make here working with P5 the sound library as well. So, there should be two code examples in this video's description. I don't know, you tell me. You're watching this now, presumably, in the future. So, I need to look up, it's called the navigator. mediaDevices. I think the function's called getUserMedia. This, I believe, is the code to open up a connection to the microphone. So, I'm going to grab this and let's make a global variable called mic and right here I'm going to say uh mic equals await navigator. mediaDevices. getUserMedia the constraints. So, what I want I believe is audio true. Think that's the property. Let me just run this sketch. Did it ask me for the microphone? Ah, well, it didn't ask me. I've already given it permission. But, we can see the microphone is in use. Now, the Whisper model is designed to process the audio file essentially. This is something that we can talk about later. There's probably a way we could finesse this to have it always be listening and detect pauses. But, again, I'm going to build a push-to-talk. Listen to me, stop listening to me. So, to capture audio between the start and stop listening, I need a media recorder.

Capturing audio with MediaRecorder

And again, this is all stuff that the p5 sound library wraps, but it's good for us to in this video to learn about how it works underneath the hood. So, I think I could just make another global variable called recorder. And then I can say recorder equals new MediaRecorder mic. Now, I need to look for the events. Start, stop. Okay, those I need and then I need data available. This is what I need. As there's data available, I need to collect all the audio data into an array because the array is what I'm going to pass to the Whisper model. This might be the hardest part of the whole thing, to be honest. So, let's grab this. And then let's make um another global variable called uh chunks, like the audio chunks. I feel like that's a standard variable name people use, audio chunks. Then, as data comes in, push that data into the audio chunks array. And when I press the mouse, that's when I just call recorder. start. And then when I do mouse released, recorder. stop. And then let's look at the audio chunks just to make sure this is working. Hello. I am speaking in chunks of audio. Great, we got a big blob of audio. Perfect. I think I could just process

Processing audio chunks into a waveform

the audio here, but recorder. stop, it might actually not be done instantaneously. Yeah, I want the stop event. And that's when I want to process the audio. So, recorder. onstop. Now, I'm kind of mixing different JavaScript style. I'm using this like arrow syntax and anonymous function right there for on data available, but I think I'm going to have to more stuff in on stop. So, let's just that equal to a function that I'm going to call like process audio. So, let's look at the Let's look at what I have in the chunks again. Hello. It's an array with a blob in it. I want this array buffer. I have to turn it into a waveform. That's what I remember from Zenova's examples that he shared with me before I started this coding challenge. So, I think I can call this array buffer function. But, what if the audio chunks has more than one blob in it? That's my issue. I think I can make a blob out of however many things are in there. Let's try this. Hello. Now, I took that array, which had a blob in it, and made it a blob. I think that's good. I think that'll protect me better. And now, now I could call array buffer. Does that need await? Let's find out. Hello. Yeah, it's a promise, meaning I've got to await it and async it. Hello. So, I need to convert it into a waveform. And to do that, I need the audio context because there's a decode audio data function. See, you can take an array buffer and decode it into something. The raw audio data is just numbers, but we've got to reformat it into a way that the model's going to understand. But, where do I call this function? Yeah, audio context. So, I need to get the audio context first. So, let's make a variable. Let's call it ctx. And at the beginning, or right before I do this, let's make a new audio context for this whole system. And something that I know also from Zenova's examples is that I need to set the sample rate. I believe that this particular Whisper model requires the audio to have a 16 hertz sample rate. 16,000 is 16 kilohertz. So, now I should be able to get the decoded audio is the context. Decode audio data the array buffer. And I have to await that. Let's look at the decoded audio. Cuz I think the waveform is in there. Hello. Okay, I've got an audio buffer with a sample rate. It was a very short duration. It's got one channel. Get channel data. That's it. It's only got one channel, so I need to get the channel data from channel one or I think it's channel zero. Let waveform equals await decoded audio. getChannelData zero. Console log. Let's look at the waveform. Ahoy. Uh Yeah. Look, here's all the raw audio. We could graph it. I bet you we could graph this and have a nice little like wave Ooh, we could have like cool visualizations of me talking. Okay, but that's not the point of this video. I'm not going to do that. I think I could just do this now. Let text equals await.

Speech-to-text working!

Guess what? The transcriber just needs the waveform. Console log the text. Okay, ready for this? I think we've got it. Freshen your drink, guv'nor. Oh, had an error. That was so sad. Transcriber is not defined. Ah, that needs to be a global variable the way that I've written this code. Again, maybe not best practice, but let's try this again. Choo choo. Speech to text. Done. Okay. Ah, actually we have to do this out of order. Now, we need the brain to process it.

Building the chatbot brain (ELIZA-style therapist)

One of the most famous first early chatbots was a chatbot named Eliza. Eliza is an early natural language processing computer developed from 1964 to 1967 at MIT by Joseph Weizenbaum. I've talked about this before in previous videos. There's something called the artificial intelligence Markov lang- Markov language. Artificial intelligence markup language, which was created in '95. Aha. ALICE was an extended Eliza, which stood for artificial linguistic internet computer entity. These systems were pattern matching systems, looking for a particular pattern in what a user was saying and then having a predefined response for that pattern. And those predefined responses could include variables and conditionals and all the traditional tools we have in programming. RiveScript is a simple scripting language for chatbots that you can continue to use now. I highly recommend it. I have videos about it. Maybe I'll bring a RiveScript example into this one in a moment. But, let's make a crude version of Eliza. Essentially, what I want to say once I have that text is I want to process the text. So, I'll write a function called process and we'll get a response from it. And I'm just going to return how does text make you feel? So, this is my chatbot. This is my therapist chatbot. Whatever you say to it, it says back to you, how does whatever you said make you feel? Oh, and I've got to switch this to response. Bananas. Oh, how does object make you feel? Oh, object makes me feel so sad. I forgot, the text is not the raw text. Bananas. Ah, response. text. Or text. This is really the Let's call this the transcription. Transcription text. Okay, last try. Blueberries. How does blueberries make you feel? Delicious. Great. We're almost done, people. Now, we need text-to-speech.

Setting up Kokoro TTS for text-to-speech

We've made a simple brain. Think at the end of this video, I'll plug in other more sophisticated brains to our chatbot, but that's good enough for now. Now, let's talk about Kokoro TTS. This is the GitHub repo for the Kokoro model. And over here on Hugging Face, we can read the model card. We can find out things about what the actual architecture of the model is. You can read this paper on style TTS 2. And you can see here more about the training data, which includes public domain audio and other synthetic audio as well. Zenova is going to save me about 10 minutes in this coding challenge cuz there's a really quick getting started page right here where I can see how to import and load the model as well as quickly generate some audio from it. But, remember, we're working in client-side JavaScript only. So, I've got to adapt this code to use the import loading function and the full URL path of the Kokoro JavaScript library. So, right up here, this is my speech-to-text. Now, let's load text to speech. So, I'm going to load Cocoro. Cocoro TTS. So, I'm on the NPM page for Cocoro. I can click on jsDeliver, and then here, here's the path. I actually need to include the full path for this to work. So, that's dist, and then the name the library file, which I have over here. cocoro. web. js. Now, even though I'm going to load the model from Hugging Face, and this looks just like awaiting the pipeline, the Cocoro model is not part of the transformers. js pipeline framework just yet. So, at this moment in time, I need to say cocoro TTS. from_pretrained, which is just a fancy way of saying load this pre-trained model. Let's grab all of this. Okay, let's look at what this audio is. I don't think I can just play it. Okay. I got some kind of object that has audio data in it with a sampling rate, but there's no play function. I think I'm going to have to convert it into something that I can play. So, let's take all this to another function. Let's write a function called speak.

Playing synthesized audio with AudioBufferSource

And let's write that here. It's an async function that's going to speak some text. Okay, the first thing I need to look up, I think, is something called create buffer source. I think this should be with the audio context. Yes. Create buffer source can be used to play audio data contained within an audio buffer object. So, I already have the audio context. Let's try this. So, let's call this a buffer. And my context was just ctx. I have one channel. Frame count is how much? Hold on, let's look at this audio object again. Audio I have an audio object inside of the audio object. So, the raw audio is just this giant raw audio thing. So, I just need the length of that. Okay. Boy, that console did not like console logging that. Let's clear that. So, it's audio So, let's call this something besides audio. Audio data or something. I don't know what to call this, but we're going to create this buffer with one channel, audio data. length, and then a sample rate. Okay, so we've made a buffer out of this audio data. That's good. Let's call this result. Also, it's result. audio. length. Oh, this buffer I made is empty. I need to put all the stuff into it. Aha, audio buffer. Copy to channel. So, now buffer copy to channel the result audio. I think. I guess I need the channel number. Now, I need that What was it called again? The buffer context? Buffer source. Source connect. Source buffer is the array buffer. So, I think this is all I need here, right? It looks like I create the buffer source, I put the data in the buffer, and I connect it to the audio context speaker, and then I play it. So, this should basically work. But, I it might need different variable names. So, I called it source, and I just called this buffer, and source connect to the audio context destination, and source. start. Oh, TTS. This needs to be a global variable. Okay, let's try this again. Hello p5. js.

Text-to-speech working!

Okay. We did it. Let's look at all the voices. We can call TTS. listVoices. So, one thing I'll say is this particular model does not support voice cloning. It's a whole other topic, which I don't know, at some point maybe I might get into, but here we just have a fixed set of predefined voices to work from. I'm being told by a chat that there was a Daniel in there. Okay, let's try Daniel. Hello p5. js. Sounds just like me. So, let's remove speak from here. And then, once we have the response from my process function, my brain, then we should be able to speak that response. Be nice to spell response correctly. I'm hungry for lunch. How does I'm hungry for lunch make you feel? It makes me feel hungry. How does I'm hungry for lunch, it makes me feel hungry make you feel? Wait, why did it get all of that? Oh! I just realized something. I have this audio chunks array that I'm putting all the audio data in, but I never clear it. So, it's re-transcribing everything every time. So, after it's transcribed the waveform, I need to clear out audio chunks. Blueberries. How does blueberries make you feel? Delicious. How does delicious make you feel? Okay. So, not the most sophisticated chatbot, but it's working. I think it would be nice to know when the speaking has stopped. So, I think I should be able to

Handling playback events

get an event for when this playback has stopped. So, to do that, I can think I can do a add event listener, and the event is ended, and then that should just be a function. And let's draw the background. Let's do the start screen. Let's just make this a function. Push to talk. And then, so we first draw push to talk when the sketch starts, and then this should be the event for when it's done. Uh it'll draw push to talk again. Let's just see if that works. Frankenstein. How does Frankenstein make you feel? Perfect. Okay, so I've actually previously done coding challenges like this. Voice chatbot with p5. speech uh from 2017. I guess this is kind of an updated version of that. And in that coding challenge, I used RiveScript to create a number guessing chatbot. Let's look at this real quick. Guess a number between 1 and 10. Pick a higher number. Pick a lower number. You got it. Let's see if we can bring this brain into my chatbot. I need a

Swapping in a RiveScript number-guessing brain

plain text file to load the RiveScript code, essentially. All of this was covered in my RiveScript video. Then, this is the code to load that brain. And I'm going to just load it right at the beginning of setup. And then, here is the reply that comes from the RiveScript bot. Oh, I guess I have to say new RiveScript there. Let's make a global variable called bot. So, now I've set up RiveScript, I've loaded the RiveScript itself. This is going to be a problem because I wonder if it's going to transcribe me guessing a number into the word rather the number number, but we'll see. And here's the code for essentially the pipeline, the RiveScript pipeline, taking the text and generating a reply. So, I have this very simple process function. Now, I just add this, and I return the reply, and I add an async to it. I also then add an await here. Okay, so now we should have a number guessing chatbot. Oh, wait. Got some errors. RiveScript is not defined. I need to import the RiveScript library, which I can get from here. This library is imported the traditional way. Ah, hold on. Breaking news. I forgot to set the device for Cocoro TTS. Hopefully, I've been editing out how long I've had to wait for the speech to come back every time, but let's make it faster right now. D type floating point 32. Device web GPU. Thank you, Zenova. The other thing I want to do is I want to console log what's happening just in case, just to know what's going on here. I haven't put anything into this sketch to track the loading of the models. So, right now I'm just going to put a console log at the very end of setup. In my video about transformers. js, I showed how you can add a progress callback to these model loading functions, and then you could draw an animation or something while the models are loading. All the models are loaded. Hello. Guess a number between 1 and 10. Five. Pick a lower number. Four. Three. Ah. So, it failed because my system is not smart enough to know that that's the number three. I'm going to fix it. Okay. So, I should be able to set the text to lowercase, and then trim it. Could probably use a regular expression to just match it anywhere in and then have the value should equal numbers with the key. Okay, so first I should trim it. Okay, if how do I test a regular expression? — This is the worst code I've ever written. I insist on fixing this. The test function operates on the regular expression. So if it's not a number switch it to the corresponding word. Oh, no, no, no, no, no. — Oh, yeah, that that'll work. No, if it's not a digit, it's the word a number. So look it up here to get the actual digit, put it in val. Oh, this is ridiculous, but I think this will work. Hello. 6 Guess a number between 1 and 10. — undefined That has to be val. I have to do the test on val. remove dot So that'll take off the dot. Ready everybody? It's going to happen. Good morning. 14 Pick a lower number. 7 Pick a lower number. 4 Pick a lower number. 2 You got it. Okay, since we're here and we're already working in transformers. js, let's look at how we might add a

Adding a language model (SmolLM2) as the brain

language model to the brain. We're going to go back to hugging face to models to text generation to libraries to transformers. js. I'm going to look for small and let's try this one. There is so much to say about language models. I would like to use one following what I talked about at the beginning of this video where I can point to a model card that has a lot of information about how the model was trained and in particular where I can examine and know about exactly what data was used to train this model. So hugging face has actually released its own family of lightweight language models and they're called small and there's LM1 and LM2 and LM3 seems to be the latest. What size you pick, whether it's a base model versus an instruct model, means all sorts of different things. I've got to come back and do a whole separate video just about that. For now, let's go to this particular model. I'm going to copy its path and I'm going to come here. Let's remove all the rive script. Right after I load the pipeline for speech recognition I'm going to make a variable called LLM. My task is text generation. My model is small LM2 360 million parameters instruct model and let's run it on web GPU. And let's take a look at how the code is designed. So in this case, I actually need to create an array that constitutes the history of the conversation. So I'm going to start with a array of messages. Let's have that be another global variable. Let's only put the system message in it right now and let's say you are a frog and you only say ribbit no matter what anyone else ever says. So the system prompt is essentially the predefined instructions for how this model should behave. A lot more to say and think about there, but now this process function which previously had my insane rive script parsing code now can take that text. I can format it like this. The content is the text. I can add it to the conversation, the messages array, and then the response is await the LLM which receives the messages array. And let's add this max new tokens. That is going to constrain the length of this particular model's reply and I'm just going to grab both of this. And the generate so this there's an output object which we maybe need to examine what's in there, but based off of the example code I should just look for the first element generate a text at -1. So probably the it's probably the whole messages array and I want the last one which is coming back from the model and it's content. So let's just see what happens. Hello there. Oh, I don't think all the models loaded yet. So I should be adding some error protection. Like I should not be able to push to talk until all of the models load. All the models loaded. Hello there. We got an error. messages push message oh I redeclared the variable in setup. So I need that to be a global variable. Hello. Cannot read property generator is not defined. Oh I'm I forgot I had copy pasted some example code in there. So this is actually LLM. And this is messages. And there we go. Okay, here we go. Good morning. Well, it looks like my system prompt wasn't very well followed. The assistant said back good morning. Let's add a couple things that are important here. Language models predict the next tokens in a sequence based on a set of probabilities and we should be adding this property called do sample to make sure it's making full use of those probabilities. I made a whole video all about this. So that's in another video. You have too many videos to watch, but it's there. Let's try this again. Hello. Looks like your character is a frog not a person. Well, the last person to speak is probably past, didn't they? Ribbit, ribbit, ribbit. There we go. Now that that's working, if I want this conversation to continue, then I need to make sure that this response is put into the messages array so that messages array will accumulate over time with the entire conversation history and have that be sent into the model each time. So let's make a variable called reply which is equal to the parsed out content of the reply. I'm realizing I might be reformatting something that's already formatted, but that's fine. And then I'm going to say messages push role assistant content reply and I think this would be simpler for me to just put this right here and return that reply to speak. So essentially what this is doing is it's getting the text that I've said, putting it into the messages array, passing the messages array to the language model, getting the reply, making sure that reply is stored in the messages array, speaking it, and then ready for the next step. You love random numbers and only ever talk about random numbers regardless of whatever the user says to you. Okay, let's try that. Okay, ready? Here here it is. The grand finale, our random number chatbot. Tell me about yourself. I'm an artificial intelligence designed to delight users with a multitude of intriguing stories, each one unique and unexpected. My existence is rooted in a vast database of human knowledge. I just want to try this 1. 7 billion parameter model before I wrap this up. Okay, so it's in there. Uh this model requires a specific data type which I'm not going to explain right now. You can ask in the comments. I will happily answer you. And let's try a new system prompt. Only ever respond with random numbers. No don't ever say any words ever anytime no matter what anyone says to you. Reply in number form only.

Final demo: the random number chatbot

Okay, here we go. How many blueberries should I eat for dessert? The number of blueberries you should eat depends on your personal dietary needs. — Didn't you just tell me a random number? The random number for eating blueberries is 13. You've made me so happy. You have no idea. The random number for smiling is 1747. Now this is a chatbot I can get behind. The random number for a positive mood is 28. All right, everybody. Thanks for watching this. I'll see you next time.

Goodbye!

Please share your conversational chatbots you make with me on the coding train website in the passenger showcase and have a good day.

Другие видео автора — The Coding Train

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник