Building AI Voice Agents with Vapi & AssemblyAI
23:44

Building AI Voice Agents with Vapi & AssemblyAI

AssemblyAI 27.02.2025 3 203 просмотров 56 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
AI Voice Agents are revolutionizing customer support, automation, and real-time AI interactions—but how do you actually build one? In this livestream, Smitha Kolan, Sr Developer Advocate at AssemblyAI and Jordan Dearsley, CEO & Founder of Vapi will dive into: ✅ How AI Voice Agents work and where they’re being used today ✅ The biggest challenges—handling interruptions, background noise, and latency ✅ Live demos of Vapi’s AI Voice Agent workflows & AssemblyAI’s Streaming Speech-to-Text API ✅ The future of AI Voice Agents—telehealth, multilingual AI, and beyond! 📌 What You’ll Learn: - How to build and deploy an AI Voice Agent - How real-time speech-to-text enables low-latency AI conversations - How Vapi’s workflow simplifies voice automation 🔗 Try AssemblyAI’s Streaming Speech-to-Text API: https://www.assemblyai.com/?utm_source=youtube&utm_medium=referral&utm_campaign=yt_smit_29 🔗 Explore Vapi’s AI Voice Agent platform: https://www.vapi.ai ▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬ 🖥️ Website: https://www.assemblyai.com 🐦 Twitter: https://twitter.com/AssemblyAI 🦾 Discord: https://discord.gg/Cd8MyVJAXd ▶️ Subscribe: https://www.youtube.com/c/AssemblyAI?sub_confirmation=1 🔥 We're hiring! Check our open roles: https://www.assemblyai.com/careers ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ #MachineLearning #DeepLearning

Оглавление (5 сегментов)

Segment 1 (00:00 - 05:00)

hey everyone welcome to the assembly AI YouTube channel I'm Smitha a developer advocate here at assembly Ai and today we've got an exciting session all about building AI voice agents which are AI powered assistants that can listen understand and respond in real time and to dive into this I'm joined by Jordan the founder and CEO at wppi Jordan thanks for coming on yeah thanks so much uh for having me on uh for those who may not be familiar vapy is a platform that is Mak easier than ever for users to build test and deploy voice agents quickly and today we're going to talk about why AI voice agents are taking off the biggest challenges in the space and of course we'll do some live demos of V's Workforce platform as well as assembly AI streaming API but first let's talk about why AI voice agents are such a big deal in 2025 Jordan what are you seeing in the industry why do you think more companies are investing in Aid driven uh voice interactions uh yeah it's really because the models all of a sudden and I mean we saw this trend Maybe started like a year and a half ago or two years ago where the models across the board the transcription llm Texas speech stack they were all getting faster they're all getting cheaper and they're all getting more performant over time and you know when we started the company we extrapolated and we're like damn if this keeps happening and all these three of these numbers keep moving in these directions um eventually we'll have models that achieve Human Performance when orchestrated together and so uh that was our bet early on and we know that if people can talk to stuff like it's human of course they will choose to talk to stuff like it's human and so we're now in a place where models can talk like humans and so that's really the reason why there's been such a wave because all of a sudden it's kind of reached the point where it can pass that human Turing test uh across the entire model stack and so that's why we're kind of seeing a lot of Demand right now that's actually super interesting I've also seen a lot of companies using AI voice agents for things like appoint appointment scheduling or you know hands-free control in smart devices uh even realtime language translation are there any use cases that you found surprising in this last couple of years surprising yes I mean uh when we started this company we never expected uh you know customer service and phone calls to kind of be like an area where there would just be a ton of value to unlock but you know lo and behold there's like billions of phone calls or every year in the us alone and um that is a lot of uh time and money spent and a lot of you know suffering for people who have to stay on hold for millions of hours so that's one um area where we've seen a lot of interest but I guess the more Niche interesting stuff has actually been uh for training and coaching like people so think like uh roleplay training for call center agents to get them ready to like be on the phones or for salespeople to get them ready to sell um or even just for more like entertainment like applications like you want to talk to your favorite like manga character or something like that um so on the whole um consumer and training side that was quite unexpected but I guess there's obviously a much more obvious um value to unlock in that whole like Enterprise like customer service ankle so we kind of serve a bit of both nice I think that definitely makes a lot of sense and also another big thing is for an AI voice agent to be effective it needs to not just respond but also listen and process speech in real time so I think I'm really looking forward to the demo that you have today as well as we're also going to be demoing assembly AI streaming uh spech text API it transcribes live audio in high accuracy and low latency usually within a few hundred milliseconds so that's a game Cher and let's actually jump into your demo and then we can have a lot more questions later on happy to yeah let me quickly um share my screen and I'll kind of walk through uh the platform a bit uh forgive me as our dashboard is going to be uh there's going to be a lot of demo stuff all over the place but you'll kind of get the point so this is the vapy platform so the way to think about us is we are everything in between um all these models like the assembly like streaming uh the new assembly streaming models and actually turning them into voice agents getting them into production and then like seeing how they're actually performing so I made this like assembly assistant uh using the assembly I streaming API and so for this one specifically you'll see we have uh how much it costs across the stack so how much the transcription cost from assembly how much the model costs how much TX spe model cost same thing with our latency budget so in a real-time conversational application you want that latency to be as tight as physically possible so that people don't have time to think after they finish their statement so it doesn't break like the fluidity of conversation so in this case across the stack we're looking at roughly like 1,400 milliseconds including uh the actual uh or I think

Segment 2 (05:00 - 10:00)

this might be a little off but roughly for transcription we're looking anywhere from like 100 to 300 milliseconds for the model piece in this case like gbd4 mini about 300 milliseconds text to speech uh from a company like 11 Labs might take another like 300 milliseconds so across the stack we're trying to shoot for like 1,200 1500 milliseconds um I put together a prompt for this model um for this agent where it's going to kind of roll play being an assistant for a dental office it's going to first ask for my full name say something uh say something funny maybe a request date in time and that's going to push all these details in from the conversation live to a spreadsheet uh that I have open uh can you see my screen okay yeah you can Perfect all right um so I can configure the transcriber so like I said I re the assembly AI under the hood this is the streaming model uh and the English language for The Voice 11 labs and I can pick any model from them as well and I can also configure tools um so a tool that I've set up for this is the book appointment tool which is going to hit an automation then hits my spreadsheet um here's my spreadsheet that I have uh open it's essentially just going to collect my first name last name date time and the reason for the visit um so I'll just call that now and we can uh try following along is that any questions from your end before I call the thing oh this looks super awesome and the fact that you can also select different uh text to speech providers can you also select different large language models I'm not sure if you covered that yes yeah for sure so we support every llm from any underlying provider so for example um if we go to anthropic we have all like the clae 35 on it soon we'll have the new 37 model on there for openi we have all their even the realtime API in here as well so you can talk with the native speech to speech models and for users who are looking to use this would they have to get the uh individual apis for each of these providers or is that something that they can do within V figure it so essentially they can either just use our you know accounts with the underlying providers like with Labs or whatever like we have like preferred pricing um or they can just bring their own API keys or if they have like a finetune model or something with one of these or custom voice they can then like bring their own accounts and then use those voices or models awesome yeah for sure let me just give it a call um so I'll kind of follow along on the prompt as I'm uh walking through a phone call uh let me make sure I have the right number here yeah hey would you like to book an appointment yeah I would oh hey are you there oh I may have hooked up the wrong model give me one sec sorry about that no worries you like book it AO yes I would I got there you go over your full name uh Jordan de glad you're back oh can you hear me sorry about that oh hey Jordan good to have you back so what purpose of your appointment uh I'm feeling a bit sick in my stomach ah gotcha so like maybe just a check up to figure things out when were you thinking for the appointment uh could we do like maybe tomorrow at 4M sweet let me just get that locked in for you one sec great thank you wow that was incredible the fact that you could actually interrupt him as well yeah so it's kind of gone back and forth you can see it actually pushed that data now to the spreadsheet uh it misunderstood my last name unfortunately but you kind of get the point where it can actually interact with tools I can you know talk more naturally and that was actually a clone of my voice that I made in 11 Labs Zone I know if you could notice that oh nice but uh but yeah and then this like demo in particular maybe took me a few minutes um to put together um but there's one thing I do want to highlight in this demo specifically it just sort of like had the prompt and just like was running the whole conversation based off of this prompt the problem with prompts as they get longer and a lot more complex is that we usually see them start to go off the rails because these models tend to hallucinate especially if you want to use smaller low latency models and so what we've been investing a lot of time in is this idea of workflows so instead of having a prompt run the conversation you can now have these like step-by-step conversation flows so for example that same this is that same prompt but now modeled out in a step-by-step fashion so it's guaranteed to First confirm that this is cross Dental then gather this information then make that API request then confirm the details whereas like with a tiny model like DBT 3. 5 or whatever it would usually go off the rails so it's a much more intuitive and secure way of allowing a an agent to actually run business logic does that

Segment 3 (10:00 - 15:00)

make sense yeah and can users actually build these workflows or is it a template which is built after they built that initial prompting and you know selecting the models yeah so users can build them themselves uh we just launched this maybe last week so all they need to do is just like create attach it to an assistant create a workflow and then from scratch they can design their own using like a different block types so for example it can say something it can gather information like in this case it gathered first name last name Etc make an API request like pushing to the actual spreadsheet can transfer the call to human and the call or kind of use a condition if this then that go to this block or go to that block and so these are the perimeters that we're starting with but we're planning on coming out with like a whole Suite of different kinds of blocks and different Integrations that are native in the platform this is awesome so how can users actually start testing this out is this available for free or what type of pricing is available totally uh we're actually uh making some big changes to our pricing soon which I will not reveal now but it will mean that a lot more developers will get to use it for longer without paying money um so that's the short version um we will but users can just log on right now there's like a free credit on the account so they can just like jump in and make hundreds of calls uh without even having to put down a credit card awesome and do you have any documentations that developers need to get started or is this very much like great question yeah so the way to think about a product is like this dashboard is 20% of the actual product is like the incredibly complex uh platform that we built under the hood so like our dashboard is just one app that's built on top of like the VAP API the vapy API has maybe like hundreds of points of config so that entire products like the vapy dashboard can be built so this is how we see people building products like um entire platforms for Home Service professionals to accept inbound calls or uh collections teams to actually like do outbound collections calls and that kind of thing um and so it's not limited to just like one API call to like send a phone call it's like an entire API platform to build voice data products on top of go ahead uh is this also going to be like easy to scale in terms of like bigger customers who are handling a lot more volume yes yeah so there's this idea of uh concurrency and concurrency just means how many calls can I take on at the same time I think assembly probably has a similar um idea for you guys for how many streams you have live so similar to us how many agents can be live talking to real Evans at once um at the moment I believe it's limited to maybe like uh a 100 or so but we're working on some stuff to allow for unlimited concurrency hopefully sometime soon uh we'll be able to actually scale up to whatever like the actual customer demand is without any uh limits or blocks awesome thank you so much for giving this demo I would like to ask a couple more questions on your thoughts on wise agents uh based on what you've seen with vapy and like type of users you have uh who are like some of your biggest customers in terms of use cases who are you know using and building with Mappy today yeah it's a mixture right there's kind of those two buckets the way I like to think about it is like new call volume and existing call volume so existing call volume are companies that currently accept their own phone calls for whatever reason so think like insurance companies healthcare companies travel companies like anywhere where you have a slightly older population to serve as well as a younger population so people are kind of used to speaking on the phone natively um outside of that in the other realm which is like powering a bunch of voice products it's like kind of all over the map uh you'd be surprised the kind of bigname companies right now that are planning on launching like voice agent products I can't name them unfortunately because it's all like um under NDA Etc but there are some very big platforms including like public companies right now that are working on like deploying voice agent products for that their actual underlying users as well and it's like a mixture of like I mentioned platforms for Home Service professionals or platforms for customer support in your software app to walk you through like how to use your app they'll actually have a voice agent in there to like point and click like an onboarding guide that kind of thing so uh it's kind of a mix right now it seems to be like there's a really V varied uh amount of use cases in which like companies are building this with yes yeah definitely in my experience with developers I think one of the trickiest things which I've seen Workforce handle it right now is when people actually interrupt the uh bot as it's talking back to you mid sentence and it doesn't handle that really well that can lead to a really frustrating user experience so how does Wy actually helped to solve that and like how did you know your team navigate that yeah so there there's two components um one is

Segment 4 (15:00 - 20:00)

you need something called a vad or a voice activity detection model and it kind of looks at the audio that's coming in and says hey does this look like speech and if it looks like speech for long enough we're like okay time to like back off a little bit so we might like lower the volume slightly um then if you have a transcription model that's fast enough to rely on like for example the new assembly like streaming model um within I think what's the latency on that 300 or so or yeah just under a couple of 100 milliseconds yeah couple hundred that's great so within a couple hundred milliseconds if I can then confirm that sound is a word then we actually will like back off and like okay the user's talking now and because we have the actual transcription we could even do super quick additional analysis to see if the user's going like uhhuh right where that means like you should continue or like wait wait and that's means it should actually like back off and stop so that's why like super fast transcription for the for interruptions is actually super critical interesting and I def see whereare assembly I plays a huge part in that um another challenge that I can imagine is keeping the conversations contextual like if a user references something that they said earlier in a conversation uh how do you make sure the agent actually doesn't forget what's happening is that something that can be built on top of the llm that is tricky um actually so in the prompt-based um assistant that I showed you first uh in that one like it's a long running conver and so the context is always in the conversation just maybe after like 5 or 10 minutes there's so many tokens that the model might get confused and miss things um but if you have something like workflows which I showed um we're actually now working on the ability to have like Global state or Global memory so even though that specific step doesn't have any context on what was talked about before it'll actually have the ability to save things to its memory to then pull up like as like quick like snapshot context uh later in the conversation nice that's super cool and I think as I'm building with uh voice agents and we've created a lot of tutorials here at assembly AI on building voice agents on our YouTube channel another big challenge I hear a lot about is handling noisy environment so yeah AI voice agents often have to deal with noisy background like maybe there's multiple speakers or music playing and even as a human that's hard to decipher so how do you see AI voice agents tackling that yeah uh I would hope that that assembly tackles it uh for us so we don't have to um because essentially what we found is like background noise cancellation very s problem uh they've had like 20 years on this thing every iPhone has like an awesome background uh noise cancellation model inside of it um the problem arises when you want background voice cancellation because these transcription models are all tuned to like listen in on every single thing that sounds like voice and transcribe that voice including the kid in the background or for the TV in the background and so that's why like we actually had to deploy a custom uh background voice cancellation model that's somewhat works well but obviously you can't catch everything because it's kind of indeterminate what is background speech and what isn't it's kind of hard to tell um but so we try to put like a filter on it before it goes to assembly for transcription um and that tends to help but ultimately we do need um smarter more prompt transcription models that can be told hey there might be back noise in this clip please ignore it we need a bit more intelligence on the transcription side yeah that's really interesting because real world environments are rarely quiet and if an AI voice agent can actually successfully filter out those distractions and focus on the main speaker that's a huge game Cher yes um next I also want to talk about some use cases and adoptions right so you know you said that uh you we've talked a lot about like the type of use cases that a lot of wey customer are doing but also have you ever seen a lot of companies adopting AI voice agents but then also blending them with like lime uh sorry life uh human agents kind of yeah we do that's actually more common than the rip and replace uh model um for the most part companies especially Enterprises today are not comfortable um deploying voice to replace their entire voice operations it's more like um let's use voice to replace our like I VR system that sits in front of the human agents or this one thing that the human agents hate doing which is like this one very transactional call that they have to do 10,000 times a day uh maybe we can replace that and put them on higher value work or higher value calls or escalations and so it it's more of a pairing with the voice agents than not and so that's why within our platform we have many different ways to allow voice agents to escalate to humans or transfer to humans or even do warm transfers to humans which means like as the call is transferring it'll quickly

Segment 5 (20:00 - 23:00)

like whisper in the ear of the human to tell them hey by the way this is like a person here's a quick summary da D are you ready to transfer they say yes and then it transfers the call so like we kind of invested in this like handoff mechanism to make it smooth I can see that being super useful for like customer service agents especially exactly yeah it's it's pretty critical um but uh but yeah I mean like I said they do tend to move slow but over time uh you can capture more and more workflows and that's why you know a big Focus for us is workflows because we want to be owning as much of the business logic as we can over time and eventually have that call tree represented as a VY workflow entirely awesome um we've also seen AI voice agents being used in you know a lot of like meeting transcription scheduling customer service do you think we'll see them in Mass adoption in more critical applications like for example Telly help in a setting where what's being said is much more sensitive and privacy focused you see AI voice agents playing a bigger role um definitely I think uh costs are usually higher uh to staff phones in these like more regulated Industries um and there but there are much more concerns around um uh customer data and specifically like patient data if you think about like healthcare examples and so they all want guarantees of hey is this data going to be persisted anywhere is it going to be trained on anywhere and so we try to provide um guarantees like contractually and actually how we build our technology to make sure that no data is stored or trained on so that we can serve even the most sensitive of applications because those very sensitive use cases are such high Roi for these companies that uh we want to put in the effort to be secure enough to serve them very nice um I think I'm going to wrap it up with the final question on maybe like a future aspect which is what excites you the most about the future of AI voice Ag and VY where do you see this technology going in the next 3 to 5 years yeah 3 to five years is a very long time Horizon with how fast everything is moving right now um I would say in the next year I'm excited to see speech to speech models pick up uh we've been waiting a long time uh for this new model architecture where instead of having like three separate disparate models that all have to like play telephone with each other and they kind of miss each other's context um to moving to one that can hear audio natively and produce audio natively it'll cut down on the latency across the board um it'll make it so it can actually hear a customers frustrated and then produce like a sympathetic response like end to end instead of like each model guessing what it how it should sound or what it should do um so that's exciting obviously uh progress on that model architecture has been slower than we've wanted um but uh what we're looking forward to I think end of this year more calls being served by speech to speech model so that's super exciting then beyond that I have no idea AGI and then uh we'll all go to heaven I guess I think that's definitely a lot which is going to be happening the next two years or three four years uh this has been an awesome conversation Jordan thank you for coming on and showcasing workflows uh before we wrap up where can developers go to learn more about vapy and get started with workflows uh yeah so they can just go straight to vy. um that's our website vapy short for voice API so it's super easy to remember and then they just log in make an account and then you can spend up a voice agent like the one I in on my phone in probably like 30 seconds so yeah thank you and for those of you who want to learn about building AI voice agents with realtime transcription check out assembly AI streaming API we have a playground where you can test it out and great documentation to get started you can also use assembly AI streaming API directly on wap's workflow as well

Другие видео автора — AssemblyAI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник