# The Real State of Voice Agents: Lessons from Founders Who've Deployed Millions of Calls

## Метаданные

- **Канал:** AssemblyAI
- **YouTube:** https://www.youtube.com/watch?v=HJdQiw0tXhY
- **Дата:** 19.02.2026
- **Длительность:** 45:26
- **Просмотры:** 233
- **Источник:** https://ekstraktznaniy.ru/video/11970

## Описание

At our New York office, we hosted a live panel on what it really takes to build voice agents in production.

Joined by Blesson from Aviary AI (outbound voice for financial services), Craig from Trellis (inbound agents and parallel dialers), and Luka from AssemblyAI (voice ai platform), we dug into the hard stuff: tech stack redundancy, voicemail detection, interruption handling, measuring call quality, and why speech-to-speech still isn't ready for prime time. 

▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬

🖥️ Website: https://www.assemblyai.com
🐦 Twitter: https://twitter.com/AssemblyAI
🦾 Discord: https://discord.gg/Cd8MyVJAXd
▶️  Subscribe: https://www.youtube.com/c/AssemblyAI?sub_confirmation=1
🔥 We're hiring! Check our open roles: https://www.assemblyai.com/careers

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

#MachineLearning #DeepLearning #VoiceAI #VoiceAgents #AIAgents #LLM #SpeechToText #AssemblyAI #ConversationalAI

## Транскрипт

### Segment 1 (00:00 - 05:00) []

All right. Um, awesome. Well, great to see the uh turnout tonight. Uh, my name is Ryan. I'm here from Assembly AI. If you have not heard of Assembly AI, uh, we build voice AI infrastructure for developers and builders like yourself. So, uh, tonight we have an exciting panel uh, here to talk about uh, voice agents. Uh, you may have seen some posts from our team on LinkedIn, on social last week. Um, but we uh, released a state of voice agents report. Uh there is a QR code not behind me. Oh yeah, there it is in front. Uh or is this the playground? There's QR codes for our voice agent report actually over there. Um if you want to go and grab a copy of that online. Uh this discussion will be focused more though on what people are actually building in production today around voice agents. And so we're going to spend about 25 minutes now doing a little like fireside chat here um with the group and then we will open it up for questions from the audience after that. So, uh, with that, um, let's go ahead and do some introductions from the panel. Uh, maybe Blessing, you want to start us off? — Yeah. Evening everyone. I'm from Chicago, by the way. So, I brought the snow and I'm not bringing the cold yet. I'm going to go back to the cold tomorrow. Um, but great to meet you guys or great uh, great to be here. I'm the CEO and co-founder of Aviary AI. Uh, what we do is we provide voice agents for the financial services space. So, credit unions, banks, and life insurance companies. We're specifically focused in on outbound right now. We have not gotten into inbound. — I guess that's my queue. My name is Craig Bedo. I'm the co-founder of Trellis. We're a YC Winter22 company. Uh we're really kind of a voice company. So we've done an outbound parallel dialer for many years. And we started recently introducing um voice agents as well, primarily inbound um as well as some kind of like web- based experiences for folks who wanted to practice or do other non- telefan applications. — Hey, I'm Luca. um head of real time here. Um I lead research engineering and product at the Assembly AI. That's all. I'll keep it short. Um and just to give some context on myself, I head up the customerf facing teams at Assembly AI. I'm based in San Francisco and it's quite cold here, so I had to bring some specific uh clothes just to get here this week. Um, so I am going to do a little bit of a cheesy thing and give some nods to this voice agent report as we go, but there's just going to be some intros to the questions to kind of set the stage on kind of what we've been seeing in the market. And so, uh, in that report, something that's, uh, really interesting is we had 87% of the people respond saying they've actually deployed a voice agent to production. But just as interesting, 75% of those respondents said that they're not satisfied with the voice agent that they have, which leaves us with what, something like 12% of the people who actually have a voice agent they're happy and proud of. That's that number is pretty low, right? I think we're all here tonight to get that number much higher over time. And so maybe to kick things off, I I'd love to hear uh what are some of the lessons you've learned uh about deploying a real production voice agent that would be worth sharing with the audience? And uh maybe Bless, you want to start us? — Yeah. um you know to give you guys a little bit of metrics behind the industry we service only 18% of them make outbound calls with human beings today right so the opportunity for us is they're doing nothing today right so from us introducing outbound voice agents for these banks credit unions so think about you're a brand new customer doing a welcome call so that you get your card set up the right way you have gone inactive on your account we're reaching out and making sure you reactivate your account we're doing collections calls right so those are kind of the series of use cases that we end up doing. Um so for us what's worked has been one is again we're going from zero to one for them because they're not doing it today anyway. So we're teaching them we're introducing really a new function for them beyond just AI just the idea of making outbound calls. But the secondary thing that we've really established is hey what does success actually mean to you guys when we make these calls? Some of them are informational. We have a great amount of alignment when we do our onboarding with them to say what does success actually mean to you so that we can measure it during these calls. We can do post call actions. We can do uh grading and auditing to determine whether or not the call actually ended up the way that we did. So that's how we've been able to start to see success where most of our clients start off with one to two use cases originally and then now they're starting to expand into seven eight different call types per client. So that's the thing that results in thousands of calls coming out per month, right? So we're building up confidence with them by number one getting agreement on what success looks like. Two, doing something that's brand new for them that they've never done before and they're seeing real ROI as quickly as possible. That's why we're starting to see the growth that we're seeing in our space. — Yeah, I guess there's sort of two things I would add to that. So there from a there's kind of always like a business perspective of what what's a good application for this. you know, is the is your customer bought into actually doing this at scale that it's going to be material for folks? I think there the best analogy is really just if they're really serious about doing this beforehand. You know, we have some

### Segment 2 (05:00 - 10:00) [5:00]

customers who come to us and they want to maybe they want to do something, but they're not currently doing it at a scale which would make sense to automate, that deployment is probably not going to go very well. But if somebody's like got a, you know, call center full of 50 people doing exactly this thing and they're pretty motivated to uplift this with AI, that's much more likely. I think to what lesson said that also tends to align with folks who have more um precise metrics really and are being metric driven in their decision-m so that's my main just flag on the business sense on a technical sense um you know I think that the industry as a whole has gotten a lot better I think for folks who maybe have tried this in the past it's probably easier now I think otherwise I would say that we work pretty hard on like really on redundancy so I think that there's so many failure points across these stacks and I think that what we've seen at least is that you're gonna hit one of these like every one of your components at some point at some part of your load will go down or we'll just be like a couple seconds latent and so we do a lot of work to make sure that we have a fallback ideally one that's running at the same time and is kind of almost immediately available for those situations. — I think that's super interesting to maybe like dig into like a little bit more. Can you give like some more to the audience like how does it actually work technically and like what does it look like end to end? — Yeah, sure. So I mean some folks here may be running voice to voice. We're running kind of a stage pipeline. So you're doing transcription, you're receiving audio from your telefony provider, you're doing transcription, you're doing something that's going to ultimately produce text and then you're doing you're generating some speech and then you're dispatching that to your telefony vendor. Right? I think that um in general everything before bytes go over your outgoing websocket that you deliver to your telefan vendor, you have full control over, right? You can run that in parallel. So if you take those stacks, right, transcription assembly is great. Like I'm happy to be here. I love these guys. Um, there are also other great transcription vendors. You should totally use Assembly, but also maybe use some of those other guys in addition. — It's okay. It's not a sales pitch. You can be honest. — They paid me pizza. That's what you get for one slice of pizza. — Little hurtful, but — but so I mean you can, right? So obviously there's a question of like what you want to do if you think one of your vendors is latent or if they disagree. But I think fundamentally if for instance one of them is latent or just gives you a error status code, it's pretty obvious that you want the other one. Um for us I think again in higher volume applications we see a lot more scripting. So I think when folks are kind of naively enthusiastic they're like oh yeah anything it says is awesome just give me that. But those people are generally not very serious. The people that you want to do business of have like really strong opinions on every single word you say and because that matters to their business and they've been doing this for a decade with 50 people. And so when you look at that I think we try and drive out the margin for creativity sort of. And so we as much as possible would prefer to kind of script things out for them and play exactly the words they said. So if you do that, if you basically have a way and I can tell you about our solution, but if you essentially have a way of realizing what you need to say, then you can kind of get redundancy prior to that and then you're just left with the speech generation part. Then I think primarily again if you knew what you were saying ahead of time, you can make that offline and then you don't have to be there at the last minute scrambling to be like, why did I why is my, you know, Texas beach vendor um laden right now. Obviously, if you're doing things live, you you're kind of out of luck, but at least you got some of the time hopefully that you could hit a cache. So, that's kind of And then I think the telefony vendor again, you're kind of single threaded on uh the teleity vendors are pretty good. So, you know, if you the likelihood that they were able to deliver you, for instance, an incoming call or take your outcoming call and then they drop your websocket doesn't seem too high to me. Like when you know, we had the big AWS outage. It's kind of like everything with your telefony vendor broke. Um so, I don't have any redundancy tricks up my sleeve for that one, but I can't say you need one. Yeah, I I'll just uh echo what Craig said. He described almost our entire tech stack right there. Right. We do the same. Yeah. Copyright's right there. Uh that's our moat. Um so we do the same kind of flow. I would say you know two things that we've learned through this entire process. One is uh redundancy really matters because concurrency is an issue of you know vendor to vendor especially when you have high volume calls that are happening all at one time where we've run into issues with deep gr right where concurrency was an issue and we had to scramble to go find somebody else in the early days to figure out what we're going to go do. Fortunately for assembly, here's the sales plug. We don't have that concurrency issue, right? So that's a great thing. But overall, we try to have redundancy across the board, whether it's using Twilio and Telenx, whether it's using Cartisia, uh, 11 Labs, even deepgram for voice, using deepgram and assembly for, uh, the transcription side of the world too. Um, the secondary thing that I would say too is just kind of going over to speech and I know we have a topic around that. I don't know about everybody else here, but we've really struggled with speech to speech models. Like one is it's kind of dumb. like it's they have really dumb responses and like task adherence is really poor. Like the performance on it as much as I would love to shift over to using speech to speech especially when we're doing outbound calls, the fact that it doesn't follow instructions is a pretty big deal because the industry that we're in like for us it's pretty simple. None of our customers care about our voice agents

### Segment 3 (10:00 - 15:00) [10:00]

until it gets deployed. when it gets deployed and it's calling their customers, they give a Like that's when they care. So they care about the quality of what's actually happening. So to back up what Craig is saying, like they all think they're unique and they all need to have different things, but like one of the big wins for us to get consistency has actually been getting caching in responses too, right? so that we're not having to go back to the LLM each time because if we're calling you about card activation, we have a probably 70% reliance that we know what the responses are going to be. So why wouldn't we cach those responses back and determine whether or not it's a good enough of an answer rather than going to the LLM each time. So that's how we've been kind of building it. But Craig already described our tech. — I'm kind of curious maybe like pulling on that thread a little bit more. Right. One of the things that we talk about is like these guard rails that you might have to put in place as you're going and building these voice agents. you kind of described the full stack. How many guardrails are actually in between each of those steps? And uh how often I guess are you changing them? Is this an ongoing like whack-a-ole battle? Like describe it to us. — Uh I would just say that, you know, I try and get my more serious customers to script more. I think that anytime you let an LLM decide what to say, you're at risk. I mean, most of the time it's going to be fine, but we had like a customer start talking about suicide. Why did they I didn't want them to do that. But then of course the LM is going to respond about suicide and then the client's like why are you talking to this person about suicide? So you know I have no interest in playing whack-a-ole. I think that categorically like the way to do this is to again if you can focus if you can find your application in a truly high volume standardized regime you're going to be set up for success. And if you need to like speak to everyone about everything you're going to be fighting an uphill battle. — Yeah. I mean, guardrails for us is it's a little bit different for us because we don't do as many much of the scripting. Obviously, the opening line is scripted, ending line is scripted, voicemails are scripted. Um, part of it what we do is one is when you're doing outbound calls, it's a very pointed reason why you're calling. So, we provided a knowledge base for it to actually have and when I say knowledge base, it's not going and doing a tool call to do pull from a knowledge base just because you don't want it to be a wide troth. And sorry, latency obviously matters in this. Um, but you know, the big part of it is feeding in the FAQs and making I mean, it's really about prompting at that point to make sure that it doesn't go outside of the rails of whatever you've given it within the context window. And that's how we've been keeping it. Um, you know, and working. The second thing that we've invested a decent amount into, again, going back to my earlier statement that when do customers actually care about these voice agents is when they're actually being when the calls are being made. So, they care about monitoring, right? So we've decided as a team again legacy based industry that we're servicing that we're actually not trying to focus on deployment like making it so that it's self-service deployment. We'll handle that. We'll take care of that because it's the 8020 rule. But where we're going to push it back onto the client to now manage and monitor or take care of is monitoring and QA, right? So making sure that we build up as much on that front as possible so that they can do natural language querying against their data to see, hey, what was said during calls? What was the tendencies of calls? They can do all of that. That's freely available. They can go ahead and monitor it. So we add those in and they can point out pieces to us. They do call reviews too uh independently to tell us if a call went great, did it meet their standards, did it not meet their standards. We haven't had anybody, knock on wood, on complaining about hallucination or anything. So we try to put that back onto the customers more and more. Now — Luka, we heard a little bit about reliability, latency, redundancy. I'm curious any perspectives from your side around like how some of these things go into the way that we're like training and thinking about bringing these models to market and assembly. — Um tough question. So for the beginning, it's very easy for from de like R& D perspective, it's very easy to get things wrong. you know, we still get some things wrong and we that's why we have very close partnership with you guys to kind of iterate over them really quickly. But um I feel like the biggest role is coming from just spending a lot of time with customers and understanding how your models perform and just going back to first principles and then designing systems. thing is sometimes we really want to have some very elegant and cool solution. Hey, like why don't we do speech to speech right away? You know that we wanted to do that in the beginning because it's pretty cool. But um at the end of the day it always just comes down to what is the simplest way I can answer to this problem and usually it just we have published a few papers. Uh the technology we used it some of it is very novel actually from like 2015s to like early 20120s um because they are very reliable and you can make sure it's easy to make sure

### Segment 4 (15:00 - 20:00) [15:00]

that hey we know the performance of the systems but we also know all the bad things you can do and that amount is very small right so yeah it's very iterative approach we have to take and some unfortunately saying know to cool things that that's the way we can ensure reliability from like our perspective. — Yeah, — I'll uh I'll switch gears a little bit. So, one of the things from the voice agent report uh was around like what are people prioritizing when they're actually building voice agents and some of the things that they're evaluating. people talked about were speech to text accuracy, conversational understanding, latency, integration capabilities, background noise detection, human sounding voices, accents and dialects. The list could just keep going on and on. But I think what's interesting, right, is like one of you is an outbound use case, inbound use case. Maybe you could talk to the audience a little bit like how for your particular use case, some of those different like characteristics matter more for your type of customer at the end of the day. Uh so outbound there's really two things that I think our customers really care about from a voice perspective is voice quality and latency. Like I remember our early days of so just previous history we were a consumer app. We actually did a voice agent for consumers to call collectors back in March of 2023 when we came out of YC and it was a 7second latency. It was actually kind of great because collectors were getting pissed off because they're like, "Hello. " But like whatever, we were a consumer app, right? But like as we progressed through this, I remember when we launched in when we pivoted over and we launched in B2B for banks and credit unions, we had like a 3 and 1 half second latency and customers were pissed, right? Like they were angry because we were three and a half seconds. Now we're sub 1. 6 all in like including Twilio getting it all the way through, not just, you know, from a model execution perspective. And they love it, right? So they care about latency one. Um within that they care about time to first in right like when's your bot actually going to reply back because they care about that first initial pickup um as that part of it. The second thing that they care about is the voice quality. Like they want it to sound very conversational. So I think it does matter who you pick from a voice provider perspective. Um some have performed better than others. I will say Deepgram Graham has a voice that has done really well for us and we kind of give our options to our customers of who they want to go and utilize. We are starting to now leverage Ryme a little bit more too just to kind of give an idea of the vendors that we're looking at and uh working with. Um but really those are the two key things that they care about is how quickly is the voice agent obviously replying back um from a uh from an overall perspective but also from that first part of the conversation how good is the voice so that it actually sounds conversational. Then there's all the other stuff too, right? Like we were having a side conversation around background noise and how it's been impacting some of our results lately. I mean like those are things that clients don't even think about during calls. Those are all kind of like added things that we care about. Clients don't even realize it or care about it or know about it. — Yeah, I agree with all that. Um I think that we tend to show our clients a transcript. So if they like they can always go listen to the call, but most of them don't. I think again when you're very early on and your client tests your agent for the first time, they're going to be hyper sensitive to latency and to how the voice sounds. But when you're, you know, thousands, tens of thousands, hundreds of thousands of calls in, they're not going to be listening to these calls. They're going to look at the transcript and they're going to ask like, did this, sorry about this echo, did this, did this like do my business objective? Did the did it have the right conversation for my business? So again, that does absolutely like, you know, you can both tackle with us and said like if anything is really wrong, like if your latency is too long, the conversation's going to be off the rails because the person's always going to be like, "Hello, hello, where are you there or not? " Right? I mean, if the voice is really bad, they're going to be like, "You sound like a robot. " But if they're not saying those things, then we try and keep our focus on again if we like conform to their requirements and we have the conversation they would want a human to have. Maybe diving a little bit deeper on like this idea of like human QA, what measures success? I'm sure like your customers define this differently maybe than you internally. Like maybe how are you using different like qualitative and quantitative metrics to measure the success of these calls and improve your voice agents over time? Yeah, this is it's actually really interesting when you're going into a market where like again we were just having a pre-con conversation about this, but um when it's the CFO of a of a bank that's never done outbound calls and he certainly wants all calls to be 100% perfect and you can't even get that with humans, right? So that's kind of a little bit of an uphill battle of like he doesn't even care about the quality. he cares about the fact that there was a double thought or the voice agent didn't reply because it couldn't hear the person because of poor cell service, right? Like those are things that they don't really care about. So, one part goes back to what I said earlier is it's really important to set the baseline of like what do you guys define as success? What is the adherence

### Segment 5 (20:00 - 25:00) [20:00]

of what you consider poor? Like for us, we try to tell them like if you think a poor call is because um it didn't sound the way that you wanted it to sound, it shouldn't be considered a poor call. A poor call is this a reputational risk for your business or not. Like having those definitions matters so that you can set the expectations with them, especially if they've never done these before. I'm not even talking about AI calls, just outbound calls in general. Um that's one part of it, right? Um, a lot of the qualitative stuff that we've started to do now is um or I'm sorry, the quantitative stuff that we've actually started to do now has been measuring out um uh how many technical issues are we dealing with on calls. Now, my dev team, including Julian, who's in the back, kill wants to kill me because of that. But like, we want to measure it so that we can see of all of our connected calls, how many of these are things that we can actually address versus not address so that we can go in and say for every single dev cycle that we do, here's how we're going to attack this because quality matters now, right? Like more than ever, like it's still sexy. everybody like there's a lot of space where people still don't know enough about this but focusing in on how many of these are actually quality calls. The other big measure that we have internally and we share these with clients and they love this is how many of these calls are ending in a natural goodbye. So regardless of if there was a double thought long delay from an uh from a perspective what you can go out and show them is hey this call ended in a natural goodbye like a human being had that same conversation. So really from a quality perspective, it met the bar. And so we've been measuring that and that's been a big sticking point to prove that these calls are actually going well. — That's pretty smart. I wish I can take credit for it, but it was Julian and Jay. — One of the coolest ones I ever saw was we had a we had some calls where the customer would thank us at the end and I was like I was reading through this call, this is horrible. Like we got to fix this bot. And then they're like, "Oh, thank you so much. " I'm like, "Really? " So I totally echoed that as a good one. Um, I have a terrible non-answer here, which is that honestly I look at revenue. If your customers are, if you're getting kind of all their calls and they're paying you more money, they're happy. And if they're paying you less money or you're not getting all their calls, they're not happy. So, I think that all these other things can help are like more sophisticated answers and you should totally go look at those. But it you also kind of got to start from that truth, too. That's true. — All right, I'll end with one last one before we open up to the audience for questions. I'll start with you on this one, Luca. What are you most excited for in 2026 around voice AI? — The making our customers happy obviously. Well um honestly we are really close to solving um lot of like foundational issues that uh we have identified so far and it takes a lot of time and a lot of resources to do so. But again, we're really thankful for all of our partners because they're working with us and helping us identify those things. I would say the most exciting part is definitely just having there a system that is really tailored for specific use cases like voice agent use cases for example or what we call conversational intelligence which is more note noteaker or medical transcription ambient devices and so on. Um yeah, we're putting a lot of resources in the direction and um we are working on some cool stuff as well. Hopefully it works. If it doesn't, you know, we're going to make um existing systems work better. Um so yeah, I know that didn't sound extremely exciting like when I look back to it, but in Yeah, everybody's really excited on that team. Yeah. I — I mean the reality is there's so many of these use cases, right? like building models that are purpose-built for some of these specific use cases means they're more context aware, better accuracy, all that fun stuff at the end of the day. So, — exactly. And unfortunately, this field uh as you know comes with a lot of good things, but also some hard parts are well, if you want to improve on one thing, you kind of have to trade off something. And important question we have is you are we okay with trading off something? And what is the thing we want to trade off to kind of perform really well in specific task? — I mean I I can have you guys all seen the meme by Andre Karpathy about like the how movement or how technology has kind of been embraced like internet was started by the government then B2B then finally consumers did it and then it's the reverse order with uh generative AI and LLMs. So I think what's really cool about voice is number one whether people like it or not like there's this undercurrent that's being driven by consumers that are embracing voice more and more right so if we think about every commercial now that Android does is talking directly to uh Gemini

### Segment 6 (25:00 - 30:00) [25:00]

right Apple whatever like whatever they're doing with Siri whatever is going to happen with it but like crazy for me three months ago walking into our New York office and seeing one of our founding engineers talking while coding like come on like I wasn't expecting that right um I fundamentally believe in the idea bias idea that voice is going to become more and more part of how consumers especially in the banking world are going to interact like there's no point that the UI should be the traditional mobile banking app of how you actually communicate with your bank or try to get an answer it's going to be through voice right so I just think like just beyond 2026 6 voice is just going to be more and more a part of regular consumers lives. I mean look at Alexa commercials now, right, with Pete Davidson. Like they're pumping this down consumer's throats. So consumers are going to demand that this is the way the businesses interact back. So I'm just really excited about all the different interaction points that are going to happen as a result of voice with AI included in it. — Um I want to add like quick thought on that. Like, you know, probably everybody has a like annoying friend who always sends the voice notes. It's like just text me. I feel like everybody wants to be like that when it comes to interaction with computers, including myself. It just simpler, you know, we're lazy to type. That That's very true. At least to Yeah. — I mean, Mark Zuckerberg said this to um I forget the CEO. I think it was data brick CEO at their conference last year for uh Llamicon where he said I think he said 97% of all interactions right now when their communications are done via typing and he's like that's not the way that we normally communicate with one another. Obviously texting still happens from time to time but when it's important you call you talk to people so they're making a fundamental bet that voice is going to be a bigger factor in people's daily lives. So that's what I mean is like the undercurrent is that consumers are just driving are going to force businesses to change the way that they're able to interact. — That was really deep and insightful. I don't — I've been thinking about this answer for a long time. — I don't got that. Sorry. — So what I do think is pretty cool is that two years ago only this visionary would have been in this room, right? Like none of the rest of us were making voice agents and now there's like a hundred people here. And so that's really neat. I mean I hope you know we hope we continue to build cool things. I hope you all things and I think it's going to just be really fun in the next year to bring it back to your question Ryan to see what everyone in this room and elsewhere is able to build and it's all still in its infancy. — Awesome questions from the audience. — Uh what approaches do you take to minimize interruptions from the agent and to adapt to different kinds of callers? So, I would say that um if the caller interrupts, there's a couple things I would say here. So, the ones that get you in trouble are kind of like the false starts or the utterances where you probably shouldn't have talked. And I think that in general there, you know, the trade-off is really you have to wait longer, right? Obviously like if your latency is bad like if you know it actually had been long enough that the person that you know you should have known that they had stopped talking for long enough you should have known but like your you know your transcription was late then you're in trouble. But if you know if you just are being too aggressive in how quick you talk after they talk the only option is to really like talk slower. Um the converse of that is if they interrupt you that one's pretty easy. You just kind of shut up. I mean let them go. — Yeah. I mean, most models are pretty good at the hey, if they start talking, interruption takes place. It to Craig's point, it's really more the end of thought. Like, it's the ums the I think a lot of models have gotten better at being able to embrace it. What Craig said is spot on. It's like you got to find the right timing of um whether it's I'm making it up but 500 milliseconds versus a second full second before you actually allow the voice agent if it thinks that the end of thought is actually taking place to actually find a response back is what we're seeing. But it really is a moving target. Like we've seen it where client to client it varies where you know we're we've got life insurance companies that we work with that's an older crowd. They talk very very slow. Right. So you've got to adjust the end of thought on that so that it doesn't interrupt and confuse the individual too. So it's kind of a moving target. — I think that it often could be slower than you think, right? I mean, if you look at just natural human speech over the phone, you know, I think that when people say very quick things, like if you ask me a question, I say no, you're probably going to start talking pretty fast. But if we had like a full fully coherent thought, a second second, even up to two seconds is not uncommon for just live human beings. So I think you know we we can all we all kind of contort ourselves sometimes trying so hard to be so fast and it's not again this is one of the things like people when they're piloting are probably very sensitive to this but I think that again like I feel like I never read a

### Segment 7 (30:00 - 35:00) [30:00]

conversation and the person like churned out of there early because it was two seconds like that's just um so I think it's almost like you have room to chill out and you just got to take it. I think the other thing too is like it's easy to overindex on the idea that like you have it perfect ultimately for every single call whether it's inbound or outbound. Did you complete the task that you were meant to go do is the ultimate side of it, right? Like so I think that's one thing too to just remember as you guys are building this piece out is we all want perfection. That's what we all strive for. But level setting the expectation that you know again did it accomplish what you intended it to do? Yeah, maybe there was some redundancy in the conversation. That's okay. But like it accomplished what it you needed it to do. So it is a moving target. I can just tell you that we're still playing around with it on a constant basis. — Hi um I'm ULO. Um we're building a instore salesperson. Um so one of our hypothesis is that we want to use different voices to drive. We want to test different voices to test conversion result. Um we're close to getting into the stores. We're launching our pilot uh sometime next months. Uh but I'm just curious if you guys have experienced or have done uh ABCD testings of different voices on driving your business result and also um personalize the voice. Right. Basically, it's not a one right voice for all of your customers, but different voice for different customers, male, female, uh, right? Someone sound younger, sounds older, all that stuff. And also, um, related is that the vocabulary you use, right? Uh, in your conversation. — I got Oh, I I do. It's an idea I keep on floating to our dev team, but nobody will adopt it from a grab bag. So, thank you. cuz he's right there um for stating this. But like I do think it's a big piece because we have clients that are specifically in the South, right? And there's a southern draw that they may have or in the Midwest I'm a fast talker, right? Um versus somebody who's uh here in New York, right? Like there's different ways that we kind of communicate. So, one of the things that I want to accomplish is exactly what you said, which is uh a using some baseline demographic data to determine based on age even the voice that's actually going out there. We haven't tried it yet, but our customers have said that they this is something that they would certainly want to adopt and want to do. Just from a baseline perspective, um we've tested out male versus female just high, you know, just that very baseline. female voices have been performing way better for us than male voices have been. Uh just quality-wise, response wise, — all of your clients — for all of our clients. Yeah. And just the quality of voice, but then just even the length of the conversation tends to be better with the females, female voices that we have. I again, no science really behind it. We've done a few different AB testing of like voice types, but beyond that, we haven't really gotten too far yet with it. — Hello. My question is um you mentioned that right now you're working on the outbound calls, right? What is the inbound future looks like or if you can map it out if it's already, you know, if there's already a map? — Yeah, there is. Um so that's actually the thing that we want to focus in on this year because um again, funny enough, Craig and I were talking about this. Um, in our space at least, there's about 35% saturation in the banking and credit union space for traditional IVR AI based solutions that are focused in on inbound. We took that purpose built approach of saying we're going to go do outbound because nobody's doing outbound. So, we're kind of laying that groundwork. The beauty is this for us is we're showing ROI quickly. So, there's a belief system that is being built by us. Now, we're getting asked by our clients into why aren't you going into inbound? So that's kind of a beautiful thing. — They're the door is open right now. They're asking us to say, "Hey, why aren't you doing this for us? " Um, so we — It is the best question. Um, — we all want that. — Yeah. So it's been great because again the the results prove to leading towards the inbound side. I do think like one of the fundamental approaches that we have taken though is to say a lot of the reasons why traditional IVRs beyond the technology piece have kind of failed has been um there's been no solid knowledge base knowledge center knowledge management obviously with Gen AI now it gets much better and it's funny cuz Julian and I were just talking about this on the train ride up here was the idea that like one of the issues though is like you can use Gen AI for knowledge management for your inbound agent to reference back to but if the document is out of date, if the information is not relevant, um or if the if it doesn't tie back to the answer that the consumer

### Segment 8 (35:00 - 40:00) [35:00]

wants, who like how do you know that and how do you fix it? So, the reason why we haven't full-fledged jumped into inbound, this is a long-winded answer, full-fledged inbound is um we've now introduced a knowledge base for their internal teams to use in their contact centers so that we can see how solid their documentation is before we introduce inbound. Then we train it off of that. Yep. — So for outbound, how do you guys determine if it's a voicemail or human? And how good are you if you — This is like our entire like conversation over there, right? Like first off, voicemail detection sucks right now. Like everywhere, every vendor like has been bad at this, right? Um so we've struggled with it. We have false positives all the time with it. Um, honestly, like I wish I had a good answer for you on this. I really don't. I don't know, Craig, you guys have done the the auto. — We run this as a business. So, like we do three million calls a month of mostly outbound and mostly hitting voicemails. I mean, so I think if you see enough voicemails, you can tell a voicemail. But I think that yeah, absolutely what said the out of the box AMDs are pretty naive. They're generally just like wait for n seconds and then if it has spoken that long, declare it a voicemail. about it. — I will say the way that we see it is more um it thinks it's a human so it does its regular conversation because the only scripting that we do is uh on voicemails too and so we don't really run into the issue at least from a QC perspective quality check perspective where it is they think that it's voicemail and it's leaving a voicemail message it's the other way around where it thinks that it's a human and you can hear it and it's like hey you know Jake blah blah blah how are you doing you know so it does that rather than leaving the voicemail. So I would say like the false positive is it thinks it's a human rather than a voicemail which you would probably rather have than the other way around. — But it's — you'll get that if you call into business because you'll get like hello this is Craig with assembly AI how can I help direct and then the thing will say it's a voicemail. The thing you get for human faults is like if I say Craig Benoy beep — your transcription vendor doesn't tell you beep and so you hear two words and you're like oh that sounds pretty human to me. — Um it's pretty funny actually. I remember I think it was you Ryan so I I'll let Ryan elaborate on it a little bit more but he was showing us a demo few days ago and it was like oh actually a lot of people have the problem of like classifying is this a voicemail is this is not and he kind of orchestrated um or transcription and like the product we have LLM gateway where you can just hit bunch of LLMs with lot of requests um to have a pretty decent accuracy on classification, but obviously do you want — you see three million of them, your problem, what's not. — Is it based on just like text that you transcribed or doing actual like sound processing? — Yeah. So, we we've done a little bit of both. I mean, I think you can make you can do okay with both sides. I think that the temporal information is pretty it depends on how like if you want to be fast, you want the temporal information, right? and the and the transcription time stamp alignment is so let's say charitably so I would look I would try and like align — that's right so on technical side how do you manage that accuracy of conversation so let's say a call dropped in between right so how do you know on technical side like how many% of calls went through or how to check the whole accuracy — for calls dropping specifically or for which failure mode are you thinking conversation. — So, so let's say a transcript is generated, right? So, do you do analysis of that and check the accuracy or it's just like you check the uh you uh look at the calls as well? — Do transcript level accuracy checking. I think that you know spot checking wise like they're these folks are probably 90 95% accurate I would say just off the cuff. I mean like it's pretty good. Like honestly when I listen to these calls sometimes people have weird accents and I have no idea what they're saying either. So, I mean like it's everyone's doing their best. It's a noisy environment. So, I don't in general like I think that I have never seen it go like so haywire that I feel like I need, you know, QA. It's also not so reliable that I think you should build your system assuming it's perfection, but like I don't think so. I haven't found like a value in quantifying that. Um, so yeah, I think that's my point. I don't really do much like robust QI though. — Yeah, we're not doing any of the transcription accuracy pieces too. I mean, they've been really good with it. It was actually again today we were doing some testing and Julian was purposely speaking fast and I couldn't even understand what he was saying but that you guys were able to pick it up. So kudos to you guys on that. — Uh can go out of the way right? So let's say a call is going on and how to check

### Segment 9 (40:00 - 45:00) [40:00]

that it goes on like perfect and not just go somewhere else. — Yeah. During the live call there's really nothing. I mean like we have alerts and things to know if uh a volume of calls are outside of our averages to alert the team to say hey this needs to go looked at or alert the customers on it. We do post grading and auditing of every single call by bringing in the transcriptions to review a few different things. One is again we're in a highly regulated space so we do check up against certain regulations to see whether or not it went offkilter. The second thing that I kind of mentioned about like technical issues that is based off of the transcriptions that we do receive and we run it through a another LLM that does the auditing and the grading so that we can uh provide defined reports back out to our clients. And then the third thing is we again we open it up to our clients where we've got a natural quer uh language quering tool on our dashboard where they can ask the question and say hey how many calls ended up where the customer was starting to swear you know if it's 15 calls here's what it is here's all the calls that are being pulled out. So, we do the post call grading. Um, nothing live. There's the alerts that go out to the team if something's going haywire. — One thing I got to add is from like a I think in general when you have kind of like non-terminal conversations, it's kind of the customer's fault like they probably were engaging this bot in a weird way. It does totally happen. I think that again if your customer is concerned about this from almost like was saying a reputational perspective like what are you guys talking about? Like you were not supposed to be having this conversation. Um, in generally like our preference, what we honestly like what we do is we this is what we kind of sell to our customers so they can sell it to their customers, right? We kind of run this as middleware, but it's really to try and be more structured or directed so that this thing is not like responding with anything ever. It's like I'm going to try up to twice to get this information and if I don't get it, I'm just going to kind of move on because at the end of the day, like that's what you know, you don't really want to engage in free flowing conversation. often you kind of have to Gliston's point like a very concrete business goal you're trying to object and you kind of have you kind of want to be on the guardrails of that and I think that being like explicit about that in what you ask the LM to do and how what you even allow the LM to do it can be helpful in reducing that variability — for one more anyone going once going twice all right wait oh there's one in the back you one more last one and then we'll end — there um so one of my questions is it seems like a lot of that implementing I more of a linear approach, but have you looked into using a secondary or model real time more high stakes environment? So maybe you have uh additional filters or additional uh methods in place that might have slightly higher latency and then you can block the audio on the base model or is that an approach you guys have looked into because we're looking into speech to speech for uh from like the lower stakes part of the call but you can do something below like up front and try having uh the uh basically having it route to uh a model that might have um I guess less more of a modular architecture. There is a risk of it going to the speech to speech model um because there are certain ways to kind of buffering audio cases when you're looking to speech and then the other piece was to multiple kind of ASR models in conjunction. — Yeah. So sorry for are you saying like transcription models? — Yeah. So like in real time often one of the questions we're looking into is having a secondary model as almost like supervisor agent. So in real time um if often times there might be certain limitations of the very small modeling we can the thought here is there are certain like compiance in fraction fine tune the secondary model for that we don't want base model to be over — so I certainly not built a system like that I mean I think that we have a lot of redundancy for like you know transcription things like that but I think that in general I in my experience it's just been to use the kind of best model with the best guard guardrails and like if I'm too worried about something going haywire, I will try and just like not admit that is a possible thing that could happen rather than trying to like orchestrate a second independent system on top of it. But not that I have not tried it. I just have not. — Yeah. I mean the way that we kind of do the fixing is not during the call itself or having a fallback or any of those pieces. is I mean be beyond what we've talked about from a parallel um redundancy perspective it goes back to that post call auditing grading providing that feedback and providing input back into the original voice agent is kind of the loop that we've started to build um the secondary thing do is providing um a coach we call it internally I don't know what we call externally yet but um there it's going to be a bird name uh it's an avary right so um it is uh it basically provides coaching on what can the voice agent do better during those calls. So I would like for us it's uh it's less important to fix it during the call itself. It's more to identify and then say what are the plans to fix it later on. I don't know if that

### Segment 10 (45:00 - 45:00) [45:00]

helps. — Well appreciate all of the great insights. Thank you to our panel. We'll give them a round of applause everybody. We have uh pizza drinks in the back. Uh we'll be around till about 9:00 and uh yeah, network, enjoy. Thank you again for coming out.