Voice agents don’t just transcribe anymore — they think, talk, and call tools in real time.
This Build Hour demos speech-to-speech agents built with the Realtime API and Agents SDK that can handle conversations natively in audio, reason about context, and call tools while streaming speech back to the user.
Brian Fioca and Prashant Mital (Applied AI) cover:
- Why voice agents now: APIs to the real world, expressive + accessible interactions
- Architectures: chained speech-to-text vs. end-to-end speech-to-speech models
- Live demo: building a voice-powered workspace manager + designer agent with handoffs
- Best practices: evals, guardrails, and delegation
- Live Q&A
👉 Follow along with the code repo: https://github.com/openai/build-hours
👉 Check out the voice agents guide: https://platform.openai.com/docs/guides/voice-agents
👉 Sign up for upcoming live Build Hours: https://webinar.openai.com/buildhours
Оглавление (11 сегментов)
Segment 1 (00:00 - 05:00)
Welcome back to build hours. I'm Christine on the startup marketing team and today's topic is all about voice agents. We have a lot to cover today. So I'm actually joined by not one but two of our solutions architects. — Hi, I'm Brian and I'm Brashant. — So for anyone new joining this series, we always like to start the session with the goal of build hours, which is to help you scale your company with open AI APIs and models. So, be sure to check out our homepage below for any additional resources. Um, and you can also catch up on any of our previous build hours, sign up for upcoming topics, and we're always adding new topics, especially in line with new releases. So, speaking of new releases, we actually released four new updates to building voice agents just this month, uh, to complement our March 2025, uh, audio model releases. Um, what this means is we'll hopefully be having a lot less of these frustrating situations because voice agents are going to sound a lot more like a real representative and a lot less like an automated bot. So, over the next hour, we'll be giving you a whirlwind tour of OpenAI's voice APIs, share some tools and patterns that make them click in production, and then we're going to be live adding a voice interface to a regular web app. And then as always, we're going to end with a Q& A session. So, be sure to drop your questions into the Q& A. Uh we have our team in the room with us as always to answer as many of these as we can um during the session, but we'll also save some for uh the live session um at the end. So, without further ado, um I'll pass it off to you. — Thanks, Christine. So, today's session, we're going to be using the term agent a lot. Let's quickly review what we mean by it. Our definition of an agent is any application that is composed of an AI model, instructions to sear that model's behavior, and that is also connected to tools to augment the systems capabilities. The model, prompt, and tools are all encapsulated in an execution environment whose life cycle is dynamic and can be controlled by the system itself. Therefore, the agent can decide when it's mets objective and stop executing. So with that, let's move on to the topic of today's discussion. I'd like to share first why we at OpenAI are so bullish on voice AI. We really believe that voice agents are at an inflection point with both voice models and tools for integrating these models into applications improving at a rapid clip. Another tailwind is increasing user awareness as evidenced by soaring adoption of voice features in applications like chat GPT and Perplexity. More users are having that wow moment with voice AI each day. And we believe it's not long before users come to expect voice interactivity in their favorite applications. So what makes this latest generation of voice agents so compelling? We believe it's three things. First, it's the flexibility of these agents. Compared to the older generation of voice agents that were more deterministic, the newer breed of agents can actually handle a much wider set of intents and deal with more ambiguous situations. The second is their accessibility over text. Just look at how many stories you might have heard about folks using advanced voice mode on their commutes or while walking their dog. I know I'm certainly guilty of this. And then finally, it's the level of personalization that voice agents can offer. This is because not only are they highly expressive, but also they can pick up on vocal cues that transcription models drop, such as tone and cadence. Overall, we think of voice agents as APIs to the real world, offering a completely novel way for builders to solve last mile integration problems. So, let's look at two primary approaches we see out in the wild today for building voice applications. The first approach is what can be thought of as a chained approach where you take a speech-to-pech model that understands what the user says and turns it into a text transcript. This text transcript is then processed by a textonly LLM like GPD 4. 1 to produce an appropriate response based on the instructions in its prompt. This response is then passed to a texttospech model to produce audio that can be played back to the user. Developers really like this chain approach because it lets you plug and play different models for each part of the pipeline. So you can choose models of appropriate fidelity for each system.
Segment 2 (05:00 - 10:00)
And secondly, because you can reuse your existing pipelines. So for example, you can convert your existing text agent into a voice agent by putting a text to speech and a speechtoext model on either side of it. Feels like a lot gets lost in translation during that though. — That's precisely right. Which is why we are seeing wide adoption of a relatively newer and more novel approach to building voice agents, which is to use these more futuristic speech-to-pech models. These are models that are capable of understanding audio natively, reasoning over what has been said in that audio, and producing audio output tokens that can be played back to the user. These models are super fast and they're what powers advanced voice mode in chat GBT as well as our real-time API. In addition to being fast, these models are emotionally intelligent. This is exactly what you're talking about, Brian. This is because they do not rely on transcription, which is intrinsically lossy and does not preserve the nuances of speech like tone and emotion. Later on in this session, we will delve into some techniques for overcoming some known limitations of speech-to-pech models such as their limited reasoning ability. We will do so by demonstrating how speech-to-pech models can now delegate hard and high stakes tasks to smarter models like 03. So, moving on, I'd like to shed some light on some of our recent launches that Christine teased at the start. These launches have removed more friction than ever from the process of integrating real-time models. First, we've launched a TypeScript version of our agents SDK. This new SDK has feature parity with the popular Python agents SDK with the added benefit of having first class support for real time API. We'll double click on this in a moment. Second, we've brought real-time models in to the traces tab in our platform dashboard. This means that if your voice app is instrumented with agents SDK, all input and output audio is automatically logged to the OpenAI platform. This is a huge unlock because of how much simplifies the process of debugging real-time applications. Remember that speech-to-pech models rely on audio tokens instead of text. Therefore, debugging a bad completion from an from a speech-to-spech model actually requires you to have access to the conversation's audio, which our platform now enables natively. And finally, we've landed our best model snapshot yet for the real-time API. Early adopters of our June 3rd snapshot, which include teams like Intercom and Perplexity, report significant improvements in instruction and tool calling accuracy. We've also added a nifty speed parameter that allows you to more granularly control the pace at which the AI speaks. So, let's dig a little bit deeper into what the new agent SDK integration really means. We'll be using this a lot in the demo that Brian's going to get into in a moment. Our TypeScript SDK supports all of the same primitives as the Python version, including handoffs. Handoffs are a really key primitive that we'll also double click on in the next slide. Support for the real-time API in the agents SDK means that developers can now turn any agent into a real-time agent with a single line of code. In the code sample on the right, we are initializing a speech-to-pech agent using the real-time agent constructor on line 17. A oneline code change to use a different constructor would allow us to initialize the same agent with a textonly language model. The SDK automatically handles details like using WebRTC when it's running inside a browser or using websockets when it's running on the server. And let's also take a moment to jog our memory about handoffs. Handoffs are a new primitive that we introduced back when we launched the agents SDK. Fundamentally, handoffs allow you to let one agent delegate control to another in a conversation flow. This allows chaining or routing across specialized agents in a multi- aent network. It's super useful for building systems with domain specific or language specific behaviors such as routing between a support and a sales agent in a voice application or between English and Spanish language agents in a translation application. In the code example on the right, we will see we can see a greeter agent that is configured to handoff
Segment 3 (10:00 - 15:00)
control to a math tutor agent. the greeter will actually handle hand over control by making use of a tool call under the hood when it determines that the user actually wants help with a math question. So with that we have all of the foundational pieces we need to get into the super exciting part of today's session which is the demo. So Brian it's over to you. — Yeah I'm super excited about this. Um, we get to make real- time agents in a situation where the audio is being looped in real time for everybody. Um, okay. So, let me switch over to the demo. Um, super excited. So, let's get started. So, for the demo today, uh, we're going to be incrementally building a suite of agents that will help me with a real project that I actually have coming up in my life. Home remodeling. — Exciting. — Yeah. Uh, a few months ago, I moved into my new house and I love it, but you know, there's some things I'd kind of like to fix. Um, so let's start from the beginning here. — This is also the moment where we are uh crossing our fingers for the demo gods. — Everything will be fine. No worries. — So, um, we've built a workspace like you'd see in a typical notetaker app, something like Apple Notes or Google Docs with tabs. Um, so let's try it out. Um, okay. Okay, I'm going to build a workspace for my remodel uh overview. Uh let's do some like inspiration. Um I can go over here. I can be like inspiration um ideas go here. And so like this is the point in the um you know in the video where you sort of see this is the old way. Uh but um it's really slow. I have to type a lot. So, you know, we have AI for this to actually help. So, let's switch over to um the next version. So, let's see what we can do with like a little bit of an agent here. Ooh, I see a lot of new UI elements uh that have just popped up. Do you want to walk us through maybe some of what we're seeing that's additional now on this workspace? — Yeah, so these are all part of the OpenAI real-time agents open source repo that this project is forked from. Um, it has a lot more patterns and demos than just this. Um, and this UI is actually really great for trying out real-time agents. You can do things like change the codec for making it sound like you're talking to it on the phone. Um, and there's some controls for how to interact with it. Um, so right here, we're just going to just we're going to start here. Um, so I've defined some tools. Um, for this agent, let me pull them up. we have a workspace agent. So, let's go down here. Um, in the real time uh SDK, we've defined a workspace manager agent that has just a basic set of instructions on how to set up a workspace, what a conversation flow conversation would look like, and it has tools. The tools are uh connected to the UI um using a context and the agent can call these tools to add tabs and set selected tabs and all the things that you kind of want to do while you're building out a workspace. — Sounds like all of the actions that you were taking manually before are now tools. Is that right? — Yep, that's right. Okay. So, if I go back here and I say um set up a workspace for a small kitchen remodel. So, the agent will make use of its tools to fill things out. And like you can see that it's already going a lot faster. If I were to type this in myself, it would take forever. This is what AI is for. It's generating all of this for us. Um, okay. So, this is still pretty slow. Um, typing to it is a little annoying. So, like, wouldn't it be better if we could just talk to it? — That's how I ideate. — Totally. So, let's try that out. Hi. — Beep beep. Greetings, human. How may I assist you today? — Um, okay. Cute little agent. Um, yeah. Hi. Um, let's see. Let's set up a workspace for a small kitchen remodel. I want tabs for inspiration, project plan, and budget. — Beep beep. Workspace tabs initialized. Where shall we begin, human? — Yep. So, that's great. Um, cute little agent. So, it's still better, but it's not very smart. It's um not really clear what's happening when it's deciding to call the tools. A better user experience for this would be if the agent were to actually tell
Segment 4 (15:00 - 20:00)
me what it's doing while it's running. — I mean, also the robot voice is it's kind of cute. I see the but I'm not sure I will be able to ideulate all that much with that thing distracting me. — User preferences. All right. I guess we are making it smarter. So, sure. Um, okay. Okay, so I just switched over the prompt in the background um in a couple of places to tell the voice agent to use filler phrases before tool calls to let the user know what it's about to do. One tip about this is if you have a function that's going to run particularly long, you can in the function description tell it to um let the user know it's going to take a bit and to hang on a sec. So let's try this one out. — Hi there. What can I help you with today? Thinking about setting up a workspace for a project or something else? — Sounds better. Hi. Um, yeah. I want to set up a workspace for an interior design project for my kitchen. — Great. Let's get started on that. I'll set up a workspace tab for the project overview first. Let me update the workspace. One second, please. We've got a space to outline the — Actually, can you make some more tabs, too? Like a schedule tab and um I don't know, a task list tab. — So, you can actually interrupt this while it's talking. — Let me add a schedule tab and a task list tab. — One moment while I update those. — A lot faster to iterate. — Did you have to do anything special to like make the interruption handling work here or is that something you get out of the box? — It's built into the SDK. I'm not demoing that during this live stream because of the loop back for the stream. But if I were to turn off push to talk, I could sit here and just kind of like iterate on it, interrupt it, tell it to change the tab names in real time and all of that. And it's actually a really fun flow. I wish I could kind of show it off better. Sounds like a really great way to like come up with new ideas and really, you know, take your kitchen to the next level. — Yeah. Um, so we have this workspace builder and I can kind of make a workspace that happens to be for my kitchen redesign. So let's like make it more useful. So, I've um gone ahead and I've set up a second agent. — Um — what does this one do? — So, this next agent we're going to show you is a designer. Um and I prompted it to be an expert in interior design topics. So, this lets the AI focus on a single task area, which will definitely improve the quality of the results that it's going to give us for this. — That's really interesting. So before this you mentioned sort of editing the prompt of an existing agent to like modify its behavior and it started sort of narrating what it was doing. That was really useful. Um how did you decide that the designer needed to be like a separate agent? Why not just like add a bunch of design specific instructions to the existing agent and call it a day? That's a super good question. Um so best practices it's definitely a good idea to break up agents along well- definfined roles. So we have a workspace agent that's really good at making a workspace. Um but we want to know this we want we know this wants to be a design workspace. So we can make a designer agent that can focus on its design task by narrowing down its latent space and its autocomplete potential so that the designer agent will only talk about design things which will give us a much richer result for what we're actually trying to build. — I love that. Um, it's also always great when we follow our own best practices, uh, like we have up on the slide here, which is to start small with one agent that does like a specific task and then incrementally add complexity, which it sounds like is exactly what you're doing, Brian. — Yeah, that's totally right. So, let's take a look at the code for this. Um, all right. So, we have a designer agent now that I've added to our code. It has a different prompt. It's an expert interior designer. Um, it has a different conversation flow. Um, it has some indications of letting the user know that it's calling tools. Um, but then it also has a different set of tools. So, this agent has a make workspaces, make workspace changes tool, which totally encapsulates all of the tools that the workspace manager itself has. So it only has to think about just like changing the workspace all at once instead of having to manage different tools. Um and it also I've given it a search the web tool which it can use to actually search for inspiration ideas or images and things like that. Um and the way these work is um we've wrapped them in a GPT41 uh text model. So this voice agent can make a call back to the server um using the responses API that um and basically call an agent or a model that um knows how to do more complicated things than this than we want this agent to even care about. Right? So we want the designer to talk about design things and then say okay I'm going to jot this down and the workspace gets
Segment 5 (20:00 - 25:00)
updated. — That's really cool. So it sounds like we're making the design agent like we're freeing up more of its head space to be able to just think about design and really simplifying how it has to sort of edit the workspace, add tabs and you know add content there. Is that like a good way of thinking about it? — Yeah, that's totally right. Um okay, so let's go back to the demo. Okay. Um, let's reset the workspace. — Hi there. What kind of workspace are you looking to set up today? — You guessed it. Hi. Um, let's set up a workspace for my new kitchen remodel project. — Got it. For a kitchen remodel project, we'll need tabs to organize inspiration, budget, timeline, and materials. Let me set that up for you. We've got dedicated tabs now for inspiration. — Okay. Okay, so you can see we started with the workspace agent like before. It did a pretty good job of setting up our workspace. It cleaned everything up and then it knew to hand off to the designer so they can actually um run through the design process that we actually want to do here. Um so let's like test out this designer's tools. Um awesome. Um what are some latest trends in contemporary kitchen design? — Let me search the web. One moment, please. Yeah. So, it's going to go off and search the web. It's going to take a little while. The 4-1 model in the background knows to kind of look for results and to format them back in a way that the designer — some of the latest trends for contemporary kitchens in 2025 are. — That's great. Can you add some of those to the inspiration tab, please, for me? — Let me update the workspace. One second, please. So now this agent is just calling make workspace changes and the model behind the scenes knows to translate that call into um all of the things that it needs to do to actually update the workspace itself. — That's really interesting. Can you maybe tell us what the arguments or like what are we passing into the make workspace changes function? — I've added those trends. — Um we can see here that the uh the designer passes it what tab to change and kind of what it wants to do. Um, and then the actual workspace manager itself, um, it makes its own function calls. Um, in this case, it just like sets the tab content and hands it back. Um, but there's more complexity underneath the scene behind the scenes for the second agent. Um, and so the first one can just like basically hand it off. Also, thank you. What's also worth noting is the context of the conversation can be passed along. So if the real time agent is sort of like lost in just like handing off to this other agent, you can tell the other agent to like read back on the conversation and make sure that it's actually doing what the real time agent said it was going to do instead of just what it was told. That's really cool. So this is like one of the ways that we can co-mingle essentially a real-time model which is really conversational, really expressive with like a smarter model which may be slower. So it's not suited for like all of the conversation, but you probably need it for like high stakes tasks. — Definitely. Yeah. And you know, if we had even more complicated tasks for to do, we can hand it off to O3 or something a lot smarter. — Um and those agents could actually hand off in the background too if we wanted. Okay. Um so that's pretty simple. We have a pretty basic designer. Um let's improve it. So um this is very open-ended conversation. There's like nothing here that kind of like screams like good design redesign process for me. So, um let's actually like upgrade our designer agent. Um I'll show this tab. So, let's do a split here. So, our old designer prompt was pretty basic. And so, now um this version is actually built using a meta prompt, the voice agent metaprompt. And this is a prompt designed by our co-orker Noah, who's awesome. Hi Noah. Um, it's included in this repo, also the OpenAI real-time agents repo. And you can use it, you can run it in the playground, filling in some details about what agent you want to build, and it will actually build out this prompt that I'm going to show you here um that you can see. So it builds out identity and its tasks, demeanor, tone. You can really personalize this thing. And I found actually working with this workspace agent and this designer agent that if you make this kind of fun to work with, it's kind of you get really hooked on like working with your agent. So it's like almost like working with a fun co-orker that you really like to be with. Um and then so now what we're going to do here is we're going to program in the ideal workflow for this designer agent. So we have conversation states. — It starts with a greeting. It goes on to
Segment 6 (25:00 - 30:00)
sort of build out the workspace, gathers inspiration. um and then sets up requirements. One thing that um I've built into this agent is um a handoff to another agent. So in a design process, you know, you have the designer, they're collecting inspiration and coming up with a plan, but then at the end you kind of have to figure out budget and scheduling. We want to isolate that. So I've actually gone ahead and I've made an estimator agent that the designer can hand off to. We might not go into this that much detail on this short little demo today, but um I've left it as a stub for people to kind of like use, but the estimator has its own tools like calculating and those sorts of things. — That's really uh a great call out because I think what's happening here is that by adding the handoff to an estimator agent, you're also sort of telling the designer agent what not to do. Uh in addition to like sort of handing off to this other agent, it knows not to get into estimation even if the user asks. That's a super good point. Um, I don't think I even thought about that. Um, okay. So, let's try out this new agent. Um, set my workspace here. — Oh, let's switch back to the right tab. — Workspace are we building? — Hello there. I'm super excited to help you set up an awesome workspace. What kind of project or workspace are you looking to build today? — Hi. I bet you're really excited to set up a design workspace for my kitchen remodel project. — Absolutely. I am totally pumped to help with that kitchen remodel project. Let's get this workspace set up so it's super organized and ready to specific. — Yeah, those sounds good. Um, go with those tabs for me. Thank you. Okay, so it's gonna set it up. That's interesting. So, even though it hadn't like said out loud everything because we could see it in the interface. — That's a really good point also. — So, for your kitchen remodel, can you tell me? — So, I'll talk over that for a second. Um, yes. So, the it takes a little while for the agent, the real time a voice agent to like finish talking. You can set the speed in the API, but it's a really good chance for your a your UI to actually um if you're streaming, update the text so that the user can read ahead and like decide to interact back and forth. The agents don't mind if you interrupt them. They're totally happy with that. But it really makes a good collaborative experience and you can go like lightning fast that way. — I love that because I feel like I can read a lot faster than I have patience for listening. Maybe I need to work on that actually. — Oh, maybe. Um, okay. So, let's run through an actual interior design workspace creation. Um, I'm going to throw it over to you guys. Um, do you have um something that you want to redesign? — Well, I don't have it yet, but I'd love to have uh a redesigned balcony overlooking the bay. — That sounds good. — What about you, Christine? — Yeah, let's actually check. rustic kitchens have such — let's check the Q& A um and see if any of our listeners who are tuning in um have any suggestions. — Okay. Well, let's start with the balcony design part um and then we can come up with other inspiration on top of it. — Yeah. — Sound good? — Let's do it. — Okay. Actually, um what I want to do is I want to upgrade uh and redesign my deck overlooking the San Francisco Bay aspiration. — Oh, we saw a little bit of a transcription bug here. Um let's start this over again because we're actually um running into a sound issue, but here we go. — To help you out today, what kind of workspace? Um, yeah. Set up a workspace for a redesign of um a balcony overlooking the San Francisco Bay. Oops. All right, let's just type it out for now. Okay, so it's going to set up the workspace for us. I'm really excited to see what this looks like. Whenever we involve the microphone during the stream, this is what we get. Okay. So, let's talk about inspiration. So, so it's actually kind of like walking through the workflow that we talked about before. So, is California coastal a thing? — Uh, it can be. We can make it a thing. — Okay. — I also really like the fact that I don't think you did anything additional to add this textual interface. Like you we kind of get this for free, right, with the speechtospech model. Like it can just take audio and text and produce audio and text on the other side.
Segment 7 (30:00 - 35:00)
— That's totally right. You can call send message and send message will send it to it and it doesn't really make a difference. You lose a little bit of the uh obviously the EQ of like the tone you're trying to get across. — Um let's see. Uh colors. Do we have any feedback? — Um we have no color requests. Uh, but someone did request to redesign the garage uh to be a gym with a sauna and a cold plunge. So, I — actually kind of like that better. — Yeah. — Um, let's see if we can get the voice to work again. Um, — Oops. Hi. Yeah. So, let's actually change gears. We want to um redesign the garage to be a gym with a sauna and a cold plunge. That sounds kind of good. We've lost total audio fidelity in this loop back. Okay. Um let's move on. Um you kind of get the idea. So the way this would actually work is it'll finish the script that we have. Um and then it would hand off to the estimator and the estimator has its own tools that it can use to sort of like you know calculate the budget. Now one of the things I've done with the estimator is I've given it access to code interpreter. So code interpreter can actually like — um it has a calculator function that you can hand off a schedule and a budget to and it will actually write a Python script to run through all those calculations for you and then hand it back so that you can have this conversation. — That's really cool. I can even imagine we could upload like a bill of materials or something or you know if we had a supplier we could probably get like some cost sheets and stuff and get a really accurate estimate out of that. — Yeah, totally. Um, cool. So, what else can we do here to like make sure the agent workflow is like stable and bug free? — Um, super glad you asked. Um, so, um, let me switch over to our final, uh, demo branch. Um, and so I've written some code. So, you know, evals are all you need. That's like we like to say around here. Uh, and so like eval can be kind of uh a little bit daunting to kind of like write especially for like a TypeScript project like this. So I've gone ahead and I've written a an integration test. — Um, and so this integration test um stubs in the uh the tools that it has, the workspace tab tools. Um, and then it calls our workspace manager agent with a script that's basically pretty predictable. Make three tabs. Um, and then you can actually run this as a just test to make sure that the agent is actually calling the tabs that um, it should. Now, um, you know, this will run 100% of the time. If you want to sort of like run more complicated workflows, you could build this out more and test your designer agent with a model graded test um, and have it have the model look through the conversation. So you could actually run through mock conversations with your designer agent and then grade it based on whether the designer agent followed the workflow. And we highly recommend um switching to this process once you get to a stable part in your um codebase. — So um let me switch over to here. So that makes a lot of sense for like evaluating the thinker model, if you will. Like this was a way that we can evaluate the 4. 1 model that's actually making changes to our workspace, right? — Mhm. — Um really cool. And what we have on the slide here is uh one of the success stories that we've seen uh with a customer Lemonade. Um the Lemonade team actually invested a lot in evaluating the agents performance early on and this allowed them to go to production with a lot of confidence. What you see on the right is actually a custom interface that they built for capturing audio on their platform and then running that through evaluations with human review as well as some automated scorers. So we teased at the beginning of the session Brian that we have some support for this that's newly added in the platform. Do you want to talk us through what that looks like? — Yeah. Um so let's go over to the traces tab. So, while I've been running this, um, if you look at the platform, let's go into the dashboard and pull up traces. You can see all of the real-time sessions that have been running during our demo. And so, this is one that um you can see that the audio in is recorded here as well as the audio out. So, you can listen to what was said and play back the audio that the agent said. Um, and then as well, you can see all the tool calls that were made. So in this case um we added a workspace um tab for materials and finishes. Um and you can sort of like debug this and trace this all the way out. This is a great start for debugging. Um that's what we have right
Segment 8 (35:00 - 40:00)
now. And soon we're going to be able to um have the ability to turn these traces into um eval platform. So as you go, you can actually run through uh trace logs in production, find ones that are either really good examples or ones that you don't like, and turn those into eval to help reinforce the flywheel of building out these real-time agents. — That's really cool. So you don't really need any custom tooling anymore to just like capture the audio, right? Um like can we actually listen to the audio? We don't have to do it now, but I'm wondering if the platform supports it. — Yeah. So let's see. You can — Oh, cool. So this is actually something which a lot of — I really want to redesign my kitchen. Can you set it? — So that was me. — Nice. Uh yeah, this is something that a lot of teams who were early adopters of this API had to build out themselves. So it's really cool that we're bringing it to the platform and at least making it easy and accessible for human review now and evolves in the near future. — Yeah. Um what are some other things that we could do for sort of you know maintaining stability in our voice applications? — Yeah. So, I mean, you know, launching this in production is like a little bit scarier than like giving it to you guys to test out. So, like how do we make sure that our agent is safe to release into the wild and doesn't go off script? Um, so one thing that the agents um SDK gives us is the ability to run output guard rails. Um, and so, um, in this example, you can set, um, an output guard rail that, um, tells the agent to not talk about certain things. The way it works is, um, the guardrail runs on the transcript as it's being generated. And the transcript comes in faster than the agent's voice will actually finish. So while the agent is speaking, the transcripts for their head like we saw before and the guardrail will run on the transcript and check to make sure that it's actually um in within the moderation um constraints that you set up. — That's really interesting. So what if the moderation constraints are tripped? Uh do we just like sort of end the conversation or do we have any other options for how we can handle that situation? If you set it up right and you give the guardrail the feedback uh about why um it interrupted or why the moderation event triggered, it will send a message back to the real-time agent about what happened. — Um which will then interrupt the agent and like let it kind of like correct itself and usually typically it says you know I'm sorry um I can't talk about that thing. — Nice. So it's like a feedback mechanism for the agent. — Absolutely. Um so we have a demo for that. Let's pull that up. Turn off that. Switch back to this tab. Set. All right. So, um, I worked on this really cool agentic application in the past called zuka. ai. Um, and so what I've done here is I've actually set up a guard rail so that our workspace redesign or like designer agent will only talk about interior design. — So let's see what I can um get the guardrail to trip here. — Should I risk it and try to push the top? — Let's do it. Let's do it one more time. — Hi. Um, can you set me up a workspace for formulating and manufacturing a new turmeric and chocolate protein bar? — Oh, certainly didn't. — We still don't have it with us. That's okay. — So, let's just copy and paste this in. — Seems like a really interesting venture, Ryan. Do another build hour just on this. — I totally would love to. Okay. Here we go. So, we have our guardrail tripped. I'll stop it so I can show you what's happening. Um, it started talking about formulation and manufacturing. The guard railroad tripped. It failed. Told her it was off-brand. — Um, and then it basically apologized for saying the wrong thing. and then proceeded with the conversation. — Really cool. And I can imagine easily imagine adding some sort of checkpointing to this product where we could have rolled back all of the changes we actually made in the workspace, right? We probably don't want the user to see all of this uh that we are seeing on the workspace right now. Probably some easy solves for that. — Yep. — Really cool. — All right. — So, um that's it. Um, these are a lot of ways. I mean, there ways that you we could have extended this if we wanted to keep going. Like, we could have added um image gem for the redesigner to actually come up with inspiration or mood boards. Um, I thought you said you were going to do that, but that's okay. Um, we could add
Segment 9 (40:00 - 45:00)
in deeper domain functionality with file storage or MCP tools. So, our redesign agent could connect to um like pricing and material supply information for estimation. Um, and then you could also, I really wanted to do this, set up Twilio so you could actually call up your agent on the phone on your way to Home Depot and figure out like what supplies you could actually get on the way. — Sounds like you have a full backlog, Brian. I'm not sure whether you'd get to this first or the home remodel. — Oh, totally. — Okay, awesome. Um, thank you guys. Um, we have some resources for you guys. I saw some questions in the Q& A that um we are going to now move into and answer live. Um so we have a good amount of time for that. But all the links you see on the screen, we will follow up via email and send them to you. So don't worry too much about um about copying these down. So uh let's get into the Q& A. We have a few questions we selected. Perfect. Which ones did you — Okay, so the question is real time API has many configuration options. Which ones did you use for the demo and how did you pick the parameter values? — Right. So some of the configuration options that you can choose from are you know which real-time model you want to use. I for the demo I used the most recent um June 3rd one which is a lot better in instructor following. Um you can choose which transcription model you want to use. We use whisper one on this one. Um and um you can choose the uh audio codec that you sent. In this case, we just like went with the highest resolution one because we were doing it live. Although maybe if I were using phone, it would have worked through this loop back. Um, and then also, let's see, some of the other parameters are like the VAD. Do you want to talk about the VAD settings? — Yeah. So, we have a couple of different types of voice activity detection that uh we support on the real-time API. Previously, we had sort of a naive system which just listened for audio or lack of audio and that's how it detected the ends of turns. Um, and I believe that's called the default. I'm not sure. And then there's a newer uh voice activity detection system which is called semantic VA. And that's where we actually take into account the content of what is being said. And so if you say something along the lines of my name is and then give it a long pause. Um, semantic VA should actually understand that you are not done speaking yet and it'll sort of hold back the model for long enough for you to complete your sentence. Um there's also some other values I guess that we can talk about. Um temperature is one which comes to mind. Um yeah, how do we think about temperature for this model? — So I think about temperature in terms of like how creative do I want it to be with my responses versus like how strictly do I want it to follow the instructions and the best practices that I set out. — So if you want to stick to a script pretty tightly, um I would keep the temperature pretty low. But if you wanted to kind of like be better at ideiation and um creativity, I would set it a little bit higher probably than that. — Yeah. And we do recommend for the real time models uh using a temperature I believe between 08 and 1. 1 or something. Yeah, there it's in the documentation, but uh we don't support the full temperature range for the real-time models simply because uh it causes problems with the uh with the audio tokens. Next question. — Yep. — I should refresh. Okay. A mobile app. The best way to implement speechtospech voice agent for a mobile app. — Yeah, this is really interesting. Uh I think uh the fact that we have a web RTC support is the biggest unlock here. Um it means that actually do you want to pull up the system architecture diagram maybe and we can just like talk about how all the different pieces sort of communicate with each other. — Um so yeah using the using WebRTC to connect your client device which in this case will be the mobile phone directly to the inference server is probably going to be your best bet. Uh this will eliminate like a network hop so you don't have to send traffic through your server. Uh another thing that'll it'll allow you to do is like significantly speed up the um the integration uh time. I think we're not sharing the screen maybe. — Okay. — Um so yeah, we'll just pull up a quick uh diagram which shows how the demo we saw today works and then hopefully that can help you map on to how it might work in a mobile setting. So in this diagram uh we can see sort of the user is on the left uh and then our application is everything else in the diagram and what runs on the client is really the manager agent as well as the designer agent which uh Brian designed and there's also some workspace state that lives on the device uh or on the
Segment 10 (45:00 - 50:00)
client. This workspace state of course can be saved to a database etc. We are not really representing that in this diagram but there's a bunch of components that you can add to the server as well. In our um example, we have the workspace editor agent which I think we exposed as a function called make workspace changes. — And so this is really the 4. 1 model which is a smarter model. Um and it's sort of being delegated by the designer agent to make changes in the workspace and it streams those tool call calls back to our client and we sort of apply them Um the interesting thing to note here is that manager agent and designer agent are using real-time API and they directly connect the client which is our browser in this case to the real-time API on the openi inference servers. So there's no server um in between the client and uh the AI model when we're just interacting with the voice which is how it's like so fast. — Uh yeah. — Yeah. So what that means is if you don't have anything too complicated um like we were showing before you can hand off between different agents the workspace agent and the designer agent directly you can tell the designer hey like when you want to make workspace changes hand off to the manager agent it can stay on the client and do its thing and then pass back to the designer so you can kind of keep those — um sort of like functional separations intact. One other thing is the client um gets an ephemeral token — from the real-time API that uh is secure, right? So you can set a DTL on it so it won't last too long and people can hijack it and um and if you want to, you don't even actually have to build a server for the mobile app. You can just like have it run directly through the client. — Nice. is the reason we have the workspace editor on the server because uh we don't really have that support for ephemeral tokens for these like text models. Um right yeah so the server you would set up your openi API key on the server and then you could sort of like run it protected just like you would normally and then give it a lot more control um for like what it can do. Um and you know you could use you have full use of the responses API that way too where you can plug into like you know um image generation MCP and all sorts of things. — Really cool. Okay, we can go to the next question. — The question. — Great. So, the question we have is, are speech-to-pech models useful in scenarios where you'd like to mostly do speechto tool calls? — Let's see. um speeches models useful I just tool calls I think. So you can um you could write um really just kind of like a phone based uh application where you could call a phone number on Twilio and then have the real-time model um make a whole bunch of like calls in the background, right? And like just sort of like tell you what it's doing. So, I mean, one of the ideas I just like had on the spot is you could set up an agent to kind of like manage your home automation, right? So, you could call up your agent and say, "Hey, like, can you like set my thermostat to like 78 degrees or whatever because I'm calling home 72 degrees. " Um, and so that way, you know, the agent kind of like using a responder thinker pattern can just like talk to you and then like hand off all of the complicated logic to the tools um to do the complicated things. Yeah, I would also add that this is perhaps not the only thing that they're useful for, right? Um, I mean there's a lot of interesting use cases which I don't think really were uh possible pre-spechtospech models. The one that comes to mind uh most prominently for me is language coaching. Um, one of our customers, uh, called Speak actually has, uh, this role playinging agent that they've built using real-time API where they actually give you feedback on how you're pronouncing words. Um, and it just helps you become better at a third language or a second language. Um, this is just something you couldn't do with the older class of models because you would just lose that pronunciation in the transcription phase, right? Yeah. So, I'd say maybe speech-to-spech models are also useful where these softer aspects of uh vocal communication are really important where you really want to uh be highly attenuated to the user's emotion, state of mind. — Yeah. — Um Yeah. And tools are just a way to give them more capabilities. — And that's why I picked the designer scenario for this is like, you know, it's a very sort of like an emotional thing. Like I want my agent to like pick up like what I really care about while I'm redesigning my kitchen. like, okay, I don't care about this. Um, but I do care about that. So, it's it'll actually spend more thought cycles
Segment 11 (50:00 - 54:00)
actually doing a better job based on how I'm talking. All of our products should have voice mode. — It also sort of mirrored your energy, you know, when you got super excited. It sort of responded in a more excitable way. Um, and that's also something that only speechtospech models can do. I — I said it before, but I really got hooked on like working with my little workspace agent because it was just so fun. I just wanted to come back and like have it make more workspaces for me. — We'll have to see uh photos of your remodeling. — Um and sort of related, how important are the prompts when designing the agent? They're I I'll admit like it's a big part of it, right? So like you have to be thoughtful thinking about um the tone and you know like the patterns you wanted to use. You can put in your own sort of like branding guidelines into it. So like you know if you want your agent to represent your brand you can like have a voice uh prompt derived from like what your sort of like writing tone is but in voice mode. Um and then also you know like I was showing you running the conversation flow. So what's important to realize about real time voice models is they're generating tokens just like text models but it's it's audio tokens and there's a lot of them. And so, um, one of the reasons why it's a little bit harder for, uh, voice agents to follow instructions right now is because they're just doing a lot more than text models. So, um, you can get a lot out of real-time voice agents with the right prompting. And like starting with that metaprompt is like a really great place, um, and then like building in these like state machines and making it easy for the agent to keep track of where it is and offload a lot of the thinking to like something else if it's if it can. — Yeah, I did notice your prompts are like super long. Uh often when I go into the playground and I'm playing around with the real-time API, I feel like I just pop in like a sentence or two and uh you know that's how I experiment. But I believe in practice what our teams really recommend is prompts that are hundreds of tokens long especially when you want to give the agent a lot of personality. Um you really want to be hyperdetailed and experiment quite a bit. — Yeah. And you know a lot of like oneshot and shot examples too for certain types of conversations that you're running. um you know it'll like help it give it a guideline for like how to run certain scenarios if it gets lost. Um yeah so this agent so the question to build the agent when language is used to code this was written in Typescript I was um the this uh platform is written on um Nex. js JS. Um, so you can run a fully client real-time agent on the client side, but then call back to server APIs on the back end to do more of the complicated responder thinker pattern implementations that we're doing. And what's really great is like, so the agents SDK has been out for a little while for Python. Um, I don't use Python as much as maybe I should. Um, but you know, I'm kind of like a JavaScript next. js person. So, as soon as this came out, like last week, um immediately switched over to it and it's been a blast. — It's also really great that um I mean, if we think about WebRTC, it's really a technology that's built for the web and it's built for browsers and so it's just got this seamless integration with JavaScript, right? Um so, it makes so much sense that we have uh real time and SDK. — Totally. — Uh will be a Yep. So, will it be able to pick up the tone of the color? Totally. That's the whole thing about real-time voice agents is um you know they can tell if you're excited or angry. In fact, like that's actually a really big useful I mean we didn't go into like customer support use cases here. The real-time agents uh codebase has a lot of customer support examples in it. Highly recommend checking those out. But yeah, you could um give the function give a function to your agent to call. Say for example, if the user is angry, um you can detect that the user is angry and have it say, "I'm sorry, you're having a hard time. Escalate to a person, right? " So then it could like hand off to a human in the loop or maybe like start over some different workflow to make the user um feel better about the interaction that's happening. — Makes sense. — Awesome. Um so I think that's all the time we have for questions for this session. Um, but we really appreciate all the questions. We after every build hour will send out a survey form. Um, and so if you have additional questions or any suggestions for upcoming topics, feel free to um to put input that there. Our next build hour is going to be July 16th. Um, so be sure to use the link and sign up um and join us next time. So, thank you again and uh we'll see you June 16th.