Prompt Engineering Workshop: Universal-3 Pro

1:00:26

Prompt Engineering Workshop: Universal-3 Pro

AssemblyAI 19.02.2026 547 просмотров 20 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

AssemblyAI Applied AI Engineers hosted a live jam session where they walked through how prompting works in Universal-3 Pro. Get the TL;DR on: - How prompting transcription works - Live examples and comparison between transcripts - Tips for improving accuracy across messy, real-world audio and domain-specific vocab Ready to try prompting to improve audio transcripts? 📄 Universal-3 Pro Docs: https://www.assemblyai.com/docs/getting-started/universal-3-pro?utm_source=youtube&utm_medium=referral&utm_campaign=workshop&utm_content=prompt_engineering_u3p 🔑 Grab your free API Key: https://www.assemblyai.com/dashboard/signup?utm_source=youtube&utm_medium=referral&utm_campaign=workshop&utm_content=prompt_engineering_u3p Live Q&A including PII Redaction techniques, prompting best practices, Diarization updates and more. ▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬ 🖥️ Website: https://www.assemblyai.com 🐦 Twitter: https://twitter.com/AssemblyAI 🦾 Discord: https://discord.gg/Cd8MyVJAXd ▶️ Subscribe: https://www.youtube.com/c/AssemblyAI?sub_confirmation=1 🔥 We're hiring! Check our open roles: https://www.assemblyai.com/careers ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ #promptengineering #speechtotext

Оглавление (13 сегментов)

Segment 1 (00:00 - 05:00)

Hello. Hello everybody. Uh thank you all for joining. Um I think everyone is actually entering from the waiting room now. Uh we're going to give everybody a minute or two here to join and get set up. Um while we're doing that, uh just some quick introductions. Uh my name is Ryan. I head up all the customerf facing teams here at Assembly AI. I'm joined with Zach and Griffin from our applied AI engineering team. Um, and we're excited to just go over all things Universal 3 Prompting uh with you all. Some quick housekeeping. Uh, this is a Zoom webinar, so you all should be able to submit questions uh while we go. So, please send those in. Uh, we will try to answer those in text. One of us will be talking, the other two won't. So, we'll tag team some of those responses here. So send them through and we will answer them. Um and uh hi nice to see you again too. Um and uh as we answer those uh if while we're live we can actually answer it live and show a quick demo we will. Um but of course uh we'll answer them in text and if there's any at the end that we want to go back to um hopefully we're going to leave some time for that. Our goal is to leave at least 15 minutes for Q& A. Uh and this session is being recorded. Uh, and we will send it out to all of the participants afterwards. Zack Griffin, grade me. How'd I do? What did we miss? — Great work, Ryan. We've all been prompting absolutely non-stop with uh with this model and we're really excited to show it uh to all of you. I'm Zach, by the way, buddy. Nice to meet you all. — So far so good. And I'm Griffin. — Yeah. Cool. Uh I think we've got about 50 people. Um, and so instead of just having you wait here awkwardly, let's get started. People can join late and, uh, they can catch up as we go. So, uh, for those of you who are unfamiliar with our new model, Universal 3 Pro, uh, it's a promptable speechtoext model, which means, uh, for the first time, next to your speechtoext requests, you can also include a natural language prompt to customize the results of the transcript that you're ultimately getting back from Assembly AI. Instead of doing a blog, slides, marketing pitch, we're going to jump right into the demo in this case and uh we can send you some of those materials afterwards. What we're going to use for this particular demo is uh this tool right here that we've gone and built that's going to allow us to do quick comparisons of Assembly AI speech to text models and customize some of the results that we see. I'm going to send this here to you as well if you want to load this up and you want to play around with this while we're talking. But uh on the left we're going to have our current production uh sorry our prior production model uh universal 2. This model really like leads the market in terms of price per performance. You might find models that perform better but you're not going to at a cheaper price. So this is kind of our old state-of-the-art model. And on the right, we're going to do Universal 3 Pro for the purposes of this particular session. Um, I'm going to go ahead and actually pick uh some files that I have stored locally that we want to use. Um, I'm going to pick to start uh a meeting that's actually from GitLab. You can actually find this file uh on YouTube if you want to as well, and we'll send it kind of the show notes afterwards. But effectively, this is a YouTube file of a meeting similar to a webinar like this, but some folks chatting internally at GitLab. They have a bunch of these meetings available online that you can go and check out. And so we're going to compare to start universal 2 to univer uh to universal 3 pro without a prompt just so you can quickly start to see what does the model look like before what does it look like after before we introduce prompts and start to just draw some comparisons to establish a baseline before we hop in on this. You'll see that all the API requests are logged as well. So if you're using this tool you'll actually see like what is it doing? It's uploading the file. It's actually going to go and do a transcription request for both Universal 2 as well as Universal 3 Pro and it's going to output those uh into the actual UI so that we can start to do our comparison here. I also debated uh pulling all these up beforehand. U but I feel like if you don't do it live now, people think there's some like dark magic happening behind the scenes and you're like replacing or uh changing things that are actually showing. So uh if we are having challenges, we're just going to have to debug them and fix them together. Uh, but this had worked up until uh right before this demo. And of course now it feels like it's slow. [snorts] All right, while we're doing that, let's do some advertisements. So, there's a prompt engineering guide. If you haven't seen this, I'd highly recommend checking this out. Um, we will share this in the chat right now. This is a great resource

Segment 2 (05:00 - 10:00)

for you to understand not just like how prompting works but a lot of the different things that you can do around prompting. And so some of these capabilities like verbatim transcript uh getting better entity accuracy, doing cold code switching and multilingual we're going to demo today, but we're not going to be able to demo all of them. And so it's definitely worth you taking a look at that guide and seeing what you want to do. So once this audio is done, we actually do an AI evaluation with Claude Opus 4. 5 and it's evaluating Universal 2 on the left versus Universal 3 Pro with the prompt on the light on the right and it's trying to actually pull out some of the differences in this audio file so that you can actually see like which of these is better. Now in this case it's picked Universal 3 Pro. That's great. There's some insertions, deletions, substitutions, no hallucinations. But I think where it gets really interesting is where you start to notice some of these nuances around, you know, some of the names that are in here, some of the actual like proper nouns, uh, etc. I'm going to actually play the first little bit of this file just so we can all look at it together and you can kind of see it play back and hear some of that live, uh, and take some of your own conclusions. Hopefully, you all will be able to hear this. Um, if not, definitely just let me know. So it is the SEC meaning secure and govern growth and data science meaning applied ML MLOps and anti-abuse team [snorts] meeting. That's a big mouthful. We might get a better name over time. Um and uh that's our meeting for September 14th or 15th in APAC. And hi Alan, glad you're here. Why are you here when it's midnight? We could talk. Uh glad you're here. So, if you were listening closely, the first thing you'll notice is like the second word was actually wrong in Universal 2. He does say it is. As the speaker continues to talk, you actually heard probably some like stutters and hesitations happened in the way that they were talking. You don't see those in either of those these audio files right now, right? You're not seeing like stutters, disluencies, etc. I will say Universal 3 Pro actually punctuated here. Sounded like maybe Universal 2 might have been right on this like and high. But one that's really critical and interesting in this example is like why are you here when it's midnight? We could talk. And then this example universal 2 is why are you here? When it's midnight we can talk. It's a completely different meaning in our prior model than our new model. And so you can start to see really quickly how you can contextually look at one of these models versus the other changing the meaning of some of these sentences. And so this is with no prompting out of the box. We're starting to correct some of these errors. We're able to actually go and like pick up some of these like context clues and fix the meaning of sentences. So now let's actually talk through how are we going to go build a prompt for Universal 3 Pro. That chart starts to tease out some of those different things that were missed that weren't perfect on the first pass. So I'm going to pick Universal 3 Pro for each of these. And what I'm going to do is start prompting on the right. So the left is kind of like our baseline, which is the Universal 3 Pro we just looked at. And on the right, let's start to actually add in some of these prompt instructions to see how this is going to change what the model is going to transcribe. So the first prompt that we've added in here, and we'll zoom in because maybe it's a little easier, is mandatory preserve linguistic speech patterns including disfluencies. And so the what this is going to do is it's going to cause the model to look over the audio file and whenever it hears a disfluency, it's going to try to enumerate it. That might be things like um uh like you know, but what's interesting is the model itself actually seems to characterize different types of speech into what it thinks are like patterns. And so just fluency is just like one of these patterns that you might go and look for. And so, uh, you'll be able to see kind of what it looks like in the results up there. So, if I, uh, scroll down and we look at what we get coming out the other side, you'll very quickly see, okay, there's like some it may, there's a comma, there's some changes in punctuations down here. We've added like to the meeting, this, but it's not really enumerating as many of these things uh, as I might want to, right? You can see that it's added some more verbatim to the context, but it's not getting all of the like ums and h and stutters that actually exist there. And what we found in testing and playing around with this model is the key is how the model interprets uh these patterns that you're looking for within the actual transcript. And in this case, if we just write disfluencies, it's actually not enough. It's trying to determine what that means, but it's not really sure like what is a disluency? Is it an um an uh is it adding in this or to the? And it's trying to interpret that, but we're not being specific enough what we want to see. And so it's kind of shy.

Segment 3 (10:00 - 15:00)

It's like hiding back some of the capabilities because it's not exactly sure what to do in this case. — Yeah, Ryan, just to add on to this piece. So, one of the key skills that this Universal 3 Pro model has, which is kind of uh you know, pretty amazing, is that it can contextualize the audio in a way that previous transcription models just can't. So, what this kind of filler words disluencies piece that Ryan's showing now, and you'll see it as we show more prompts and examples here, is that the model's capable of like interpreting an audio event and based on the context that we provided up front with the prompt, determining how to represent that uh information within the transcript, right? So like one piece is like, you know, it might not transcribe ums uh if you don't tell it to, but there's other pieces uh like that as well that we'll show later on. So that's just a that's a little uh trailer for what to come. [snorts] — Yeah. And so we'll quickly add in on top of this, right? Maybe disluencies is the wrong word. The model's thinking that um and are actually like filler words in this case, right? Maybe it thinks that they're hesitations. These are again like some of those like different features you're going to start to tease out. And the more specific you can be with these, the better the model is going to be able to go look at all that context and be like, "Oh, I should be transcribing um and uh this is clearly what I'm looking for. " Rather than uh having a scenario where it's like unsure, is this a thing? Uh, as we do that, I did see in the chat, um, there were some questions. Uh, I can't actually see them on my end, so let me just take a Are they being answered somewhere else? All right, here we go. Let me open this one, too. Okay, cool. Um, Bill. Yeah, cool. Yeah, this is uh live today. Um, so with uh the Q& A that we had, uh, by the way, like we'll go through at the end of this uh, chat how you can actually run your own evals, whether that's on your own data set and you want to get the best prompt or if it's a scenario where like you actually want to take your files for example and run that against a bunch of different models that are out there. But I do think what you're going to see is like out of the box with no prompt, this model is going to be much better than the previous generation of models. And then if you start to prompt it, like if you have a data set that's very unique and specific to your context, maybe this is something like medical, finance, legal, um any sort of like specific context, you can actually use um this evaluation set uh to to go and find like the best prompt for your specific use case. And I feel like of course when you're doing the live demo something's like excessively slow which is great. Um but uh we do have a question so we can answer this live actually. I don't know Griffin or Zach if you want to take it. — Yeah. Um so Adam asked uh can it interpret and include in the transcript di non-verbal but audible signals like coughs, snipples, throat cleaning, etc. Yes, it does have um uh audio tagging uh capabilities. Um so it kind of depends what in the training data we have seen most with these audio tags, but it does have the ability to pull things like laughter, silence, noise, coughs, etc. Um we'll see later in this demo, we're actually going to insert like certain tags uh on top of this that aren't even just um speech events like if the audio is unclear or not. you can actually um change the output so that it can note that it's unclear rather than guessing. Um but yeah, it does have that capabilities out of the box comparison of some of these um while we're actually talking. Um so this will be the next one that we go to and I'm going to go and uh stage it here so you can actually see it. Cool. Uh so this one's back. uh what you'll actually see in this example again we're looking at disfluencies, filler words, hesitations within our prompt. And so with this uh it's going to try to pull out more of these different features in the audio set, right? And so now we're all now you're seeing very quickly like um and uh have now showed up in this transcript. Uh it may, which like is probably some sort of like speech hesitation that's happening there, right? uh more ums, more H, just a lot more enumeration of all the little nuances that are currently occurring in the transcript. And we're teasing those out by continuing to add more and more of these different like linguist linguistic patterns um into the actual promp prompt itself. If we actually uh go back and see this next one, what we're going to do here in this case is uh include even more. So, previously we kind of stopped this prompt at hesitations, right? What if we add in repetitions, stutters, false starts, and

Segment 4 (15:00 - 20:00)

colloquialisms, which is like most people say gonna versus going to, but a lot of times when you look at a transcript, it says going to, right? It's trying to like guess the words. And so you can continue to kind of layer on these prompms and you'll see that you keep getting different results within uh the actual um results that you have here around um all these different speech patterns that you have within your audio. And what we think is super cool about this is, you know, because you have the control with prompting for different use cases. This may be good and this may be bad, right? Uh if I was maybe evaluating my performance on this webinar, was it good or bad? You probably want to hear ums and us to understand if I was speaking clearly and well and professionally and whatever else you want to judge that on. Right? In other context, maybe you actually prefer this original one where it's very human readable and it's gotten rid of all these ums and you want it to read a little bit more like a transcript or a book or a novel, right? you can actually control this and so depending on your use case, you can change the output that you ultimately want to see which is really exciting. — A key piece too of what Ryan was touching on with colloquialisms, right? going to versus gonna is that because Universal 3 Pro um you can prompt it to be, you know, very verbatim and kind of show exactly what was going on in the audio file is that um if you're using a traditional like word error rate based evaluation, uh you should really pay attention to specific insertions. What we've found a lot of the times is that the universal 3 pro will do a better job than a human transcriber on some of this in some of this audio data particularly if it's very difficult. So uh something that we've been doing is like if we see insertions uh in universal 3 pros output that word in the truth file go back and listen to the audio file. I know in a in an AI world, it can be tough to want to go back and listen to the audio, but you can actually hear a lot of things that Universal 3 Pro will pull out that a human transcriber might have missed when they were initially transcribing the audio. And I think that's a good point to transition to our next prompt here. And so far we've been looking at teasing out these different linguistic patterns, right? But let's talk about now what Zach was talking about which is how do we get the model to start guessing and specifically like how do we control how it's going to guess what word should come next and what we've done for this is in addition to the prompt that we already have which is I I'll just read it out so everyone has it in case it's like small and whatnot but it's mandatory preserve linguistic speech patterns including disfluencies filler words hesitations repetitions stutters false starts and colloquialisms in the spoken language. We're adding this additional one now, which is always transcribe speech with your best guess based on context in all possible scenarios where speech is present in the audio. And so the model, as you might imagine, is generating predictions for any of the words that it's ultimately going to output into the transcript. And we've now added in a rule here where we're telling it, hey, take your best guess based on context. It may or may not be right, but like take your best guess. I want you to guess and try to figure out what should be here. And this latter part is in all possible scenarios where speech is present. And this says make a guess. And if you're unsure if it's even speech, make a guess on that, too. And so it's trying to get the model to actually go in and just take a guess at what might be happening in the audio versus looking at, you know, some of the predictions that it has and saying, "Ah, actually that's like out of domain or it's too low of a prediction. I should ignore putting that within the result set. " And so if we go and start to look at this particular file, um it's a little bit hard to see all the differences compared to the original. But this particular sentence, right, if you remember this, why are you here when it's midnight? We could talk kind of pauses and then says, "Glad you're here. " That's like actually what the speaker said in that scenario. And that's where this idea of like disluencies and stuff comes into play. Um, it really makes a difference when you're able to have uh this prompt pull all those different linguistic patterns out because it really captures the meaning of what the person said. And I'm gonna play it back really quickly just so you can hear that specific segment because I think it's really cool to see how it comes out in this audio. — And hi Alan, glad you're here. Why are you here when it's midnight? We could talk. Uh glad you're here. Don't make it a habit to come to this meeting since it's really late for you. Uh — and if you were doing some sort of analysis or putting this in LLM or kicking off a workflow, right, it's really important to figure out how that person spoke, not just what they said. And that's what you're starting to see when you get it to start guessing uh some of these different things. and uh see that within the results that you ultimately have.

Segment 5 (20:00 - 25:00)

So with that, I think what we've done now is we have a pretty good prompt for this specific audio file. Uh I think if you go and listen to this, yes, there's probably some nuances of things that are like different or wrong. Uh but it's definitely like headed in the right direction. What's challenging though, right, is if we just keep making this prompt better and better, right, it's going to overfit to this file versus representing like the diversity of audio that we might have from our user base. And so instead of continuing to dive deeper in this file, I'm going to go ahead and pivot to a completely different file. And so instead of a meeting, we're actually going to take a transcript of a uh file from Miami Corpus, which is actually a Spanglish data set that's available where it's uh recording folks in the Miami area um actually talking in mixed languages. And so we're going to do the same prompt as before and just to see like how it affects a file where there's actually code switching involved and you have users who are speaking English, Spanish, and Spanglish, right? all potentially within some of the same sentences here. Uh just to maybe uh clarify uh Mark Andre, if we split the question you have is conversation for asynchronous transcription, is there a way to preserve the overall context between the different parts? Are you I presume you're talking about like chunking the file when you go and actually uh send it through for transcription. Uh there wouldn't be a way today for the model to know these like five chunks are all part of the same file. Uh however, what we found in testing and doing these prompts is the actual context of the call is less important than the instruction that you give it. And what I mean by that is if you were thinking about medical for example let's take medical if you wrote hey this is a doctor patient visit um you know and that was like your prompt that actually doesn't tell the model how to control what it should do in that scenario it's like cool but like how should I guess this word the way to write that prompt would actually be something like what we have here which is this is a doctor patient visit you should pri prioritize accurately transcribing medications and uh diseases wherever possible. And that would tell it like, hey, I should actually like really be thinking about medications and diseases as I make guesses rather than like this is a doctor patient visit, for example. Um, so I'll mark that as answered live. Hopefully that helped answer. If you want to follow up, uh, feel free. Um, we will get to the next question actually in a second as we go through this example. Uh, did I just delete everything? Okay. Um, I deleted the prompt. So, uh, I'm sorry that I can't Oh, here we go. Uh, read it. So, same prompt as before. Um, if we scroll down and look at what we have here. Um, this actually looks pretty good, right? It looks like we're adding in some like hesitations. It looks like, you know, there's some like speech hesitation here with gauge ums and h. So, it looks like we've actually improved this file uh pretty well. But what happens is uh although it looks good, and this is why using the LLM as a judge can be sometimes misleading, um it's still helpful to see what patterns it shows. Uh but although it looks good, you can actually end up in a scenario where like something looks great, but the reality is it's not actually what the user said. And so I want to actually go and get like a specific example from this file and uh allow us to listen to it together. So let me zoom out one more time so we can get right on the spot where we want to be. Uh, — okay. Cool. So, let's play this part of the audio right now. Um, and look at it together. — No, installing it. We have this guy or George — and he's going to school now. — And if you listen to that, right, they're actually talking in Spanish all right around this particular segment here, right? There is some Spanish used before it says like with George, there's some Spanish afterwards. And so while this like looks right if you're using an LLM as a judge, it actually translated this to English. Like it very clearly put this as English, right, in the ultimate result set that you're going to see. And so what we uh want to do here is actually now start to tell the model how should it handle these scenarios where you see different languages spoken. And so I'm gonna go ahead and move the current prompt we have to the left. And I'm gonna go in and add a new prompt on the right. And sorry, this is like the next one. So let's not do that one yet. Let's do this

Segment 6 (25:00 - 30:00)

one. Um, and this prompt is the same with all the same uh linguistic elements. Always transcribe speech. But we're going to add one additional instruction to the model here, which is preserve the original languages and script as spoken, including code switching and mixed language phrases. What this tells the model is if you hear some different languages on this file, transcribe them as spoken, right? Like answer them or sorry, transcribe them as the user said, not do some sort of translation or something like that as well. Uh, and as we're talking, I think the critique this question is particularly relevant. Um, what are the different languages that supports and will more be added soon? So, Universal 3 Pro today supports English, Spanish, French, German, Italian, and Portuguese. Um, we're well aware like many more languages would be awesome and especially for this like prompting capability. And so, uh, our research team is like actively working on Universal 4, which would include all of these different uh, feature sets. Um, and so, uh, yeah, six languages today. Um, but you are able to use our API for 99 languages and you could use the prior model that we had on the left um for those other languages as well. So, let's go and actually look at the results since they're here. We've added in this code switching instruction, right? As we scroll down and look at this, you can actually see now, oh wow, like there's like legit Spanish being spoken right around that particular utterance. And by telling the model like don't translate this thing, we've actually caused it to fix it. Now, there are pros and cons, right? I haven't even listened to this part of the audio, but like is he saying gauge or date? I don't know. We should probably listen to that, too. But you start to see how having these different uh prompts is going to change the way that it's transcribing. And it's picking up all these like small Spanish words, right, uh within the audio file that previously the uh the model did not tease out. And so this allows you to really get specific with code switching, especially if you have like mixed language files. Uh pick up some of these nuances of like the different speakers saying different things across them. So, we'll go ahead and do um one more here and uh then we'll switch to talking about like evaluations and answer any of the questions that we feel would be good to do live. So, um I'm going to again move the current prompt to the left just so we could see it. And then I'm going to do a new prompt on the right. Uh and I will run this. And we've been talking a lot about how the model is going to ultimately like guess, right? And we're trying to guide it how to make these guesses. And so the current prompt, we're telling it to like take its best guess. What if you don't want it to guess? What if you want to be very specific and you only want to write a word if it has very high confidence, right? You can change the guessing methodology by writing different instructions. And so in this case, let's do the same thing. transcribe speech with your best guess when speech is heard. Mark unclear when audio segments are unknown. So instead of trying to guess now we're instructing it like if you're not sure be not sure and go ahead and write that as your answer. Uh I think uh Adish you actually asked this. Um and I see like uh another reply on top of that. So in any case um let's scroll down and just kind of look at what this looks like. So, what you'll actually see now is those specific Spanish segments we just teased out are actually where the model is most unsure about the prediction that it's making. And those are the things that you're going to start to see in these unclear tags that are coming in the audio. It's also actually picking up on potential like background noise. And this is where like we could start adding in audio tagging for example in other use cases. But you could use this very quickly to figure out like okay well the model predicted the left. it was its best try. The model was unclear about the right. And you could even use like both of these workflows or sorry both of these transcriptions to kick off all sorts of workflows. Um you could imagine even like the left is like almost like a pseudo labeled human transcription and the right you would just go have a human look at the parts that are unclear and that's how you could get like a great human labeled file for example. Uh, Zack Griffin, anything you would like to add? — Yeah, sorry, I was uh diving into the Q& A here. — Yeah, I saw so that's bringing it back. — Yeah. So, um, so what's really cool about this uh, you know, the ability to prompt for kind of like unclear language within the audio is that we found it kind of gives the most accurate representation of like what's in the audio, right? Uh also if you're doing word error rate based evaluations and using like a normalizer like whisper

Segment 7 (30:00 - 35:00)

normalizer it's going to pull out those brackets. So um so you know basically if there's audio that like a human wouldn't catch either. Uh we're not being forced to be make a guess and then um it's kind of just important for that piece of it. What I've also found as well is that this unclear tag is great, but um we've also experimented with using a tag uh in brackets uh masked in the in brackets. Um I've found that this uh using this tag in particular has led to um tagging more areas of the audio where there's unclear language. However, this math tag is also commonly used to um to uh tag profanity within a transcript. So, it also has the negative effect of potentially um removing profanity within a transcript. So, um that's why unclear when we've tested a lot, you know, um Ryan Griffin and I have basically become mad scientists with this model. We've done tons of experiments up day and night. Um, and uh, you know, there's certain things that we feel are just probably the best for these types of use cases, and we've landed on unclear as probably the best way to represent that data. Of course, if your use case you don't care about, you know, curse words being covered up, then I'd probably recommend you go with masks for this specific use case. — Yeah. And just to touch on that, if you have uh a style guide that you're using for human level transcriptions, uh which is common, you know that your uh word evaluations are like stripping those things. Um that's something that this is great for like kind of pseudo labeling uh near human level transcription. You can just prompt based on your styling guide and tailor to that. So, um, whether you want masked or unclear, um, or any of those different tags, um, that's something that's definitely possible with this prompting. So, I'm just going to do this live. We'll see if it works. This is actually what we do a lot of the times when we get questions like this. So, I see this one. Is PI redaction any better in Universal 3 Pro? And then uh Adam actually I think put like hey here's some things that like you might actually go and try to like put into a prompt and pull this out. And that's kind of exactly the type of uh thought process we would end up on here. Um this is probably not going to be the best prompt on on try one. We're just trying to oneshot this. It's not going to be perfect. But the idea is like if you give the instructions to the model of okay look for PII and personal information and like tag that as private it's going to go and try to do its best to do that. Now the nuance is I don't actually know like does the model know the word PII? What does it call personal information? And this is where like the prompt engineering comes into play, right? It's like okay uh it looks like George is still here. So that didn't really help anything in this case, but it's now tagging private foreign. Like it's hard to control this if we don't give it very specific instructions. And so you would actually want to enumerate something exactly like what was in here, which is like, okay, put all of these specific things as private. And this would give you a much better result because now we're saying, you know, um, always transcribe. It's going to see names, addresses, contact information now and be like, "Oh, I could figure what that out, right? That out is. " Um, and so it should actually now instead be like, "Okay, is this a name and address or contact information? " Like, let's go ahead and mark that as private. And this is like the kind of back and forth that you would ultimately get into. Um, trying to do this within a file. — Yeah. Um, and outside of prompting for that information and redaction, we also have our speech understanding uh, PII redaction suite that many of you are probably used to using at this point. Um, and this model via prompting will have better entity detection and that's ultimately what that PII redaction model is based on. So even if you are not even using prompting here um or if you're using prompting for entities not necessarily PII redaction our existing PII redaction uh feature should see increased performance from this as well. — Ryan, I just sent over in the chat to uh this uh prompt transcribe the audio verbatim include speaker change uh markers. So, what what's really cool about this model um is that like obviously we're focus we're focusing here on like probably uh the most used use cases of this model and we want to get the prompts perfect for those use cases, right? But there's so much potential and so many possibilities of what you can transcribe with this model. And um this prompt in particular is really interesting. It's one that we've been experimenting with little a little bit internally of like because the model can understand the audio up front, it does a really good job of like tagging the end

Segment 8 (35:00 - 40:00)

of uh speech utterances between speakers and you can actually see the events there. So there's a lot of stuff with the audio tagging that I think um you know will be really fun to experiment with um and definitely encourage everyone to experiment with it. But yeah, you can see like every time the speaker changes it has a different speaker tag on there. Um, — and something that we're working on is this is pretty cool as like a experimental feature. Um, well, let me do this first. Hold on. So, so like you can do speaker change markers and this like try to guess the speaker's name or role based on context. This would actually address uh the ask in here as well from uh Joseph or sorry not Joseph um from Paul which is around like can you actually label the speakers in the output. The answer is yes, but right now I would say it's experimental. And the reason I say that is because behind the scenes when we actually process this audio, it's not processed all at once as all 30 minutes, right? There's some amount of like chunking going on our processing pipeline. And so we're actually working on how we persist some of these like speaker changes across those different chunks. And the reason that like uh speaker labels might be uh strange is that in this case it doesn't even know, right? It's like I don't exactly know who these people are. This is just like two people talking. Um they actually picked up some additional changes down here. But in the case of like a doctor patient type visit, it could try to guess the doctor and patient, but if the call is long enough and the certain chunk doesn't contain the doctor or the patient, it's like not sure who to tag. And so you might see like the first chunk look good, the middle look bad, the last part look good. So all these things are like we're learning together with you. are experimenting with. And ultimately, um, we want to actually take the best of both worlds and combine these speaker tags you're seeing the model emit with our native speaker diorization feature. And so you're going to get a much better performance on overall diorization because it's not just going to be based on the embeddings of the speakers and the timestamps of the audio, but it will also include this information from Universal 3 Pro, which is based on the audio embedding itself to understand if the speaker changed or not. So that's more of like a coming soon teaser of like what you would see um in the future as well. Uh let me just look. So I do see uh a question around like mandatory always required. Um also for medical accuracy, do you recommend prompting versus key term prompting? So uh maybe we could just address both those and I will go to like the prompt guide here uh just to really like uh add to this. What we've seen is when you use authoritative language, the model is much more likely to follow your instructions than if you say something that is like soft. And so if you're like, "Take a guess," the model's going to be like, "Yeah, whatever. That's not that important. " Versus, "You should always take a guess based on any audio that you see. " It's like, "Oh, I guess anytime I see audio, I should try and take a guess here. " And so it's not necessarily that like those words probably don't change the outcome ultimately. It's more about the fact that you're being very specific and pushing the model to do something rather than um telling the model just like generally like ah try this thing. It doesn't respond well to things that are like soft and don't have like authority in where you want to push it. So that's that. And then on the medical accuracy piece, uh I do think it's a good point uh to maybe bring up um you know there is kind of two features within Universal 3 Pro. So we have prompting which we've been discussing. We also do have key terms prompt. I know we brought it up earlier u but this allows you to for instance like specify right terms that you want to boost within the audio beforehand. I think really it depends on your use case. If we had you know a virtual meeting like a webinar like this we actually know that I'm Ryan Seams from Assembly AI and I have Rams at Assembly AI is my email address. We could actually put all those things into the key terms prompt because we think they might come up and that would help for sure. But where prompting is really useful is if you don't know the context of the audio file, it's going to be actually really hard for you to pick the right key terms to boost it in your audio, right? And so ultimately, key terms prompt is great if you know the context. But if you don't know the context, prompting is going to get you so much further than key terms prompting because it's going to allow you to be very uh general with the guidance that you want to give and allow the model to infer the context rather than you providing in the key terms prompt. Uh, cool. Adam, I see you were talking about the emotion model and emotional labeling uh for this. Maybe we should go back and just quickly try to tag some of these things in. So uh let me just do this instead. So include all audio tags for non speech

Segment 9 (40:00 - 45:00)

wherever encountered. Again this is a an experimental feature that similar to speaker labels we want to be able to uh improve over time and make it like a native feature of the API. Um, we're really excited that the model is exhibiting these capabilities, but it will potentially be like on one file versus the other very different in the type of responses that you're ultimately going to get. So, I'm going to run this and we'll see what we get. Um, and I see that the next uh question here is actually uh pretty similar to this from Turner, which is how about tagging when the media is played uh when audio video is played by the participants in the recording being transcribed. These are actually kind of similar. Like some of these audio tags are things like noises, but actually things like size or silence or happiness or whatever. And so these audio tags are really going to be its attempt at any non-spech markers in the audio. What does it think that that uh that audio actually is? And it's going to try and go tag that specifically. Now Griffin, Zach, have you tested some prompts with this? I've seen it go crazy. I don't know what your experience is. Yeah, it can be overeager sometimes, uh, just potentially. Um, but I've seen it be really accurate as well. Um, I think just like giving it, uh, as many specifics in terms of audio tags, uh, that you're trying to guess uh, as a guardrail is usually best performance I've seen. And Zach, I think I cut you off. No, I was just going to say I I was I saw Adam's question about like emotion detection and um you know being stuck uh you know with basically the situation that you're in. Emotion detection is definitely something that the team uh has talked a lot about uh internally. Um and in the future we definitely will be powering those use cases. [snorts] — Um there's a great question from Dan and I think Ryan you touched a bit on this. Um yes, right now uh they are mutually exclusive between key terms uh prompting and prompting just open field prompting. However, because prompts uh parameter is open, uh you can actually include key terms if you'd like in that open prompt. Uh and that way you can kind of combine the two, but like at the parameter level, yes, they're mutually exclusive. Um but we've actually seen if you say, "Hey, here's the key terms that I want you to look out for and boost. " Um that's a way to get around that restriction. Um yeah, anything to that add there guys? — No, we're going to edit the docs right after this meeting so that you have the latest and greatest. Um and that'll be in the materials that we share as well. I do want to come back to this like audio tagging, emotion tagging, etc. So we did run this prompt uh just to be clear this was include all audio tags for non-spech wherever encountered. And what you see here right is like this was that part of Spanish that we heard earlier and made it foreign language. But you're also seeing other things uh further down in this audio file, which is like there's some sort of noise in this file before he says corrosive. Sounds like there's music later on. This file is actually really interesting because later on it's like a child singing. It's really low volume actually. Um and you see all these different things starting to come out around music and applause um all these different things. And so again, I would call this like an experimental feature in that like you're going to get different performance on different files. And what we're trying to do is make the model more robust to these scenarios and then allow that to be like an actual API level capability where you're getting consistent performance versus like trying to provide this every time. Uh so that one's done. Uh yes, Dane. So uh prompting currently uh costs an additional five cents an hour um on top of the uh list price. Um, that being said, uh, I mean, yeah, we're more than happy to talk pricing or anything like that. We're pretty easy and friction free. So, if you want to talk to us afterwards, great. Um, but, uh, that's what it is out of the box, uh, for the pricing. It's 21 cents an hour list and then five cents an hour on top of that. Um, and then Turner, uh, I see you're actually Yeah, go ahead, Zach. You you nominated yourself. You're taking it. — Yeah. So, um, I was about to type an answer, so I figure it's probably easier to just talk it through. Anyway, so what we've found with, uh, with kind of prompting the model is that providing some context up front on kind of what the audio data, uh, that's going to see does help, but um, the model's really it's very instructional, right? So, like you can see that a lot of the prompts that we gave it were kind of like instructions on how to transcribe the audio or, you know, transcribe uh, you know, a word this way. Hey, you know, you can attach names, you can key terms, prompt, stuff like that. Um, when it comes to attaching domain level context up front, um, uh, the thing is about it is that because it's a contextually aware model, it's going to find out by just listening to the audio very quickly that it's a political convers, you know, for example, this conversation about politics, like a few a minute into the

Segment 10 (45:00 - 50:00)

audio, it's going to be aware of that. So attaching that domain level understanding up front doesn't necessarily have a huge impact on the accuracy. Now if you um let's say that there's like local politicians you're transcribing the audio file of that and you can attach their names up front obviously that's going to impact the accuracy there or um you know if there's specific sounds that you want to transcribe within the audio like you know shouting or claps or stuff like that if it's like an uh you know a speech or something like that that's going to improve outputs as well. But just attaching that kind of context up front uh may have some effect but not a major effect. So I think we're all just reading Chris's question. Um generally speaking a lot of the features speakerization speech understanding etc seem to be limited with the stream method versus pre-recorded method. is the answer to have my developers record audio as well and send it to post-process the audio recording. Um, interesting question. Um, so I'll let Ryan take this one. Uh, yeah, I'll hop into there in a second. I did just want to say like we I just put this up here to do the prior example that was about like the context um that we had here. Uh, this was what Turner, right? So on the left I was like, "This is a conversation about a family in Miami hanging out, discussing their day-to-day. " And then on the right it was like, "Hey, transcribe mixed language phrases wherever you see them. Like pay close attention to transcribing wherever you can. " Like this is like an ambient mic. I even had a typo with like low quality. And just to kind of show you like the difference that comes out here, right? You you're already getting like more of like the disfluencies and the grammar of what's said at least some of the Spanish, right? in what you're they're saying further down. And again, the reason is like when the model sees like the left, it's like, "Okay, cool. " Like, "What do you want me to change? " Like, "I'm transcribing this file. Like, what should I do? " Like, "This doesn't tell me what to do. " Versus the right here is like very specific like, "Okay, well, if we have a Spanish call in a family in Miami, like what do we want to pay attention to? " And that's kind of like the difference of like how you would prompt this model ultimately. Uh Chris, I will go to your question now. Answer live. — Yeah, I mean — yeah, I'll just I'll Yeah, so two things. One, we are launching speaker diization for streaming. That's coming very soon. Um and the other piece is that we are going to be launching a streaming version of this universal 3 pro model. Um that is fully promptable. So uh be on the lookout for that. It's coming very soon. [snorts] — Yeah. So I think speaker diization we actually tested like an alpha um whatever. We're live. I don't care. It's in the API. You You can try and find it yourself. Uh I won't guarantee the performance, but Chris, we'll reach out to you oneonone afterwards. If anyone else wants to try it, we can send you the information as well. Um it's actually in the AP live API, just like, you know, we're still testing and verifying it. On the speech understanding piece, I totally hear you there. Uh we need to make it easier to do things like redact PII out of the box in streaming. And so we are going to find ways to go and do that. Uh, I think on the teasing part that Zach was getting to, um, if you've paid attention, we've been using this the whole time, there is a streaming button actually in the top right corner. And, um, this will actually allow you to play with Universal 3 Pro streaming as it exists today. Uh, what this model does again is like you can add context and prompts, but I'm just going to record like a quick clip to try to show you some of the nuances of like where it performs versus what we have out there today. Let's hope the app actually works. All right. Hello. Awesome. Uh, this is Ryan Seams. I'm here from AssemblyAI. Uh, my email address is r seams assemblyai. com. My phone number is 555112288. Super excited to be here on this prompt engineering workshop. Yes. All those things that I just did were the hardest things to do in voice agents, right? And specifically here, like we got my email right directly head on. The phone number was correct. We didn't like skip a number. If you heard me at the end, I was whispering. Yes. I mean, that's pretty good to start to pick up on those things. And so all of these things around like the model being contextaware and really getting this great live transcription is going to be coming to streaming as well. And of course you can start to prompt that too and say things like the audio quality is terrible. Always try to pull out anything someone says or uh you got you just got off a call about this Zach. You have a customer that always says very specific terms, right? And they're

Segment 11 (50:00 - 55:00)

like we just want you to really get these like five words because it's 50% of what people say to us. Like how can we prompt that out of the model? Right. — Yeah. And so obviously the core of this was about the async part, but we're obviously very excited about the streaming piece as well. We're getting things uh you know building out custom term detection for this model in particular. And obviously the promptability piece is really amazing cuz if you have a voice agent and uh you know and you can dynamically prompt it, right? So you can increase accuracy of your voice agent on the fly. For example, you can be like uh you can ask a specific question from the voice agent and then be like these are the types of answers that are coming and boost the transcription accuracy of those coming. So, uh yeah, I know we're primarily focused on the async piece though. So, let's uh we'll jump into those questions. But yeah, you y'all can all go test it. — Um I I'll actually jump to Emil's question here about like iterative transcription with prompts. How to actually like get to the right prompt. I think that's a great segue because we have eight minutes and we're way too excited to answer your questions and we should talk about how do you actually do this yourself at scale. And so I [clears throat] have a couple uh GitHub repos here. Uh you could probably go explore these. So this is from our uh head of product and technology, Alex. This is a command line tool that allows you to actually pull in uh data sets from hugging face and uh run different prompt simulations on those data sets and see how that compares to the ground truth files ultimately. And so using this you could actually pull any public data set and I'll uh we can share these as well. They'll be in the post show materials. um you could pull a public data set, you could compare the were there and you could just keep iterating through prompts to like find uh the ultimate like best thing for that data set. Now uh I think this is really useful for like quickly trying some of these simulations. Uh me personally, I've been like much deeper in a like what you were talking about which is like how do we find the best prompt for a certain scenario and the reason I've been doing that is because we just talked about the streaming model. We want to have the best system prop possible for this model. And so we've been running simulations on our streaming model of all these different data sets that we have and actually pushing them into um uh like different eval sets. And so this particular uh one is the one that we've been using. Uh it's actually public as well and we can share this with you afterwards. But basically all these are using like a optimization technique where it's going through and defining different prompt components. It's trying like extreme versions of that, positive, negative, middle of the road, and running a bunch of trials to try to figure out like which types of components really influence the output. And then within that component, like what style of that component works and when I say component, you know, we talked about all the things we talked about earlier, which is like there's a disluency component, and then within the disfluency component, there was like the really short one we did, there was the medium long one we did. Those would be like the features that we start to test and running these optimizations you can really quickly start to converge on like what's the best were for this particular data set. One thing additional I'll add is this particular uh repo is using uh traditional normalized were that could be great in your case. In this particular repo we're actually doing something different which is called semantic were and semantic were is more nuanced. it's actually defining a bunch of rules around word error rate that you want to put up front and have the LLM interpret the judging for. And so we're finding that with this model, most humanlabeled data sets are actually wrong. Like humans have been making transcription errors for a really long time. And it's only with this new model that you're actually starting to see those errors come out the other side. Zack, I know you've been working with a customer. You had one file you were working on with them. And you say it you give the example. — Um sorry I just I was thinking about the for the custom formatting piece and coming up with a you know a conversation points on that. So sorry what were you saying? — Oh good. I was saying like you have a customer that you're working with where you're like you actually looked at their human labeled files and they were just like straight up wrong. — Oh yeah. So, so it was interesting because when we'd run kind of like large scale evals and I kind of I hinted out at this earlier is that you might see if you have human labelled data that you're running these on like the word error rate from universal 3 pro is actually higher than universal 2 which you're like what why is that the case? This is our the state-of-the-art model. So I actually went through for one of our customers went through one by one every single insertion that universal 3 pro or difference uh between universal 3 pro and the human label truth file and identified uh the differences within it and pretty much like I think like 95% of them uh all came back were universally pro and the human labeled truth file was

Segment 12 (55:00 - 60:00)

actually incorrect. So as you evaluate the model like it this is kind of a crazy thing that we've been thinking about a lot internally of like how we explain this to customers because it honestly breaks a lot of the traditional uh human truth filebased word error rate evaluations. Um, so yeah, the better that your the better your file is with kind of all the of transcribing all the audio data available within it, um, the more accurate your word error rate based evaluation is going to be. That's also why for when you're doing these word error rate uh type um uh benchmarks and evaluations including that um in the prompt something like label unclear or inaudible uh audio data as unclear or masked actually it improves the word error rate because in that case the model's less likely to transcribe audio data that it actually does hear but it doesn't um but that a human is just not going to transcribe at all. So um you know as you're evaluating you know feel free to reach out to our team and we can help uh kind of like you know with these types of valuations explain what might be going on but definitely something to look out for like we're hitting a pretty crazy point with these AI models where they're doing better than you know a human uh at evaluating it. So definitely be on the lookout for that. And I don't know if you follow artificial analysis, they do benchmarks of tons of models. And I thought this was super interesting. They had to create their own proprietary data set that providers don't train on. They went through earnings 22 and boxuli and manually corrected all of them because the ground truths were wrong. Uh and like you should watch out for that when you're picking data sets off the shelf. Like you need good labels to do good optimization. You can't use something like this and do a statistical analysis if your labels are bad. And they even removed data sets that were just like wrong as well. And this last thing, this improved normalization is getting closer and closer to semantic were. And so we're starting to see like the market converge on those things. I did want to address one quick thing uh within the folks that are like in here trying out speaker labels in Universal 3. Speaker labels is for sure a experimental feature right now. Um if you're going to play around with this, like it's very clearly like your mileage may vary. This is not going to be something you should put in production. We have a speaker diorization feature. This is what you should be using on your API request if you want stable correct speaker labels today. We're going to be folding in the power of the model and the acoustics that we showed earlier into this API and so that'll be coming soon. But if you want consistent results, you should be using speaker diorization uh and not using the speaker tags from the prompting right now. So that that's just something to note. Go ahead. And then on top of that, it kind of says it right there in the doc, but you can use speaker identification on top of that, which is an additional feature that based on the context of the call, it'll use the speaker labels uh to assign actual names to them. So, while this model is really good via prompting at identifying like speaker boundaries, um it's still quite not there yet for like full speaker labels. So, just something to look out for there. Um definitely try to use speakerization first uh with this model at the current point. And that would address some of the comments around there around like the drierization being wrong as well as like the speaker label hallucinations. You won't see those in the dorsization and identification features. Those are just in Universal 3 Pro for the time being. — Um Ryan, we were mentioning kind of like how to iterate on prompts here, and I think that'd be a good time to mention the prompt repair wizard that we have as a new tool on the dashboard. Um we've been doing a lot of this testing ourselves, uh iterating on prompts, so it becomes kind of hard at scale. Um, and like these repos, we've shown a way to kind of do that. Um, but this is another tool that is now available on the dashboard where you can essentially put in your prompts, describe what issues you're seeing in the output, and it will analyze and based on what we know about prompting best practices, scanning our docs, it'll output um ways you can improve your prompt. — Uh, so we'll run this. Uh, just for everybody, I know we're at the top of the hour. I don't think we got to like every single question that's in here. Uh we're more than happy to jam on this with you. If you want to have another follow-up session, one-on-one, come and join us in Slack, we're going to send a means for you to jump in and join us afterwards. Uh email, live chat, whatever you want to do to contact us, we're more than happy for you to try this. Again, uh this model is very new. Uh we're only geez it two weeks in, but it feels like two months. We are so excited about the capabilities. You can see some of them as we're doing some of this prompting. And as uh we learn more and more, we're going to be bringing these features to market as like a full-fledged features in our API. And so appreciate everyone for the time today. We're here for you if you have questions. And um yeah, thank thanks for coming out there. And there we go, Griffin. Right on time. You can actually see like all the different things it's

Segment 13 (60:00 - 60:00)

recommending for you to go try to get a better prompt. And so definitely try this out. Play around. And we can't wait to hear your feedback. Like share feedback, let us know how it goes. Let us know what you find. Uh we're constantly learning with you and uh yeah, excited to see what you prompt. So, thank you all. — Bye everybody. Great questions.

Другие видео автора — AssemblyAI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник