AI Show LIVE | Live Interpreter API & Seth live coding
1:05:58

AI Show LIVE | Live Interpreter API & Seth live coding

Seth Juarez 30.09.2025 17 просмотров

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Discover how Live Interpreter breaks language barriers with real-time speech-to-speech translation across 76 languages. It preserves voice, tone, and style—making global meetings, classrooms, and live streams seamless and inclusive.

Оглавление (13 сегментов)

Segment 1 (00:00 - 05:00)

Hey, hey. Cut. Jack, got it. Cut. Hello and welcome to this episode of the AI show. Seth here. How exciting. another week. We're so glad you can spend it with us here this morning. Well, this evening or this afternoon, wherever you may be. Uh we got some cool stuff today. Um pretty excited about that. And if you're watching this later on in a recording, hello to you as well. Uh all right, where's everybody coming from? Uh all side of the eye. Take a look at where everyone's coming from. So, this is the place where we do things all things AI. We've been doing the AI show for quite some time, actually. Believe it or not, we started we started a while ago, then we stopped for a second, and then we come back. But I don't know. I like talking about AI stuff. Pretty excited about that. Okay. Uh let's see. What are we doing this week? Let me share my screen here. Boom. Mouse. Today uh we are going to be doing uh live interpreter with Jackie Kane. Uh awesome dude. Uh he's a good guy. Um hopefully he can come. Oh, looks like his network is not good today, so he will not come. But we'll show I'll show you some stuff that we recorded just the other day. It's great stuff. Maybe we'll go and look at some of the docs and play with it. Second half, um I actually have some work to get done, so I thought why not get it done together? You know what I'm saying? you know. Uh so, uh for number two, uh I have a project called Prompty. I did a little bit on this last week, but I actually have to work on the I'm supposed to get the internals done this weekend, and I just did not. So in the new uh prompty spec uh there's a lot of what is called uh um

Segment 2 (05:00 - 10:00)

polymorph there's a lot of polymorphism based upon uh attributes in a yaml file and I started looking at something called because I wanted the C# because one of the things is that it did not have good C implementation. Python was great because it's easier, right? It's a dynamic language. And so, um maybe we'll work on some um polymorphic uh kind uh YAML instantiation. How about that? Polymorphic kind YAML instantiation. How's that for a $5 phrase? $5 phrase here from the AI show. Uh so that's what we're going to do that a little bit today and then uh we'll do a little review what Prompty is just for kicks and giggles. Uh and then this week um by the way tell us where it come from side of the eye. I I'll take a look at I well actually most people maybe I should do a show on my setup but what I'm looking at right now is a screen that has all of the stuff in it. So, when I say side of the eye, sometimes I mean like that. Um, uh, looks like we have, uh, Richard here with us. Hello, Richard. How are you, buddy? Party time indeed. Oh, I did not change this. Let me change it to this. This one's better. There we go. Now, now I can say party time. Yes. Uh, welcome. Um, okay. Uh so that's what we're going to be doing today uh with these uh two things and also any of your questions or comments. Uh I really want to get to a place where I started working on this here. I'll spitball with y'all. I want to get to a place where people could go into a site and record an idea or thought or whatever and then we play it here as part of the show, you know, like thoughts, deep thoughts, deep AI thoughts from the community. and then we just play them and then talk about them. Uh, so I want to do that, too. I have I started I actually, to be fair, I started vibe coding it a little bit. Um, and it got I got pretty far. Um, but I didn't finish. Uh, oh, look at this. Uh, Richard is from Canada, B, British Columbia. I will be in um Toronto starting tomorrow. I leave tomorrow. Uh and then uh there's an AI tour there and then I'll be in Ottawa on Friday for a different AI tour. So, um make sure to go that uh make sure uh if you're around come say hi to me at the AI tour. Uh okay. Yeah, I know. Deep thoughts. I know. It's a great idea. Deep thoughts. Deep. We'll make it like deep AI thoughts. I think that would be fantastic. I'm not going to lie. I think that would be amazing. Um, we should totally do it. Deep AI thoughts from the community. I bet we get some we we'd get some bangers. I'm pretty sure. But like I said, I started vibe coding it. I just never finished. Like we could open the show with a deep thoughts with AI and we get even like we like categorized them into the saltiest interpret saltiest takes. There's a lot of salty takes. I would say that half of them are right. Gosh, I need to just get some better sleep or something. It's like looking in a mirror. Um, salty takes, uh, uh, and so we'll cut it into three. Salty takes, uh, pragmatic takes, and wildly optimistic takes, you know, just like it's a bit we'll bend them, and then we'll see who wins in the week. That would be fantastic. I We got to write this down somewhere or record it. We We'll record it like right now. All right. Um, so let's get started. Uh, Jackie was going to try to be here, but he couldn't be here. Um, internet issues, uh, as being as they are, uh, let's turn over to Jackie Kang and live interpreter. Take a look. You're not going to want to miss this episode of the AI show. We talk all about the live interpreter API with my friend Jackie King. Make sure you tune in. Hello, welcome to this episode of the AI show. We're talking all about the live

Segment 3 (10:00 - 15:00)

interpreter API with my friend Jackie King. Thank you so much for being with us, Jackie. — Hi, Zes. Nice to meet you. — Nice to meet you, too. Okay, so first time on the show. Why don't you tell us who you are and what you do? — Oh, yeah. Um, yeah. Hello everyone. My name is Jackie K. I'm the PM leader for the address speech modeling and quality. My focus is on improving Microsoft uh soft speech product service and overall quality. — That's amazing. So why don't you first why don't we just first start with like what is the actual live interpreter API do so that people get a sense for it. So I'm going to bring up a slide that you brought for us. Yeah, I'm so excited. I have chance to talking about our new live uh interpreter API and this is a breakthrough that like makes cross language communication more seamless and in nature than ever before. So you can see that what it will bring is originally we have a real-time speech translation API. It has been G for quite a while but it's already powered uh the translation applications and device like a Microsoft translator, teams and other customer products. But there are some important gaps when it comes to seamless humanlike interpretation. Imagine in a global conference a speaker is talks uh but the audience here's a delayed robotic translation or uh in some meeting or phone call the conversation will feel like a choppy with mismatch timing and no personal time. So the challenge we face is you can see that one is the latency. Yeah, the traditional system usually takes like a seven seconds to finish the speechtoech translation which is much slower than human interpreter. Another thing is voice quality. Yeah, the output if you using a standard PL platform voice which will then carry your tom, your volume, your speaking style, some of the message will lost the nuance and the emotion were lost. And also uh in a lot of app or device you have to manually set like a source language uh which is inconvenient when you have a conversation with mix multi multiple languages people. Okay. So the new AI interpreter you can see in the slides we have bring several I think great things. The first is you never need set any input language. You're just using your device using your app talking to anybody speaking any language. It can help you to translate. It can already support 76 language which 143 locals almost cover majority of the locals in the world. Another thing that we have cut the speechtoech translation from 7 seconds to 5 seconds. I will tell you how to do that later. Yeah, because it become much shorter and close to the human parity. It will bring the improves the user experience much better. Another amazing thing is we introduce personal voice which will really simulate your time your voice. So you can start the translation just like using your own voice speaking another language. — Hold on as I'm looking at this. This is impressive. Like I don't know if people know this. Jackie maybe don't know this. I actually my research was in computational linguistics and machine translation. And so the fact and we used to have to do a whole bunch of huge models, but the fact that we can go from like five seconds to live translation of voice that preserves intonation what 76 languages and 14 that's actually really impressive. — Yeah, that's amazing. See, we're so excited. We can really bring this to our customer to the people like really wanted this. Yeah. — Yeah. And you're looking as I'm looking at this, you have a couple of scenarios there. Can you tell us about what kind of scenarios people can use this for? I mean, I personally, I don't know, I watched Star Trek when I was a kid and I was like, "Wow, it would be great to have a universal translator. " This is starting to get close to that in terms of human language, which is really cool. — Yeah. So, we have a several targeted uh scenarios. First one is thinking about in a global company conference or in like a national conference. Then the speakers you are speaking one language but your audience they are speaking different language and they they want to listen to your speaking with your own language to understand better. So using this technology your speaking contents will be seamlessly translated to different kind different languages and using your own voice then all the audience they can and because it's a very short latency it's just like you are speaking in parallel multiple languages and everybody can hear. Imagine that's a very amazing thing. And another thing like we already received some of the customer scenarios they wanted this is they are in their phone call they go to the travel abroad they will call some of the government or hospital or something but they cannot say the native language local

Segment 4 (15:00 - 20:00)

language then they want to when they have a call if the this technology is embedded uh into their translator. Think about it. You can talk in native language with local peoples directly and they know they can understand your contents. They understand your volume your tones. So that your message whatever you're happy you're angry you are like a very urgency. Yeah. Those information were directly delivered. That's amazing. — And this is birectional, right? So like they're talking and then I hear it in my language. I'm talking they're hearing it in their language. — Exactly. They just you just both side are just like talking with — with people in your own language naturally conversation. Yeah. — That is this is amazing. — I want I really want to introduce how this works. Yeah. — Yeah. So can we move on to like how this actually works? There's another slide. Can you explain what we're looking at here? — Yeah. Yeah. I'm excited to to tell you what what's going on. You can see that the first uh diagram is the currently traditional uh speech translation we call is cascade solution. So they it will detect your language translated to t uh a transcriber to text then do the textto text translation and the text to speech. So everything is like a whole sentence based. So that's why it's called the latency. In our latest technology we introduced the streamlined modeling we call. So there is one model it will automatically translate the audio into the target language just one model you don't have to call multiple models and within this model it will do the multilingual understanding it will automatically understand what your language you are talking and do the translation without to detect the first and the meanwhile your audio segment will bypass to another thing we call personal voice that personal voice will consume your audio uh segment and simulate your voice with the translated contents then everything together it's just less than five seconds so this is the amazing thing and also you can get this with just a three line of code you can see the code I share is you just need set the target language you want to translate enable the personal voice and tell the system that you want auto detect boom that seems done you get it yeah — this is cool because I'm looking at this and now I'm starting to get a sense for why It's faster because I mean in the first diagram the language detection, speech attacks, text attacks and then text to speech like you can see how that's going to say you you've actually made a single model do that job but then you've added another model for personalization that works in the at the sameish time so that you can get these things happening in parallel so to speak. Is that right? Yes, you get the point is yeah, it's exactly right because it's one model and the the in parallel personal wise simulation happened at the same time. So we can seamlessly it's we record the streaming like output the the simulated voice with target language. — So I'm looking at the code here. It's really cool how you're able to set it up. Are you passing like an audio stream into the service and an audio stream comes back or how does that actually work? Yes, you pass the audio stream there and the service will uh a few seconds it will pass the translated audio stream back. So then you can hear the your voice in another language. — Okay. So let me I know you brought an example of a video. The video we're going to sh see next is one where you passed it through the service and that's happening live. Is that right? — Yes. — Okay. Let me put that up. — We actually accumulated tons of It's basically um happening live but I heard it like if it was the same person. — Exactly. So we pass the English original language is English uh audio streaming into the service and ask service generate three language in parallel. Then you can hear that they are just a very short second latency and then it sounds like the person through is speaking three languages in parallel and it's a very low latency and a very nature voice quality. So that's — and then the other — the other example you have is it happening in teams. So I'm going to bring that up so we can take a look at that. Let's do that. — Hi everyone. Thank you for joining the call today. We're here to explore how AI technology is reshaping our meetings. Reggie, could you start by sharing the developments in China? No problem. AI realtime translation technology has greatly improved the efficiency and understanding of meetings, allowing us to communicate

Segment 5 (20:00 - 25:00)

more smoothly. Are there any new tools or trends in Japan that have an impact on meetings? Realtime translation by air has evolved and provides barriers to meetings. It's very convenient, isn't it? What kind of symbols are there in Korea? It seems like I'm really giving a big strength. Thanks to this technology, meetings are much more accurate and efficient and everyone becomes more actively participating. — Great conversation. Let me schedule a meeting tomorrow to discuss this further. I'll send out a meeting invite later. Thank you all. — As you can Yeah, as you can see the demo that that's exactly live demo that we already have published. We have a it's already go to every rings uh every for every team's subscriber. So if you're using teams you can using this tech right now. So you can see that in the meeting you have a people speaking different language. So if you enable this called the AI interpreter and it will automatically translate so that you can hear the understand their language like very seamlessly and it's speaking in a voice just like I'm speaking different language when I talking to different languages speaking. Yeah. — And that's really cool. I like I sometimes I go to other countries and I speak. It would be great if I could like speak it through like a little device that actually outputs the audio to the microphone that then talks in the language of that's of the place that I'm at. That would be really cool. — Yeah, that's exactly our talking scenarios. We really want to help people who travel globally who really want to who really need to speak to people who speak a different language. We want to break the bridge break the gap between cross languages. — All right. So where can people go to find out more? I know there's a blog here. Tell us about this. — Yes. This is a this is just a public preview. It will introduce exactly what we released and how you can use it. Yeah. — And then obviously there's a learn thing where you can go and learn a little bit more if you want to do that. — Yes. it will share your sample code and exactly how you can use it or develop just as I showed in the slides. It's just a several lines of code. — Well, thank you so much for spending some time with us, Jackie. — Thank you, S. Thank you for having me. — And thank you so much for watching. We're learning all about the amazing live interpreter API with my friend Jackie King. Uh thank you so much for spending some time with us and hopefully we'll see you next time. Take care. Alrighty, Jackie's here. Let's get him on. Uh, okay. Here we go. Jackie, let's see if it works. Jackie, are you okay? Can you hear us? Can you hear us, buddy? No. Oh, can't hear. Oh, he can't hear. Yes. All right. So, here's First of all, I want to tell you uh a couple of people said something like this is Mike. He said, "That's really cool. " That's really cool. First thing Austin said, um uh super cool. Here's a couple of questions. Is this using the OpenAI real time models? Uh Jackie, — yes. Can you hear me? — Yes. — Okay. Yeah. This is not using realtime model. Uh this is using the uh Microsoft internal uh transducer model. Uh we call the sequence to sequence model. Uh but real time model is also one of the candidate. Real time model right now it's uh still need one step is make it a streaming lines because real time model take streaming in and not streaming out. So we need to do the streaming out part. Then we're using the latest realtime model. We are doing similar we are doing those uh sims in the next version. — This is cool. Uh and the thing about it that goes to the next question is it seems like it's really fast. There's another question from Mike. Um how long do you have to sample before the translation can proceed? You said there's like a little bit of a lead time. How much is there to get the live translation out? — Yeah, so far we end it's a streaming input. Uh the end to end output is uh around 5 seconds but uh around two second it because we needed the text to a voice it will need some time to processing. Uh the if you only need text out like if you are talking in the video and you want to broadcast your video in different language only show the caption it like a three plus second three second

Segment 6 (25:00 - 30:00)

a little bit plus so that's the end to end. So it's streaming count after 3 second around you will see the translation text and after 2 seconds you will hear the audio. — So I did not realize this that was said on the video you so this can also do speech to text live as well. Is that right? — Yes. — I did not know that. — Yeah. Yeah. So you can see that the diagram is shows that we translate then we use in the personal voice to uh put in your uh speaking your own language and this is average for some language it will be much shorter uh some love language like a Japanese because their grammar is behind so it take a little bit longer so 5 second is average time from speech in speech out and we are right now is doing something amazing technology I cannot it's still confidential I want we want to bring it even shorter like a two three second end to end speech to speech and text around like a two second that's how I go — dude I like all me and Robert are like nice this is cool stuff uh like cuz I and the thing about it that's impressive to me is how you were able to capture the user's voice and their inonation like that's impressive to Okay. — Yeah. Yeah. The technology is uh we are doing the in the diagram you can see we doing the audio segment sampling. It's very short like every 200 millisecond. So it will catch your time. It will trans volume your emotion your speed. Then it will carry over to the target language translation part. It will simulate your voice. So that like you are if you are very happy or you are very excited talking about things all the things will carry over but in different language. Yeah. — Yeah. So here's Mike. I don't know if I understand the question all the way so we might need to ask it a different way. Do you channelize the translated audio to the listener's audio leg? Not sure what that means. So effectively let me see if I can ask it a different way. Are you is it just a stream of audio bits that are coming out of the translation API or is it like sending an event with the new audio chunk? How does that work? So when you send audio streaming into the our API so the first step we will sample those audio segment to the personal voice we call personal voice component. Meanwhile, those audio were also sended to translation module which will translate to the text. Then the personal voice module will get your audios uh volume, tone, emotion, all those things information at the same time when this receive the text tech uh translated text then it combine together to simulate your voice in that language. Yeah, like I said, this is uh impressive. And on the other side, on the other end, you can get either text streaming or audio bits streaming back. Is that right? — Yeah, it can get both. You can get a both. — Oh, I see. So uh so and then Mike as you're clarifying I think what it means is that the actual API will send events with both the text and the audio bits so that what the actual service returns those two things and then you can do whatever you want with them. Is that correct? — Yeah. — Which is cool. Uh all right, Jackie, any other things you want to add that maybe we didn't cover in the video? — Uh yeah, I will say that we are uh we are talking different cylinders. These ones we call the real time speech translator. It will focus on the shorter reduce. So reduce the latency and uh and make everything like very quick. So that's our goal of course. There will we were also released a new API which is a focus on offline which means you send audio file you just want to translate it to some of the different voice a different language but in your own voice but it's offline like you were like this broadcast Jackie plays oh or says you do this I want to become Japanese Jackie says talking in Japanese okay you can get this done uh in offline that one will be coming very soon so — wow that's a — yeah like look I I know that in the future we kind of all wanted flying cars, but to me this kind of technology where I can communicate with anybody in almost any language almost instant like near instantaneous, but we're talking 3 to 5 seconds, 5 seconds on average longer if it's a longer language, a longer grammar language is absolutely

Segment 7 (30:00 - 35:00)

impressive to me. So this is amazing. — Yeah, thank you for having me. Thank you for Yeah, I we're thank you Microsoft to give chance for this team to release this product and bring this to our customers. That's I love it. — That's really exciting things. — Thank you so much for your time, bud. — We'll see you soon. — Thank you. — All right. Uh that was Jackie. Uh that is really cool. Oh. Uh just a little backstory, right? We were uh we were trying to get him on um um and he was in a different building at Microsoft bad internet and I was like oh no I got to go to a meeting and so we got him on later on and this is just I don't know there's certain technology that you're just like wow and there's others that were like yeah you're like Hey, that's really cool. And there's others that are just like, yes, golf clap for Jackie. Um, uh, yeah, it literally great demo for sure. It's just it it's really cool. Where can we find information about the API usage and cost? Great question. Uh, in fact, I have it left over. Here's the blog. Actually, let's go to it. So, let me get the blog up here. Let me share my screen and we can take a look at that. Uh, share screen. Entire screen. I don't know why I make the sound effect. Um, I it might be something wrong is wrong with me. Uh, here's the blog. Ola real. By the way, Deb also uh has a channel that all translation all the time. So, make sure you look her up. Deb is awesome. Um, just FYI. Uh, here's all the things. Full language coverage, personal voice that preserves style and tone. Translation delivered in a voice that sounds like the Yes. Amazing. Get started today. Here's the quick start guide. Continue. Here we go. Uh, here it is. Uh, let's see about the cost. Uh, because that's what someone was asking in this article. Uh, there's C. Uh references documentation. Here we go. Sensitive data environment variables. Yes. Create speech translation configuration. You can see it right here. Uh and holy cow. That's it. This is a config. Uh nice. Speech recognition language. Speech translation language. Oh wow. Looks like you can target multiple languages at the same time. uh with every call to add target a new target translation specified. In other words, when speech is recognized from s each target translation is available. Initialize a translation recognizer uh speech. We did that. Um learn how to get device ID for your audio. This is there's a lot of stuff here. Uh that Oh, and my face is gone. Maybe that's better, you know? I don't know. Uh maybe it's better that my face is gone. Uh let me go back to this chat. Um uh here uh text text text. Um uh there it is. The translation looks like two languages. Um and then translate async. Oh, but this is offline. Uh oh no, this is not offline. This is the real time using var translation recognizer. translation recognizer config audio config looks like it's doing a wave file and there you go there's the results happening live translate speech async for each language recognizer here's the translation eventbased translation yeah intermediate translation is not available when you're yeah that makes sense and then you can see here's all the this is Cool. Synthesizing uh here. Um Oh, here's Jackie. This is offline when you want multiple voice online support, one voice out, which makes sense. Online voice. Uh very nice. Look at this. Oh, and there's even like if you don't know about languages, there is a uh um a language kind and then this is like a sub kind of it. Uh oh, no, no. That's not that's a voice name. That's the person. Uh so that's the kind of voice which is nice. Um manual

Segment 8 (35:00 - 40:00)

synthesis you can translate blah. This is all very cool. Yeah, this is the these are the voices. Man, this is so good. In many scenarios, you might not know which input languages to specify using language. I you can detect up to 10 possible input languages. Insane. Uh like we're living in the future here. Living in the future. This is cool. Oh, and here's a websocket. This is the live one. Using live interpreter for real time. Yes, of course it it'd be a websocket. Um, set your test file here. There's your test file. Um, and if you're doing a websocket, that means that you can grab you could probably grab a handle to your audio device and then just shuttle bits into that, which is amazing. Yeah, here's the event. This is so cool. Uh, using custom speech translation. I got to put this in. I got to write like a little helper like Windows service or something that puts Well, no, because these captions, you can already do this in PowerPoint, but only with text. Uh, but you can't do it with voice. It would be cool if I could somehow that just have a little app that just real time translates my stuff. Um, which would be cool. Uh, okay. There it is. Uh, so that's the uh that's the bit here that we were talking about. So, let me look at the chat. Uh, see uh here. Oh, so there's another thing that um um Richard was saying um about here uh event-based uh should be used in tool execution too. It is uh well uh the thing about here's the thing. Um and now let's move on. Let's we're moving to the coding part of this whole thing. I guess I um let's talk about that. Um tool execution. Let me turn this on again. Uh to pan, press mouse wheel or right mouse button. Oh, there we go. Right. Right mouse button. Nice. Okay, cool. So, I can just right mouse button. So, we talked about and I'll briefly go over this because this is what uh Prompty is supposed to be sorting out. Let me get this out here. Uh there we go. Okay. So, uh to Richard's point, the way I think about tool calling, and this is I don't know if this is like um the orthodox way of doing it. I am more of a heterodox when it comes to this stuff. Uh LLMs don't do tool calling. They don't execute tools. That's not what they do. Um yes. Exactly. Uh tool exeus is based on the response to LM from the client that has tools registered as a source to access. Yes, it's exactly right. Um, I I'll say it I'll say it uh in a different way. The LLM itself uh doesn't um this doesn't execute anything. I don't know where this came from. I don't know why people think this. I know people like talk about it like uh MCP servers are great, but they're not it's not like the LM is negotiating with an MCP server to execute stuff. That's not what's happening. Effectively, what happens is you have some text come in uh A B C D blah blah. Uh and if you're using real-time API, it's like it's like a voice, right? Um but um but effectively the LLM just produces a response and this response is going to be uh text text or if it's the real-time API it's going to produce you know voice that's all this is doing. There's no thing else. I feel like I there's nothing need to say this in an emphatic way, right? Um

Segment 9 (40:00 - 45:00)

LLMs do not execute code more at 11, right? I mean, I don't know what else to say that that's not what LM do. And there's this misguided apprehension that hey, this model is going to execute things and no, it's the people making the service that uses an LM that calls stuff. And so if anything, don't blame the LLM. They are an innocent bystander in all of this. Um, it's almost like they're like, LM are given a bad rap. It's almost as if they're blamed for the ALS of AI, but they don't do anything, you know? They don't do it. They don't do anything. They just return text. The thing about this text though is that you can tell the LLM to return JSON or structured text, right? And what some enterprising person did is they said, "Hey, we can format this uh this response format. We'll call it response format. We can tell it to look like a function call and it would be like the name of the function, the parameters and the values. That's all a function call is. And so some enterprising youth or person youths, we're all youths decided that we could do this like a function call. So, which this is great because now the LLM can respond in a way that makes it so that the person using the LLM can execute a function. But again, the LLM does not execute a function at all. Um, yes. Uh, here's another here's from Richard. But the response back from the LM to please ask for the tool to be executed and then respond stuff that back in the chat is yes. Yes, that's what it's expecting. But you don't have to do that. Like the LLM does not the LLM is like please may you mayapse, this is how I imagine LLMs to talk. Mayaps you execute this function and tell me what the response is. Mayhaps. Um, you don't have to use an LLM like that if you don't want to. You could use an LLM and just say just tell me what function to run and then I'll take that and do something else. You don't have to do that. There's no rule that says if you use an LLM, you need to put the you need to execute the function and put it back. It's not required. Um, uh, so yeah, that's what this that's what this does. And so MCP servers, all they do is they advertise response shapes. That's it. They advertise response shapes that look like function calls. And then there's like magic stuff happening behind. And this is the thing I do not like. And you know, I don't know. I don't know. Maybe the thing I do not like is how all of this stuff is obfiscated behind services. Now I work at Microsoft and uh we have services that are kind of like that but at the lowest level which is where I work Azure AI foundry we do not offiscate that we don't you like you what you do is if there's tool calls that are service sidebased we have to call those on your behalf because they're service based but we you have a way in our service with our agents as of six months ago where you're able to put a local function definition And as Richard says, it calls your code back and says, "Please execute this and put it back on the thread. " And you don't have to. You can say no. But that that's how our service works. And it's going to get even more uh pointed in that direction as we I can't say more want. Uh here's another one. Yeah, absolutely. Uh LMS don't run tools. They ask for them. MCP is what actually executes them and sends them back. Yeah, but you don't need MCP to do that. And if you have already existing like APIs with your own like uh way of of doing security, whatever, you don't need MCP. You could just call those uh MCP makes it nice to like expose these things, but I don't know. I worry about the

Segment 10 (45:00 - 50:00)

exposure. Um yeah, here it is. Uh yes. Uh for example, if you ask it to write a SQL query, the model will just output a structure request like this. Yes. Uh yes. And then also I will say that the I don't know how good uh these things are at making good single queries. Him will be able to tell us uh about that. Um so this is how I think about this stuff. And now LLM's become a programming primitive. I maybe we should like we should have like a programming primitive, you know? We should have a shirt, you know, uh with like a you know what I'm saying? Like with the backgrounds all green and like we're wearing a shirt and it's like programming primitive to the rescue, right? rescue. Um yes. Uh here's another thing. Uh this is that LM response is a mix between what the LM is trying to respond in terms of a conversation and request for management tool. Yes, it is. But you don't have to use the LLM that way if you don't want to. That's I think that's what I'm trying to say. You don't have to like just because someone decided that chat is the way to use these things, you don't have to use it that way. You can abuse that. Now, now this is where you and I Richard agree. You can use that protocol to do stuff. Uh but you can do whatever you want. You don't have to do that. You don't have to be the conformist. Uh oh man. Programming primitive. Man, this is funny. Uh so that's what this is. Uh okay. Um but you need that information. No, Why? But you need you don't need the information from the tool. That's only if you're assuming that you're doing a chat thing. You might here. Let me let me be more specific. Maybe you just want to a thing to come in to the LLM. You want the LLM to be like, you should run this function. and then you pass it off to another system that runs the function and then does something else. You don't have to pass it back to the tool at all. You don't even have to you don't have to use this chat thing. This doesn't even have to be a chat. It could be something that comes from another function. You don't have to do it that way. Um there's no rule that says if you don't put this on the chat, the LLM explodes. You can do whatever you want. like everyone. Uh like here here's this here's the fun bit. Um we're going to need extra power for this. I'm going to dial up to get some more powerful internet for this one because I think like because this is an interesting topic. Um we'll get the more powerful internet for this, you know, not just um I think that the thing that's important is that this stuff came out what not last year but the year before that. It's like a I don't know in internet years it might as well have been a hundred years ago. I moved on from using it like a chat thing pretty quickly. Uh I used it to do uh chain of thought reasoning. other stuff but I controlled the inputs and outputs of the LLM. Uh um I can use it so I just use the LM as a programming primitive to go from unstructured to more structured so that I can drive control flow. Uh yes here's a I had a question. How many chapters in the book that drove me into this rabbit hole? LM with a rag of chapters could not answer the question. Yeah. Yes. And first Richard excellent point. Um, the first thing that I do is I ask myself, do I even need an LLM for that kind of query? The answer is absolutely no, you do not. Like I would never use an LLM for this kind of question. Having said that, you can see why like it couldn't answer it. But there are certain questions where if you could just be like and like I said, what year are we in 2025? These things came out what, 2023ish? Was it 2023? Let's ask the internet. Uh, when did the first version of chat GPT come out? 20. Okay. So, cool. So, 2022. So, we are uh now three years on. 2022, 23, 24, 25

Segment 11 (50:00 - 55:00)

right? We're October. So, we're a month away from three years. uh a month away uh from three years. Um and I kind of lost my train of thought there. Um Oh yeah. Uh so three years on from this. So it's been a while and everyone has just sort of pigeonhold this technology into being about chat. Like I need to make a chatbot. You do not don't like you could do a claims processing thing that like you can use one of our services to get information out of a PDF and then do stuff and then fill out forms for you that like look at the thing. You could like that's not a chat thing at all and no one talks to it. Like you can make these things go offline, do longunning tasks. Um you don't need to do any of that. But yeah, you're right. Uh if you're the thread of the conversation you Yes, but doesn't have to be a thread of people at all. It could be another LLM did this, another LLM did that. Do you see what I'm saying? This is and this is where people lose sight of like why this tech is so cool. uh is because like we we pigeonhold this thing for three years into a chat and the chat is like I don't know anytime I see a chat on a thing and I know I'm talking to a robot I don't want to do it. like and this is coming from the guy that like made a demo with this stuff like January 2023. Uh, but I use these LLMs as programming primitives to go from unstructured to structured. And once you think of it like that, chat is just like one of the billion things that you can do. Uh, so um yeah, that's uh that's what I think about that. And wow, like that rant took a long time. We're at like 919. But I'm curious what people think about this because I like one of the things I work with a lot of people uh I say things dog dogmatically not because I believe them 100% but just so that there's clarity in the thinking right that's the important part uh clarity in the thinking um that's the but then people will be like hey you're wrong and I'll update my mental model if it makes sense there's a compelling argument for saying wow Seth You're completely out of your rocker. Seth is out of his rocker. News at 11. Um, yeah. So, um, that's what I'm that's how I think of these things. Uh, it makes it so that like that you can actually make sensible software with this stuff. Uh, and then the engineering trade-off comes not in like which LM do to use but and how much to make it do in a single call. Uh yes and this is so I think the assumption Richard that you're making is that um that this interaction is with a person. It doesn't have to be with a person like that this LLM interaction right here does not have to be with a person if that makes sense. But yes, keep pushing on cuz oh I just my earpiece fell out. Sorry about that. uh keep because I like I said I love these kinds of things where I'm like hm push me on this and then I'm like oh maybe I'm I had a guy uh he worked in our design and computational information all that he was one time in a meeting he like kept leaning on a thing and I was like that can't be right and so I messaged him I PM I me private messaged him like hey you were really leaning on this can you maybe spend a half hour helping me understand why and then he went through and I learned something new. It was great. So, uh the the conversation does not have to be between the LLM and a person. The conversation can be between an LLM and another computer system. And it's not a conversation in the sense that uh like they're not talking to each other. Here, let me show you a crazy thing. Uh

Segment 12 (55:00 - 60:00)

I need to share with my um with audio. Oh my half my face is gone. Maybe that's a good thing. So let me see if I can share my screen here. Share with audio. Entire screen. Okay. Let me show you something. Uh, so this is um this is a demo I do. Um, nice. Uh, turn my debug on. And it's not going to work all the way because it's doing my camera. I can't take my camera. So, let's see if you can hear it. Tell me this. Hey. What? What can you do? Can you Hey. Uh, People are telling me you can't hear. So, let me People can't hear you. One second. Let me see if I can fix this. Okay. So, so Rihanna in the back, you can't hear the thing talking. She'll type at me. No. Dang it. Stop sharing here and let me share uh let me try sharing with um so share entire screen. Share system audio. — All right. You still there? — Yes, I'm still here. How can I assist you today? Seth, — can you tell us what you do in Japanese, please? Hey, can you uh can you make a Zavakor uh a shirt for me? And in the design, like on the shirt, I want you to put in the front, like a picture of a cup. And because we are the cup kickers. Can you make a Zava core shirt like that for me? — Absolutely. I'd be happy to help design a Zava Corp shirt for you with a cup kicker theme. Could you please provide a little more detail for the design one? — No, you take all of the uh all of the liberty that you want. It just I just need to have a cup in the center and then cup kickers in on the thing. That's the only thing that matters to me. Can you do that? — Got it. I'll focus on creating a Zava court shirt with a prominent image of a cup in the center and the text cupers. I'll take care of the rest of the design details to make it stylish and fitting for Zava. I'll get started on that right away. Your cup kickers t-shirt design is ready. It features a prominent cup image in the center with the text cupers. All styled with Zava's modern aesthetic. I hope it captures your vision perfectly. — Hey, thanks. That's amazing. — You're welcome, Seth. I'm glad you like it. If you need any more designs or have any other requests, just let me know. Enjoy showing off your new Cup Kickers shirt. — Okay, so I want to show you something. Uh, and this is the part, Richard, that makes more sense once you see what's actually happening. So, this agent is GBT 40 real time. It's a 40 realtime API. I'm talking to it. It doesn't know how to make things like this. So, what I did is I gave it access to tool calls. And the real time API asks for tool calls. Let me show you how it asks to execute a function. You can see right here, um, it's like call a function. By the way, this interaction I just had is real. It's not like this was all real. Like it happened right now in front of everybody. It's not like I faked it. It's a video. Take a look at how it called the other agent.

Segment 13 (60:00 - 65:00)

with its own language. Notice that the interaction between uh the voice controller and the uh other agents in the system it managed it had its own threat. Here I'll show you another thing. Um, this is Azure AI Foundry. Uh, let's go to agents here just so you can see. Like right now it's uh 9:28 a. m. 9:28 a. m. Uh, here's the long thread that we just had. You can see right here, this is the agent calling. This is like a different agent calling this agent. Like this. This was not me. I did not say this. Notice that in this case, the LLM that's used here did not talk with a human. Didn't talk with a human at all. It talked with that voice thing. It asked for this stuff. This is what it asked for. Uh this is the full conversation that we just had, right? it remember then we had away and then it talked Japanese there it is and then uh core it's core not corpse gosh uh you can see it's talking here you can see here's the agent's like hey I need to call this other thing so I just put this on thread but notice that um the agent the sub agent was not executed by a person and then we get this notion of like a global thread There's thread local scope and then you get this notion of scoping these threads so that the LLM can do its work and all a thread is context for the LM to do work. It's basically goes into the prompt. Oh crud, we are almost done. Um walk-off music there. Um hopefully that makes sense. Um, this was an example of me working with a separate LLM, but that LLM never heard anything from a human. Yes, this is Richard. So, the response from your voice agent was used to start a conversation with another agent and then you marshall the output. Yes. But that other agent could have called another agent which had another thread which thread. So effectively thread scoping which thread is the conversation history or the work I just think of it as it it's notes for its work. Um there could be a chain of agents that are called that have a thread that has nothing to do with a human. No chat. And this is where this stuff starts to get awesome because as you watch hopefully when you watch this you're like, "Holy cow, Seth is talking to an AI thing that's talking to other AI things in English. " Now, I wouldn't have done I had to do it this way because of certain limitations, but you can imagine a scenario this I just want to show you the extreme. I would not do this way if I had to. I would literally have it call a function that does the right thing, but Dude, it worked. By the way, this image is uh brought to you by This image uh Flux Context Pro. Um that's why the text is fantastic. Um uh so uh again, I appreciate everyone uh coming today. Um, thank you so much for sending your time with us. Uh, I I can't believe we we ran out of time so quickly. Uh, so let me uh remove that. Uh, hello friends again. Thank you. Thank you so much for spending some time with us. Next week on the AI show, I don't even know what we're doing next week. Unlocking AI potential with Foundry local integration of VS Code and GitHub with Manav Dalal and new agent frameworks framework for nextg multi-agent solutions with Elijah Straight. Great. By the way, two amazing things. I'm excited for Foundry Local. Imagine being able to use models locally to do the stuff we just talked about. Uh, thank you so much for spending some time with us. We know your time is valuable and that you spent it with us. It's a privilege and an honor. Thank you so much for being with us. And by the way, we have good stuff coming up next week, so make sure you tune in. Thank you so much for watching and hopefully we'll see you next time. Take care, friends.

Другие видео автора — Seth Juarez

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник