Build Hour: AgentKit

46:19

Build Hour: AgentKit

OpenAI 29.10.2025 46 223 просмотров 829 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Introducing AgentKit—build, deploy, and optimize agentic workflows with a complete set of tools. This Build Hour demos how to design workflows visually and embed agentic UIs faster to create multi-step tool-calling agents. Samarth Madduru (Solutions Engineering), Tasia Potasinski (Product Marketing), and Henry Scott-Green (Product, Platform) cover: • Build with Agent Builder- a visual canvas based orchestration tool • Deploy with ChatKit- an embeddable, customizable chat UI • Optimize with new Evals capabilities- datasets, trace grading, auto-prompt optimization • Real World Examples from startups to Fortune 500 companies like Ramp, Rippling, HubSpot, Carlyle, and Bain • Live Q&A 👉 AgentKit Docs: https://platform.openai.com/docs/guides/agents/agent-builder 👉 AgentKit Cookbook: https://cookbook.openai.com/examples/agentkit/agentkit_walkthrough 👉 ChatKit Studio: https://chatkit.studio/playground 👉 Sign up for upcoming live Build Hours: https://webinar.openai.com/buildhours/ 00:00 Introduction 04:50 Agent Builder 21:27 ChatKit 24:53 Evals 35:17 Real World Examples

Оглавление (5 сегментов)

Introduction

All right. Hi everyone. Welcome to OpenAI Build Hours. I'm Tasha, product marketing manager on the platform team. Really excited to introduce our speakers for today. So, myself kicking things off, uh, Summer from our applied AI team on the startup side, and Henry who runs product for the platform team. Awesome. So, as a reminder, our goal here with build hours is to empower you builders with the best practices, tools, and AI expertise to scale your company, your products, and your vision with OpenAI's APIs and models. Um, you can see the schedule down here at the link below, openai. com/buildours. Awesome. Our agenda for today. So, I will quickly go over agent kit, which we launched just a couple weeks ago at Devday. Um, then hand it off to Samarth for an agent kit demo. Henry will then run us through eval which really helps bring those agents to life and let us trust them at scale. Um if we have time we'll go over a couple real world examples and then definitely leaving time for Q& A at the end. So feel free to add your questions as we go through. Awesome. So let's do a quick snapshot of what agents was like um building with for the last really like several months or even year. Uh it used to be super complex. Uh orchestration was hard. You had to write it all in code. Uh if you wanted to update the version, it would sometimes introduce breaking changes. Uh if you wanted to connect tools securely, you had to write custom code to do so. Um and then running evals required you to manually extract data from one system into a separate eval platform, daisy chaining all of these separate systems together to make sure you could actually trust those agents at scale. Um prompt optimization was slow uh and manual. And then on top of all of that, you had to build UI to bring those agents to life. And that takes another several weeks or months to build. So basically it was in massive need of a huge upgrade which is what we're doing here. Um so with agent kit we hope that we've made some incremental improvements to how you can build agents. Um now workflows can be built visually with a visual workflow builder. It's versioned um so no intro no breaking changes are introduced. Um there's an admin center uh called the connector registry where you can safely uh connect data and tools and we have built-in evals into the platform that even includes thirdparty model support. Um as Samarth will show us in a bit, there's an automated prompt optimization tool as well uh which makes it really easy to perfect those prompts uh automatically rather than trial and error yourself manually. Um and then finally we have chatkit which is a customizable UI. Cool. So bringing it all together, this is sort of the agent kit tech stack. At the bottom we have agent builder uh where you can choose which models to deploy the agents with. Connect tools um write and automate and optimize those prompts. Add guardrail so that the agents perform as you would expect them to even when they get um unexpected queries. uh deploy that to chatkit which you can host yourself or with open AAI and then optimize those agents at scale in the real world with real world data from real humans by observing uh and optimizing how they perform uh through our eval platform. Cool. So we're already seeing a bunch of startups and Fortune 500s and everything in between using agents to build a breath of use cases. Some of the more popular ones that we're seeing are things like customer support agents to triage and answer chatbased customer support tickets, sales assistants similar to the one that we'll actually demo today, um internal productivity tools like the ones that we use at OpenAI to help teams across the board um work smarter and faster and reduce duplicate work. Uh knowledge assistance and even doing research like document research or general research. And the screenshot here on the right is just a few uh templates that we have in the agent builder that show some of the major use cases that we're already powering. Okay, so um let's make this all real with a real world example. Uh a common a challenge that businesses face is driving and increasing revenue. Let's say that your sales team is too busy outbounding to prospects, building relationships, meeting with customers. We want to build a go-to market assistant to help save sales time and increase revenue. And with that, I will kick it over to Samar to show us how to do it. — Great. One of the biggest questions that we get at OpenAI is how do we use OpenAI within OpenAI? Um, and hopefully this kind of rolls the curtain a little back so you can take a peek at how we actually build some of our goto market assistance. Um we'll cover a few different topics today like uh maybe the agents that are capable of uh doing data analysis, lead qualification as well as outbound email generation. Um so what I'll do here is move over and share.

Agent Builder

Great. So we're actually on our Atlas browser. Um feel free to download that. I had a fantastic time using it these past few weeks and um I think it saved me hours if not uh you know days worth of time doing some things sometimes and uh um I'm a big fan. Uh okay so we'll get started and when we get into the agent builder platform the first thing that we really see um is a start node and the agent node. Um you can think of the agent as the atomic particle within you know the workflow that you go in and construct and behind it is the agents SDK which actually powers the entirety of agent builder. Whenever we build these agent builder workflows, um it doesn't have to live within the OpenAI platform. Uh you can copy this code, host this on your own, and you might want to even, you know, take this beyond traditional chat applications and do things uh like being able to trigger these via web hooks. So for this example, um we have three agents in mind that we're looking to build out. the data analysis one where we'll pull from data bricks a lead qualification one where we'll scour the internet for additional details and outbound email generation um where we want to maybe qualify an email with things on a product or a marketing campaign that we're launching. Sound good? — That sounds great. I'm on board. — Okay, great. So, we'll get started by building our first agent here. Since we have uh three different types of use cases in mind for what we're actually trying to build, um what we want to do is use a very traditional architectural pattern using a triage agent. So the way that we think about this is that agents are really good at doing specialized tasks. So if we break down this question to um you know the proper sub agent, we might be able to get better responses. So for this first agent, let's call this a question classifier. Typing is hard. a copy over the prompt that we've put in here. I'll just take a quick peek at what this looks like. And really what we're doing here is asking the model to qualify or classify a question as either a qualification, a data, or an email type of question. Really, the idea here is that we can then route this query depending on what the model selected as what its output should be. Um, and rather than having a traditional text output, what we want to do here is actually force the model to output in a schema that we recognize and can use for the rest of the workflow. So let's say let's call this variable that the out the model will output in category and select the type as enum. What this means is the model will only output a selection uh from the list that we provide here. So um from my prompt I had the email agent, the data agent and the qualification agent. — Great. — And real quick uh how did you write the prompt? Did you write that all yourself or I know the importance of prompt and steering the agent. How did you come up with that? I think writing prompts is one of the most cumbersome things that we can do. Um I there's a lot of time spent spinning wheels on what actually matters when you're capturing that initial prompt. And I think um one of the most key ways that I write prompts myself is use chat GPT and GPT5 to be able to create my vzero of the prompts. Um, within agent builder itself, you can actually go in and, uh, edit the prompt or create prompts from scratch to be able to use as, uh, the bare bones for what you might, you know, spin on in the future for your agent workflows. Um, for now, we'll leave it as the one that, uh, we pasted in here, but we'll in the rest of this workflow, we'll take a peek at what using that actually looks like. Great. Um, so now that we've actually got the output, um, agent builder actually allows us to make this very stateful. So for example I have a um a set state icon here. Sorry just again drag and dropping also can be difficult. Um so what we want to do here is take that output value from the previous stage and assign that to a new variable such that the rest of this workflow is able to reference it. Um we'll call this category again. Um and assign no default value for now. Um, using that same value, I can now conditionally branch to either the data analysis agent or the rest of my workflow to handle maybe additional steps I want to do prior to executing the email um or the data qualification use case or the customer qualification use case. Um, what we'll do here is drag this agent in and we'll set that the we'll set the conditional statement here to say um if the state category is equal to data. Let's see. Oh, it looks like I spelled it wrong. — Debugging. [clears throat] Great. — As you can see, there's helpful hints where we were actually able to see um what actually went wrong and be able to really quickly go back and debug that. So here in this case we want to see if it's a data you a data agent will route to that separate agent and if it's not we'll probably use um additional logic to go in and scour the internet for those um you know inbound leads that we want to qualify or an email write. Um let's stick with the data analysis agent for now and go over what it's like to actually go in and connect to external sources within agent builder and largely agents SDK. Um what I want to do here is actually instruct the model on how to use data bricks and create queries that it can use um in co in cohort with an MCP server. So what we've done here is uh added a tool for the model to be able to go and access this MCP server and query data bricks however it chooses fit. Um if my quer is really hard and might require you know joins data bricks and GPT5 would be able to use those together to then be able to create a concise query. Um, so since I've built my own server for now, um, I'll add it here. And let's call this I'll add my URL first. Um, I'll call this the datab bricks MCP server. Um, and what I'll do here is actually choose the authentication pattern. You can also select no authentication. Um, but for things that are protected resources or might with live within authenticated platforms, you might want to use something like a personal access token to go do that last mile of federation. So, in this case, I'll I'll use a um a personal access token I created within my data bricks instance and hit create here. Let's give it a second to pull up the tools. And we can see that a fetch tool is actually submitted here. Um what this allows us to do is select a subset of the functions that are actually allowed to the MCP server um to really allow the model to not get overwhelmed with the choices of potential actions that it can take. So, I'll add that tool there. Oops. Um and I'll also um I'll go back. One thing I might have missed here is actually setting the model. What I want to do is make this really snappy. And so what I can do is choose a non-reasoning model there. But for this one, I really want the model to iterate on these queries and react to the way that the model or the the results of the model were actually perceived to um the agent. And so uh what we'll do here is do a quick test query to make sure the piping works. So maybe I'll say um show me the top 10 accounts. That should be good enough. Um, and what we can see is the model actually stepping through the individual stages of this workflow. So in the beginning, you can see that it classified this question as a data question, saved that state and then routed. Um, we can see that when it reached that agent and decided to use that tool, it actually asked us for consent to be able to go and take that action. You can configure that logic on the front end to be able to handle how to actually show to the user, hey, the model actually wants to go and uh select an action there. Um, with MCP you're able to do both read and write actions. And we have a few of these MCP servers out of the box. Think like Gmail. Um, we have a ton more uh out of the box that you're able to connect to. — SharePoint. Totally. Um, and so here we can see that the model is actually, you know, thinking about how to construct that query. And we can see that we can see a response here. We didn't ask for the model to really format this result for us, but we can actually really quickly do that with this agent itself. by just asking the model to say um I would like the results to be in natural language and just by you know spinning on um the generate button within agent builder itself you're able to provide these inline changes depending on the results that you see in real time — super cool — cool u so the next thing I want to do is actually create another agent to do some of that research that we were mentioning that might be useful for something like generating an email or uh qualifying a lead. So, we'll call this the information gathering agent. Looks like it's stuck here. I might have to give it a quick refresh in a moment. See, platform's a bit buggy. Great. Um, cool. So, we're at this information gathering agent and what we want to do is tell the model uh how to actually go and search the internet for the leads that we want. Particularly, we're looking for a subset of the information that might be publicly available for a company. So, think about like the company legal name, the number of employees they have, the company description, maybe their annual revenue, as well as their geography. Um, and what we want to do here again is use a structured output to define what our output should look like when the model goes and um searches the internet for this. This gives us a good mapping and for the model itself to know what to look for when it's writing these queries and we're able to then uh you know instruct the model in terms of the way that it should search for the uh across the internet. Great. Um what we want to do here is also change the output format for uh the schema that we want to enter. Maybe we want to put the fields that we previously just showed into a structured output format. You can also add descriptions um in the properties, but for now we're going to leave those blank. Great. So now that when the model goes to this information gathering agent, it will hit this uh agent, search the internet, and output in the format that we're looking for. Cool. Um since we saved the state of the query routing in the beginning, we can go ahead and reference this again um when we're going to um route again via email or to the lead generation u and lead enhancement agent. So what we'll do here is set this equal to email and then otherwise we'll just route it to the other agent. — Awesome. Yeah. And the sub aent architecture is great because it means that you get better quality results a bit faster than you would just using one general purpose agent which is helpful for actually having impact and helping the sales team be more productive. — Um what we'll do here is paste in a prompt for this email agent. Um, but really the highlight for this for the email agent is that we're looking to generate emails that are not just from, you know, information from the query or from the internet, but we also want to upload files that might map to the way that we're actually thinking about building uh emails in general for marketing campaigns. So, what you may have in this case is something like PDFs that contain information on what the campaign is. Maybe you have other PDFs that contain information of how you should write emails. Um, all of this is really useful information for the model in order to spec out what that email should actually look like. Um, so what we'll do here is add a tool to actually go and search these files. You can attach vector stores that you may have already um to the workflow and be able to use those out of the box. You're also able to add these via API. Um, but for now, what we'll do is just drag in a couple files that we have. Um, we have one that's a standard operating procedure for how to write emails. And we have another document on a potential promotion that this sample company has. Um, and what we've done is allowed the model to then go in and search the vector store for this type of information in order to actually go and generate that email. um on the lead enhancement agent. Um instead of writing one ourselves, let's pretend like we have a uh like a general segmentation of u the market that we want to actually assign various account executives to. So in this case, what we want to do is essentially um be able to output a quick schematic of how we're going to do that assigning process depending on the information that was gathered from the internet. And without writing a prompt, um, agent builder will be able to output an entire, um, you know, version of that prompt as a starting point. — Super cool. — Great. Um, before I move away from agent builder and show like this working end to end, what I wanted to show is that agent builder doesn't just support text and structured output formats, we also support really rich widgets. So, what this looks like in practice is that uh, we can instead of outputting text or JSON, upload a widget. And I'll show you in a little bit what it looks like to actually create a widget and use a widget. But we can actually go in and upload a widget file itself. Um, so I'll drag this in here. Or maybe I have to Great. So we can see a quick preview of what this widget looks like. Rather than just outputting in text and maybe, you know, traditionally like chat GPT what you see is like a markdown formatted result. we want to maybe render something richer such that if you do host this on your own on your own website, you're able to have that multimodal um component as well. So what we'll do here is create this component. And now if I say draft an email to should we use open AI about — great so we can see that it went to the information gathering agent um since we've given access to the web search tool from the reason did we do that let me make sure that I did that — may skip — might may have skipped that step — there we go Great. — So again, sorry, — I was just gonna say I love that you can test the workflow live here and debug it like we're doing before going to production. — Totally. And the really nice thing is as you run questions through this workflow, we save the traces of exactly how the model has executed um you know various queries and then more holistically the way that the workflow has orchestrated. So this is really rich information as you're continuing to iterate on your workflow. And Henry will touch on this a ton, but the ability to really peel back the curtain and see how the model is thinking about this and then assign graders I think really allows you to scale out this process of evaluations as well. — Yeah. — So great. Looks like here it's searching for Lumaf Fleet. Um we'll let this run for a little bit and see what happens at the end. Um, okay. Looks like it might take a little bit to do that. We we'll get back to that one. Um, end to end. So, what we've built here is essentially an agent that allows you to do three different things. The first one is that allows you to go and query data bricks for being able to pull in that invi um you know information that might live bei behind some form of a information wall and be able to pull that within the agent workflow itself. Um and then alternatively being able to qualify write uh emails um and then also qualify inbound that you might get from customers. Um, all of this lives within a workflow that you can then host within um, chatkit, which we'll cover, or you can take this out and use it in your own codebase to handle um, what those chat workflows actually look like. — Super cool. Um, one of the questions I was wondering was uh, what's the difference between pulling a tool from the lefth hand sidebar in like drag and dropping that in as a node as opposed to adding that tool into the um, agent node specifically? — Totally great question. So, um, when I added like the search tool to the information gathering agent, I've allowed the model to determine if it should actually go in and use that tool. Sometimes I always want the tool to run prior to an agent actually getting that information. So, I can add one of these nodes to be able to ensure that the model is actually doing this action, prior to the agent actually receiving that information. — Makes a ton of sense. Yeah. So, agent kit then I feel like is a good combination of deterministic and somewhat if you want non-deterministic outcomes to be true. — Yeah. — Cool. — Great. Um I want to pivot to so we built this amazing workflow. Now we want to go in and deploy it. Um I think one of the most fantastic things that we released at our most recent dev day was the ability to go in and host these workflows that you've built. So using the uh workflow ID that we've gone in and built, we're able to actually power these chat interfaces that uh

ChatKit

might require a ton of engineering otherwise to support things like reasoning models as well as being able to support um you know complex agent architectures and the handoffs that you might want to show to users. Um what this looks like in production is that you're able to match the entirety of your uh your brand guidelines to the actual chat interface that you're building. and we'll take a peek at how some of our real customers are using this today. Um, but really the I wanted to highlight the fact that you can, you know, entirely customize this to, you know, the color scheme, the font families, as well as the starter prompts that your users might go in and use. Um, say for example, we have a workflow that looks at our utility bills, um, where we might want it to go and connect to an MCP server, pull up your billing history, analyze those past bills, uh, and then be able to, uh, show a really rich widget to the user. the entirety of that process and the customization of what the user sees is entirely uh configurable through chatkit. So here in the question, how's my energy usage? Rather than just showing a traditional text response, we see a really rich graph that allows you to visualize the output. — This is super cool. Yeah. And I think for our use case uh example, just to drive it home, one of the widgets that we have available that maybe you'll show us shortly is uh an email widget. So, if you wanted the agent to actually draft a email to Opening Eye, which I think it's still researching information for because there's so much public information out there. Um, and then sales can just click to have that button uh to have that email sent to the customer. — Totally. Yeah. Let's take a look at a few what those widgets could be. So, we've released a gallery where you can take a peek at some of the ones that we think are really cool. You can also click into these and see what the code is to actually build these. But what I think is really cool is being able to generate gen generate these through natural language. Like for example, if I wanted to have an email uh component or widget that I wanted to mock up that um contains some specific brand guidelines or um formatting in a way of that widget that really appealed to my brand. Um I'm totally able to do that via natural language. Um and so using this you can then export that into agent builder and then show that UI when uh agent builder invokes that widget in chatkit. — Amazing. — Great. Um before moving it to Henry, I wanted to show an example of what this looks like in real life. Um we have a website here that renders a p a globe with or a picture of the earth. And what we want to do is be able to control this globe that we have uh via natural language. So where should we go today, Tasha? — Well, I think our next dev day exchange is in Bangalore. So I'm going to say India. — Let's go to India. So what we should see here is another agent builder powered workflow. But we can see how not only did um a widget populate on the right side, we actually were able to control the JavaScript that was rendered on the actual website itself. So being able to have this customizability and portability into the websites and browsers that you use every day is something that um we find really fascinating with Chatkit. — That was the fastest trip to India I've ever taken. — Totally. Um awesome. So we covered um the build side as well as the deploy side into chatkit. Um what really is the most important part and you know the hardest part of a lot of building agents is the evaluate part. — Yep. That's how we know that we can trust the agents uh in real world scenarios in production at scale with all of the glorious uh and weird edge cases that come up. So with that, I'd love to hand it over to our friend in the UK uh Henry who can walk us through

Evals

an EDOS demo. Thank you so much Tash and Samath. And hi everyone. I'm Henry. I'm one of the product managers who worked on agent kit. Um and so today I want to talk a little bit about how once you've built that agent, once you've got that workflow, um and you've defined it in the visual builder, I want to talk about how you can test it. I want to talk first about how you can test an individual node and get confident that specific agent or that specific node is going to perform as you want it to. Cuz ultimately your agent is only as good as its weakest link. like you need every single component to be dialed in and performing how you want it to. Once you've got every one of those nodes in a place that you're comfortable with, you then want to be able to assess the endto-end performance. And for that, you can look at traces, but traces are hard to interpret. And so now we have a trace grading experience, too, that allows you to take those traces and evaluate them at scale. So, let me pull up my screen and start talking you through a bit of a demo and show how we can uh how we can do this. So, here you can see an agent that I built. This is based on a real example from one of our financial services customers. This takes an input of a company name. It assesses is this a public or a private company and it completes a series of analyses on that company before ultimately writing a report for one of the professional investors of that company to review. So, as I mentioned, you have a whole bunch of agents here and every single one of these agents needs to perform well and needs to perform as you want it to. And so how do you get confident in the performance that's going to do that? How do you get visibility and um and kind of transparency into how it's going to perform? So when you're defining this agent and you're looking into one of these nodes, you can see there's an evaluate button here in the bottom right. So we click that evaluate button that's going to take that agent node which has a prompt, it has tools, it has a model assigned and it's going to open it in a data set. So here you can see this data sets UI and this allows you to visually build a simple eval. And so I'm going to now attach um just a couple of rows of data into this eval. You can see a company name and then you can see some ground truth revenue and income figures as well. So I've imported that to this data set and that's going to allow us to run this eval. So here you can see everything that was passed through from the visual builder. You've got the model, you've got the tool of web search, you've got the system prompt and the user message that we had assigned. And then you can add additionally see this data that I uploaded. So this is just three rows, a couple of company names, and then some ground truth values for the revenue and income figures that our web search tool should return for those um those companies. So what I can do now, I can run the generation. So this is obviously the first stage of any eval is to run generation and then once you've completed the generation then you complete the evaluation stage. So while that generation is running I want to show how we can attach columns and so here we can add new columns for let's say ratings where we can attach a thumbs up and thumbs down rating and then let's additionally add columns for free text feedback. So this is where I can attach kind of a free text annotation. uh maybe I'm happy with something, maybe I want to attach some kind of longer form feedback on that data as well. And so what you can see now is that this output is coming through. And if I click into this, I can tab through these generations that have been completed. So you can see here I was asked to complete some analysis of Amazon, of Apple, and then of meta as well is still running. And I can scroll through that and I can see the generation that was completed. So what I can then do is I can attach these free text labels or attach these annotations sorry that I just created. So I can say this one's good. I can say maybe this one's bad. good. And then I can attach feedback. I can maybe say this is too long for example. Now once I've done those annotations I can also add graders. So, let me add a grader here. And I'm going to just create a simple grader that's going to evaluate a financial analysis. And it's going to require that this financial analysis contains upside and downside arguments that it considers competitors, that it ends with a buy, sell, or hold rating. So, I'm going to save that and I'm going to run it. And this is now going to run through uh in fact, let me just change that. Okay, let's just leave that. So that's now going to run through and complete those kind of greater uh ratings. So that's going to take a little while to run through because we've got a lot of data in there. So I'm going to tab over to a data set that I created earlier where you can see these graders have now completed. If I click into these, I can see the rationale. I can see why the grader has given the result that it has done. So here you can see for example uh this grader has failed because there's no explicit recommendation and there's no competitor comparison. So what we could do at this point and now just maybe recap even where we are here we've got those generations that have been completed. We've got all those annotations and we've got all these greater outputs. What do you do at this point? How do you make your agent better? So one thing you can do is just do some manual prompt engineering and try and find patterns in that data and then try and rewrite your prompt. That obviously takes a long time and requires you to find those patterns and to spend a bunch of time, you know, trying to solve them. What we see as a better solution is automated prompt optimization. So you can see here there's this new optimize button. So if I click that, it's going to open a new prompt tab in this data set and that's where we're going to automate the rewriting of the prompt. And this is how you save yourself having to do that manual prompt engineering every time. So this is where we're taking those annotations, we're taking those greater outputs and we're taking all the uh the prompt itself and we're using that to suggest a new prompt. And again, this will take a minute or two to run through. So I'm going to tap here to one that I made earlier. And you can see here the rewritten prompt that completes a fundamental financial analysis but is much more thorough and complete than the initial kind of pretty scrappy and rough prompt that I had completed. So that's an overview of how you can take that single node from that agent builder and how you can robustly evaluate that single agent. But we're not building a single agent here. This is a multi- aent system and we want to test every one of the nodes individually. But ultimately what we care about is that endto-end performance. So how do we get confident in that? How do we test that? So as Samarth mentioned these agents emit traces and here you can see some example traces from when I've previously run this agent. So clicking through this I can see every span. I can click into every span and I can start to identify uh you know what happened when this agent ran. Now as I'm clicking through this I might start to notice problems. For example, here you can see I there's a bunch of sources that have been pulled by the web search tool. For example, CNBC and Barons. Maybe we don't want these third party sources to be cited. Maybe we want only first party authoritative sources. So, we should say web search sources should be first party only. Let's just run that with GPT5 and Nano so it's nice and fast. And then as I click through more of these, I might find additional problems. Let's say we identify another pattern that the end result doesn't contain a buy, sell hold rating. So we say end result needs to contain a clear by sell rating. And again, I'm building up these requirements that I can then run over specific traces. And now this set of requirements you can think of as like a grader rubric. And this greater rubric is built up with a series of criteria that define a good agent. And then once I've got that set of criteria built up and I've tested it on a couple of traces, I can then click this grade all button at the top here. And this is going to export the set of traces that I've scoped this to. So in this example, just these five traces. And it's going to take the set of graders that I've defined on the right. and it's going to open that in a new evap. And this allows you to assess a very large number of traces at scale because clicking through every one of these traces and trying to find problems doesn't work that well. It takes a lot of time. It doesn't scale well. But instead, you can run these trace graders over a very large number of traces. And that will help you identify just the spans that are problematic and just the traces that you want to dive into. So that was an overview of how we have this kind of embedded eval experience. is tightly integrated with the agent builder. Um I also just wanted to flash a couple of best practices that we've seen from working with a large number of customers now uh on this platform and a couple of lessons that we've learned. Um first starting simple you know don't over complicate things but do start early. have a handful of inputs and a simple grader that you define right at the start of the project instead of leaving evals right to the last minute as like a I'm just about to ship this thing I better do some testing which I know some people do it's much better to like start early embed evals and do kind of eval driven development where you're rigorously testing your prototypes finding problems in the prototypes and then quantitatively measuring your improvement as you hill climb against your eval much better way to build a product and likely to result in higher performance. Secondly, using human data. It's really hard just coming up with hypothetical inputs, using LLMs to generate synthetic inputs. You'll probably get much better performance if you get real user data, real inputs from real end users because that captures a lot of the messiness of the real world. And then finally, make sure you invest a bunch of time annotating generations and aligning your LLM graders because this is how you make sure that your subject matter expertise is really encoded into the system so that your graders are actually representing what you want your product to do. So that was a high level overview um of our product. This is all in G. So we'd love for you to give it a spin and please let us know uh any feedback at all. And with that, I'll pass back over to Tasha and uh Sam. — Awesome. Thanks, Henry. I feel like we could do a whole hour session on emails.

Real World Examples

That was awesome. Um, one quick question for you actually before uh you step out is um how large of an email data set do you recommend? We got this from chat. Is it uh 100, a thousand, 10? How do you know what the right uh data set size is to get the results you want? — Yeah. So, the best thing to do is to get started early. And so even like 10 to 20 examples goes a long way. And having um having that set of data in there to just test your application against is really helpful. So even just you know 10 20 a couple of dozen uh rows is helpful. And then as you get closer to production clearly the more is better. But it's really, you know, I wouldn't think of it as a question of just how many rows because there's kind of a quality times quantity uh multiplier that you have to um have to consider here. Having, you know, 50 rows of really high quality inputs that are very representative of a large set of user problems and then having graders that are really aligned with the data that the behavior you want to see that can perform phenomenally. Whereas if you use an LLM to generate a thousand rows of synthetic inputs, it's not going to be that helpful. So I'd say the quality is almost more important than just the quantity. — Yeah, — that makes a lot of sense. Yeah. — Yeah. And just to add on top of that, like one of the questions that we get a ton of is like how do we create a diverse data set to run evals from, especially if you haven't put a lot of these tooling into production already. Um, when we're building our go to market assistant, our engineering team that actually supports those workflows sits right next to our go to market team to understand what subject matter experts are actually asking or curious about. This allows us to build a good diverse set of questions that on every iteration that we continue to optimize. Um, we're capturing the nuances and the real queries that people are actually interacting with. — Super cool. Um, awesome. Well, thanks Henry. So, with that, I'd love to cover a couple real world examples and then we'll leave some time for Q& A at the end. So, um, our first one here is a short video of a procurement agent that RAMP built. Um, so they used chatkit to actually visualize uh this UI to the person requesting a software. They used agent builder on the back end to actually orchestrate the agent flow. Um, and they used evals to make sure that it would work uh at scale in production. So while this isn't live on their platform yet, um, we hope that it will be in the near future. And that was a quick run through of um, what they actually built and the prototype. Um, awesome. So, uh, RAMP with the agent kit stack, uh, was able to build this prototype 70% faster, which I think is pretty amazing, uh, equivalent to like two engineering sprints instead of two quarters. Um, Ripling, I actually think you worked on this project a little bit. Do you want to maybe share what they built and how it went? — Yeah, totally. We were initially thinking about like how we can spec this out through the agents SDK and um one of the hard challenges was like getting that alignment between subject matter experts as well as you know the ability to build workflows that were logically sound and so we really sat with them to understand what was their real go to market use cases and be able to work backwards from there. Um chatting with their team I think it was a pleasure to use a tool like agent builder and we got a ton of really uh good feedback on next versions that we're looking to roll out. — That's awesome. Um similarly HubSpot who has been doing a lot of amazing uh work in the AI space they used uh chatkit to enhance their uh breeze AI assistant. If you want to actually advance um awesome thanks all good. Uh so yeah they saved weeks of front end time like we mentioned at the start building agents from start to finish is super timeconuming because of each of the complex steps involved. So if we can even help with just one of those um numerous steps, the UI uh aspect in this case, that's um that's a useful lift. Uh and then finally, Carlile and Bane, which were two uh amazing evals customers of ours. So um they were able to see a 25% efficiency gain um in their eval data set, which is fantastic. Um cool. Okay, so maybe to round it out before we go over to Q& A, um when we launched Agent Kit, these are some of our early um customers who built on the product. And you'll see that Agent Kit's currently powering tech stacks at startups, Fortune 500s, everything in between. Um these are the different types of agents. There's a bunch of uh breadth of use cases here from uh work assistants to a procurement agent, policy agents. Um, Albertson's the large grocery retailer has a merchandising intelligence agent. Um, Bane code modernization. So, really cool to see just the wide range of use cases here. Awesome. With that, we can go to Q& A from the chat. Uh, maybe do you want to go to the next slide? — Cool. Okay. So, how can I add a four loop blocks? Mark, you want to take that one? — Yeah, good question. So we don't have a for loop but uh we do have a while loop that's available within agent builder you're able to actually be able to um conditionally continuously run different agent workflows depending on if a completion criteria has met. Um obviously with the agents SDK you can take it out into a codebase and then orchestrate that on your own. maybe use like our interpretation as that of that as like v0ero. Uh but instead of a for loop, we do support Y loops. So such that you're able to actually iterate um throughout the workflow until that uh end criteria has been met. — Hopefully that helps. Um what else have we got? How does Agent Kit compare to the agents SDK? — Um I would say that agent kit so far is uh so well I I'll back up a bit. Agent kit is a suite of products that we've tried to opinionate as to the most useful tools that we at OpenAI find uh from our day-to-day as we build agents. Um agents SDK powers the entirety of agent kit and most of everything that you can do within agent kit you're also able to do within the agents SDK or it's via uh available via an API. Um so far uh we're continuing to roll out a ton of these changes to make that parody happen a little bit more closer. Um but we imagine in the future that agent kit will also contain um you know some features that allow you to extend the ability to host these workflows um on the cloud. And so rather than using like traditional chatkit implementations uh you could also trigger these workflows via an API as well. Um this allows you to essentially host the agents SDK on the cloud. Yeah. — Very cool. Yeah. And I would say um agent builder is like the equivalent of the agents SDK functionally but it's the canvas visualbased way to actually orchestrate those agents whereas agents SDK is like the jump straight in straight into the code version of it. Um so yeah very cool. Uh how do you build out of the box MCP servers versus building your own? — Yeah totally. So we have a few MCP servers. So we support uh remote MCP servers which means that the MCP servers have to be hosted on the cloud or um hosted on the publicly available internet to some degree. Uh when we're building our own MCP servers, a lot of the considerations that we have around authentication require us to build our own MCP servers. That said, a lot of the providers that you use every day like think Gmail, your calendar, etc. Those all have out of the box connections likely that you're able to just paste in an API key and get started with all the tools that we support. Some of these I think um you know we don't have full capabilities to do things like write. So for example, if you want to write an email via the Gmail API, I don't believe that is currently supported. So you might want to spin up your own MCP server there. Um the thing I really like about MCP is that it allows for that authentication and blackbox is what that flow actually looks like. So whether you want to bring your own personal access token or go through something like OOTH and then pass in that last token that you get to uh the MCP server, both are totally great options to be able to authenticate to secured sources. — Cool. We have any more questions? — Yes. — When do you recommend a classifier agent with branching logic to different agents? — Yeah, I think this is a great question. It's one that we get a ton because um as you add more tooling and instructions to a model, what we've seen is that the performance generally deteriorates. Um imagine a world where you had a 100 tools, right? Allowing the model to select which one of those 100 tools becomes increasingly difficult. Um more realistically, you might not have a 100 tools, but you might have 20. And each agent or each use case for an agent might use those tools in entirely different ways. So one way that I like to think about agents is stratify the logic for what is a core competency for this agent. What are the net set of tools that I want this agent to use and only in that specific type of way. The moment I start confusing the model and how to invoke these tools, how to interpret the instruction with the context of those tools, I like to branch off to a different agent. So in the cases that we had um where uh you know we were looking for three different uh GTM use cases maybe the email agent that we're building you know that outputs a widget is not the best one to also do lead qualification. So um those use cases where you're maybe using even the same tools but uh you want to structure the outputs a little bit differently you want the model to interpret the outputs a little differently um it's good to branch out to different agents. — Cool. Alrighty. Uh, can we use agent kit for a multimodal use case especially for analyzing images and files? — Totally. So, um, this is a great use case for agent kit. We do support file inputs within that preview section that we covered. You're able to even play around in the playground with uploading files. Um, I what I find really interesting is that like we propagate this behavior to chatkit as well or chatkit propagates that behavior to agent builder as well where if you upload files within chatkit that is also passed into hosted agent builder backends. — Oh, super cool. Yeah. — Okay. So, we are at the end here. We would love to leave you with a few resources if you're interested in exploring more. Um, agent kit docs, a super helpful place to guess place to get started. Um, we also released a cookbook the other week um that walks you through a very similar use case to the one that we showed today um in a bit more detail even uh chat studio if you want to play around with chatkit and see how you can customize it. Um, and then finally, uh, to learn more about upcoming build hours and past build hours, the build hour refill on GitHub. Awesome. Uh, and with that, I think we're at a close. If you want to, um, Right. Okay. Upcoming build hours, we have two, uh, agent RFT. So, building on what we talked about today. How do you actually customize models for tool calling and custom graders and things like that? Um, that will be November 5th. So, really excited to build on today's session um, with that next session. And then on December 3rd, agent memory patterns. Um, so hope to see you at both of those. You can uh get more information about registering at this link. — Awesome. Well, that's it. Thank you so much for putting this awesome demo together. It was super fun. Um, yeah. Thank you all for watching and I hope you have fun building agents.

Другие видео автора — OpenAI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник