# OpenAI DevDay 2024 | Structured outputs for reliable applications

## Метаданные

- **Канал:** OpenAI
- **YouTube:** https://www.youtube.com/watch?v=kE4BkATIl9c
- **Дата:** 17.12.2024
- **Длительность:** 40:39
- **Просмотры:** 96,070

## Описание

Learn how to increase reliability with precise JSON schema adherence

## Содержание

### [0:00](https://www.youtube.com/watch?v=kE4BkATIl9c) Segment 1 (00:00 - 05:00)

hi there hello everyone we're here to talk about structured outputs an exciting new feature we shipped in the open AI API in August this year it's been a huge unlock for developers working with llms and in just a few short weeks hundreds of thousands of developers have integrated it into their applications my name is atel and I lead API design at open Ai and I'm Michelle poat and I'm the tech lead here for the API today we're going to talk about three things first we're going to tell you why you need structured outputs then we'll talk about how the feature works and finally we'll tell you how we built it under the hood so let's get right into it let's start at the beginning the year is 2020 and open AI has just launched gpt3 gpt3 can write emails draft blog posts and generate movie scripts it was great for text generation quickly developers all of you started to find some exciting new applications for this technology from generating in-game scripts like AI dungeon or drafting marketing materials like copy AI fast forward 3 years in 2023 we launched GPT 4 a new breakthrough in llm intelligence for the first time modelss were capable of advanced reasoning following complex instructions extracting information from long documents and taking action on behalf of users developers took it to the next level from building AI powered productivity tools like cursor customer service agents like clarus and language learning apps like dualingo all of these products had one thing in common they were connecting llms to the outside world from your codebase to external apis or on device actions to do this they needed outputs to be structured typically Json here's an example let's say you're building a personal assistant and you want to convert the user's message into an API call the user might say play The Beatles here's what you expect you want a Json object that says play music for the API and the Beatles for the artist but too often here's what you get the model begins with a preamble saying sure here's the Json to call the play music API now this isn't particularly helpful for developers since we need just the Json not the text surrounding it this is a problem llm outputs are not always reliable making it hard to integrate them into your apps now as developers we've tried all kinds of tricks to solve this some of you may have written prompts like this or even this now sometimes it works and sometimes you have to go all the way and write a custom markdown parsel that's obviously not how it should work so to solve this problem in June last year we launched function calling a native way to define tools the model can call using Json schema you define the schema of the function and the model outputs Json adhering to it unfortunately function calling wasn't perfect sometimes the model would output invalid Json such as by adding a trailing comma we've all done that before so in November at Dev Day last year we launched Json mode ensures valid Json outputs so you never see another parsing error but this still wasn't enough the model could output the wrong type like a string instead of a float as you see here or hallucinate a parameter entirely this cat and mouse game to get reliable outputs can be really frustrating fundamentally AI applications need reliable outputs so to solve these problems once and for all we introduce structured outputs to the API in August here's Michelle to tell you more so structured out outputs is a new feature designed to ensure that model generated outputs will exactly match Json schemas supplied by you developers you can think of structured outputs as the difference between suggesting to the model to use your schema and constraining it you don't have to ask nicely anymore and you can just Supply the Json schema that you need and you might be wondering if this is the right solution that finally solves the problem why did it take us so long to get here and the answer is twofold

### [5:00](https://www.youtube.com/watch?v=kE4BkATIl9c&t=300s) Segment 2 (05:00 - 10:00)

twofold first we still believe that function calling is often the right abstraction for this functionality and then finally it actually took us a little while to build a solution that is performant for inference at scale while constraining your outputs so we're going to talk a little bit more later about the engineering and research we undertook to make this work but for now let's talk about how you can use it so structured outputs is available in two modes in the API the first is function calling as ADD just showed you and you're probably already familiar with it this feature allows our models to generate parameters for your tool calls and it allows developers to connect llms with functionality in their applications the second mode is the response format parameter this parameter is useful when the model is responding directly to a user rather than emitting a function call so let's start with function calling if you've used it before you're probably pretty familiar with the format of a request that you see here you can see we're supplying a Json schema in the tools section and there's parameters for the function this tells the model how to call our tool and in our case the function is called get weather and it has two parameters location and unit so location is a string type and so is unit but it's actually limited to just a few Fields with the enum parameter when you supply a function like this in our API our systems will show the model this spec and it will use that information to generate tool calls when it's appropriate enabling structured outputs here is really easy with just one line of code I can set strict to true and this will enable structured outputs this will use the supplied schema and always make sure the model's response will follow it so that's structured outputs in function calling now let's look at a quick example in the playground and before we get into it I'll tell you a little bit about a startup I'm building so I'm actually building an AI glasses product uh they're pretty hot right now they're really futuristic and they're built on the openi API these glasses have a speaker in the stem to read the answers out from the assistant and there's actually a little AR screen for the glasses I actually want to make an internal admin dashboard to help my team answer questions about orders that we've received and you know what their shipping status is so I actually already have a database with order information and I want to connect this assistant to that information so let's get started in the playground so I've actually already created a query function to tell the assistant how to query my SQL database let's take a quick look so here's my function it's called query uh and it has kind of the properties that expect so the first is table name and we only have one table name orders that's all you support so far then we have all of The Columns of my table uh so these are just straight from my database and then you know the meat of this is the conditions that we support for querying so you'll notice here that we have some operators my database supports equals greater than less than and not equals and then we have uh like an order by program so just stuff you'd expect for a database when I use this assistant I'll set the system message to tell it today's date so it can be useful and then a user comes in and they're asking to find all of the orders shipped September 1st or later so let's give this a quick go all right you can see we've got a function call back and it looks pretty good I mean the logic is right we're checking that ship at is greater than or equal to September 1st but the Discerning developer will notice that this will likely C cause your application to blow up and the reason for that is that we're using the greater than or equal to operator however our om for some reason only supports these operators and so that greater than equal to will just not work with structured outputs we can set strict to true and this will constrain the model to only use what you've provided so let's save that and retry awesome you can see this time we've actually used the greater than operator and the model's done the logic to determine we actually need to check that we're greater than the last day of August and not greater than or equal to the first day of September so through this you can see how structured outputs will just eliminate a whole class of Errors for your application now let's get back to response formats so in the past when you wanted the model to respond using Json you would be using Json mode so back to my AI glasses start up

### [10:00](https://www.youtube.com/watch?v=kE4BkATIl9c&t=600s) Segment 3 (10:00 - 15:00)

since they've got a speaker in the stem I want them to read some of the assistant response out loud and I want to show a summarized version in the lenses and before structured outputs I would actually just put these instructions into the system message and use Json mode this is pretty nice I would always get back Json but sometimes if the user asked for something specific I would get back an extra key or the model will you know use the wrong type with structured outputs I can move these instructions into the response format parameter like you see here I'm going to use the description to explain to the model how to respond with the voice over and display parameters so this way the model will always follow the format and use these two keys so let's give this a quick go as well all right I have another tab here and I've actually created this schema so let's take a quick look so voiceover is our first string property and we're telling the assistant that this will be read aloud via TTS and to write out numbers and acronyms fully then we have the display property and this is you know we don't have a ton of room on the glasses so we're going to keep this tight just five words so I'm going to put this into the playground you can see on the right here we have the response format option I'm going to paste in my schema and you'll see that strict is set to true so structured outputs is on now I'm going to test stud the glasses with a you know typical query I might ask something like how tall is a giraffe when I run this you see that I get the output in the form I've asked for so we have the voiceover parameter which has spelled out the words to make it easier for our TTS models to read them out we also have the display field which is just four words which should fit right on the glasses so there you have structured outputs with response formats this is just a really quick demo but addes get into something a little more interesting okay so that showed us the feature but let's get into some more interesting demos and how you might use structured outputs in your applications let's imagine we're building a fictitious company called convex is an AI powered recruiting tool it lets recruiters create job postings submit referrals and schedule interviews behind the scenes convex is a node and react app that uses response formats to extract information from resumés and uses function calling to perform queries over the candidate data let's see it in action so I've gone ahead and created a job posting here for a ml engineer role that we're hiring at open and you can see the job description the hiring manager and some of the candidates that have already applied now before I came on stage a promising candidate reached out to me and shared his resume let's take a look okay Greg seems to have worked at stripe and has a little bit of experience working in AI uh he has a whole bunch of skills including coding and python so seems like a promising candidate uh let's put in a referral so I'm going to go click in the add candidate button here and select Greg's resume and you can see that we're outputting the fields from the resume in real time the model is using structured outputs to extract this information from the text that's within the PDF let's take a look at the model's response behind the scenes so we see a Json object with name title location contact info work experience all of this has been extracted from the resume now let's take a look at the code to see how this all works so behind the scenes we're using a response format called resume it has the fields we just saw like name title location and so on and in particular for work experience you'll notice that I'm using an array so structured output supports a wide subset of Json schema including arays that allows the model to Output more than one work experience now some of the JavaScript developers in the room might also notice that I'm using the library Zod to Define my schema Zod is great it lets me Define my schemas in code and get runtime type safety the openai node SDK has native support for Zod you can Define your schemas in it and when the model starts responding we parse the response back into the Zod object as well and it supports streaming very similarly the openi python SDK supports pantic natively okay so let's add one more field to the format here let's add the GitHub username so I'm going to go save that

### [15:00](https://www.youtube.com/watch?v=kE4BkATIl9c&t=900s) Segment 4 (15:00 - 20:00)

that refresh my page and let's add a referral one more time great the models responding and you can see that the GitHub username was also extracted okay so that was the first feature I wanted to demo which is extracting information from unstructured data using response formats next let's see how we can use structured outputs in function calling so I'm going to go click view all here to open my candidate analysis screen and I have a helpful AI assistant here that I can ask questions to analyze my data now we'll see that candidates have applied from all over the country in San Francisco New York and Austin but for this role we're hiring in the SF office so let's filter the candidates down I'm going to say filter the candidates to those based in San Francisco let's press enter and we see that the model called the find candidates function with a criteria field the field is location and the value is San Francisco and this hit our backend API and returned a list of candidates that were filtered and are based on SF we also noticed that the UI on the left has been updated to reflect these candidates in this way you can use function calling to control the UI of your application let's look at this behind the scenes so in our code I've defined find a tool or a function called find candidates has a schema that includes um a list of criteria and each criteria has a field so that can be title location and so on and a value which is the value to filter on so that was a quick 101 example of using structured outputs with function calling let's try something harder let's say graph these candidates by years of experience you'll notice that the model is generating UI a card with a header and a bar graph that shows all the candidates and their years of experience a table with the rows as well now this is not a pre-built UI this is actually the model composing a set of react components dynamically we can look at the schema here as well we see that the top level property is component and it says card and card has a list of children that includes a header a bar chart and so on here's how the schema is defined behind the scenes we have a tool called generate UI and generate UI has one property called component is an any of card header bar chart and so on and each of these schemas is defined later on in the depth parameter now this is a really interesting feature of structured outputs where you can actually use deps to Define your schemas in one place and use them multiple times and you can use the cursive schema definitions as well so you'll notice that the card schema here has a children which then references component again so in this way a component can have a list of children that are components and structured outputs handles this with no problem at all great so now that we have candidates sorted by years of experience let's go ahead and schedule some interviews so I'm going to say schedule interviews for the candidates with uh let's say the for the top three candidates by years of experience with Michelle and Olivia so what this is going to do is first go and check the availability of Michelle and Olivia on their calendars once it picks out a few good slots it's going to schedule those interviews and finally it's going to email the candidates that their interviews have been booked so let's press enter awesome so we see that the model is calling the fetch availability API and got some data back it's now calling the schedule interviews API and that succeeded and finally it's calling send emails with custom emails for each candidate and that succeeded as well so that's an example of a multi-step workflow using function calling where each step benefits from structured outputs before this feature if any one of these steps failed the whole workflow would fail and in production if one of these had let's say a 1% error rate the workflow would have an approximately 3% error rate you can see how important structured outputs is for reliability of agentic workflows okay so that's a little demo of how you can use structured outputs in real world applications you can use response formats to extract information from unstructured data you can use function calling to generate UI and finally you can build agentic workflows with 100% reliability to tell you about how all of this works under the hood here's Michelle thanks Addie super cool and can't wait to interview Greg

### [20:00](https://www.youtube.com/watch?v=kE4BkATIl9c&t=1200s) Segment 5 (20:00 - 25:00)

later let's get into how structured outputs Works under the hood we're gonna talk about three things we're going to talk about the engineering implementation the research we undertook to make our models better at format following and then finally some of the interesting API design decisions we made to ship this feature to ship this we actually decided to take an approach that combined both research and Engineering to create a holistic solution doing just one of these would make a pretty good product but together they are greater than the sum of their parts there's actually more to structured outputs than just prompting our models differently it creates some interesting trade-offs for developers so we thought it'd be useful to take you under the hood under those choices on the engineering side we decided to take an approach known as constrained decoding to ensure that our models would all always follow schemas supplied by developers and there's three components of constrained decoding that we're about to talk about the first is llm inference how that works then we're going to get into token masking and finally we're going to talk about the subset of Json schema that we support all of those grammars so let's start with llm inference llms operate on tokens which is the vocabulary of a large language model this is all of the labels that the model can output or produce so let's get into a simple example of a model to show a small vocab this is the classic example of an AI model so it's a digit recognition model I want to train a model to recognize this handwritten three and determine what digit it is so the vocab of this model is just going to be 10 labels the digits 0 through 9 and this is actually what a model like that would look like so first we convert our input into some sort of computer representation and in this case we're going to take every pixel from my hand written three and feed that into the model then we have the inner layers of the model and then finally at the end the model is producing 10 values and these are the predictions of our labels from 0 to 9 you can actually see here that the digit three has a 97% probability so this digit is probably a three this is also how large language models work for inference in pretty broad Strokes rather than predicting the digits 0 through n large language models predict you know language in order to do this they use labels that represent language and one of the best ways we found for doing this is to use tokens are like words or word fragments and here we have some English text broken up into its tokens this is actually the text from our blog post and you can see that the token boundaries vary sometimes they're an entire word like the word the or sometimes are a word fragment like struct these tokens make up the vocabulary of large language models and at any point the model can predict any of these they're all valid so this is unconstrained decoding where there's no constraints on what can be produced what you see on screen are actually some of the real tokens in the gbt 4 tokenizer so there's all kinds of parts of speech like individual characters so the exclamation point to full words like conveyor at the bottom here now let's say we're using structured outputs to generate an output with this schema so this schema has a single property value and that's a number type and we're partway through our generation already so we've produced a curly brace some white space and then value by default when we sample any token from the vocab can be chosen so that can be a string or a Boolean but if we did produce a volan that would not match the schema so and a number like 42 would be valid so this tells us that we actually need to limit which tokens can be produced Next and we can't just have the entire vocab of the model to do this we will use a technique known as token masking and this constrains the tokens which can be picked at the very end of sampling so back to our digit example here's an example of token masking let's say I have some side info about these numbers I know that they're all prime for some reason and so I wouldn't want to decode you know 0 1 four or six any of the non-prime numbers so this is actually what I would do I would kind of mask out any of those predictions so what we do after we generate the probabilities with a forward pass is then remove any labels that are not supplied so that we will only sample from a valid token this is known as masking and we're effectively ignoring any values that we don't want

### [25:00](https://www.youtube.com/watch?v=kE4BkATIl9c&t=1500s) Segment 6 (25:00 - 30:00)

to be produced large language models are Auto regressive and that means they produce output one token at a time each time we sample a token that output is fed back in right as input for the next uh inference step this means we're sampling Json for your schemas we actually need to update the mask tokens at every step of inference and we can't just do it once at the beginning of the request and if that's not making sense look at this example so you can see first we're allowing the curly brace token and then we once we sample it in step two you actually no longer see the curly brace in The Mask so you can see that from step to step we need to update these masks because this is happening at every inference step we need this operation to be super fast at the scale that some of your apps run you know we need to make things lightning fast and keep inference as fast as possible so you probably already know this but token sampling happens on a GPU and we actually do it one batch at a time for each batch we calculate probabilities for everything in that batch and then we'll apply any token masks and then we'll sample from that final probability distribution and that sampling is right here in the orange segments so as you can see here we can actually calculate the probabilities and determine the masks in parallel and this frees up precious GPU resources because we can do the mask determination on CPU this is very helpful for us to keep things fast you'll also notice that we need the mask to happen at about the same amount of time as calculating the probabilities and this is also known as the time between tokens and it varies based on the size of the model and so we need to keep this under even something like 10 milliseconds so a naive solution to this problem would involve determining these valid tokens from the current state at every step of inference but like we said we have 10 milliseconds so we're not super likely to meet that budget so we wanted to pre-compute as much work as possible and then reuse it when sampling to make Mass computation more like a lookup and less like a lot of work and just like you can build an index in your SQL database to speed up queries we can actually build up an index with the specific B on schema that you supply to make fetching these masks very fast during inference there's actually a lot of work that goes into producing this index from the Json schema first we convert the Json schema into a grammar and a grammar is actually a formal specification of a language once we have the grammar we can create a parser and a parser is a program that can check strings and see if they're part of this language you can think of the parser as a program that takes an a Json blob and tells you does this match the schema or not finally once we have a parser we can iterate over all of the tokens in our vocabulary and all of the states that are possible in the parser and this will determine which tokens are valid this work lets us build an index our index is actually a tree which is a prefix based data structure that allows for of one lookups so during inference we're running our parser building up some states and then every time we'll Traverse this tree to find the leaf node will tell us what the mask is for the next step generating this index is pretty computationally expensive we have to go over all of the possible States so we do it just once and then we cash it later for fast lookups this is why the first query to structured outputs can take a little time usually under 10 seconds and then the following queries are just as fast as you normally expect finally let's talk about the kinds of grammars we support with structured outputs a common approach in the open- source Community is using regular expressions for determining token masks and this approach Works quite well for simple or depth limited schemas so for example you can imagine a regular expression for our value schema from before it's actually pretty tractable to imp Implement with a regular expression which we have here and this regular expression has about what you'd expect first we have the curly brace and then we have some white space then value and then finally a regular expression for number types however you actually can't Implement all of the expressiveness with Json schema with regular Expressions they're missing basically the memory needed to store information about past lookups so let's talk about why that's

### [30:00](https://www.youtube.com/watch?v=kE4BkATIl9c&t=1800s) Segment 7 (30:00 - 35:00)

important here's an example of a recursive schema from our blog post so this schema is useful for generative UI and it's kind of like what Addie showed you just now as well each component can have a list of children and those must also themselves match the schema at the top level so you can see the schema here uses the ref Pim to reference the parent and there's actually no way to implement a regular expression to validate this because of the potential of recursive nesting this is just a fundamental limitation of regular Expressions because they don't have arbitrary memory to encode information about past outputs to make this easier to Gro here's a really simple example of a language our language is defined as all of the strings that have matching open and close parents and you can see we have a quick example here where all parentheses are matched this is something that's so easy to explain but it's not possible to implement with a regular expression and to show you why there's actually two attempts we have here to do so the first one is very simple we just allow arbitrary open and closed parns but it's pretty clear why this won't work you can have one open and two closed parents so it doesn't validate our language we're kind of missing the memory of what happened before our other attempt here is much more complicated and it works but only up to three layers of nesting we go further than that the regular expression doesn't behave properly this is just a toy example but we believe that recursive and deeply nested schemas are often critical for developers so we really wanted to find a way to support them we can do so by adding in a stack and this is this gives us the memory we need to encode information about past outputs so this stack will let us keep track of things like how many open parns have I seen before this approach is known as the CFG or context-free grammar approach it's a little trickier to implement but it allows for far more expressiveness this is the approach that we went with and it's why it takes a little bit of time to build up this tree for inference it's also why we support recursion and deep nesting so that's the engineering side of structured outputs we just talked about how llm inference work works why token masking is a useful building block and how this all comes together in the extensive subset of Json schema that we support we believe that this implementation results in the greatest set of trade-offs for developers we've kept inference as fast as possible and we' supported a very wide subset of Json schema and we believe the tradeoff of waiting a little longer that first request while you're developing is worth it for your for most developers so now let's get into how we've actually improved our models with research to work with structured outputs awesome thank you Michelle so that was a little bit of the engineering behind structured outputs next let's talk about the research side on the research side we wanted to ship a model that was much better at following formats as specified it's not enough to just constrain the model to a valid output if those outputs are out of distribution or low probability for the model because the model will often behave erratically some of you may have seen this as the infinite new line problem before this happened with some of our older models that weren't trained on response formats the model's natural inclination was to generate text that wasn't Json but because it was being forced to Output Json the only valid and sufficiently probable token was the new line character so the model would just repeat the new line token one after another all the way until it hits Max tokens so we wanted to avoid this problem so we specifically trained our models to understand Json schema and in particular complex schemas better it's also not enough for the model to just understand the schema it needs to know you know uh what the quality of that field is there's a semantic meaning behind the keys for example consider this action item schema each action item has a description a due date and an owner to produce a good output it's not enough to know that a description is a string and the owner is a string the model needs to know what kind of string goes into a description and owner so to ensure the models are good at this we trained them with a whole bunch of complex schemas including nested schemas so here are the results from our trading process this graph shows our models on the x-axis and their accuracy on one of our evals on the Y AIS the first three bars from the left are GPD 4 Turbo and the original GPD 40

### [35:00](https://www.youtube.com/watch?v=kE4BkATIl9c&t=2100s) Segment 8 (35:00 - 40:00)

you can see that accuracy improved from about 36% to 86% over the last year now at the very end here you see the results for our latest model the orange bar shows the model's uh results with just prompting and it has an accuracy of about 85% when you add in the newly trained response format we see with the yellow bar that the accuracy improves to 93% this is much better but still not 100% so to get to 100% we can enable constraint decoding as Michelle shared earlier giving us a perfect score on the green bar so in this way combining research to improve the model with engineering to implement constraint decoding brings out the best possible results okay finally let's get into some interesting design tradeoffs we made when creating this feature we'll talk about three things additional properties required properties and the Order of properties let's start with additional properties so one of the controversial API design decisions we had to make was deciding what to do with properties that were not defined in the schema by default Json schema allows extra properties so all additional properties are always allowed now this is generally not the behavior we as developers expect you can imagine errors like this where we have a function get weather that accepts two arguments if the llm produces an extra argument we get a runtime eror so we de decided to disallow additional Properties by default in the openi API that said this me the default in our API was different than the default in Json schema that's not great especially if a developer already has a predefined schema from elsewhere in their application in general one of our API design principles is that we prefer to be explicit rather than implicit so we decided to require that developers have to pass in additional properties false in their schemas this makes API a little bit harder to use it means you have to set this property every time but it sets expectations better with developers okay let's talk about required Properties by default in Json schema all properties are optional this is again not what we as developers expect to go back to our earlier example of the get weather function if the llm decided to skip one of the parameters we would again get a runtime error so to make this feature more intuitive we decided that all properties were required by default and once again to set expectations we require developers to pass this in the required directive now we do have a workaround for optional parameters which is to make them nullable this gives us the best of both WS you get optionality without the performance trade-offs okay finally let's talk about the order of properties by default Json schema has no ordering constraints which means the llm can produce property in any order but in the context of llms order really matters having a strict ordering of properties can be really useful for example you can use an earlier field in your schema to condition the value of a later field such as adding a Chain of Thought Field to your schema the model will first generate the Chain of Thought to explain its thinking and then generate the answer this often improves the quality of the answer significantly so to support this use case we decided to generate fields in the same order that you define them in the schema so that's API design we wanted to make sure that the API had good defaults while making the constraints transparent to the developers okay so to bring it all home here's Michelle awesome so you can see that the engineering and research halves of this prodject result in a meaningful Improvement but only when combined mind do they offer the best possible results this is actually our ethos behind the work in the openai API we want to do the engineering and research work on our end to make the easiest to use API for developers we want to solve problems like structured outputs for you so you can spend time working on what matters most your application we think structured outputs was the final puzzle piece for unlocking the full power of AI applications data exraction is now reliable function calls have the required parameters and finally agentic flows can work 100% of the time since we launched structured outputs in August developers like you from big companies to small startups have been building amazing apps we've seen customers like Shopify use structured outputs to reduce hallucinations and improve the reliability of their applications we've also heard the excitement from all of you it's been so great to see structured output solves so

### [40:00](https://www.youtube.com/watch?v=kE4BkATIl9c&t=2400s) Segment 9 (40:00 - 40:00)

many problems for developers and this is why we do what we do our mission at open aai is to build safe AGI for everyone we build the API because we think working with developers you all is just critical to achieving that mission you see the future before anyone else and together we can spread this technology to the furthest reaches of the world as fellow Engineers we feel so lucky to serve you so thank you for building us and we're so excited to see what you do thank you

---
*Источник: https://ekstraktznaniy.ru/video/11382*