# OpenAI DevDay 2024 | Balancing accuracy, latency, and cost at scale

## Метаданные

- **Канал:** OpenAI
- **YouTube:** https://www.youtube.com/watch?v=Bx6sUDRMx-8
- **Дата:** 17.12.2024
- **Длительность:** 33:36
- **Просмотры:** 16,019

## Описание

Scale AI applications by balancing accuracy, latency, and cost

## Содержание

### [0:00](https://www.youtube.com/watch?v=Bx6sUDRMx-8) Intro

you built an app and its user base is expanding quickly doubling or more and the growth isn't slowing down the success is exciting but the challenges scaling sustainably what worked for a thousand users often doesn't extend to 1 million you'll need to make itical decisions about which llms to use how to predict and manage costs and how to keep response times fast I'm Colin Jarvis I lead the AI Solutions team here at open Ai and I'm Jeff Harris I'm one of the product leads on the API platform team today we're going to cover common pitfalls and techniques for scaling your apps on our platform and to start I wish I could tell you that there was one Playbook that would just work to give you the perfect way to optimize an llm app of course there isn't there are lots of techniques and many trade-offs to be had but we're hoping that this talk gives you a set of approaches and best practices to consider and that you walk away with some optimization ideas that do end up working really great for what you're building and the first thing I'll say before we talk about how you should optimize is we think optimizing your apps is Central to what we do so we push out models that are more intelligent we make models that are faster GPT 40 is twice as fast as four turbo and we push relentlessly on cost so I love this chart since Tex Vinci O3 which came out in 2022 that was a less capable model than 40 mini and our cost per token has decreased since then by about 99% so that's a really good Trend to be on the back of another example the 32k version of GPT 4 if you wanted a million tokens from that model that cost $120 so quite expensive today if you do that with GPT 40 mini that's 60 so 200 times Improvement in just about a year and a half now these lower costs combined with fine tune smaller models like 40 mini they've unlocked a lot of new use cases and we see this in our data since we released 40 mini in July token consumption on our platform has more than doubled so a ton of new ways of using our API but more models brings more questions when do use 40 mini when do you use 40 when's reasoning required we've worked with lots of Builders like you across many different sizes and used cases and we've developed a pretty good mental model to help you work through those kinds of decisions so Colin's going to talk about how to improve accuracy the quality of the outputs of the model and then I'm going to be back in just a few minutes to talk about how to optimize latency and cost thank you

### [2:46](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=166s) Scaling AI Apps

cool scaling AI apps involves balancing accuracy latency and cost in a way that maintains accuracy at the lowest cost and speed it's important to keep that in mind as we go through because clearly none of this is going to be silver bullets we're just trying to kind of share some of the trade-offs that we've seen work well in this area to accomplish this we worked out an approach that generalizes pretty well across uh across most of our customers this is to start by optimizing for accuracy so use the most intelligent model you have until you hit your accuracy Target is typically something that means a that has a meaning in business terms like 90% of our customer service tickets are going to be routed correctly on the first attempt once you've hit your accuracy Target you then optimize for latency and cost with the aim being to maintain that accuracy with the cheapest fastest model possible I'm going to begin with how to ensure your model is your solution is going to be accurate in production so optimizing for accuracy is all about starting with that accuracy Target that means something so first we need to build the right evaluations so that we can measure whether or not our models performing is expected I know you folks know that so we're going to just kind of recap that and then we're going to share some best practices that we've seen from folks building the real world then we want to establish a minimum accuracy Target to deliver Roi this is an area that a lot of folks skip but what we find is a lot of folks start arguing about like when what accuracy is good enough for production is an area that a lot of folks get stuck at and can't actually like take the final leap and we want to share some ways that we've seen customers kind of communicate that to get past that and lastly once you have that accuracy Target you get to the actual optimization bit of like choosing your prompt engineering rag fine tuning techniques to actually reach that accuracy Target so first step is to develop your Baseline EV valves so it's been like almost two years since chat DBT came out but a lot of people still skip this step so I want to just start here just to recap kind of make sure that we're on the same page I know all of you know what evals are so I'll just recap the basics quickly so we encourage our customers to embrace eval driven development so with an llm you can never consider a component of like built until you've got an eval to actually test whether it runs end to end and actually produces the results you're intending evals come in a lot of flavors but the two like key ways we try and flip frame them are component evals so uh like simple EV vales that function like a unit test usually pretty deterministic usually test like a single component to make sure it's working and then end to end EV vals which are more of like your kind of Black Block tests where you're going to put in an input at the start of the process and then you're going to measure what comes out at the very end and it might be like a multi-step Network or something that's working through that eval Probably sounds like a lot of work but we found effective ways of scaling these that I want to share with you guys that we've been working through with some of our

### [5:35](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=335s) Customer Service Example

customers so I want to go through an example for customer service because this is a pretty common use case that we run into with our customers reasonably complex you've usually got a network of multiple uh multiple assistants with instructions and tools they are trying to do things so couple examples simple component evals like did this question get routed to the right intent simple true false check did it call the right tools for this intent and then did the customer achieve their objective they came in looking to get a return for $23 did they actually get that now the uh I guess the best practice that we found here is customers using llms to pretend to be customers and Red Team their solution so effectively what we'll do is like spin up a bunch of customer llms and then run those through our Network and those are going to act as like an automated test that's going to track Divergence where if we CH if we like change a prompt here we want to check that actually everything's still been routed and completed corly so I'm going to just quickly show uh Network here uh if this works beautiful cool so no uh I'm going to talk through this network and kind of uh share how this works so a lot of details here the main thing to take away here is this is a fairly typical customer service network you've got a bunch of intents each of those intents routes to an instruction and some tools so the way that we're approaching this with all of our customer service customers now is we'll start off by mining historic conversations and for everyone we're going to set an objective that the customer is trying to do they were trying to get a refund for $12 book a return whatever it might be then we give those to an llm and get them to pretend to be that customer and run through the network and then we Mark whether all the EV valves pass so in this case this customer tried to purchase Plan B and they went through they got the upgrade plan intent and they successfully did that then we run through another customer and they fail to get routed correctly so they fail and then we run through another one and they get routed correctly but then they fail to follow the instructions so this won't be 100% accurate but we've seen customers use this to scale their customer service networks because they're able to test like it it's almost acts like every time you raise a PR you're going to rerun these automated tests and figure out whether your customer service network has uh has regressed uh the reason I wanted to share this I mean seems pretty basic but a lot of the time the biggest problem with customer service networks is Divergence so we changed something way over here in the network how does it affect the whole network and the reason why I'm sharing this particular approach we did this with a customer where they started off with 50 routines that were being covered we eventually scaled to like over 400 using this method and the roll out of the new routines got faster and faster just because this was kind of our base and we ended up with over a thousand EV vals that we ran each time that we uh that we made a material change the

### [8:23](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=503s) Customer Service Accuracy

network so once you have your evals then um The Next Step is probably the biggest area of friction for customers that are deploying to production which is deciding like when is enough to go to production how accurate is good enough you're never going to 100% with llms so how much is good enough so I want to take an example from one of our customers we had done two pilots for their customer service application and accuracy was sitting at between like 80 and 85% the business wasn't comfortable to ship so we needed to agree on a Target to aim for so what we did was we built a cost model to help them model out the different scenarios and decide how like where to set that accuracy Target so we took 100,000 cases and we set a bunch of assumptions the first one was that we're going to save $20 for everyone where the AI successfully triages on the first attempt for the ones that get escalated we're going to lose $40 because each of those is going to need a human to work on it for a series of minutes and then of those ones that get escalated 5% of customers are going to get so annoyed that they're going to churn so we're going to leave we're going to lose a thousand for each of those customers what we ended up with was a break even point of 81. 5% accuracy to basically break even on the solution then we agreed with management right let's go for 90% And that is going to be an area that we that we're actually comfortable to ship at now the interesting like addendum to this story is that people often have a higher expectation of accuracy for llms than they do for people it's probably not a surprise for from folks in the room and in this case uh when we set this 90% uh accuracy marker and met it they also wanted to do an AB test on hum uh so what we did was do a parallel check of human agents so the fully human tickets we took a few of those and we tested those to see how they performed and they ended up at 66% accuracy so in the end it was a fairly simple decision to ship at 90 so I think that the key thing here is once you've got your evals and your accuracy Target then you know how much to optimize so at this stage we've got the business on board we've got evals that measure it we know that 90% is the target so then we need to hill climb against that 90% and actually get there so here I want to revisit our optimization techniques so this four boox model is a pretty common uh asset that we use within open AI so typically you start in the bottom left corner prompt engineering best place to start begin with a prompt with explicit instructions make an eval figure out why your what Your where your evals are failing and then decide where to optimize from there if your model's failing because it needs new context so it needs new information that it hasn't been trained on then uh you need toal augmented generation or rag if it's failing because it's following instructions inconsistently or it needs more examples to learn the task or style you're asking for you need fine tuning we have a lot of other resources that go into bit more detail on this so for now I just want to call it a couple of like key principles that we've seen about these over the last I'd say six months where the meta is like changing slightly for each of these techniques first one is prompt engineering so we get a question a lot of like does uh long context replace rag and the answer is it doesn't for now but I think it allows you to scale prompt engineering much more effectively so we see long context as a really good way of like figuring out how far you can push the context windows so you might not need as effective a rag application for example if you can stuff it with like 60k tokens rather than 5K tokens and still maintain performance against your evals the other thing is that we're seeing is uh is um is automating prompt optimization so there's a concept called meta prompting where you effectively make a wrapping prompt that looks at your application you give it the prompt and the eval results and then it tries to iterate your prompt for you and you basically do this almost grid search where you're running evals it's iterating your prompt then you're feeding it back into this like wrapping prompt and it's improving your prompt again and this is an area where we're seeing I think all the stuff we've talked about so far is like a lot of manual iteration and I think the real difference with meta prompting is that people are starting to lean on models to actually accelerate that for you and I think one area that we're seeing that have like a lot of effect is actually with one of the use cases that 01 is like best at is actually meta prompting we uh did it with a customer for uh to generate customer service routines and then optimize them and it saved them like probably like I think we did a couple weeks of work in like about a day or two so it's definitely something that I'd encourage you guys to try and we do have some cookbooks showing as well come in as well to like show you how to do that so once you move prompt engineering there's rag so I know everybody in the room knows about rag so I just want to call out a couple of things which are probably most key that I've seen over the last few months to making rag applications work first one is that not every problem needs a semantic search probably fairly obvious but we see a lot of I think one of the most like obvious and simple optimizations that people do is put in a classification Step at the start and saying based on what this customer has asked does this require a semantic search does it require a keyword search or does it require like an analytical answer to the question and we're seeing a lot more folks choose databases where you can have vectors as well as keyword search as well as like writing in SQL to answer certain questions and it's a pretty simple optimization but I think is actually one that you get like a ton of mileage out of and one that I'd suggest people try the other one is extending your EV valves to cover retrieval so once you add retrieval you have like a new axis on your EV vales and like Frameworks like ragas and this sort of thing kind of formalize this but the important thing is like extending every eval example to show the context that you're that you've got and then highlight you know are we retrieving the right context could the model actually answer with the content that it was given again fairly straightforward but again something that a lot of folks don't do when they're working with rag applications so there's a couple things

### [14:10](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=850s) Fine Tuning

there the last one is fine- tuning and I don't want to labor fine tuning because I know that you folks have heard a lot about distillation today so the main thing with fine tuning is to start small and build up you'd be surprised how few examples fine tuning needs to perform well typically like 50 plus examples is enough to start with making a feedback loop so making a way to like log positive and negative responses and then feed those back in again fairly straightforward but a useful like kind of aspect here and the last thing is considering distillation so just like meta prompting having a way to like scale up and actually let the system learn and actually feed those like feed those positive and negative examples back into to retrain whatever model you've got is a fairly like fairly kind of key aspect but um for fine tuning so to kind of bring this to life of how this how we've seen this like actually work in real life I want to share a customer example here and for this one we had a customeram we had a customer that had two different domains that they were using for a rag Pipeline and there were fairly nuanced questions that they were trying to answer so the Baseline we were working from was 45% which is pretty bad that was like a standard retrieval with cosine similarity they also worked in a regulated industry so what we did was get our Baseline evals 45% not great and then we set our accuracy Target for that we decided to maximize for false negatives rather than false positives so we'd be happy if the model says it doesn't know rather than it hallucinate and we'd use that to set our kind of tolerance for this one then we targeted 95% accuracy on an eval set and uh and kind of had at it and the route that we took to optimization is here the reason I included the like ticks and crosses with each of these is to show that we tried a ton of stuff and not everything worked but the kind of key the key things that improved performance first of all were chunking and embedding so doing a grid search across different chunking strategies and uh and figuring out what like the right are to set our tolerance was to get to 85% we then added that classification step I talked about so if somebody types in one word you want to do a keyword search rather than a sematic search pretty straightforward but again 20% uh percentage Point boost as well as a reranking step so taking all the context and then training a domain specific ranker so depending which uh which domain the question was for we had a specific ranker which would rerank the uh the cont the chunks and then the last thing that got us from 85 to 95 is a lot of the time people would ask analytical questionss of this rag system and often rag systems will just like fetch the top 10 documents and give you the wrong answer for a lot of these questions so we added some tools so that it could write SQL on the content and we also added query expansion where we fine-tuned a query expansion model to basically infer what the customer wanted and that was what we eventually shipped with so I guess just bringing all this together that we've talked about for optimizing for accuracy you start with building evals to understand how the apps performing you then set an accuracy Target that makes an Roi and keeps your application safe and then you optimize using prompt engineering Rag and fine tuning till you hit your target at that point you have an accurate app but uh the problem is it might be slow and expensive and to solve those problems I'll pass over to Jeff so thank you all very much thank you fantastic so first you focus on

### [17:19](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=1039s) Improving Latency

accuracy you need to build a product that works but once you've achieved that desired accuracy it's time to improve latency and cost and it's both for the obvious reason of saving your users time and yourselves money but also more profoundly because if you can reduce the latency and cost for each of your requests then you can do more inference with the same dollar and time budget which is just another way of implying more intelligence to the same requests and another pretty Central Way of improving accuracy so let's start by talking about techniques to improve latency and then we'll get to cost the first thing to understand about latency is an l m is not a database you should not be fixating on total request latency that's not the right way to think about latency for an llm instead you want to break it down into three different subcomponents that make up total request latency and those are the network latency that's how much time takes for the request once it's entered our system to land on the gpus and then get back to you input latency we commonly call that time to First token that's the amount of time it takes to process the prompt and then output latency for every token that's generated you pay a pretty much fixed latency cost for each generated token you combine those things together and that's total request latency what we see for many customers is that the vast majority of their time is spent on output latency it's spend generating tokens usually like 90% plus of the time so that's probably the first thing you'd think to optimize but there are cases like if you're building a classifier on Long documents where input token speed is actually going to be the thing that dominates so it really depends on your use

### [19:02](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=1142s) Total Request Latency

case so we broke down total request latency and it I I'll say first that when you think about total request latency I'm telling you not to fixate on that but most customers will be paying attention to how long it takes for the LM to complete unless you're doing a streaming chat application the thing they're going to care about is okay the answer is done but even with that frame even knowing that that's the final thing customers care about for you as developers you want to be a little bit more focused on the detail details so we talked about this network

### [19:30](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=1170s) Network Latency TFT

latency ttft that's the prompt latency time to First token and then the output latency we call that TBT or time between tokens and you take the time between tokens times the number of tokens that you generate and that's what composes the output latency so that's the basic formula that you should just always have in the back of your mind when you're thinking about why are my requests taking so long or how can I make them faster now let's just break these up one

### [19:56](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=1196s) Network Latency Overview

at a time and talk about component and how it works and what you can do so first Network latency unfortunately Network latency is on us it is the time from when the request enters our system to when it lands on gpus and when the request finishes to when it gets back to you it'll add about 200 milliseconds to your total request time just to be routed through our services one of the really nice pieces of news is this is a central thing that we've been focusing on for the last 6 months or so and I'll say historically our system has been pretty dumb where all of our requests have been routed through a few Central choke points in the US and then they land on gpus and then they come back to the US and then they get back to you which means that for a standard request you're often hopping across the ocean multiple times not ideal for really lowering Network

### [20:45](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=1245s) Regionalization

latency but we we're actually just rolling out right now regionalization so if you pay attention to your metrics you should be seeing Network latency go down over these last few weeks and first I'll say I'm not allowed to tell you where our actual data centers are so this is a little bit illustr but you can see that instead of having just centralized choke points we now take requests we find the data centers that are closest to those requests and we try to process them completely locally so it's one really meaningful way of reducing Network latency all right so stuff that you can

### [21:14](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=1274s) Latency

affect really easily time to First token this is the latency that it takes to process prompts and there's a few important techniques that really help optimize

### [21:26](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=1286s) Improving Prompt Latency

optimize here the first the vious one is just to use shorter prompts and I know Colin just said you want to put more context more examples in your prompts and I wish I could tell you that didn't come at a latency cost it does come at a minor latency cost the longer your prompts are the longer it will take to process the prompts usually prompts are about 20 to 50 times faster per token than output latency depending on the model so that is a trade-off you have to make between the amount of data in your prompt and how fast you want your prompt to be processed one of the really nice things that we just released today is Generations in our playground where you can actually tell a model that we've tuned for prompt engineering to say I want them the prompt to be able to do these things and I want it to be concise brief and the model will actually help form a prompt that meets those criteria and does the right balance between verbosity and also brevity so that's one technique second technique for improving prompts and we'll talk about this a lot is choosing a model of the right size depending on which model you choose the time to First token varies quite meaningfully so 01 and GPT 40 both have a time to First token of about 200 milliseconds 01 mini is about 50 milliseconds so super fast and then GPT 40 mini is somewhere in between it's about 140 milliseconds is the standard time to First token that we see across many requests in our system and then the third way to improve prompt latency I'll just dangle it's what we call Prompt caching we're just releasing this today but it's a way to speed up your requests if you have repeated prompts and we'll talk about it a little bit more in the cost section in just a couple minutes so that's time to First token prompt latency then the final component the component that is probably where you're spending the most time is time between token or output latency so time between tokens takes most of the time and that's true for our classic models like GPT 40 and 40 mini but it's really true for our reasoning models like 01 preview and 01 mini where each of those Chain of Thought tokens is an output token so each token that the model is thinking inside its head is actually adding a fixed cost to the latency and there could be a lot of tokens

### [23:36](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=1416s) Supply and Demand

tokens there so what can you do well the first thing to understand about time between tokens is one of the biggest determiners of this speed is just simply supply and demand how many tokens our system is processing versus how much capacity the system is provisioned against and what you can see here is a pretty typical week for us where the weekends tend to be the fastest time where we have the least demand so tokens are being spin out the fastest and then during weekdays typically the morning specific time is when we have the most demand and the models will be at their slowest over the course of the week and the way that we

### [24:10](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=1450s) Latency targets

optimize this internally is we have latency targets that we set on a per model basis the units here are tokens per second so you want fast higher numbers mean that they're faster and what these numbers mean is this is the slowest we ever want the model to be so at 8 a. m. on a Monday which is typically one of our slowest times we want GPT 40 to be at least 22 tokens per second if not faster and you can see here gbt 40 and 01 generate tokens at about the same speed 40 mini meaningly faster and then 01 mini is hugely faster extremely fast model but it's also generating a lot of Chain of Thought

### [24:50](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=1490s) Reducing output tokens

tokens so you can pick smaller models if that's possible you can't really move all of your traffic to weekends although if you can that's a very straightforward way to make things faster um but one of the other things you can think about to improve time between tokens latency is reducing the number of output tokens this is can make a huge difference in total requests so if you have one request that's generating 100 output tokens and then you have another request that's generating a th output tokens that second request is literally going to take about 10 times as long it's going to be substantially slower so you really want to be thinking about how to prompt the model to be concise in its output and just give you the bare minimum amount of information that you need to build the experiences you

### [25:34](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=1534s) Prompt length

want and then the last way that I just talked about oh I should also mention that one thing that's a little bit subtle in time between token latency is the length of the prompt actually makes a difference too so if you have one request that has a thousand prompt tokens in it and another request that has a 100,000 prompt tokens in it that second request for each token it generates is going to be a little bit slower because for each token it has to be paying attention to that whole 100,000 long context so shortening prompts does have a small effect on the generation speed and then the final thing the final way to improve time between token latency is choose smaller models if you can get away with 40 mini that's a very straightforward way and make your application faster all right so we've broken down latency we've talked about Network latency prompt latency and then output latency let's close out by talking about cost and the different ways that you can make um more requests with less

### [26:30](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=1590s) Cost optimization

money so the first thing to know is that many of the techniques that we've already touched on improve both latency and cost a lot of ways that you make requests faster are will also just save you money so as a very straightforward example we bill by token and also tokens cost a fixed amount of time so if you use for fewer tokens it's obviously going to go meaningfully faster but there's some optimizations that apply just to cost the first thing I want wanted to just plug is we have great usage limits in our developer console where you can see how much you're spending on a per project basis and get alerts when your spend is more than what you're expecting this is a really straightforward way to manage costs and make sure if there's different efforts happening inside your company that you're really aware of how much each of them are spending and you're not getting surprised by some sudden Surge and usage over the weekend so just always good to use those tools and really set up projects on a granular way to control your cost and just keep that good visibility

### [27:29](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=1649s) Prompt caching

visibility one of the things you can do to actually reduce costs is with prompt caching this is a feature that we're just launching today and the way it works is if a prompt lands on an engine that's already seen that prompt and has it in Cache it won't recompute the activations for that prompt those are like the internal model States associated with that prompt and instead it can just go straight to the generation steps so this is a way to both speed up the prompt processing time but also to just save money one of the key things that you should think about with prompt caching is the way it works is it's doing a prefix match so you can see in this example you have first request on the left and then if you send the exact same request plus a couple things at the end it's all a prompt match but if you change just one character at the very beginning of the prompt it's a complete Miss none of the other activations are going to help your prompt speed at all so what that means as you're building applications is you really want to put the static stuff at the beginning of the prompt so that's like your instructions for how your agent should work your one-hot examples your function calls all of that stuff belongs at the beginning of the prompt and then at the end of the prompt you want to put the things that are more variable like information that about a specific user or what's been said previously in the conversation typically our system will keep prompt caches alive for about 5 to 10 minutes this will all just happen automatically without you needing to be involved at all but one of the second ways that you can kind of improve prompt caching rate is just keep a steady Cadence of requests we will always clear the prompt cach within an hour but as long as you keep hitting with the same prompt prefix within about 5 to 10 minutes you'll keep those prompts active the whole time and how does that

### [29:10](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=1750s) Money saved

manifest in terms of money saved well we just are announcing this today so for all of our most recent models you can immediately start saving 50% on any cash token and one of the really nice things about this implementation is that there's no extra work required hopefully your bills are just going down as of today if you have really cash FL prompts you don't need to pay extra to use this feature you just save for making your traffic more

### [29:38](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=1778s) Batch API

efficient the last thing that I wanted to cover in terms of saving cost is our batch API and I think our BPI is a little bit of a sleeper hit but it is 50% off both prompts and output tokens by running requests asynchronously so instead of hitting the model and the model replying as quickly as possible you create a batch file which is a SE sequence of really a lot of requests huge numbers of requests you submit it to the batch API and it will complete this job within 24 hours it's often much faster if you're submitting the job not at Peak time and one of the really nice things about this service is that anything you pass to the batch API it doesn't count against your regular rate limits so it's a completely separate pool of capacity that you can just use and pay half price off what we have generally found is that most developers have at least a few cases that really benefit from batch so that could be content generation you're making a bunch of sites it could be running evals ingesting and categorizing a really big backlog of data indexing data for embedding based retrieval doing big translation jobs those are all things that don't need to be run where every token is generated within 40 milliseconds so they're all really good use cases for batch and to the extent that you can offload the work there you're both saving half of the spend and then you're also of course not using any of your other rate limits you can just scale more for the services that need to be more synchronous just to give you one example of how people have used this one of our partners Echo aai they categorize customer calls so they summarize the customer call after they've transcribed it they classify it um they put takeaways and follow-ups and the batch API saves them 50% of their costs because they don't need to be purely synchronous and what that's let them do is by reducing their costs they're actually able to pass on lower prices to customers the way that they've built this is that they've designed a near realtime system that processes every call that comes in creates batch jobs sometimes really small batch jobs of just a few requests and then is tracking all the batch jobs that are in the system so as soon as they complete they can notify the customer this isn't going to be less than one minute response times necessarily but it is way cheaper than running requests synchronously and it lets them scale a lot more and just ask more questions about their calls so we just released the batch API in the spring they've been using it for a few months so far it's saved them tens of thousands of dollars that's pretty good for early prototyping here they're expecting today they've used 16% of their traffic has moved over to the batch API they're expecting for their particular use case they can get to about 75% of their traffic and are aiming to save about a million dollars over the next six months so it's a really substantial amount that you can save if you can really bifurcate your workloads to the ones that need to be synchronous and the ones that

### [32:25](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=1945s) Recap

don't all right so we have hit you with many different ideas we've shared a lot of things the good news is that a lot of the techniques that we have for optimizing cost and latency they're highly overlapping if you optimize one you tend to optimize the other and if you can apply a good smattering of these best practices you're going to built experiences that are state-of-the-art in terms of balancing intelligence speed and affordability also say that I'm sure there are many techniques that we didn't mention here and there are probably lots that we don't know so if there's something we missed that you found effective we'd love to hear about it after the

### [33:03](https://www.youtube.com/watch?v=Bx6sUDRMx-8&t=1983s) Closing

talk and to close I'll just say once more that I wish there was one playbook for balancing accuracy latency and costs there is not that thing this is actually the central art in building llm applications is making the right tradeoffs between these three different constraints so I hope that we've given you some ideas to think about and on behalf of myself Colin and the team we're very excited to see what you built thank you

---
*Источник: https://ekstraktznaniy.ru/video/11424*