The New Stack and Ops for AI
34:09

The New Stack and Ops for AI

OpenAI 13.11.2023 83 343 просмотров 1 835 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
A new framework to navigate the unique considerations for scaling non-deterministic apps from prototype to production. Speakers: Shyamal Hitesh Anadkat, Applied AI Engineer at @OpenAI Sherwin Wu Head of Engineering, Developer Platform at @OpenAI

Оглавление (7 сегментов)

Segment 1 (00:00 - 05:00)

hi everyone and welcome to the new stack and Ops for AI going from prototype to production my name is Sherwin and I lead the engineering team for the open AI developer platform the team that builds and maintains the apis that over 2 million developers including hopefully many of you have used to build products on top of our models and I'm shaml I'm part of the applight team where I work with hundreds of startups and Enterprises to help them build great products and experiences on our platform and today we're really excited to talk to you all about the process of taking your applications and bringing them from the Prototype stage into production but first I wanted to put things into perspective for a little bit while it might seem like it's been a very long time since chat gbt has entered our lives and transform the world it actually hasn't even been a full calendar year since it was launched chat gbt was actually launched in late November 2022 and it hasn't even been a full 12 months yet similarly gbt 4 was only launched in March 2023 and hasn't even been 8 months since people have experienced the our Flagship model and try to use it into their products in this time gbt has gone from being a toy for us to play around with and share on social media into a tool for us to use in our day-to-day lives and our workplaces into now a capability that enterprises startups and developers everywhere are trying to bake into their own products and often times the first step is to build a prototype and as very as many of you probably know it's quite simple and easy to set up a really cool prototype using one of our models it's really cool to come up with a demo and show it to all of our friends however often times there's a really big gap uh in going from there into production and often times it's hard to get things into production and a large part of this is due to the non-deterministic nature of these models scaling non-deterministic apps from prototype into production can often times feel quite difficult uh without a guiding framework often times you might feel something like this where you have a lot of tools out there for you to use the field is moving very quickly there's a lot of different possibilities but you don't really know where to go uh and what to start with and so for this talk we wanted to give you all a framework to use to help guide you moving your apps from prototype into production and this framework we wanted to provide to you is in the form of a stack diagram that is influenced by a lot of the challenges that are customers have uh brought to us in scaling their apps we'll be talking about how to build a delightful user experience on top of these llms we'll be talking about handling model inconsistency via grounding the model with knowledge stor and tools we'll be talking about how to iterate uh on your applications in confidence using evaluations and finally we'll be talking about how to manage scale for your applications and thinking about cost uh and latency using orchestration for each one of these we'll be talking about a couple strategies that hopefully you all can bring back uh and use in your own different products so often times first we just have a simple prototype and at this point there isn't a whole stack like what I just showed there's usually just a very simple setup here where you have your application and it's talking directly with their API and while this works great initially very quickly you'll realize that you know it's not enough so let's talk about the first layer of this framework technology is as useful as the user experience surrounding it and while the goal is to build a trustworthy defensive and delightful user experience AI assisted co-pilots and assistants present a different set of human computer interaction and ux challenges and the unique considerations of scaling applications built with our models makes it even more important to drive better and safe outcomes for users so we're going to talk about two strategies here to navigate some of the challenges that come with building apps on top of our models which are inherently probabilistic in nature controlling for uncertainty and building guardrails for stability and safety controlling for uncertainty refers to proactively optimizing izing the user experience by managing how the model interacts and responds to the users until now a lot of products have been deterministic where interactions can happen in repeatable and precise ways but this has been challenging with the shift towards building language user interfaces and it has become important to design for Human centricity by having the AI enhance and augment human capabilities rather than replacing human judgment when designing chat GPT for example we baked in a few ux elements to help guide the users and control for this inherent uncertainty that comes with building apps powered by models the first one depending on the use case the first strategy here is to keep human in the loop and understand that the first artifact created with generative AI might not be the final artifact that the user wants and so giving the users an opportunity to iterate and improve the quality over time is important for navigating uncertainty and building a

Segment 2 (05:00 - 10:00)

robust uxx the feedback controls on the other hand also provide affordances for fixing mistakes and are useful signals to build a solid data flywheel another important aspect of building transparent ux is to communicate the systems capabilities and limitations to the users so the user can understand what the AI can or cannot do you can take this further by explaining to the user how the AI can make mistakes in chat gpt's case this takes the form of an AI notice at the bottom this sets the right expectations with the user finally a well-designed user interface can guide user interaction with AI to get the most helpful and safer responses and the best out of the interaction this can take the form of suggestive prompts in chat GPT which not only help on board the users to this experience but also provides the user an opportunity to ask better questions suggest alternative ways of solving a problem Inspire and probe deeper all three of these strategies really put the users in the center and at the control of the experience by designing a ux that brings the best out of working with AI products and creating a collaborative and human Centric experience to establish a foundation of trust and for you to build more confidence in deploying your GPD powered applications it's not only important to build a human Centric ux but also to build guard rails for both stability and safety you can think of guard rails as essentially constraints or preventative controls that sit between the user experience and the model they aim to prevent harmful and unwanted content getting to your applications to your users and also adding stability to the models in production some of the best interaction paradigms that we've seen developers build have built Safety and Security at the core of the experience some of our best models are the ones that are most aligned with human values and we believe some of the most useful and capable ux bring the best out of safety and stability for better safer outcomes to demonstrate an example of this let's start with a simple prompt in Del very timely for Christmas to create an abstract all painting of a Christmas tree Dolly used is the model to enhance the prompt by adding more details and specificity around the Hues the shape of the tree the colors and brush strokes and so on now I'm not an artist so I wouldn't have done a better job uh at this but in this case I'm using Dolly as a partner to bring my ideas to imagination now you might be wondering how is this a safety guard rail well the same prompt enrichment used to create better artifacts also functions as a safety guard rail if the model in this case detects a problematic prompt that violates the privacy or rights of individuals it will suggest a different prompt rather than refusing it outright in this case instead of generating an image of a real person it captures the ense and then creates an image of a fictional person so we shared one example of guardrail that can help with both stability and safety but guardrails can take many other forms some examples of this are compliance guard rails security guard rails and guard rails to ensure that the model outputs are syntactically and semantically correct and guardrails become essentially important when you're building interfaces for highly regulated Industries where there's low tolerance for errors and hallucination and where you have to prioritize security and compliance so we built a great user experience with both sterility and safety but our journey does an end there so at this point you've built a delightful user experience for all of your users that can manage around some of the uncertainty of these models and while this works really great um as a prototype um when you know the types of queries that you'll be getting from your users are pretty constrained as you scale this in a production you'll very quickly start running into consistency issues uh because as you scale out your application the types of queries and inputs that you'll get will start varying quite a lot so with this we want to talk about model consistency which introduces a second part of our stat involve involving grounding the model with the knowledge store and tools two strategies that we've seen our customers adopt pretty well here um to manage around the inherent inconsistency of these models include one constraining the Model Behavior at the model level itself and then two grounding the model with some real World Knowledge using uh something like a knowledge store or your own tools the first one of these is constraining the Model Behavior itself and this is uh an issue because often times it's difficult to manage around the inherent prob istic nature of llms and especially as a

Segment 3 (10:00 - 15:00)

customer of API where you don't have really lowlevel access to the model um it's really difficult to kind of manage around some of this inconsistency and so today uh we actually introduced two new model level features to help you constrain Model Behavior and wanted to talk to you about this today the first one of these is Json mode um which if toggled on will constrain the output of the model to be within the Json grammar and the second one is reproducible outputs using a new parameter named seed that we're introducing into chat completions the first one of these Json mode has been a really commonly asked feature from a lot of people um and it allows you to force the model to Output within the Json grammar often times this is really important to developers because you're taking the output from an llm and feeding it into a downstream Software System and a lot of times in order to do that you'll need a common data format and Json is one of the most popular of these and while this is great one big downfall of inconsistency here is when the model outputs invalid Json it'll actually break your system and throw an exception which is not a great experience for your customers Json mode that we introduced today should significantly reduce the likelihood of this the way it works is something like this where in chat completions we've added a new argument known as Json schema and if you pass in type object into that uh into that parameter and you pass it into our API the output that you'll be getting from our assistant or from the API will be constrained to within the Json grammar so the content field there will be uh constrained to the Json grammar while this doesn't remove 100% of all Json errors in our evals that we've seen internally it does significantly reduce the error rate for Json being output by this model the second thing is getting significantly more reproducible outputs via a seed parameter in track completions and so uh a lot of our models are non-deterministic but if you look under the hood there are actually three main contributors to a lot of the inconsistent Behavior happening behind the scenes one of these is how the model uh samples its tokens based off of the probabilities that it gets so that's controlled by the temperature and the top P parameters that we already have the second one is the C parameter which is the random number that the model uses to start its calculations and base it off of and the third one is this thing called system fingerprint which describes the state of our engines that are running in the back end and the code that we have deployed on those as those change there will be some inherent non-determinism that happens um as of today we only give people access to temperature uh and top P um and starting today uh we actually be giving developers access to the seed parameter as an input and giving developers visibility into system fingerprint in the responses of the track completions model and so in practice it looks something like this um where in chat completions uh there will now be a seed parameter that you can pass in which is an integer if you pass in a seed like 1 2 3 4 5 and you're controlling the temperature uh setting it to something like zero your output will be significantly more consistent over uh over time and so if you send this particular request over to us like five times the output that you'd be getting under use um under choices will be significantly more consistent additionally we're giving you access to the system fingerprint uh parameter uh which will which on every response from the model will tell you a fingerprint about our engine system under the hood and so if you're getting the exact same system fingerprint uh back from all your responses and you passed in the same seed and temperature zero you're almost certainly going to get uh the same response cool so those are model level Behavior that you can actually very quickly pick up and just try with uh even today um a more involved technique um is called grounding the model which helps reduce the inconsistency of the Model Behavior by giving it additional facts to base its answer off of um the root of this is that uh when it's on its own a model can often hallucinate information as you all are aware of a lot of this is due to the fact that um we're kind of forcing the model to speak and if it doesn't really know anything it'll have to try and say something and a lot of the times it'll make something up um the idea behind this uh is to ground the model and give it ACH bunch of facts so that it doesn't have nothing to go off of and so concretely what we'd be doing here is in the input context explicitly giving the model some grounded facts to reduce the likelihood of hallucinations from the model and this is actually quite a broad uh sentiment uh the way this might look in a system diagram is like this where a query will come in from your user it hits our servers and instead of first passing it over to our API we're first going to do a round trip to some type of grounded fact Source let's say we pass the query in there then in our grounded fact Source it will ideally return some typee of grounded fact for us and then we will then take the grounded fact and the query itself and pass it over to our API then ideally the API takes that information and synthesizes some type of response using the grounded fact here and so to make this a little bit more conrete one way that this might be implemented is using rag or vector databases which is a very common and popular technique today in this example let's say I'm building a customer service spot and a user asks how do I delete my account and this might be specific to my own application or my own product so the API by itself wouldn't really know this let's say I

Segment 4 (15:00 - 20:00)

have a retrieval service like a vector database that I've used to index a bunch of my internal documents and FAQs about support and it knows about how to delete documents what I would do here first is do a query to the retrieval service with how do I delete my account let's say it finds a relevant snippet for me here that says in the account deletion FAQ you go to settings you scroll down click here whatever we would then pass that along with the original query to our API and then the API would use that fact to ground some response back to the user so in this case it would say to delete your account go to settings scroll down click here and so this is one implementation but actually this can be quite Broad and with open AI function calling in the API you can actually use your own services and we've seen this use to great effect by our customers and so in this case instead of having a vector database we might use our own API or our own microservice here and in this case let's say a customer is asking for what the current mortgage rates are which of course even our llms don't know immediately because this changes all the time but let's say we have a micros service that's doing some daily sync uh job that's downloading and keeping track of the current mortgage rates in this case we would use function calling we would tell our model that has access to this function known get known as get mortgage rates which is when within our microservice we'd first send a request over to the API and it would express its intent to call this get mortgage rates uh function we would then fulfill that intent by calling our API with get mortgage rates let's say it returns something like 8% mortgage rates for a 30-year fixed mortgage and then the rest looks very similar where you're pass in that into the API with your original query and the model is then responding with a grounded response saying something like not great current 30-year fix rates are actually at 8% already and so at a very broad level you're using this grounded fact source to help uh in a generic way to help ground the model and help reduce model inconsistency um and I just showed two different example of this but the grounded factors can also be other things like Asser index even like elastic search or some type of more General search index it can be something like a database it could even be something like browsing the internet or trying some smart mechanism to grab additional facts the main idea is to give something for the model to work and one thing I wanted to call out is that the open AI assistance API that we just announced today actually offers an outof the-box retrieval setup for you to use and build on top of with uh with retrieval built right in a first class experience I'd recommend checking it out okay so far we talked about building a transparent human Centric user experience then we talked about how do you consistently deliver that user experience through some of the model level features we released today and then by grounding the model now we're going to talk about how do we deliver that experience consistently without regressions and this is where evaluating the performance of the model becomes really important we're going to talk about two strategies here that will help evaluate performance uh for applications built with our models the first one is to create evaluation Suites for your specific use cases working with many orgs we hear time and time again that evaluating the model and the performance and testing for regressions is hard often slowing down development velocity and part of that problem is for developers to not think about a systematic process for evaluating the performance uh of these models and also doing evaluations too late evaluations are really the key to success here and measuring the performance of the models on real product scenarios is really essential to prevent regressions and for you to build confidence as you deploy these models at scale you can think of evals as essentially unit tests for the large language models people often think of prompting as a philosophy but it is more of a science when you pair it with evaluations you can treat it like a software product or delivery and evals can really transform ambiguous dialogues into quantifiable experiments they also make model governance model upgrades much easier setting expectations around what's good or bad and capabilities evaluations and performance really go hand inand and they should be the place where you begin your AI engineering journey and so in order to build evals let's start let's say we start simple and have a human annotators evaluate the outputs of an application as you're testing a typical approach in this case is where you have an application with different set of prompts or retrieval approaches and so on and you'd want to start by building a golden test data set of evals by looking at these responses and then manually grading them as you annotate this over time you end up with a test Suite that you can then run in an online or offline fashion or part of your cicd pipelines and due to the nature of large language models they can make mistakes uh so do humans and depend depending on your use case you might want to consider building evals to test

Segment 5 (20:00 - 25:00)

for things like bad output formatting or hallucinations agents going off the rails bad tone and so on let's talk about how to build an eval earlier this year we open sourced the eval framework which has been an inspiration for many developers and this labrary contains a registry of really challenging evals for different specific use cases and vertical and a lot of templates which can come in handy and can be a solid starting point for a lot of you to understand the kind of evaluations and tests you should be building for your specific use cases after you've built an eval Suite A good practice and hygiene here is to log and track your eval runs in this case for example we have five different eval runs each scored against our golden test data set along with The annotation feedback and audit of changes the audit of changes could include things like changes to your prompt to your retrieval strategy few shot examples or even upgrade to model snapshots and you don't need complicated tooling to start with tracking something like this a lot of our customers start with just a spreadsheet but the point is each run should be stored at a very granular level so you can track it very accordingly and all the human feedback and user evals are you know the highest signal and quality it's often expensive or not always practical for example when you cannot use real customer data for evals and this is where automated evals can help developers monitor progress and test for regressions quickly so let's talk about model graded evals or essentially using AI to grade AI gp4 can be a strong evaluator in fact in a lot of natural language generation tasks we've seen gp4 evaluation s to be well correlated with human judgment with some additional prompting methods the benefit of model graded evals here is that by reducing human involvement in parts of the evaluation process that can be handled by language models humans can be more focused on addressing some of the complex edge cases that are needed for refining the evaluation methods let's look at an example of what this could look like in practice so in this case we have an input query and two pairs of completions one that is the ground truth and one that is sampled from the model the evaluation here is a very simple prompt that asks GPD 4 to compare the factual content of the submitted answer with the expert answer and this is passed to GPD 4 to grade and in this case GPD 4's observation is there's a disparity between the submitted answer and the expert answer we can take this further by improving our evaluation prompt with some additional prompt engineering techniques like Chain of Thought and so on in the previous example the eval was pretty binary right either the answer matched the ground truth or it did not but in a lot of cases you'd want to think about eval metrics which are closely correlated with your what your users would expect or the outcomes that you're trying to derive for example going back to shervin's example of a customer service assistant we'd want to eval for custom metrics like the relevancy of the response The credibility of the response and so on and have the model essentially score against those different metrics or the criteria that we decide here's an example of what that criteria or scorecard would look like here we have provided GPD 4 essentially this criteria for relevance credibility and correctness and then use GPD 4 to score the candidate outputs a good tip here is show rather than tell which basically including examples of what a score of one or a five could look like would really help in this evaluation process so that model would really appreciate the spread of the criteria in this case GPD 4 has effectively learned an internal model of language quality which helps it to differentiate between relevant text and lowquality text and harnessing this internal scoring mechanism allows us to do our evaluation of new candidate outputs but when gp4 is expensive or slow for evals even after today's price drops you can fine tune a 3. 5 turbo model which essentially distills GPD Force outputs to become really good at evaluating your use cases so in practice what this means is you can use gp4 to curate high quality data for evaluations then find tune a 3. 5 judge model that gets really good at evaluating those outputs and then use that fine-tune model to valuate the performance of your application this also helps reduce some of the biases that come with just using

Segment 6 (25:00 - 30:00)

GPD 4 for evaluations and the key here is to adopt evaluation driven development good evaluations are the one which are well correlated to the outcomes that you trying to derive or the user metrics that you care about they have really high endtoend coverage in the case of rag and they're scalable to compute and this is where automated evaluations really help so at this point you've built a delightful user experience you're able to deliver it consistently to your users and you're also able to iterate on the product in confidence using evaluations if you do all this right often times you'll find yourselves with a product that's blowing up and you know really popular um if the last year has shown us anything except the consumer appetite and even the internal employee appetite for AI is quite insatiable so oftentimes you'll now start thinking about how to manage scale and oftentimes managing scale means managing around latency managing around cost and so with this we introduced the final part of our stack known as orchestration where you can manage around scale by adding a couple of additional mechanisms and forks into your application two strategies that we've seen in managing costs uh and latency uh involve uh using semantic hashing to reduce the number of round trips that you're taking to our API as well as uh routing to the cheaper models um the first one of these is known as semantic hashing and so what semantic caching looks like in practice from assistant's perspective is you're going to be adding a new layer in your logic to sit between us and your application and so in this case uh if a query comes in asking when was GPT 4 released the uh you would first go to your semantic cache and do a look up there and see um if you have anything in your cache in this case we don't and then you just pass this uh request over to our API then the API would respond with something like March 14th 2023 and then you'd save this within your semantic cache which might be a vector database or some other type of store but you're really uh the main point here is you're saving the March 14th 2023 response and keying it with that query of when was gp4 released and then you pass this back over to your users and this is fine but let's say you know a month or a week from now uh another request comes in where user asks gp4 release date question mark now this isn't the exact same query that you had before but it is very semantically similar and can be answered by the exact same response and so in this case you would do a semantic lookup uh in your cache realize that you have this already and you just return back to the user with March 14th 2023 and with this setup you've actually saved latency because you're no longer doing a roundtrip to our API and you've saved cost because you're no longer hitting and paying for additional tokens and while this works great um often times it might be a little bit difficult to manage um and uh there's often even more uh capable ways of managing cost and latency this is where we start thinking about routing to cheaper models and where orchestration really comes into play and so when I talk about writing to cheaper models often times the first thing to think about is to go from gbd4 into 3. 5 turbo um which sounds great because gbt 3. 5 turbo is so cheap so fast however it's obviously not nearly as smart as gp4 and so if you were to just kind of drag and drop 3. 5 turbo into your application you'll very quickly realize that you're not Del delivering as great of a customer experience however um the GPT 3. 5 turbo fine-tuning API that we released only two months ago has already become a huge hit with our customers and it's been a really great way for customers to reduce costs by fine-tuning a custom version of gbt 3. 5 turbo for their own particular use case and get all the benefits of the lower latency and the lower cost um there's obviously a full talk about fine tuning earlier but just in a nutshell the M nutshell the main uh idea here is to take your own curated data set this might be something like hundreds or even thousands of examples at times um describing the model on how to act uh in your particular use case you'd pass in that curated data set into our fine-tuning API maybe tweak a parameter or two here and then the main output here is a custom fine-tuned version of 3. 5 turbo specific to you and your organization based off of your data set while this is great um often times this actually there's a huge uh activation energy associated with doing this and it's because it can be quite expensive to generate this curated data set like I mentioned you might need hundreds thousands sometimes even tens of thousands of examples for your use case and often times you'll be manually creating the yourself or hiring some contractors to do this manually as well however one really cool method that we've seen a lot of customers adopt is you can actually use gp4 to create the training data set to fine tune 3. 5 turbo it's starting to look very similar to what shamal just mentioned around evals as well but gbd4 is at an intelligence level where you can actually just give it a bunch of prompts it'll output a bunch of outputs for you here and that output can just be your training set you don't need any human uh manual intervention here and what you're effectively doing here is you're distilling the outputs from gbd4 and

Segment 7 (30:00 - 34:00)

feeding that into 3. 5 turbo so it can learn and often times what this does is it in your specific narrow domain it helps this fine-tune version of 3. 5 turbo be almost as good as gp4 so if you do take uh the effort in doing all of this the dividends that you get down the line are actually quite significant not only from a latency perspective because GPT 3. 5 turbo is obviously a lot faster uh but also from a cost perspective um and so just to illustrate this a little bit more concretely if you look at the table even today even after today's gbd4 price drops a fine-tune version of 3. 5 turbo is still 70 to 80% cheaper while it's not as cheap as the vanilla 3. 5 turbo um you can see it's still quite a bit off um from gbd4 and if you switch over to fine tuning um fine tune 3. 5 turbo you'll be saving on a lot of cost all right so we talked about a framework that can help you navigate the unique considerations and challenges that come with scaling applications built with our models going from prototype to production let's recap so we talked about how to build a useful delightful and human Centric user experience by controlling for uncertainty and adding guard rails then we talked about how do we deliver that experience consistently through grounding the model and through some of the model level features and then we talked about consistently deliver that experience without regressions by implementing evaluations and then finally we talked about considerations that come with scale which is managing latency and costs as we've seen building with our models increase the surface area for what's possible but it has also increased the footprint of challenges and all these strategies we talked about including the orchestration part of the stack have been in converging into this new discipline called llm ops or large language model operations just as devops emerged in the early 2000s to streamline this software development process llm Ops has recently emerged in response to the unique challenges that are posed by building applications with llms and they've become a core component of many Enterprise architecture and stacks you can think of llm Ops as basically the practice tooling and infrastructure that is required for the operational management of llms end to end it's a vast and evolving field and we're still scratching the surface and while we won't go into details here's a preview of what this could look like llm Ops capabilities help address challenges like monitoring optimizing performance helping with security compliance managing your data and embeddings increasing development velocity and really accelerating the the process of reliable testing and evaluation at scale here observability and tracing become especially important to identify in debug failures with your prom chains and assistance and handled issues in production faster making just collaboration between different teams easier and gateways for example are important to simplify Integrations can help with centralized management of security API keys and so on llm Ops really enables scaling to thousands of applications and millions of users and with the right foundations here organizations can really accelerate their adoption rather than one-off tools the FOC should be really developing these long-term platforms and expertise just like this young Explorer standing at the threshold we have a set of wide field of opportunities in front of us to build the infrastructure and Primitives that stretch beyond the framework we talked about today we're really excited to help you build the Next Generation assistance and ecosystem for generations to come there's so much to build and discover and we can only do it together thank you

Другие видео автора — OpenAI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник