Build Hour: Agent RFT

1:02:10

Build Hour: Agent RFT

OpenAI 10.11.2025 38 777 просмотров 666 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Agent RFT enables reasoning models to become even more powerful, tool-using agents by training directly on the workflows they will execute in production. By operating on agent rollouts, reasoning models can call tools, generate intermediate reasoning steps, and receive real-time feedback via customer-provided endpoints. This Build Hour will walk through the preparation, infrastructure, and safety oversight to use Agentic RFT. Theophile Sautory (Applied AI) and William Hang (API Engineering) cover: • Improving agent performance with optimization and fine-tuning options • Key differences between Base RFT and Agentic RFT • New additions and how Agent RFT works • Task setup and live demos training with tools • Customer spotlight on Cognition with Sampriti Panda (Research Engineer) • Success stories featuring Ambience, Genspark, Mako, and Rogo • Live Q&A 👉 Agent RFT Interest Form: https://tinyurl.com/agentRFT 👉 Follow along with the code repo: https://github.com/openai/build-hours 👉 Sign up for upcoming live Build Hours: https://webinar.openai.com/buildhours/ 00:00 Introduction 01:34 Intro to Agent RFT 11:12 Task Setup 14:15 Demos: Training with Tools 31:33 Best Practices 35:15 Customer Spotlight: Cognition 44:58 Success Stories 51:16 Summary 52:33 Q&A

Оглавление (9 сегментов)

Introduction

Hey everyone, welcome back to another build hour. I'm Christine. I'm on the startup marketing team and today I'm here with Will and Theo. — Hey, I'm [snorts] Will. I'm on the engineering team building the fine tuning product. Yeah. And I'm Theo, solutions architect working with startups and with Will in particular quite a lot. — So today's topic is on agent RFT. Um really exciting. So, if you have been tuning in to our past build hours, we did a series on agents all about how to build agents, starting with the responses API and then working our way up to agent kit. Um, and now talking about agent RFT. So, all of these past build hours can be found on our YouTube. Um, and the purpose of build hours is really to help you build on our API and use our tools. Um, with that, I'll give you a quick snapshot now of what this next hour will be all about. So, first we are going to intro you to agent RFT. Then we'll spend some time quickly with the task setup um and then move on to some live demos. Um we have a really exciting customer spotlight today with Cognition. Um so we'll be dialing in and then share some customer stories and then end with Q& A. So on the right side of your screen, you'll have a Q& A box. Um feel free to toggle over and submit questions throughout the hour. our team is in the room and joining in virtually to help address these questions and then we'll save a few for the end to answer live with our team. So with that, I will pass it off to Will and Theo. — Awesome. So let's kick things off by

Intro to Agent RFT

talking about agents. So you're probably joining us today because you're building an agent for your application or your business and you'd like to improve its performance. So what makes an agent different from a regular model is its ability to interact with the outside world to complete a task. doesn't have to go through you all the time or even talk to you. It just gets things done on its own. Now, in order to get things done, this agent has to have access to tools. So, if you're building a coding agent, for example, it needs access to a terminal, a code interpreter, or maybe even an entire codebase. Or if you're building a customer service agent, it might need access to internal software to look up customer records or billing systems to issue refunds or even the ability to escalate to a human being. So this agent needs a way to interact with your business context and the outside world to get things done through the use of tools. And the way that we think about agents here is that all their interactions with the outside world go back into the context window. So that means that after looking at what it sent into and then got out of a tool, the agent will reason to itself, call another tool, and then repeat the process. — Yeah, that's super cool. And so how does that tie in with our first product first party products in agents? — Yeah. Yeah, totally. So we do care a lot about agents here obviously and we're building some of the best agents for specific use cases and here's how openi agents use tools. So for example codeex has access to a wide range of tools to complete coding tasks end to end like running tests reading your code files or making code changes. So in the case of codeex might have access to say a code planning tool or a terminal tool or even a tool to apply um git patches. And another example of a first-party agent that we've released is deep research. It's now embedded within our agent and GP5 products. Um, deep research has access to a browser. Uh, can also look through your files and can also run code. Um, so for deep research, you know, so this set of tools allows the agent to deliver you the most up-to-date, most accurate research articles. — Yeah. Yeah, that's super cool. And so when we work with customers usually using our models and they're interested into optimizing those models, they work a lot on prompt engineering. What would you recommend to optimize your agents? — Yeah, so prompt engineering is honestly a great way to start. Um, we've seen many different ways to improve the performance of agents so far. So, let's go through them. So, as you said, you can steer model behavior by optimizing the prompt. So, it's almost like instructing the model to do your task better. But let's say that you've optimized your prompt and um you're still not as satisfied as you could be. So, you can then um optimize the task itself. For example, you can simplify the task. You can add better guardrails around the task to improve the agents chances of getting things right. Um or you can add or subtract tools even. Um or make the tools better at accomplishing what the agent intended to do. — Yeah. It's interesting when you look at agents because we've seen customers be successful at improving the performance of the agent just by changing the description of the tools. Yeah. — And even their naming just because it makes more sense. It's like semantically um easier to understand for the model. — Totally. Yeah. So there's a lot that you can do to um improve the performance of the agent before you move to fine-tuning. Um but yeah, let's say you've tried all these approaches, but you still want better performance. So that's where fine-tuning comes in. Fine-tuning is a way to train the agent end to end on your task to achieve even better performance. And uh what we're here to talk about today is agent reinforcement finetuning or agent RFT. Agent RF is the way to do this. Agent RFT changes the weights of the model according to a learning signal that you can specify to teach the model what good behavior and less than good behavior look like. Uh during training, the agent will go and explore many different ways of calling your tools to learn how it can do better and better as training progresses. And we wanted to remind everyone that base RF is already a functionality in the current fine-tuning product, but you cannot use it to fine-tune agents. Agent RF does allow you to do this. So it allows the agent to call tools while it's exploring during the rollout process. So it can learn from all possible ways of using your tools. Um you can also specify arbitrary reward signal through an endpoint uh which we call to train them all on so that it gets better and better in ways that matter to you. And so to summarize the benefits of agent RFT, it first helps you improve the performance of reasoning models. It improves the agents ability to use tools and reach the best final answer. It's also quite sample efficient which can be really important in domains where training data is scarce. We'll talk more about specific examples when we go through the customer stories and the process itself can result in the model that has lower latency and is better on agentic tasks. — Okay, that's that's really cool. And when you mentioned latency, I think that's a very key point. Is it the number of reasoning tokens that drops and tool call the number of tool calls or what is it? — Yeah. Yeah, totally. So yeah, let's uh kind of dive in a bit into latency and ML performance and how we can improve those aspects. So one of the challenges of making agents work with your business context is that might be very different from how we at OpenAI train our models. So if your tools look and behave the same way as say Codeex's tools that's that we've trained on or Deep Research's tools, then you're in luck cuz the domain of the tools is going to be similar between the uh the base model and the uh um your task. But your business context um is most likely specific to you. So that means that um your agent might not be used to using your tools in the way that is ideal. It might call a tool too many times. it might call five different tools and calling one tool is better for what it was trying to do in a given moment. So using agent RF you can kind of align these domains. Um it's possible to train them all to use far fewer tool calls to achieve the same or sometimes even better performance on even task. Um that means lower latencies for you and faster experiences for your end users. This process happens naturally because we do apply a light penalty to the amount of tokens a model uses to reason. Um okay. Yeah. But um perhaps you want to impose a uh like a constraint actually because instead of this natural process of the model kind of learning how to use fear tool calls and use fear tokens um sometimes you want to make sure the model stays within a given tool called budget and doesn't go over that limit. So given how important tool calls are in affecting latency this could really reduce the uh latency of your rollouts. Um, so agent RP allows you to specify this cutoff that train the model to stay within this given budget while preserving or exceeding the original ML performance. But ultimately you're probably here in the first place to improve the ML performance of your agent. So um obviously agent RF can help you do this by first training the model to reason better across tool outputs and two training the model to use tools better in the first place. So all this is learned organically during the exploration and rollout process um as it tries many different ways across the search space to call your tools and then think about the outputs from your tools to arrive at a better answer. So hopefully it hill climbs well hill climbs nicely on your task. — Yeah, that's awesome and I really want to try it out. I want any people to try it out actually and I know you worked so hard to like make this work. So — we worked hard. — Yeah, but like the whole team and your whole team was like under the hood, how does it work? how does it communicate with the tools and stuff? — Yeah. So, let's dive in. So, in order to make all this work, uh we've introduced several major new updates to the existing RF product. So, first of all, it's the ability of the model to call tools during training via calls to your endpoints. So, calls to your tool endpoints. And second, it's the ability for you to specify greater in the form of an endpoint that we can call to get your custom reward signal out. So, these two additions mark the first time we've allowed our models to interact with the outside world during the training process. even our frontier models um through your tools as the model is exploring and doing its rollouts and through your reward signal when we're ready to uh update the model. So to dive even deeper into exactly what's happening during the training process. So for each agent roll out, we assign a unique identifier to all tool calls and final answers that come out of that roll out. And um when the agent calls your tools uh we attach that unique ID to the tool call so that your system can recognize different tool calls as originating from the same rollout. So this can allow you to keep track of rollouts as they happen which could be important for state management if you choose. So you can do this in your own database or in your own backend. Um so that when we emit the final answer and then call your grader um you can then attach all the context from the agent to the final answer through that unique identifier and then you can pass all the stuff into your greater um and [clears throat] you know you can have this like very holistic grading context. — Yeah. That's awesome. I think what I find the most powerful here is really that all the tool calls and the grading happens in your environment. So it can match totally your production environment and then your model will just not be surprised when it sees that specific tool and will know how to call it — and it also gives you like so much flexibility in the grading. So currently on our platform we have couple of ways of grading but here because you receive every tool call you can store them you can grade them and really shape the policy that you want for your model. — Yeah absolutely totally. So there's a lot of flexibility here and uh we do hope that agent RFT helps you teach agents to achieve frontier performance on your tasks. So enough talk about theoretical things. Let's now illustrate how agent RF works with a real world example. So we're going to fine-tune an

Task Setup

agent to perform better on fin QA financial QA which is a benchmark that gives a model a financial report and asks it to answer questions about it that require numerical reasoning. So the uh original benchmark so this is actually an academic benchmark that was published and the original benchmark prompts uh include the relevant financial report that the model needs to answer the question. Um but we've decided to make things harder because we like doing things the most difficult way here at OpenAI. We've modified the benchmark. We've made it a lot harder by only giving the model the question itself without the context, no report. And uh we require it to use tools like an agent would to search for the correct report in this pile of 2,800 financial reports to answer the question. Uh and to make the task even more challenging, we require that the model arrive at its answer within 10 tool calls. — Yeah. So that's so much harder because you have to know where to look in to. Then once you found where to look in, you have to reason over it and all of this in a very constrained project. — Totally. Yeah. So here are the tools that we've given the access uh we've given the agent access to. So we have a search tool which is a semantic search tool. Um we have a list tool which kind of goes through all the directories and document paths and um tells you what's in the file system. And we have this like funnily named cat tool which is kind of our engineer brains just naming things the way that we understand. But um cat returns a document given a path. So it's kind of like opening a document in your computer. So let's go through an example. So here's a sample question from the benchmark. um the agent might call the search tool uh after seeing this question about you know answering some uh question about like intel's return. So it might ask the search tool um some query to try to find the relevant documents and information out of it. Uh and so the search tool might return something like this which has a table and a text form with all the relevant numbers it needs to answer the question. And here's the greater for this task, which is how we generate reward signal for the agent's final answers. Um, just to keep things as simple as possible, since we've already over complicated things by making the benchmark harder, we used a model grader for this task. We could have used an endpoint grader, um, which is something that we'll cover soon. Um, but we also could have used a string grader which um actually you know rewards the model for exact string matches to the ground truth but is super brittle and can penalize the agent for minor formatting errors like you know writing out $32 instead of using the dollar sign. So it's super brittle. Um kind of penalizes the model in ways that we don't want to. Uh so in our case we also want to give partial credit to the agent as well for answers that are really close to the ground truth like you know like rounding errors like 0. 999 if the ground truth is actually one. — All right I'm going to hand it over to Theo to talk about the demo and training process itself. — Oh thank you. Well that was a great setup. So let me dive in some code here and just going to make sure the right screen is being shared. I'm sorry. Yeah, that took time. — Yep.

Demos: Training with Tools

All right. So, here you should be seeing the the code here. So, the first thing that we're going to do is we're going to look into the tool server. So, we're using model to do this because a very was very fast to just set up a fast API endpoint and we have a bit of descriptions on on how to set that up. But the main idea here is that we're going to first set up a base image where this is going to be a Debian image and we have fast API, pandas, numpy and open AI. So those will be the libraries that we need to run our code and to run the tools. So and we're also going to add the the corpus the full like data all the data and the documents so that the model can actually look into them. So here we defined the different tools and let me just look at the search tool because as will mentioned it's a semantic search. So we're going to get we create some embeddings we use a open embedding model and then we'll compute the cosine similarity. So very similar to rag that you are probably all familiar with and this is how we build um the [clears throat] search tool which is just defined here. I'm not going in depth on the other tools because list and cat are quite straightforward. And then the way we're going to provide these tools to the platform and to the model. It's just going to be through this list where we have um the JSONs of the tools where we have a name which is going to be search a URL which is the model URL I just set up and then a set of headers which have an O token so that only I can access um those end points. — Oh, if you put your name in the URL, it's great. Yeah, — it's mine. — Yeah. Yeah, it's yours. No one else is — No. Yeah. So, that's how we set up the tools and um then we can have a look at the greater. So, as Will mentioned, we're using a model grader because in the data set there the answers are not don't always have the same consistency in the number of decimals or should we put the dollar sign before or we write dollars after. So, the idea is to prevent this brittleleness. We just use a model greater and as we'll mention we provide some partial rewards of 0. 5 if the answer is close but not exact but also this allows us to provide like answer of one if you say 7% instead of 0. 07 as an answer. So this is very important because we want to make sure we provide the right signal to the agent or else it's not going to be able to learn what was a correct reasoning path versus what was not a good reasoning path. All right. And we're using GPT 4. 1 and then the uh response format. And you might be familiar with this from previous build hours or RFT engagements. — Right. And I also want to remind everyone that we used a model grader here. But you know, we also have this endpoint grader that you can call where the endpoint grader is basically us calling your endpoint via the public internet so that you can define your custom reward signal. But in this case, for simplicity sake, we just chose to go with a model grader. — Yeah, totally. Thank you, Will. All right. So now what do we always do before running our team? You can imagine that we optimize a prompt etc. And what we're going to do is we're going to run a baseline to see how GPT5 performs. And if you remember the reinforcement fine-tuning uh where build hour we did with Prashant a couple months ago, we were very interested in the variance of uh the model. So given the specific sample, what is the variance of scores that it gets for that sample? And so I'm going to run those plots. I actually ran the model free time on each of the training sets. I mean actually sample from training and validation. I'm showing validation here because it's just 100 samples and training is a thousand samples which is a bit too large and hard to read. But that's the plot that you might have seen last time. I'm going to describe it again. But what how does it look to you? Will — Yeah, I mean well you're going to I'm going to need some help interpreting this graph because there's a lot of stuff going on here. So — yeah. All right. — Take it away. Yeah. This is — for sure. So we ran each sample. So on the x- axis you have each different sample. For each sample we ran the model I mean the agent three times and on the y-axis we have the score. — So if you look at each point the red cross is the best score that it got out of the three runs. — So if you look at this sample here it got zero every time. — If you go look at the sample at the top right it got one every time. And in the middle sometimes they go zero sometimes they go one and maybe sometimes they go five — right — so that's for the overall the red cross is the big the best the thick blue bar which is light blue this is the mean over the free runs — and the thin blue bar is actually the variance — and when I see a plot like this I — I don't think it's a great plot for reinforcement finetuning because many of the samples do not have variance — okay — but we still have a fraction of them like probably 15% in the middle that do have variance and is it is this variance which is going to enable the model to learn what a good reasoning path is versus what is not a good reason. — Yeah, totally. — And so we expect that all those samples will actually provide some signal to improve the performance of the model. — Right. — And uh very importantly this is on the validation set but you can trust me the distribution on the train set. So it's kind of similar. — Yeah totally. And maybe this is a good um part to talk about the compute multiplier because the compute multiplier kind of controls the amount of exploration that the model does. And so maybe like you know over three repeats basically each data point is being explored three times. There are just not enough samples to kind of hike up those uh zero scores up into the blue bar region. But maybe if we set the compute multiplier higher such that we make the model explore more, maybe one of those uh the model has more chances to get some like nonzero reward out of its exploration. So that's where the uh exploration actually really matters. — Yeah. Totally. Totally great. All right. So now that we've seen this uh we also share a notebook very simple that you can run through that will allow you to run the training on our platform uh using our API and okay now let me actually go and share go and find the examples of one of the training runs we did. All right. So now we're on the platform the OpenAI platform that again might be you might remember it in some ways and you can see here that uh this is the job that we ran this it has a number and we're going to explore all the hyperparameters that we ran and then see the curves for rewards for output tools and so on. So very high level I run for free epochs meaning that we go through each sample three times. The batch size was set to 16 and as will mentioned there's a compute multiplier which is very important number for the amount of variation that we will observe during training and here I've set it to one. If we want more variation and use more compute and have more chances of stumbling on good reasoning paths then you would bump up this compute multiplier. But you also have to remember that you are hosting endpoints and during training we're going to hit those endpoints. So if you increase this compute multiplier you're going to have to increase the robustness of your endpoints as well. — Yeah. — Right. — Totally. — And I'm using reasoning effort medium and eval samples is the number of times we evaluate each point from the validation data set to have robust curves during training. All right. So let's have a look at this reward curve. Yeah. So you can see at the very beginning we start at a baseline of around 0. 6 of validation reward. So this purple curve is the score on the validation set and that's the full validation set run twice as per samples. And then the green curve is actually the model performance on the specific batch that you're training on. So here we have batch size of 16. So this value here step two 0. 461 461 means that like for all the trajectories that we ran over those all 16 samples in the batch we have an average or of 0. 461. So it's less representative and robust than the validation curve because the validation curve is on the full validation data set. — Right? — And so what we can see here is that very rapidly in just 10 steps the model actually improves the performance by 14 percentage point from 0 59 to 0. 63. — Yeah, — that's quite a lot. And so it directly probably learned how to use the tools much better. — Yeah. — And if you go on a bit longer, you can see that the work goes down a little bit and the assumption is that the model is exploring new solutions to try to push the performance even more — and it manages at the very end to push a little more. — Yeah. And I always love correlating the reward with the reasoning tokens means because here you can see that in the big exploration phase in the middle where actually reward was going down the model was starting to think more and more. — Right. Yeah. — And sometimes it's just not necessary to think more and maybe you just have to learn how to use your tools better or different tools. And so this is what I love about the UI is that we also show those tool call per rollouts that show the distribution of how the number of tool calls evolves during um during rollout. So you can see at the very beginning we use a total of probably eight or nine tools and then it drops quite significantly. — Yeah. — Um to much lower numbers and so you can see that the performance gain that we saw after 10 step is also correlated to a huge drop in tool calls. — Yeah. And you can assume that the model is just learning to use those tools much more efficiently. Yeah. — And I think that's really awesome because it shows how we're closing the distribution shift just in those 10 steps. Of course 10 step is a number that worked here. Maybe in your case it will be more, maybe it will be less but um that's very interesting. And then you can see all this region of interest where the reward was going down and the number of reasoning tokens was also going up is a region where well the tool call were definitely shifting. And then if you go to the very end, you see that the model starts doing a lot of lists. Not exactly sure why, but this allows it to reach higher rewards and it kind of converges to a policy. You can see where list becomes kind of flat and all those lines become more or less flat. — And um — I think that's very interesting. As a business, you might have stopped after 10 steps because you don't want to plateau for too long. — Mh. — But it's always interesting to see what happens beyond. — Sure. — All right. So those are the high level curves I wanted to explore. There are many other curves such as the number of tokens per tool call and that will give you a sense um most probably also about for the speed of the training run. The more output tokens uh the longer it will be it will take to train the model. Um, but let's go back to sharing the code because right now we've done a very high level analysis. Um, where is this? All right. Yeah. Sorry. So, we've done a very high level analysis of what happens. But because we have access to those models, we can just observe the traces in depth and try to understand what is happening under the hood. So I've loaded all the results. I run the evaluations three times on the validation set for the baseline model and step 10 model that we saw the big increase and the decrease in reason in the number of tool calls. And what I'm going to look at is as will mentioned like the performance initially but also the latency and also the output tokens. So let's do some quick plots here. You'll find the code to do all those plots. And here you can find a very simple plot. On the left hand side you have the average reward over the 3 * 100 samples and the average latency. And what you can see here is that the well technically you want to be at the top left, right? You want higher reward, lower latency. And that's what we get from going from baseline to step 10. We have a five second um reduction. And that's approximately like 10%. And we have an 11 percentage point increase which is quite significant. — Yeah. Wow. — And um so latency here sometimes it's a bit hard to look just at latency. What we can look at is also number of tokens because it will give you some information on — and uh on the time it will take to expose. And you can see the tokens mean went from like 2500 to 1,500. — Wow. Yeah. uh for the — huge reduction. Yeah. — Yeah. Huge reduction. That's probably from less reasoning and less tool calls out. — Right. Right. — Um all right. So now that we saw the high level, let's look into the tool calls per trace. So I also run something here to compute the means. And you can see that for the baseline model, we were around 6. 9 tool calls per trace. And for the finetune model, we're only at 4. 2. So that means smarter model, faster model uh that has I mean just closing the distribution shift. — Yeah. — And if we look more in depth into this, we can actually plot um equation plot which is a plot that I really love to plot after having run RFT to understand really what's happening in the model behavior. So let me walk you through this very simply. — You and the plots man. Yeah, I think it's a great way to analyze, you know, absolutely to get an understanding of the policy change. So here on the left hand side, you have the delta reward. So we take the reward of step 10 minus that of the baseline for all of our data points. And on the x-axis we have the delta and tool calls. So step 10 minus baseline. And the equation where you want to be here is again the top left because you want higher reward, lower number of tool calls. Yeah. And you can see that we have 29 points or in this region which means that a large fraction of those data points actually just like a I mean of those 100 samples are just faster and higher reward. Then we have another interesting one which is the ones where there is no delta in reward but a decrease in number of tool calls and that's also quite a big fraction with like 62 uh I mean more than probably 50 samples cuz here we count all of them but that's the idea and then there are some samples where the model is starting to lose in reward and lose and doing less tool call — and this kind of highlights that even if we learned a policy that's kind of general we might not be able to capture all of the data points because this policy might be a bit too maybe a bit too strict for many of those points. — Yeah. — So that's the trade-off, but we don't have any point in the bottom right corner which is the one that we really don't want to which is more tool calls and lower reward. So I'm quite happy with what the model has trained and how the policy changed. — And we can also skim through some of the traces uh in a little even a little more depth and like in all of those tool calls. But since what is interesting to see is that not the model has learned to use each of the tools uh better on the first line here. This is the number of tool calls per specific tool. And you can see that it drops for search for cat and for list — so it's really general which is quite cool. And finally something more about the model policies. But then this was would require even more work. But we can look at is the model being a bit smarter in the way it uses the tools and the way it repeats the number of of making exactly the same tool call with different parameters. So here I'm plotting I mean I was looking into diagrams. So sometimes does search and then cat and cat — and you can see that the number of repeats drops significantly from 1,000 to 500. So we just divide by two. So the model is much better at understanding um I mean I take making the right tool call the first time so it doesn't have to repeat the exact same following with slightly different parameters — right — and if we look in depth this is a very cherrypicked example but you can see the baseline model doing the search tool six times in a row before doing list cast another search and cat whereas the function model just follows a very simple policy of search list and cat and then it probably just reason on the output to provide a final answer. Yeah, — totally. — Yeah. And I just want to add that on this benchmark, the documents in the train set and validation set, um, there's no overlap in the documents that are required to answer the question. However, the model is kind of still operating on the same pile of documents. So, so it still has access to basically the entire file structure, but the questions that are asked, um, there's basically no overlap in the documents that are required. So in some ways it kind of like learns how to use this file system and like knows what documents are inside. Um but ultimately you know the documents are kept separate. — Yeah. So that Yeah. — Yeah. That's pretty cool. — So that would probably match potentially your business use case where um you have this existing corpus that you want to learn how to use better. — Yeah. Totally. — Yeah. [clears throat] — All right. Yeah. Cool. So uh that was a quick demo. uh you will find the code if you want to go through it run it and very high level we also have some advice on how to get successful with agent RFT. So the first one is you need a well specified and constrained

Best Practices

task and this is mainly in the sense where you need consensus from uh people who have domain knowledge or aesthetic uh for some visual task. Um where there is one real answer in a way or people will agree on what the good answer is. And this is very important because you want to share some signal to the model that is consistent and not say in one example oh answer A is good in the next one answer A is not good because then the model will get confused and will not learn how to reason better. — Yeah. — The second one is nonzero baseline performance. So we saw initially in our variance plot um you need it you need the model to be sometimes right or else if it's just never right at after running like a 100 times on the same sample it will probably never learn. And if it's especially if it's like this across your whole data set, — right? — And then improve accuracy. IK. So that's very interesting. If you run multiple times for each sample and then instead of looking at the average performance, you look at the average performance for the best samp for the best trajectory of each of the samples. That gives you some information on again the variance and how often does the model get it right. And so during training, we're going to nudge all the trajectories to go um and match those best sample those best trajectories. And technically you can then bootstrap on this because you will generalize across other samples and so you'll bring in some new reasoning patterns and probably keep on pushing and you can do that multiple times. — And finally is quality over quantity. Um in this example we use 1,000 training samples which is something which is quite a large number. I've done a lot of engagements [clears throat] with many less samples probably 150 and we've been quite successful. And again the idea is really on the how good the data is and you don't want any uh mixed signals uh to the model — right yeah — all right so that was like on the performance side on what you do beforehand now on the infrastructure side really related to our product um what you want to do is you want to mirror production behavior you have the opportunity to host your tools so just go for it make it like very similar to production and like that everything that you're improving during training will actually translate to your product then the second part is investing in designing your grader. So the grader will really affect the way uh the model behaves and so the model policy and so it's very important to have it be aligned with your domain knowledge to make it hard to game and to hack. Um though this is very hard so as soon as you get something that is little bit like hard to game you should like go for it and try it and then preferably have some gradient as well mentioned if it's just binary and just the string check that will be complex um by nature of many of the problems which don't have like a first order logic answer yes or no right — and the want to give the model like partial credit right — exactly yeah — and you want the model to know that it was going in the right Yeah, exactly. Like here in this case, we could have thought of adding some reward for reading in the right file. It knows that writing the right file. Maybe just the reasoning is wrong. — Right. It's like teaching a person kind of. — Yeah, that's a little bit how I think about it. All right. And then yeah, limit the tool calls output length because it's just going to make your training very slow and also going to confuse the model. So if you can uh work on outputting only what is necessary, that will make it more efficient across the board. But I think that's also very reasonable for any agents and they'll just fine. — Totally. And it saves you money, too. You don't want to shove tons of useless tokens into the context window. So, — yeah. — So, neat. — All right. — Awesome. — Cool. — Thanks so much for that. That was incredible. — No, thank you. — Got analysis and all the plots charts. — Thank you, Will and Theo for walking us through that. Um, really excited about

Customer Spotlight: Cognition

this next segment. Um, our customer spotlight. We're going to hear directly from the research engineer Sam Pretty at Cognition. So, please welcome Sam Pretty. — Hey everyone. Uh, I'll quickly take over the screen sharing. [clears throat] — Um, is that good? — Yeah. — As long as it's coming from your computer. — Yeah. Yeah, it is. — Sweet. — Yes. Uh, thank you, Tio, Will, and Christine. Um so hi everyone I'm SRI I work as a research engineer at Cognition. Um at Cognition we build Devon and Wensurf. So Devon is a autonomous AI engineer that works independently on solving tasks in your codebase. And as part of my work at Cognition I work on improving like models to make parts of Devon smarter. Um so I'm excited really excited to share like what we've been working with uh Will and Theo on the agent RFD feature. Um so one of the tasks in Devon like when you give initial query to Devon the first thing it does it goes into a planning mode to kind of try to figure out what it needs to do to like actually solve this task. Um and from a UX perspective we don't want the agent to spend too much time in planning mode because we want Devon to start working and like showing edits to the user as soon as possible. Um and so one of the motivations of this was can we like fine-tune GP5 or other frontier models like that better so that they get to this um editing stage as quickly as possible while still maintaining or even improving accuracy. Um so the way we designed this task is given the initial user prompt we want to restrict the set of tools that's available to Devon. Um, so in this case it's just read file and shell because we don't really need to make any edits at this stage and let so and then we let the model explore or the agent explore it and like figure out how to like which files to look at and which files to edit to solve this task. Um, so in this case we just have the read file and shell tool. the motivation of the shell tool is so that the model can run commands like prep and find to like search the code base for like certain strings that the user might have put in the query or just look for like certain file names or things like that. Um and so as the mentioned earlier so we need obviously the tool calls and then the data set and the reward. Um so for the data set we collected a bunch of uh real world repositories and collected uh user queries from those repositories and then we labeled what are the files like the user actually edited to solve this uh this task and ideally we want this sub agent to return those exact files so that in the following on the agent can continue and make the edits to those files. Um and then for the reward we use the metric called F1 score. Um so F1 score balances both precision and recall. Um, this is because we want the model to obviously like not just like if we just did precision or just a recall, the model would either like be very conservative and only return a few files or like return too many things to try to get everything. Um, and obviously we want to be in the balance so that the agent that comes along after does not have to like is it context is not polluted with too much data. Um, so yeah, we can uh get to the eval results. Um so we started off with GPT5 being so we started with the GPT5 base model being um somewhat lower than like the current frontier model. Um and we ran two experiments. So one experiment was uh GP5 with like a less smaller data set of around 100 samples. So 100 tasks across varying repositories. Um and then a larger uh experiment with around thousand samples. Um so one thing we tried to maintain was that the set of repositories would be distinct or disjoint between the train and the eval uh case because we wanted to make sure it wasn't that like the model was just learning things about the uh about the data set because ideally like when we want to use this in the in real life the model the train model will have never seen certain repositories because it will be private repositories. Um so um as you can see like uh even with the smaller data set it already obviously beats the base model by quite a lot. Um and the late with the larger data set, we get a even further boost. Um and the plan action score here is basically the F1 score. So we take a look at all the files the model looked at and also at the end the model does output u like what it thinks the right files to edit should be and we kind of compare that with the label ground truth. Um so during the experiment some of the things we noticed are that uh the model starts learning how to do a lot of parallel tool calls. So if you looked at the traces, the first action that the model does, it will like kick off like eight different things um depend like listing the repos, grapping for things and then following on it will like once it gets the results from those tool calls, it will like independently explore all of these those things by again running more parallel tool calls. Um and usually because running the tool called such as read file is quite a bit faster than like the actual like um the model inference it does help a lot that like these back and forths are like reduced quite a lot. So um I think in the eval score for example when we put this in Devon directly we noticed that to get to the originally on the baseline end of the planning mode it would take around eight like 8 to 10 back and forths with the agent or with the model but with the fine-tuned model we would be done in like four back and forth. So that like cuts in cuts time in half by almost force. Obviously there's a thing where like sometimes the model could learn could run a tool call that takes a longer amount of time. So we do we do try to penalize things like if it tries to do too many tool calls because it does take a lot of time or things like that. Um yeah and also during training we did have to kind of penalize um if the model took too long because we don't want the model to like keep exploding and like never be satisfied. Um so yeah uh we noticed that like with this agent RFD feature we can just push like already frontier model like GPD5 even further on like a specialized task when we have a clear reward of what we want to optimize um for the infrastructure um as we mentioned earlier um we we run both the tool calls as a remote endpoint as well as the grader um and so the way the training works is every step the platform like sends us a bunch of rollout requests So basically like the model tries given a certain sample it tries to like uh like does the roll out and like there's like around 32 copies or something like that um for each and so for each roll out we spin up a new VM um and like run the tool calls in this VM and then the results are given back to the platform or the our left grader and then at the end when we get the final answer the uh we call a greater endpoint where we compare the trajectory. So we look at the list of all the tool calls that uh the model made in that particular rollout as well as the final answer and we give it a score based on the label ground truth. Um so in this case we decided to go with like isolated VMs because uh we didn't in because as you know remember we used like a shell tool so the model could decide to do some destructive actions. So we didn't want to like one roll out to affect the other rollouts in case the model goes crazy and runs RMRF or something like that. Um and obviously like we use VMs because uh we could reuse the production Devon VM info where we give every Devon instance a VM but I think containers work well for this purpose as well. Um yeah and some of the interesting things we noticed was that the RL is quite bursty. So um at the beginning of every roll out they would send us like 500 new rollout requests. So um you definitely need to like handle that because that's like 500 new VMs starting instant at the same time. Um and then the other kind of like foot gun is that um sometimes like let's say there's infrastructure error um and the VMs fail um the it does like what ends up happening is the model gets zero reward because like the tool calls fail and like the model can't figure out what's going on. Um and while that's not the model's fault, that does lead to the training kind of collapsing or like the model learning in a bad way because even the model might have done something good, it got a zero reward. So um it is good to have a lot of monitoring on like when there's tool called failures or like you know there's abnormal issues with the model. Um because sometimes it could be that the model just has formatting issues and it's not calling the uh the tool correctly. Um but uh other times it could be just our infra issue. But anyways, uh thank you like to Will and like Kathy for like helping me debug all of these issues — and all the other people. Yeah, — for sure. — Yeah. And anyways, it's been really exciting to try this agent RF feature and like being able to tune the performance of GP5 even more. — Yeah, thank you. — Yeah. — Yeah, it's been incredible. Thank you so much for working with us so closely on this and really glad to ship an agent to you that's um that that's like beating the bas beating the state of the art it seems. So yeah. — Um do we have time for questions for this part? — Yeah. Um let's go through some uh success stories too that we can share and then um we do have quite a bit of questions so I want to make sure we have enough time for that. — Sounds good. — Thanks and pretty. We will catch you later. — Thanks so much. Thank you. — Bye-bye. — All right. So, yeah, we've seen a good success story with Cognition and SRT and

Success Stories

we just wanted to showcase some others to show the versatility um of agent RFT. So, let's start with one with whom I worked really closely with Ambience and Ambience — closely with all of them. — Yeah, that's true. But, uh ambience is a yeah healthcare uh company and they're they they're embedded in some hospitals and one of the tasks that they look at is ICD10 coding and they have an agent for this and ICD10 coding is actually the work that you have to do when you want to do billing um after a session with a patient and you have to map the topics or the illnesses um the diagnosis actually of what you've discussed with some um codes and those codes are very precise and there are like 70k of these codes so it's quite a very quite a hard task And what we're looking at here is we have a transcript between the doctor and the patient and automatically from that transcript we want to propose the right ICDS and codes and this is requires a lot of um nuance understanding of the discussion but also a lot of medical reasoning and that's why ambience looked at GP5 and was using GPT5 and one other aspect sorry is that it has to be quite fast and so they use GP5 with low reasoning because of the way the doctors use And so if you look at the plot on the right hand side, we started with GPT5 low hovering around the 0. 5 F1 score and then we built that agent that actually has a tool that does a search um for those IC10 codes and then we actually RFD that model and you can see the jump from 0. 52 to 0. 57. It might look slightly small but the actual highest performance you can get is around 0. 70. 75 because we are looking at a task that is slightly subjective and doctors agree or not on what are the actual codes. So this is a really a significant jump for them and not only we're seeing this increase in F1, the fact that we are fine-tuning as we've seen during the whole session also allow them to reduce the latency and so there's a 18% um average response time and that actually halfs the number of samples that are above their latency product uh latency threshold in the product. So that was a great use case and it was great working with Brandon Patrick and the team and uh then regarding another use case very different no more healthcare here we're looking at Jensen Spark slides creation agent so genspark has amazing agents and one of them is an agent that builds slides so the user will communicate with the agent to make different tool calls and at the very end those slides um sometimes are not aesthetically necessarily very pleasing. There's a bit too much text or they're too long and therefore uh they use a reasoning model to try to harmonize all the output. And this is what we fine- tuned on. And what was great working with Flame and team was that they worked a lot on their model grader and different type of graders to judge both the content and the visual aspect of it. And they were extremely happy with the output. And I also find those slides quite pretty on the right hand side that we have. And um in terms of numbers, it provided like 88% improvement uh on bad cases over the existing models which is a significant uh number that we're very happy with. — So yeah, well do you have other use cases? — Yeah, absolutely man. We yeah we should use GenSpark for the next build slide. These look great. Yeah. So moving on just to show you how diverse um the success stories on agent RF can be we have we work closely with MacO to build a GPU kernel building agent. So, Mac is building agents to write these GPU kernels, which is really difficult for LLMs because there's just a scarcity of training data out there compared to other domains. Um, especially true for new hardware. So, if Nvidia puts out a new accelerator, there just aren't enough examples of performant kernels. But using agent RFT, uh, as few as a 100 PyTorch prompts were enough alone for GP5 to learn how to write fast kernels for a new hardware platform, as long as you have a good grader, which is what Macro worked really hard on. Um and that allowed us to not need any code examples actually to um start writing these really performant kernels. Uh so the fine-tuned model actually beats the state-of-the-art by 72% in writing correct and performant GPU kernels which is a huge boost. — And um lastly we worked closely with Rogo. Uh Rogo is building a financial reasoning agent. It's uh capable of reading uh financial filings, extracting investment insights and then supporting human analysts through this question answering interface. And uh they wanted to fine-tune oformin uh to summarize and present these key findings from earlier steps in the uh kind of finance workflow. So rogo is really interesting. They used a custom LLM as a judge grader uh that was accessible via endpoints. So that we called their custom grader that measures the agents factual accuracy, reasoning, completeness, financial soundness, clarity of explor uh clarity of explanation. Um so you can see how you can just fit a lot of your own criteria and your own rubrics into your own custom grader, which is the power of the um part of agent RF platform. Uh the results are fantastic with a 21% increase in core ML performance with much lower hallucination rates and missing citations. Um I also just want to call out that u rogo did a ton of work in kind of making their greater uh unhackable. I think earlier runs showed that um the model actually started reward hacking. [clears throat] So what happens with these um with the RFT process sometimes is that if you have an edge case in your greater um the model is super smart and sneaky and will find ways to uh kind of exploit that grater. So it's also really important to make sure that your grader is pretty watertight. Uh and that's what Rogo did. They made their greater watertight. They detected the hack. Um, and as a result, the true performance that you're trying to optimize the model on just started shooting up. — Yeah. I remember there was one robo run where we came back on the platform that we showed earlier and the average reward on validation was just one. — Yeah. Little — 100% little too. — That's too good to be true. Yeah. — All right. Yeah. So, that was for the customer. — So, let's wrap up. Let's wrap up.

Summary

So um just to summarize uh so let's talk about when to turn to agent RFT. So um the general process that we recommend is first you want to make sure that you build this really high quality data set where you're training and eval sets closely match your production traffic. So you kind of want the uh agent to not be surprised um when you know you kind of go from like fine-tuning it to actually exposing it to Showtime. Um, and second of all, you know, you're probably on this journey of improving your agent performance. So, you want to figure out what the baseline performance looks like, so you know where to improve from there. So, uh, you probably want to run these baseline evals against GP5 or, you know, whatever models you like using. And then from there, you want to try to optimize the performance without fine-tuning because that's often one of the easier ways to get better performance out. You might want to adjust other parts of the task like improving your prompts, improving your infra, improving your task harness. And then after you've squeezed out all the juice out of the task and base model, that's when you turn to agent RFT to further optimize and start changing the weights of the model to be fundamentally better on your task and your domain in an end to end fashion. — Awesome. — Yeah, thank you, Will. — Okay, great. Let's move on to Q& A. We have um a ton of questions. So um wanted

Q&A

to make sure we had enough time for them. Um so maybe just let's go to the next slide and we can tackle them. — Cool. — Yeah. — What kinds of tasks are best suited for agent RFT? — You want to take Okay. All right. So I have a take on this. Um so obviously you know we we've explored and explained a lot of ways in which you should kind of structure your data set or make sure that your data points have enough variance in them so that during exploration the model actually knows what the difference between a good and a bad data point looks like. So fundamentally you do want a train set where hey you still haven't squeezed all the juice out of it but the model given enough exploration can figure out what a uh kind of good performance looks like. So that way it can kind of hill climb. So that's one thing that I would say. Another thing I would say is, you know, there's a task itself, but then there's the way that you're evaluating and grading it. And if the way that you're grading it is in a binary fashion, then it's going to be really hard for the agent to hill climb or kind of gradually get better and better um and improve on that task. So you want to make sure that yes, you have the task itself, but you also have how you're evaluating it. So those two things generally lead to quite a bit of success. Um do you want to talk about maybe domains or you know other things that you have? — Yeah in terms of domains it was quite surprising I think we showed it with the customer stories but it's really widely applicable. So I see agent RFT as you presented very clearly at the beginning something for any type of agents and as soon as you have an agent that uses tools that are out of distribution of what we trained our marine model on which will naturally happen because you have your own tools then this is really an opportunity for you to dive in on agent RF and if there's a lot of reasoning associated with it that's even better to use GP5 or a very strong reasoning model. — Totally. — Yeah. — Totally. Um, all right. What's different about the RF2 platform now and since it debuted in May? Well, I'll take this one even if if Will is the one who built it, — but obviously and the whole team — and a big team. — But, um, what I really like about agent RFT now is well, there there's multiple things. In May, we were only able to fine-tune O4 mini with a very specific set of graders. Now you're able to fine-tune O4 mini or GPT5 with tools and with endpoint graders. So the flexibility is just incomparable and you can tackle pretty much I mean so many new tasks. And what is great here is that with this you can actually create features in your product that just did not exist before. They the model were not is not good enough and now you have a path to actually other than prompting help the model to use those tools well and to build the product that you actually want. So for me that that's the most important and then a lot of work also on the observability of the platform and those different curves and all the stability all of this has improved a lot and that's great. — Totally. Yeah. And I also just want to kind of emphasize that with agent RF we're now in this multi-step RL paradigm. So with the original RFT platform it's you give the model the prompt, the model thinks for a while and then spits back an answer to you. But now you can actually do this multi-step thing. So um the model is now in this loop with your world and with your environment and all of that actually gets trained on end to end with your grader. So — yeah that's awesome. — Yeah. — Um Kristen do you have any other questions from uh — yeah some more? — Yeah. Why don't we go to if you go to the next slide I was refreshing during Okay. So if you're able to refresh um otherwise I can just read them out loud. — Uh let's see. Um, start slashhow. Oh, so here's curious why RFT is sampled. — Go back. I think we — Yeah. — Yep. — No, that's cool. Yeah. Take it. — This one, right? Yep. — Oh, okay. Cool. Uh, do we have one after this or — Yeah. — Okay. There's more. Okay. — We can run a bit. — All right. Sounds good. Okay. Uh, yeah. Why is RFT sample? Okay. So, this is sort of a question around like why maybe RL is sample efficient? Uh I mean we could talk at length about that but um fundamentally um first of all the model is basically generating its own training data through the exploration process. So remember how we talked about this compute multiplier thing or like how the model is exploring the search space. Well, it's actually generating its own training data through that sampling process. And when you give it the grade or the reward at the end, that tells the model how well it's doing based on the trajectory roll out that it generated. And that actually ends up um being the thing that we train the model on. So we obviously do a bunch of like reinforcement learning stuff over that trajectory, you know, like uh and we take your reward or your grade and we kind of apply it over the roll out in certain ways. Um but ultimately the model is generating its own data. — Yeah. I also add a note on the fact that we're working on a frontier model and fine-tuning a frontier model. So the prior that you're working on is very strong already and probably already has success on your task. — So all of this variance only works because that model just is able to sometimes get it right. — Exactly. — And so it's just the power of the prior. And so if you think also and you continue to accelerate if there's a new model that is stronger and you have RF on it as well, you can expect also to it to be sample efficient again because the prior will keep on increasing. — Yeah, the model is generating really good data for itself basically because the prior is so good as Dio said. So — yeah. All right, let's see what the next one is. How does the RL training objective function differ from general RFT and tank RFT? Um — I can take you can take first. All right, sure. I'll start. So um yeah, so there's the difference between the actual like RL loss function and the actual reward. So um the reward is what we allow you to specify. So whether this be reward functions that are native in our platform, for example, the string check greater that we kind of dissed on earlier or the model grader uh or your endpoint greater. And so with that reward, we do stuff with that reward. So that would be the RL or reinforcement learning like loss. Um so that might be what you're asking about actually. So that so far doesn't differ between general RFT and agent antic. So we don't change the loss part. Uh we may though we may try new things as we're doing our research and research engineering um to deliver even like bigger model gains. But yeah, for now honestly there's no difference. The main difference is um now you can define much more flexible reward functions. — Yeah. And I think that's super important. Um when you look at B RFT because you don't have access to the train of thought, you only have access to really the final output. Whereas when you look at agent tech RFT, you do have a lot of information on the traces and some pretty even mentioned it grading those tool calls. Yeah. And so you can uh have much more control on the policy that you want to see uh for your model after training. So that's one big difference to me. — Totally. We have a little bit left. I don't know. Should we try to — Yeah, let's go through them. Um we can run over if people are free to stay. Um but — sounds good. — Cool. Use RFT. When there is a new response of the model, does the new model learn automatically from previous um is this like if there is a new response of the model? Is this talking about like domain shift? for example, like you train the model on like um a set of data points, but then you evaluate on — Yeah, I'm not exactly sure. The way I see it maybe ties in with all this idea of different trajectories that we run and technically any new response generated by the model is a new trajectory and we actually do leverage this when we compute uh the objective. So my answer would be yes. — Yes, during training. — If that's understanding, — during training. Yeah. But when you're doing inference then like we're not using whatever you have during inference to go back and make the model better continuously. Not yet at least someday. — All right. — Yeah. — Cool. — Great question. — Yeah, — you can read it. Yeah. — Uh I guess so. Okay. Are the alpha endpoints used in the code available to everybody? — Um I guess it depends on which alpha endpoints we're talking about. So — is the alpha endpoints as a tool integrator? — Yeah. Okay. Oh, yeah. Okay. So, no, they aren't. And um that's where the agent RFT interest form comes in. So, this is a functionality that is um that we're exposing and is something that is more or less in like I don't want to say private beta, but is the type of functionality that we want to work with you to make sure that you get the most success possible on. So, um, this is something that, you know, we'd love for you to, you know, talk to your, uh, friendly neighborhood account executive or account director at OpenAI to, um, see how we can work together on. — Yeah. And that question is such a good segue, right? — Yeah, exactly. Um, so we can wrap up with some resources on the right here. um feel free to explore these but if you're interested in learning more about agent RF and specifically working with our team here um check out this link this tiny URL um fill out the interest form and we will be in touch um and with that I will share one more slide on the upcoming build hours. So join us on December 3rd for agent memory patterns. Um this will be the last build hour of our agent series. Um but many more build hours to come. check out um our homepage there below and we'll keep adding them. So, thanks everyone and we'll see you next time. — Thank you. — Yeah, thank you. Bye.

Другие видео автора — OpenAI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник