# Build Hour: Reinforcement Fine-Tuning

## Метаданные

- **Канал:** OpenAI
- **YouTube:** https://www.youtube.com/watch?v=YWLOo_fc5oA
- **Дата:** 03.09.2025
- **Длительность:** 59:47
- **Просмотры:** 12,682
- **Источник:** https://ekstraktznaniy.ru/video/11272

## Описание

Reinforcement fine-tuning (RFT) lets you improve how models reason by training with graders instead of large labeled datasets. This Build Hour shows you how to set up tasks, design grading functions, and run efficient training loops with just a few hundred examples.

Prashant Mital and Theophile Sautory (Applied AI) cover:
- Intro to RFT: optimization, fine-tuning options, RFT benefits
- Task setup: prompts, graders, and training and validation data
- Live demo: building and running RFT for a classification task
- RFT workflow: from dataset selection to evaluating and iterating
- Customer spotlight: Accordance uses RFT models for tax and accounting workflows (https://accordance.com/)
- Live Q&A

👉 Follow along with the code repo: https://github.com/openai/build-hours
👉 RFT Cookbook: https://cookbook.openai.com/examples/reinforcement_fine_tuning
👉 RFT Use Case Guide: https://platform.openai.com/docs/guides/rft-use-cases
👉 Sign up for upcoming live Build Hours: https://webinar.openai.com

## Транскрипт

### Segment 1 (00:00 - 05:00) []

Welcome to OpenAI Build Hours. I'm Christine on the startup marketing team and today we brought back Cashant and have Theo joining us for the first time. — Hi. — So today's topic is on reinforcement finetuning or RFT. But before we begin, we will start with the goal of build hours which is to empower you with the best practices, tools and AI expertise to scale your company using open AI APIs and models. So at the bottom of the screen you have our homepage and this is where you can find all of our resources as well as watching any past build hours on demand. So I'm going to give you a snapshot of what you can expect in the next hour. So first we're going to be giving you an intro to RFT. We'll go through all the benefits of RFT optimization um and then actually setting up a task. Um, and then my favorite part is the demo will actually be live building and I'll drop the link to the code repo in the chat so you guys can check it out and follow along. And then finally, we have a really exciting customer spotlight today. We have the co-founder and CEO of Cordins in the room with us and he's going to be coming on and sharing the realworld application of RFT. And then finally, we will end with Q& A. David will answer some questions and then Theo and Pashant will be here to answer questions as well. Uh our team is also in the room so feel free during the hour if anything comes up to drop your questions in the Q& A chat. You have to toggle a bit between the chat and the Q& A. Um but this is where you can submit your questions and we'll answer them uh during the hour. So uh Pashant Theo feel free to take it away. — Great. Thank you Christine. All right. So let's start by orienting ourselves around where fine-tuning and reinforcement fine-tuning more specifically fits in to model customization. So if you're trying to improve the performance of your LLM app, there's really two broad levers that you have. The first is to improve what the model knows. Uh and this can be handled with context engineering techniques such as better prompting or retrieval augmented generation and also to improve how the model reasons. Uh that's the second lever. This is where fine-tuning comes in. So if you're missing knowledge, start with prompting and rag. But if the model knows the facts but still struggle to apply them or reason about them accurately, that's where fine-tuning would come in. We remind everyone that fine-tuning is an investment. So you must make sure you've squeezed everything you can from the other levers that I just mentioned. And only then should you reach for fine-tuning. Today we offer three techniques for fine-tuning on our platform. Uh the first one which is supervised fine-tuning allows you to feed prompt and answer pairs so the model can learn a fixed pattern. This is great for simple classification tasks or making the model adhere to a output pattern. Um structured outputs also does this but supervised fine-tuning can be useful for things like custom code formats. The next technique is called preference fine-tuning. Here you give the model example of better and worse outputs and the model sort of learns to mimic the toner style of the better outputs and just prefer the the one that you don't prefer. This is great for marketing copy chat bots uh and places where personality is important. And today we're going to focus on reinforcement fine-tuning. Here instead of labeled answers, you actually give our system a greater. And a grader is basically a rubric or a rule that allows the model to score responses or accuracy. The reinforcement finding system explores the solution space, grades different solutions that it comes up with and improves itself. This technique has been found to be super powerful for policy compliance, legal reasoning, and medical workflows. Really any domain where reasoning matters. So let's quickly touch on the benefits of reinforcement fine-tuning before digging into the details. RFD is unique because it's the only method today that can be applied for reasoning models and reasoning models we believe are the future. It's also data efficient. So you only need tens to hundreds of examples uh to get started with reinforcement fine-tuning. And it's much easier to get signal on whether your task will improve uh by applying this technique. And finally, it doesn't require these manually labeled outputs. Um as I mentioned, it operates using graders and so you can focus more time uh figuring out a grading strategy which is really useful for other things like evaluation rather than sinking a lot of resources into building large label data sets which are quite expensive to curate. Today, teams are using RFD to replace

### Segment 2 (05:00 - 10:00) [5:00]

complex policy pipelines with a single reasoning agent. They're also using it to improve compliance checks by training on real policy logic and to boost accuracy in medical coding tasks with expert verified graders. Next, we'll have Theo show us how it works under the hood. — Yeah. Thank you, Pashant. So now let's have a quick look at what happens under the hood and some aspect that actually make RFT so sample efficient in a way. So when you look at RFT you start it with your task which is your data set probably your prompt and your answer and then you choose a model here we'll be focusing on O for many that's available on the platform right now and then there is the greater that we've mentioned. So here under the hood the way the algorithm is going to run is that it's going to sample multiple times the same example and then a different reasoning path and then a new answer. And so the model will be able to compare all those different answers and therefore tell what is good and what is bad. And therefore one example actually provides a lot of information and a lot of insights on what is the good reasoning path to follow, what is the less follow. And by training on tens or hundreds of samples, you actually improve the reasoning and so the generalization on that specific task. And that's why we love RF. — Nice. So this is really different, right, from the other techniques we talked about because one sample in those techniques is just one training example. And here we're pulling so much more signal out of a single uh task. — Exact. Exactly. We get many trajectories out of it and that helps us be efficient. — Great. All right, now let's look at the task setup. So we're going to present first our data set. And the data set that we're using here and for the class is going to be a classification task where given a legal text, we have to predict which Eurovvok level one classes it belongs to. So are the highest level broadest thematic categories in the Eurovvas. And this is a multilingual controlled vocabulary maintained by the publications office of the EU. And there are 21 different classes. And each sample can have one, two, up to six or even more classes assigned to it. — Cool. Let's look at one of those samples. — Yeah. Awesome. Let's do that. So here you'll see on the left hand side there's a sample, a very short sample. It's the shortest we could find in the data set. And then on the right hand side the reference answer where you can see there are two classes which are employment and European Union. And if you want to follow along, you can find the data set on hogging face. We plug the link here, but it's also in the code uh when we run the data set exploration. — So first, now that we have the data set, how are we going to evaluate it? — Yeah, we talked a lot about graders already. Uh so we're going to preview a little bit of how the grader is going to work by talking about the metrics that are important for a classification task. So this is probably not new if you have been doing machine learning for a while. But just to catch everyone up, the two metrics we really care about in a classification task are called precision and recall. And instead of giving the academic definitions, some of which are up here on the screen, um I'd really love for us to build some intuition for what each of these mean. So precision uh can be thought of as uh a measure of figuring out of the predicted uh labels, how many of the labels are actually correct. So, so that's precision. And then the other metric we care about is called recall. And this really measures how many of the actual labels did we find. So, did we return all the labels that we expected for a spec specific sample or did we miss some? Uh, and then finally on the last line here, we have an F1 score. And an F1 score is a composite sort of mean score between these two values, which weights the lower value slightly higher. Right? So we expect to be bottlenecked on a classification task by sort of whatever is less in terms of precision or recall. Uh and therefore we've chosen this measure. The reason it's important to have the F1 measure here even though it's like less intuitive than the other two is because the reinforcement fine-tuning uh system really requires a single grade for each training sample. Right? So while we will measure precision and recall separately as you'll see in a bit, uh we really need to provide a single composite score and F1 allows us to do that. That being said, I think uh there are valid situations where you might use a different way to like combine these two uh you know precision and recall uh metrics, right? — Yeah, totally. If you value recall more than precision, you might use an F2 score which balances the two terms a bit differently and you can train with that. — I'm going to have to ask you what that is later. — Okay. — All right. So, let's dive into the demo. So, we've seen very high level what our data set looks like and we have the greater and we know what model we're going to look at. So because we're in the machine learning task, we need to do some data exploration and in particular

### Segment 3 (10:00 - 15:00) [10:00]

with reinforcement fine-tuning really the quality of your data is primaril and it's going to allow you to hill climb on your data set or not and to generalize on all the samples you can see or not. So I have set up a quick code. We're just loading the data set. You can probably do that on your end as well. and we'll build the items that are expected for eval platform and for RFT. So this is a very simple example where we have an item we give it an ID a text input and a reference answer and sorry about that just before when I loaded the first sample I wanted to show you here you can see what it looks like under the hood in hugging face. So you have the text and then you have some categories uh as well which I completely messed up but here are the categories and you have ids in the categories and those ids represent the eurov concepts and they are just identifiers and that's not ideal for the model because they have no semantic meaning. So what we want to do next, which I skipped over, is actually transform those IDs into names. And the reason why you want names is that the model is going to be able to reason over them in a much better way than over IDs. Does that make sense? — It does. Yeah. Although I have to say I might enjoy the challenge of having to guess one 0147 for one of these. — I'll let you do that in the future. Right now we're focused on trade, finance, and all the other classes that exist. And so I'm splitting the data set into I'm selecting 150 samples randomly from the data set. And the reason why I choose 150 sample, 100 for train and 50 for the validation is to show that RFT can really be sample efficient. And with only 100 samples, we can actually get decent improvement. And then the 50 samples in validation or mainly to have a robust indication of how it performs on some data. — This data set we're using is quite large, right? How many samples does it have across? — Yeah, in total I think it has 7,000 samples, but it's across 23 languages, so it starts to scale up. — Nice. So, if we see signs of life with these 100 samples, we can presumably like make a much larger training run after that, right? — Totally. And probably get even better results. — Nice. — All right. So, now that we've created those 150 samples and sample them, we can check out the distribution of them. Where I forgot one cell, which is this one. Exactly. We're counting the frequency of all of them and we are going to be plotting their distribution. And as you can see here, that's the distribution across all the categories and the number of times we see them in the transplit. Does that look to you good to you, Christian? — I mean, it's definitely a distribution. What are we looking for here? Uh, would we want to sort of have an equal number of u occurrences of each of these labels or Yeah. So ideally we it would be equal distribution throughout for all of the categories. And the reason is that if we train the model with this type of imbalanced data set, it's going to artificially increase the score by only predicting trade or agri food stuff. — I see. — And therefore it will not be able to generalize across all the categories. But we want a more general model. — Is this what they mean by reward hacking? — A version of it. Yeah, — I need that. So now let's go into the let's go into a balanced sampling strategy. We cooked up some code quite rapidly that is just doing a balanced um retrieval and the sampling and I can just show you the output of what this provides and it's a much more much cleaner um data set much more balanced and now we can be quite confident that if the model learns it will learn some general aspects that works for all of the classes. — Nice. While we're on this topic of like the test and training split, I think uh maybe one other thing to consider is would we do this type of a balancing only for the training set and then maybe on the validation side I think we have a couple of choices right in terms of how we think of that data set. — Yeah, exactly. So that's a great question. When you do validation, you can have different metrics that are interesting for you. either it is on the general on like all the categories and see how it performs for all of them or you can want to represent how your model will perform during production — and so in during production you'll probably have a distribution which is closer to what we have seen earlier so you can compute your F1 score there but you can also on this um more balanced data set so you have those two values that are interesting they both hold different information one of them how is it general one of them, how would they work in production? And so you can do both and I would really push you to do both. — Yeah. So you can have like an idealized F1 score and then maybe like a more practical like real world F1 score. — Yeah. They're called micro and micro. — Okay. — Thanks. — All right then. Let's go back. — Let's move on. So um we've sort of previewed what the data set looks like

### Segment 4 (15:00 - 20:00) [15:00]

uh and we've uh we've gone through some initial analysis to like balance the data. Let's quickly look at how we will be grading um the results of the reinforcement uh fine-tuning. So the first thing that we will do is we will translate uh the formula we saw a couple of slides ago for precision into some Python code. And this Python code is going to represent our precision grad. And a grad is essentially just some blob of code which will be executed in line when our training is running. So there's really nothing fancy going on here. Uh the one thing I will say is on line three we have this if statement which is you know accounting for a particular edge case and the reason we want to have this in here is to make our grader robust right so often times uh in these systems something can go wrong when you're sampling from the model and the model may not return exactly what you're expecting it to in those scenarios you still want the grader to return a valid score between zero and one so we're just sort of handling for some of those edge cases and then the recall grader is like quite similar as Um, so I'm just going to breeze through these. Uh, and then next we have this prompt that we're going to be using. Uh, and I know Theo, you did a lot of work to like come up with this prompt. So, yeah, walk us through the process and maybe break down this prompt for uh, for our listeners. — Yeah. Yeah, for sure. So, now that we have the model, we're going to fine-tune, we have the data, we have the greater, it's really important to do some prompt optimization as we've shown in the first slide. And here, the prompt that I've built looks relatively simple. We provide some context. We describe the expected uh output response format and then we give the 21 classes to the model so that it has the canonical names from which it can choose and finally we give a set of guidelines. I've iterated many times on the prompt and some of them were more complex having few short examples but nothing really beats this one and this one is also quite short which I like when I want to run RFT because first it doesn't bias the model too much — and then secondly it will also be faster to run because all the samples will be shorter. — Yeah, this is really important right because we want to use a similar prompt to what we use in training uh at inference time as well. So whatever prompt you pick right now is actually quite important because changing it later may not be the best idea. — Exactly. Totally correct. So let's have a look at how we run the evaluation. — Yeah. — All right. So we'll move back to the code and here we've built a pipeline notebook that you can follow along and you can actually even reuse it for your own data sets if you follow the way we built the items initially. So you could have a look in there and just reuse it. So we have some boilerplate code that I'm going to skim through. But now is the more important part. So we want to run the evaluation for the optimization on the training set. So let me just select the training set and then define the response format. We are predicting level one codes. So I've built a identic based model that lists level one codes. As you can see here, level one and then code and you can see environment or agriculture. This one I've just printed it so that you can have a understanding of what it looks like. — And just so everyone is clear, we're just using the structured outputs feature of our API here. — Exactly. — Right. And this just ensures that the model always outputs uh like there's 100% accuracy on the format of what it outputs. So we can always parse it and we can always like apply these graders reliably to it. — Exactly. The sampling at inference is constraint and they will always follow this structure. — Okay. And another important thing that we're doing here is that we are saving I mean we're accessing a response format that we able we'll be able to use directly in the RFT as well. — Oh that's really interesting. So tell us the importance of using the same graders right like it's really great that a platform supports that but why should why does it matter? Why should we care about it? — Yeah no totally. So let me show you those graders. You will find exactly the functions that you've presented previously. Here we stringify them so that they can be shared through the API and you can find the precision, the recall and an F1 grader. And now I create the objects and those objects are going to be shared between the eval platform and the RFT platform. So if you eval here with this creator then the RFT will learn to hill climb on the exact same evaluation. That's very critical so that you know you're comparing apples to apples and you know that if you see an increase in performance on the validation set it's a valid increase and it's not because of something different. — Yeah. It's also really great that you just engineer this once and then you get to reuse it. Right. So it's like so much higher leverage for a team to invest time in uh engineering their greater because you can reuse it across both of these. — Totally. Here we have very simple graders but you can use LM as a judge and build very complex rubrics. Right. — All right. So, let me save those graders. I'm saving them. I'm also loading the prompt that we've created initially. And this utility allows you to switch between prompts quite rapidly. So

### Segment 5 (20:00 - 25:00) [20:00]

that's the main point of it. And now we can build the evals and the eval object. And here is a quite important section where we actually define the model that we want to evaluate. So here we'll be evaluating O4 mini because it's the model that we'll be fine-tuning and I use a response format and I choose to use a low reasoning effort. low because I assume the task is not that complex and the input prompt is not very long and the documents are not that long either. So the model can probably reason very rapidly over it. — Also this is like something you tried, right? So once you had the eval, you actually could just run it against all three and we actually saw that low was actually better. — It was actually — and it's cheaper and faster. So why not? — The model was overthinking with medium or high. So let's keep low here. And now I'm just setting up the evals. And now let me just run one actually. So when I run this, we're using the Evals API and you can see that it will send us a link to the Evals platform. So let us navigate to that link and see what we see over there. All right. So it's coming up. All right. So here we are on the Evos platform and you can see that I ran log codes which is what we're doing version seven for the prompt and we have the different testing criterias which are the graders that we've built. And if I click on it, I can see the exact Python code and a pass threshold. And this pass threshold is something that we use on the EAS platform to say if the performance of the model on that sample is acceptable or not. Here I said 0. 8. It it's a bit random, but you can say whatever you want. — Yeah, maybe we can look at one of these which we ran in the past just to show folks how it looks, right? — Yeah, totally. So if we go back you'll see all the practice that we've done and let's jump onto this one and inspect the results for variance run D1. So here we are on the platform and you can see that all the examples show up and it's very easy to navigate and to have a look at what was my text input, what is the reference answer and what is the output. And personally what I often do is I just drag the reference answer index to the output for me to have a quick look very easily through all the I mean for each example. — Yeah. And this is really useful when you're iterating your prompts uh and when you're trying to maybe change your the output schema etc. when you're just iterating with the system rapidly this UI is like pretty useful right? — Yeah totally. And here you can see the past fail based on the threshold that we've defined. But you can also see the main score here on the right hand side. And if you're really into it, you can use the thresholds here to only see a subsample of your data set that satisfies or not uh one of your conditions. — Great. All right. So, we can go back. Unfortunately, it hasn't finished running yet. Often it does. So, that's I guess live demos. But, let me interrupt it. And all right, I've interrupted. And now — we luckily ran this before so we have something to show. — Yes, of course we have many samples that we've ran before. So let me run a bit of an evaluation that we've done. — Whoa. Okay, there's a lot going on in this plot. Uh do you want to walk us through this? Uh it looks really exciting. — Yes, totally. It looks a bit like magic right now what we showed, but we'll walk you through why this is important and what's happening in here. So you can see this plot on the y-axis you have the score. So it goes from zero to one as person described earlier. On the x-axis you have all of the different samples. — So when I run the evaluations I always run them more than once often run them three times. And in this case we have nine runs of them. But the reason why we do this is that for many is inherently um stochastic and the output can vary a lot. Sometimes it can have it all correct, wrong. So it's very important for us to see what is the variance on each of those samples. And this is what we're looking at here. If you look at the cross, this cross is the best score that the model has reached for that specific sample over the nine runs. — Okay. So on the on these where you have we're hovering the cursor right now, it seems like we actually got a perfect score at least once throughout the run. — There was once where the model actually got everything correct, — right? And then the blue bar that you can see there is the mean over those nine runs. And finally the gray bar is the variance. — Okay. — And this is very important for us. — Yeah. So maybe help us build some intuition for like why it's important. Uh and yeah. What are we looking for here?

### Segment 6 (25:00 - 30:00) [25:00]

— Yeah, for sure. So if you recall earlier how we looked at what happens under the hood during RFT, we saw that the model does multiple trajectories and then grades some of them and then some of them are graded well, less well. And this variance actually allows the model to learn what is good versus what is less good. And in this example, having this variance also shows that the model can fine-tune its reasoning pattern to fit mean the better scores and to start reasoning better every single time. — Yeah. And we're also seeing like a continuous distribution, right, between zero and one. So it means that in this set of training data, we have all sorts of examples that return 3 that return 6 and all the way up to one. So you have like a relatively continuous training signal. Um, great. So is maybe one huristic or like one mental model for this to think about RFT as a way to pull the mean of each of these samples up closer to the max. Uh and if that something that we can achieve, it's going to be a successful outcome. — Yeah, totally. — And hopefully we'll see that later today. Okay. All right. So now we've looked at our data set. We've chose our prompt. We defined the graders. We even checked that our data set is shows signs of life for RFT and that there is headroom to learn through this plot. So let's just go on and start the RFT job — finally. — All right. So let me run this little boilerplate code. And now we can select our prompt v7. we are in the correct build our project. Now what I'm going to do is I'm going to push the training and validation sets to um the API platform and we do this it's slightly different to what we do for the evals because for the evals we share all the examples without the prompt and then we share one prompt so that we can easily evaluate different prompts whereas for RFT when you share one example you share the full example including the prompt — right — so that's why we need to push new files — yeah and what that means is in a single training set of training data you could go mix different prompts. — Exactly. — Right. So like if you have dynamic parts of the prompt where you're doing rag ahead of time, you can just pre-fill those and upload that as a training batch. — That's it. And I do think it's a great feature to have more general models and then you can be in distribution with different prompts. — Right. All right. And now you can see we load the same response format as previously and we can load the greater that we've defined earlier. And that's very useful so that we don't have to test, build it again. And now if we validate it, well, it gets validated because we've already built it and used it. So we know it's correct, but need to check. — But this is useful because if you don't want your training runs to fail because your greater had like some sort of formatting issue. — Exactly. — Right. And that's quite common because you have to stringify Python, which is not the best thing to look at. — Exactly. So now we're ready. We can choose the hyperparameters. We'll probably walk through the parameters uh on the platform. So, let me just run them and create a job. And now we have the created job. We have the link showing up. And let's switch to that link. All right. So, Crystal, do you want to walk us through — Yeah. Platform? — Let's do it. Okay. So there's a lot going on. I'm just going to hide this left sidebar. So um you know we have a little bit less visual noise here. And let's uh increase the size a little bit. Okay. So at the top of the right hand panel uh we have the current job that we just kicked off. Uh we can sort of see some information about the job uh like the ID. The ID is really useful to track what the status of the job etc. is. Um and then we see this listing of hyperparameters. So, we're going to double click on a couple of these uh just because they're important, but let's just skip over them for now. Uh we also have the training and validation files linked here for uh for ease of use. So, we notice that in this particular uh job, we are actually still validating the files. So, there's a few steps that the jobs go through. You know, you validate the files, then you actually start doing the training, and then there's multiple steps of training. So, it takes quite a while. Um and of course, we could foresee this. So we uh we did some more uh training ahead of time that we could look at live. So as you can see this training run has succeeded. Uh we also have this additional key here which says output model. Uh and this is really the model slug that we can use to run inference against this. So you use all the standard APIs. Uh you just pass this wherever you used to pass 04 mini as the model and now you're talking to your fine-tuned. So it's really simple. Uh we've used the same hyperparameters for this previous run as for the one we just kicked off. So this is actually a pretty close comparison because we're also

### Segment 7 (30:00 - 35:00) [30:00]

using the same seed. This is very close. Yeah, — it should be relatively the same. — Okay. Um and then we also have these three checkpoints that are emitted by the pipeline. And we'll talk a little bit about how these are emitted and like what to do with these, but you can think of these as sort of model checkpoints along the training route. Um and you know there are different set of model weights that you may want to run our evals against and see how they perform. — Yeah. And in particular those checkpoints are actually the best scores that you've gotten over time. — Got it. So the top three — that's top three. Okay. — So that's quite great because sometimes if you have a very good version and the early checkpoint but then the model keeps on training it would be very frustrating to not be able to access it. So we save it and provide it to you. — Yeah that's really useful. Okay, let's start walking through some of these charts because I think that's sort of the most fun part of the build hour today is uh trying to break apart what some of these charts are telling us. So the first one here, let's uh let's blow it up for a little bit. Um is the reward curve through the training process. Uh and we can sort of see two things plotted on this chart. Uh but before actually we go there, let's look at the axes. On the x-axis we have the training step. And so each step of training is essentially each time we update the model weights uh is a step. And then on the y-axis we actually have the raw value of the reward function. Uh so it's sort of the numeric score that the f1 uh metric returns. And then if we look at each of these lines uh the green line is the reward returned by the training set or the training batch actually. Uh and then the purple dotted line is the reward of the validation set. So one thing you might notice is that green or the training reward is actually much spikier uh a bit more discontinuous if you will uh compared to the purple uh curve which is much more continuous and smoother. Um so just to help you build some intuition for like why that is I'll draw your attention back to this hyperparameter that we set called eval samples and we actually set it at three in this scenario. uh eval samples is very similar to um how Theo ran each sample for the varian study nine times right in that case we could have said our eval samples was actually nine and here it's three and the reason we do this is because we want a fairly robust score on the validation set so each sample is being run thrice through our grader and then we compute a mean uh score and that's what gets plotted for the entire validation set whereas for the green curve what we end up plotting is just the reward returned by the currently training batch. And if we again look at the hyperparameters, our batch size is only 16. So it's a sampling of 16 examples from our training set. Uh and then we just run those once. Uh or yeah, there there's some number of times we run them, but essentially there's a lot of variance uh because we are uh looking at a much smaller cut of the data. — Exactly. It could be a very easy cut as you can see in those like step six of a training mean training reward of 0. 6 very early or a very hard one at step 10 which you can see is much for much lower than the average. So that's why we see all this discrepancy and all those ups and downs and more of a valley — but we can notice that the trend is still correct and that's really good information on the quality of our data because the model is able to learn and I mean of our whole setup actually. — Yeah. And it's up and to the right and that's what's good in the scenario, right? — That's all you want to hear. Yeah. — Yeah. Uh so maybe before we move to the next uh curve, Theo, you've done way more of these than I have. Uh what are some ways in which this can go bad, right? Like what might overfitting look like on this curve? — Yeah, totally. So overfitting here would be you would have the green curve going up into the right going up increasing the model learning, but the blue or purple curve staying relatively flat. And if you do this, you know that your training set is perhaps too small or not representative of your validation set. — Got it? That's really helpful. Okay, let's look at one of the other curves uh for the reward which actually uh gives us even more granular detail about how the reward or how the training is progressing. Right? So there's a lot going on here. Uh so it's important to not get overwhelmed. And let's just hover over one of these points where we have the full legend show up. So the main difference from the last curve here is that we are instead of just reporting the F1 score which is a combination of precision and recall uh we are actually reporting precision recall and F1 separately. Uh and we're doing it for the same the training batch as well as the evaluation set or the validation set rather. Um so yeah why is this useful and like what are some trends we might observe? Yeah. So I think this is very useful because you want a particular behavior in your model and perhaps you want it to be more focused on precision rather than recall and perhaps you don't want to change the greater because you still want to optimize both and so this actually helps you see what checkpoint favors which of

### Segment 8 (35:00 - 40:00) [35:00]

those two uh graders and you can use multi-grader with many other graders. So here we're comparing recall and precision but it can be many other things. And another aspect is just to understand actually this model behavior. For instance, if we move to step 65, you can see that the precision is extremely high 0. 84, but the recall is relatively low of 0. 53 both on the validation set. And this shows us that the model will not predict many sampers. It will probably predict just a few, but they're nearly always correct. Maybe that's a feature for you. Maybe that's a bug. And so based on this, you can select the checkpoint that's the best for you and then even think on changing the reward, — right? I mean getting to this point requires a lot of iteration, right? So I think it's important to state that we didn't get here overnight. We didn't just run the one training job and get here. So uh I think these charts are really useful to inform your strategy as you're sort of figuring out what are the next experiments I want to run. How do I want to iterate and evolve my graders — and the prompt as well. — And the prompt. — Yeah. — Okay. Uh I think there's a couple more charts. I'm going to breeze through these. Uh so the first one here is uh or yeah this chart uh is showing us the number of reasoning tokens or chain of thought tokens that the model is producing throughout the course of training. So we can actually see that um we start increasing the number of thinking the model decides that maybe a strategy to get a better reward is to just like think for longer, right? And it just it goes down that path for a bit. Um and before I talk about the implications of that, let I'll draw your attention to this uh the lower chart here which tells us how much time we spend training during each step of the training process. So one thing we can see here is as we produce more chain of thought tokens, we actually end up spending much more time during uh training and evaluation because the model is just like more verbose. It's taking more time to like finish its turn if we if you will. Um yeah. Anything else you want to point out about these reason tokens? — Yeah, so I think that's a very important aspect to look at because as you remember we set a low reasoning effort at the beginning — and so this parameter really drives the duration of your training but it can also drive the performance of it. So you can see of course it becomes much slower but what becomes apparent here is that the model actually started reasoning a lot to find a solution and then perhaps has found a new reasoning path that it's confident in and it is going back down. So there's a lot to explore and if you go to medium your training will be longer but you might have better results. So there's a trade-off and you have to iterate and experiment. Yeah. And perhaps one of the reasons you're doing fine-tuning or reinforcement fine-tuning is to reduce the number of thinking tokens that you're using uh in which case this may not be the right uh snapshot for you to use. Right. — Exactly. — Either way, you have the insights in the charts and you can sort of pick the appropriate job — uh or checkpoint. Uh and then finally we have a couple of charts here which uh tell us a little bit about the latency of the graders. It tells us about how many errors are being produced during grading. And so these are really helpful when something goes wrong and the job fails and you know you can we might ask you for these charts if you ask us to debug your run. Um one final thing uh before we move away from this page I think is I want to draw everyone's attention to this messages section at the bottom and this is essentially like a write ahead log. You know the system's just sort of uh giving you updates on like what it's doing. One really cool thing about this is this idea of having the eval for each of the validation runs linked. Um the do you want to walk us through this? — Yeah, for sure. So as soon as we launch the training, there's a nal that is being created and we can just click on it as Pan did and then we see the same evaluation dashboard as we've seen previously. Here we have scores because it's coming from the training and you can see all the different steps and you can explore all of them. So I can simply inspect results and then the same as previously I have all the information of the samples and I can navigate between different steps to see how the model performed uh on each of those steps and checkpoints. — Yeah. So this is really cool because I was kind of curious what the model was doing when it was thinking so hard. So I can actually go in here find that step and then look at the outputs. — Exactly. And try to understand how the behavior is changing. — Mhm. All right. So now that we've seen the curves, uh let's go back into our code and have a quick look at how the model got improved. So again, let's run some uh compare model metrics code that we've created for you. Here we have the experiment. So we'll be comparing GPG41, 04 mini, and then our fine-tuned model. All of those we've pre-rander evaluations. So it's going to run smoothly. And we're not showing the other 03 or O4 menu with different reasoning efforts because actually 04 menu with low was the best we've got. So this is quite representative and we select the

### Segment 9 (40:00 - 45:00) [40:00]

validation set and now we can choose all the graders and we run some code and here is our main results plot that we have. So here you can notice multiple things. First is that in blue on the left you have GPT4. 1 and you can see that it had the relatively low precision and rather high recall. On the other hand 04 mini had a very high precision and lower recall meaning that it didn't dare kind of to predict some classes. And now we can see that the fine-tune model because we had F1 creator that improves both of them has managed to take both the precision and the recall higher than the two other models providing a much better F1 score. So this is very like ideal. You won't always get this. Sometimes they can improve just the recall, just the precision, but it gives you a sense of the advantage that you can get by running RFT. — Yeah, I love that. Uh F1 progression. Uh maybe really quick because we're running short on time. Uh should we redo the variant study? — Let's do it very quickly. Let's go. I'm running it here and this is what we get. And the F1 score is — there. So what do you observe here? — Well, I one thing I'm observing right off the bat is that a lot more of the means and maxes are sort of coincident here, right? We didn't have so many of these and the variance bars are actually shorter. Uh I also think that everything has on average a higher score. Uh it's kind of hard to see, but I know that it's true. Yeah, maybe we can have a quick look on the validation set in this metric and compare to what it was beforehand. So, let's go to the pipeline. We can rerun this, but we change the split to be validation. And now we have those two plots that we can compare oneonone. It's a bit hard to see here. Um maybe if we focus just on the small values you can see exactly as what you mentioned Pashant the mean has definitely went up. We have no samples that has a complete zero whereas previously we had some samples that had full zero right — that means that the reasoning pattern that the model learned to follow is actually quite generalizable. — Really cool. Yeah. So let's go back to the slides now and — yeah let's quickly recap what we did. So uh actually Theo maybe you want to take this. — Yeah okay go for it. So for the RFT workflow, what we've done is we've looked at the data set. We did the selection qualification and really made sure we have really accurate and good data. And then we implemented a greater to make sure that we're evaluating what we want to evaluate. And we did prompt optimization and benchmarking to make sure that we weren't um using fine-tuning in a position where we don't need fine tuning. Then we looked at the varian study that provides a lot of information on is there headroom for us to learn and then we started an RFT with just 100 samples and iterated um and we can then later push to even more samples if we want to get even better results. — Yeah, maybe a couple of uh important threads to draw out here is we actually employed a few best practices, right? — Uh so if we think about like point one here which is data set selection, we did two things. One is we balanced the classes for a classification task. That's really important on the training set. The other thing we did is we didn't use identifiers. We used the actual class names which had semantic meaning because the models are just much better with uh with handling that. Uh on the greater side, the really important thing to call out was that we used a grader that was fairly robust and had a continuous signal. So it wasn't just uh a true or false type of boolean um value. And that really meant that a the model had very continuous training signal and then b it couldn't really guess the right answer. Uh which was really important. — Yeah. And we also made sure it was never zero. It was not only zero everywhere that it was sometimes right. If it's never right, it cannot learn to reproduce those reasoning patterns. — Yeah, that's really important. If your models are not if our models are not performing well at all on your training set, it's very hard that RF is going to get you any performance improvement. — Cool. All right, with that over to you, Christine. — Thank you. Um, so earlier on, uh, we started this hour with, um, showing some of the fine-tuning options and how, um, RFT is specifically good for domain specific tasks. Uh, so I'm really excited to shine this spotlight on one of our startup customers, Accordance, who is building specifically in the tax and accordance uh, field. So, David is the co-founder and CEO. Um and we'll be taking the stage. Awesome. Welcome, David. — Thanks for having me. It's a delight to be here. — Of course. Um okay, so let's share the

### Segment 10 (45:00 - 50:00) [45:00]

next slide. Perfect. Um so yeah, take it away. Tell us what you're building um and how you use RF. — Yeah. Um so Accordance is an AI platform for accounting and tax professionals. And under the hood, Accordance uses a network of agents uh to help around a lot of workflows for accounting and tax strategy, optimization, uh advisory, compliance, and we use RFT a lot. Actually, we've seen a lot of value in RFT. Um and so I'm happy to talk about how we use it. Uh I think one of the first things when I think about RFT uh as a method is that it's very different from you know supervised fine-tuning or pre- you know preference fine-tuning which um all have their benefits but maybe have uh different utilities and you know you might use them for uh certain tasks that are a little bit different um and so I can kind of walk through one example where we did use RFT from end to end um and I can talk about the uh first the task selection and then the data preparation uh then the greater and the evaluation and the actual training itself. So in the task selection, uh I think as Pashan and Theo sort of mentioned earlier, uh there are certain tasks that are very well suited for RFT because RF helps improve the reasoning capabilities of the model. So it might not necessarily be imbuing additional knowledge into the model, but it helps train the model to better think in a way that you would like it to. And so a lot of the value in RFT there is selecting a task where you can see some value. you know, there's some value where the model's performing decently well, but maybe not doing 100% well or not performing perfectly well. Um, so that's an area where you can improve the model's thinking abilities. And you also want to pick a task that is uh objective. So oftentimes in preference fine-tuning or supervised fine-tuning, having data that's directionally accurate or maybe it's subjective is okay. In RFT, that's actually very different. You want to have data that is objectively correct where experts or you know there's right answers would agree on what the correct output for the model would be for us. One of the workflows that we picked is actually on the tax strategy and tax optimization side. So you can think of this as a process where um you know imagine you have a tax fact pattern a scenario of some sort um and then you want to figure out how do I actually plan and optimize the taxes so that the position is the most efficient. uh and traditionally you would go through a series of steps to do that right you would first establish the facts you would identify the issues you would locate the authorities and then read the different you know court cases and laws and understand how that works and then you would develop a conclusion around that process and this is also traditionally a pretty difficult task because it requires both reasoning across a lot of uh analytical and mathematical modeling and also sort of on the uh the legal interpretation side given that things are always changing there's new regulations always happening in the accounting and tax world. So we chose this as a task because it's both objective and uh we saw that there's a lot of value to be gained by working on a task like this. Um then when it comes to selecting the data I think just echoing sort of what we heard earlier today which is that the quality of the data matters so much more than the quantity. Um and so for us you know we have data from our partners and also from our customers who choose to opt in and share data with us. uh and while we have large a large volume of data we specifically choose to have a very small subset of that data and the reason for that is because uh as I mentioned before like in supervised finetuning as long as your data set is directionally good and as long as you have some high quality data in there it's okay if you have lowquality data rows um in reinforcement fine-tuning it makes a big difference even if you have one low quality uh data row in your training set so we chose to select that data and so I think it might depend on Whatever your task is, you might have a very different selection of your data, but I would s, you know, suggest using starting with a small amount of data to begin with. So you could choose maybe 100, 200, 300 rows of data. Um, and just letting that uh be a high quality data set to begin with. Um, then when it comes to actually building the evaluations and doing the training, uh, we heard a lot about about, you know, how do you design a greater and you know kind of uh the intuition around that today. Um, and I think just to give one practical example of that, um, I think you I just want to Yeah, I think to echo what we've heard earlier, you want to have a grader that is both um that is both objective but also stratifying. So you want to have a grader that is continuous in nature and it also needs to be able to stratify what is a good answer and what is a bad answer. So in the example that I mentioned before, if you know, let's say you're trying to do tax optimization, right? Um, a very naive grader you could pick would be, uh, hey, is my answer right or is my answer wrong? Right? And, you know, while that might seem like a good way to actually grade a test, um, if you were taking a test, it actually might confuse the model when it's doing the actual RFT. And the reason for that is because, uh

### Segment 11 (50:00 - 55:00) [50:00]

the model might just try to guess an answer and maybe it'll guess correctly. And if it guesses correctly u, without actually having the right reasoning steps to get there, then you're essentially confusing it and rewarding it in the wrong reasoning pathways. and where we really see the values of uh the reason the right reasoning pathways are being rewarded. Then the model gets smarter and smarter over time. So you want to pick a you know a greater that is both continuous and stratifying. You could also think about hey how would I design my task with a greater that's effective. Um and so suppose you picked a continuous function. Let's say in the case of tax optimization it's like how do I have a you know how do I how much tax can I save right? how much uh how much do I deviate from the optimal tax saving? That would be an example of a continuous function and it's also stratifying. But you might also want to make it a multi-art greater as well, which is that um you know if you sort of reward just how much tax am I saving? Then the model is sort of rewarded in just like hallucinating and making up random tax strategies to make that number higher and higher, right? So you might want to add another part to your grater which potentially is saying how viable is this strategy and maybe you weigh that part or maybe you have that short circuit the other part of the grater as well. So choosing a grater is very important and actually doing RFT well um and we've seen that having a very intelligent and kind of simplist you know simple design in your greater um goes a long way. Uh and then finally yeah going to the training itself uh if you're actually going to do the evaluations and the training uh there's some intuition that we kind of talked about already around uh you know picking the right hyperparameters and how do you look at the graphs um I think you will be able to as you sort of go through the process you'll be able to understand how are they adjusting um and being able to do those so if I had to summarize that inend process I think picking the right tasks the right graders and going through that process um is very important and you can get really good results so we worked with uh an evaluation set called tax bench which is you know an industry accept you know standard for evaluating how well is tax you know our LLM performing on tax and so that's one where we saw over 40% improvements just using RFT um and so that's significantly valuable for both our offering and also for our users as well — awesome thanks David um I want to keep an eye on time but just very quickly flash to the next screen um and can you share some of the other techniques that you tried — yeah I think as Pant mentioned earlier Um definitely RFD is an investment. So uh don't just kind of go all in at the beginning. Try all the other things you know prompt engineering using rag. I like to think of it as two orthogonal axes actually right. So what the model knows or factual knowledge as I call it and then what the model how the model thinks or experential knowledge and so uh in our task we want to improve both right. So we want to see how do we improve what the model knows in terms of rag and using those techniques but also if you can see value in how the model thinks you also want to you know consider RFT as well. — Nice. Um did you have a question Theo? — Yeah I I do have a question. I'm just wondering so you talked a lot about the different graders and I would just like to know how was your process to iterate on that grader and did you observe any reward hacking in the meantime? — Yeah absolutely. So I think reward hacking is you know something I briefly mentioned earlier as well which is you know you could pick sort of a binary class you know greater at the very beginning and uh the model might just guess um and then you might take extend that one more step further and say hey instead of doing binary maybe I'll give it a multiple choice answer there's 10 options and they each have different levels of stratification as to how good it is right um and that's a little bit better than just binary because it prevents guessing and reward hacking but there is still a level of it's not a very um continuous and differentiable function at that point. So ideally you like to stratify it as much as possible. Um and so through that process we figured out um a lot of our work is how do we make this um as most stratifying as possible of a reward. — Awesome. Yeah. — Yeah. Thanks for sharing. — Yeah, that's really cool. — Awesome. Um so we had a question actually from the audience. Um I think this might be relevant for you both. Um so the question is I have a chatbot uh app that answers legal questions to humans. Their prompts will be inconsistent at best and the output is just unstructured text. So you mentioned that the training prompt is important for um inference. Can I then use a training prompt that produces a structured output for the greater yet run my model with any query and expect good results for the user? — Yeah, that's a good question. Um maybe I can kick off and then add in. Um I think just the way to think about it on at the high level is that you know reinforcement learning is a little bit different from supervised like supervised learning right so how you set up the environment makes a lot more of a difference in as opposed to the hyperparameters and you know some of the other information there. So I think that setting up the right environment i. e. How are you choosing your task and the structure of your greater in the task you know in the data and the output itself uh makes a big difference. So um if you have a lot of

### Segment 12 (55:00 - 59:00) [55:00]

disperate data across all kinds of different it's noisy and it's messy I would actually recommend trying to clean that up and really formalizing what that task and what that environment looks like. — Yeah, I think that's a great take. — Awesome. Thank you so much David for joining us. Um we're going to move into the Q& A session now. We're going to bring Pashant back um on stage with us. So, we did get some feedback from you all that um that you wanted more time for Q& A. I know we are coming on um just up to the hour. Um but we do have some questions that we pulled. Uh if you can't see the Q& A tab, you just have to click on the actual Q& A text and uh see some of the answers. Awesome. So, let's get started. I think if you go to the next slide, we pulled some. Perfect. — Okay, so the question is, what kinds of tasks are best suited for RFD? Yeah, we get this question a lot. — Yeah, — maybe I can start off and then you can chime in. — Okay, so I think uh I think of this as two key things contribute to a task being suited for RFD. The first is that you must be using a reasoning model. uh if your task is better suited or like has very comparable performance on sort of non-reasoning and reasoning models uh I think the case is much weaker for uh using RFD and then the second thing is that you must be able to have a formal rubric of some kind to apply to your task and construct the sort of continuous reward uh that you can offer. I think David put it really well uh by talking about a differentiable reward. I think that's the holy grail. If you can have like a formal mathematical function that gives you reward, that's ideal. Uh but even without that, having a very formalized and stratified rubric uh to grade your task is really important when you're doing RFD. — Yeah. To add on top of that, it's really meaning that there is a correct or preferred answer to your task. And therefore if there is such an answer then you can build a rubric and do a lot of engineering on building that grader to then feed back in and train the model to reason better. — Yeah. And maybe one thing we should is worth saying again here is that the task must have nonzero baseline performance on the model you're trying to audit. Right. So if you have a task that's too difficult for 04 mini and it maybe always gets 5% or 10% on your eval. probably not going to end up with a very strong RFD run. Okay, the next question is my data is pretty noisy and of variable quality. Uh, can I use RFD? Okay, do you want to take this? — I think we've pretty much covered that question where the first part would be really to make sure that you clean up your data because each sample is very valuable in RFT. — Yeah. Example of this maybe coming back to uh the classification task we did is if you had sort of two classification tasks which applied which were in conflict with each other right and so it wasn't really clear for the RFP system what rules you wanted to learn that would sort of confuse the system and maybe result in a poor training run — okay so can you share the cost the considerations trade-offs between cost latency and performance when using RFD Uh yeah I can take a shot at this. So uh often a motivation for teams using RFD is to extract frontier like 03 level performance from O4 mini. That's what it looks like today. Maybe in the future we'll have different models but uh in most cases when we specialize O4 mini for a domain specific task it is actually possible to get or surpass 03 level performance with O4 mini. And that by definition makes it cheaper to run because over many is cheaper, smaller, faster. Uh and so directionally this makes sense, but it's also I will say hard to control exactly the number of reasoning tokens the final fine tune model will produce. So it's not a short bet by any means. — I do think it's also dependent on the amount of production traffic that you expect because there is a cost in running RFT both the experiments and the trainings. And therefore, if your production will make up for it, go for it. If your production is very low and you won't make up for it, then there's a it's a bit harder. — Okay, thanks everyone for your questions. Um, really quickly, we'll flash this up on the screen, but you will get an email with all of these resources. So, don't worry about copying this down. Um, and then I just wanted to share our upcoming build hours. So, July 31st we're going to be talking about built-in tools. And then August 26, we're going to be talking about codecs. So, thanks again for tuning in with us live and we'll see you on July 31st.
