Go beyond out-of-the-box models with gpt-oss, OpenAI's newest open model series. Discover how gpt-oss lets you adapt, extend, and fine-tune to your needs while combining seamlessly with GPT-5 for flexible, high-impact builds.
Dominik presents one of the first of NVIDIA's DGX Spark AI Computers on stage.
Оглавление (5 сегментов)
Segment 1 (00:00 - 05:00)
Good afternoon everyone. My name is Dominic Kundal and I work on developer experience here at OpenAI. Before we get started, a quick raise of hands. How many of you are using a combination of open and proprietary models? All right, that's quite a lot. How many of you are using or have used GPOSS before? All right, that's uh more than I expected, but hopefully we can get it to 100% at the end of the talk. Um, so over the next 25 minutes, we'll talk about GPTOSS, our latest open model series that we released in August earlier this year. Uh, why you might want to use them and how they fit into the broader OpenAI ecosystem. But first, what is GPOSS? GPTOSS is our model family consisting of two models. GPOSS 20B, our medium-sized model that can run on higherend consumer hardware with at least 16 GB of VRAM. So, a top-of-the-line consumer graphics card or recent mid-tier MacBook. And then GPUs 12B, our larger model that can run on a single 80 GB GPU like an Nvidia H100 or an AMD Mi300X or even a top-of-the-line MacBook like this 128 GB MacBook Pro. But why did we build these models? Well, first of all, because you all kept asking us to. But uh joking aside, we know that proprietary hosted models aren't always an option for you. Whether you have data requirements where your data needs to stay on premise for safety or privacy reasons or you have specific hardware requirements where your data needs to uh stay on or you have specific hardware requirements or latency requirements or maybe your use case is has to run completely offline because of flaky or non-existent internet connections. There's a wide range of reasons why you might want to pick an open model and we know that a lot of you already have to deal with that mix of proprietary and open models for your use cases. Because of that, we wanted to make sure that we offer you the best possible experience. Both of the open models are reasoning models with variable levels of reasoning and raw chain of thought access. And they're the only open models that can perform tool calling as part of the chain of thought, including web browsing and Python tool calling. Meaning the model can combine a series of tool calls and reason between them to make uh to more effectively achieve complex tasks. And both of the models are permissively licensed under the Apache 2 license, meaning you can use them both in commercial applications or fine-tune them to make them your own as long as you adhere to local laws in your region. These models should also instantly feel familiar to you if you've used or other reasoning models like Open AI's 03, 04 mini or GPT5 as they introduce the same variable reasoning, chain of thought, function calling, browsing, and Python capabilities. In fact, GP2s 12B and 20B have nothing to hide when you compare them to 03 and 04 mini. On humanity's last exam, for example, a benchmark designed to test knowledge of uh test AI at the frontier of human knowledge. Both GPUs 20B and 12B perform on par with O4 Mini. Except in our case, GPOSS 20B can run entirely locally on your laptop. And even with complex math problems like Amy 2025, the models are able to keep up with 03 level performance. to repeat that is real state-of-the-art level intelligence that you can run now locally on your computer and these models are designed to be used in agentic use cases. So function calling performance is incredibly important in Towbench retail for example which tests the model's ability to use tools to resolve a retail customer service issue across multiple turns. Uh both models are performing extremely well especially given their size and overall the models fit very well into our open AI ecosystem. You can use them in the agents SDK or with codec cli and increasingly more providers like grock hugging face vlmvidia and as of today LM studio have started offering their own responses APIs. That way they should work directly within your existing projects and allow you to mix and match them with other open models. And when building these models, we tried to really listen to community feedback. And two of the biggest asks we heard from the community were for one us to not hold back when it comes to capabilities, especially for agentic use cases, and to have efficient models. And we tried to balance these two and are pretty happy with the result in the overall um open models market space, striking a great performance between a great balance between performance and size. And the feedback from the community and our partners has been great so far. In total, it's been downloaded over 23 million times on our hugging face
Segment 2 (05:00 - 10:00)
organization alone. And people love using the model for local use cases, its tool capabil tool calling capabilities, and the overall cost efficiency of the model. Some of my favorite use cases of GP2s so far came from our six week long hackathon that wrapped up a few weeks ago. You'll see some of the examples later on in the developer state of the union, but we've seen people use GPTOSS for controlling robots, discussing sensitive topics in their own personal diaries, fine-tuning it for uh to be a subject matter expert on highly specialized topics or even just to be a better storyteller. have it used in offline coding assistance and even use it use its coding ability to create a fully on premise cyber security operations center to help protect systems without disclosing sensitive data just to name a few. That's why we've invested in GPOSS to give developers the flexibility and control to build the models the the way that they know best and to run it anywhere. Enough talking though. Let's actually see GP2S action in a couple of examples. So since GPUs is an open model um there's a wide range of ways that you can run and host it on servers. You can use frameworks like VLM and transformers or for local inference you can use projects like Llama CPP LM Studio or Lama. For this demo we'll actually build a chat agent that will help me keep track of my finances without revealing my own private financial data and by leaving it entirely local. To power the agent, we'll run GPUs 12B. So that's the larger model completely locally on my MacBook using Olama, but the same thing should work with other inference solutions. I've already downloaded the model on my laptop since I didn't want you all to watch me download 70 GB on conference Wi-Fi. Um, but because of that, we can actually go and turn off our Wi-Fi here since the model is entirely local. All right, we've turned off the Wi-Fi and with that we're completely at the will of the model and the demo gods. So, wish me luck. We can see if the model is running by using the Olama CLI here and send a friendly greeting to the model and we can see the model going through its reasoning process and responding accordingly. All right. So now that we know that the model is running, we need to use uh a local API to integrate it into our app. Olama and most inference providers um already provide a chat completions API, but in our case, we want to harness the full power of the model. So instead, we'll run our own responses API proxy that Let's get out of here. All right. Hi. There we go. Um, we're going to run our own responses API proxy that we shipped as part of GPUs and that is available on GitHub. This proxy will expose built built-in Python and browsing tools on top of sending the token generation to our inference provider in this case running on this device. Now, let's actually build the agent. Uh for this I'll be using the agents SDK for TypeScript to build our finance agent that powers the chat interface that you can see on the right. We already have a very basic setup here where we have the chat interface hooked up to the agent on the left and then configure the agent to use our locally running responses API and for it to use the Python code interpreter as a tool. So right now this agent is pretty generic. Um but we can see what it can do by asking a question like what is the square root of some random number and we should see as I mentioned earlier that the model can perform tool calls as part of the chain of thought. So you can see here um it's starting to think through the steps it has to do. it tries to use the Python code interpreter, realize that it actually was trying to use some uh dependencies that it didn't have and correct automatically. This is the same kind of behavior as what you would see from models like GPT5, 03 or 04 mini, except that in this case it is completely offline. That means we can build agents the same way that we're used to with these models, but for sensitive data that might have to remain on premise. for my finance agent, for example, I have a bunch of financial files um scattered in my directory, but I should probably clean up this situation because there's a couple of files that I don't want to have leave the system. Um but since we're running the model locally, I don't actually have to worry about this. With GPUs, we can equip our agent with the necessary tools to handle sensitive data while keeping the data entirely local. to connect uh to the file system.
Segment 3 (10:00 - 15:00)
We're actually going to use an MCP server that exposes the necessary tools for the agent to browse and open files. Let's add the MCP server and then connect to the MCP server again entirely locally and provide it to our agent here. We can now check if it works by asking how many files are in the top level. And you can see it's going through its reasoning steps again using the different MCP tools to browse the system and get the information. And we get an answer. So now let's try a more complex question like summarize my overall portfolio growth in 2024 in percentage and this is where you can really see the power of GPTOSS coming to uh coming into place where we're using the tools as part of the chain of thought. performs multiple requests to the file system things, write some Python code to interpret it and then give us an answer um directly to us without having to um go back and forth with the user. Now, now all of this has been fully running offline leaving the data entirely on premise. But especially if you're using the smaller GPUs 20B model, there might be moments where you need the model to know more or you're hitting other capability limits. For that, we're actually going to connect back to the internet. And I'm going to show you two more things. One, I want to provide the model with the ability to do web browsing. And two, I want to give it access to GPT5 for some additional tasks. For browsing, the model was trained on a generic browsing tool, meaning that you can build entirely your own browsing browser tool on your proprietary search provider, run your browsing through your own proxies if you want to apply content filters, or even have a fully offline or on premise search index. In our case though, we're uh we'll use an example search provider. We have two on GitHub. One is Exa and the other one is u. com. And in our case, we're going to use the XA API. This is already set up by us using the responses API proxy that we're you uh that we set up earlier. So I just have to enable it. But how enabling web search works will depend on your inference provider. And the same goes for Python. So some of the inference providers might not support this yet. I also want the agent to be able to create me little interfaces uh dive into the data that I'm working through. While the model is great at coding, it's not our best model for coding. For that I actually want to give it GPT5 as an agent that our agent can use as a tool. So in this case I already have an HTML agent here that is using GPT5 and is specialized on writing inter interface code in HTML and nothing else. I also have an input guard rail here that checks if anything looks like a social security number to avoid accidentally passing any sensitive data to my remote GPT5 model. You could do the same check type of checks with other confidential data that you want to make sure stays on premise. All right, we need to still add this here. So I'm going to pass it in and I'm just going to give it a tool name of generateization. And with that we can ask a task like create a bar chart of my individual stock gains in 2024. Got to put a comma there. All right. Now we should see the model going through the same steps again. Reading the f reading the necessary files crunch using Python to crunch the necessary numbers and then using GPT5 once it has the data. See, so you can see here it was calling the code interpreter to actually process that um file. Seems like it messed up something and like this is the benefit of that chain of thought reasoning and it can actually go and recover. So um every once in a while you might run into these situations. Uh there we go. Got the data. Um oh there you go. Now it's calling the generate um generate visualization tool. And you can actually see here that because my social security number check is
Segment 4 (15:00 - 20:00)
relatively rudimentary, every once in a while it might run into things where like that regular expression gets triggered. So in this case the model realized that and again selfcorrected and performed another tool call to the model here. And now in a second we should get back the result. See? Come on internet. All right. While we're waiting for that, let's move on. As you saw, GP There we go. All right. Thank you. Um, so as you saw, GPOSS seamlessly fits into the broader OpenAI ecosystem. We're able to write a uh write an agent the same way using the agents SDK tools that you're already familiar with. Run it entirely offline and the model behave the same way as our other reasoning models do. And we're able to leverage GPT5 for task that is less equipped to do. One more thing though that we can do with GPUs is to make it our own by fine-tuning it. Maybe you're not quite happy with the performance of GPTOSS on a specific topic or task. Or maybe you want the model to be a subject matter expert on a specific field or some internal data. Traditionally, that would mean that to use a technique called supervised fine-tuning where at a high level, we give the model a bunch of input and output examples and train the model on those to get better at similar inputs. However, with GPUs's reasoning capabilities, a more interesting way of fine-tuning and personalizing GPUs is doing additional reinforcement fine-tuning where we give the model inputs and a reward function that tells the model how good the output was. To give you an example of fine-tuning, um I'm we're going to briefly see it in action, but since watching a model fine-tune for hours might not be the most interesting thing you can do today, uh I already fine-tuned a model ahead of time. So before I explain you what we did, how many of you know this game of 2048? All right, cool. So in these games, for those of you who don't know, uh the as a player, you try to combine different tiles with the same number that are next to each other by swiping up, down, left, right, and to merge these tiles and eventually reach 2048. We wanted GPUs 20B to play the same game by having it write a Python function that encodes a strategy to actually play the game. And as you will probably see here, the base model is decently okay at playing it, but occasionally writes wrong or very basic code where it doesn't get far in the game, especially on low reasoning. So, if we're testing this out here, um, seems like this did okay, but it definitely didn't get very far on the board. So instead, we fine-tuned a dedicated model by giving it a reward function that takes the board strategy and then place boards on it. So that and then we used a and to see how far it gets and then we used a technique called uh gpo uh which is a reinforcement train uh reinforcement fine-tuning technique and a tool called unsllo to actually fine-tune the model. This is a non-trivial amount of code. if I scroll through this entire uh entire notebook. So if you do want to check this out, uh it's available on our GPOSS GitHub page and you can check it out there and retry it yourself. So let's actually see this in action. Um by having GPUs and uh and our fine-tuned version each generate five different strategies here. So uh we can see the model is starting to kick off all of these different Python strategies. It's a bit harder to read at the second while it's still generating. But every once in a while, both models will return some more like rudimentary techniques, uh, rudimentary strategies. But it's also pretty hard for you to distinguish between them and which one is better or not. So, in order for us to figure out if the model actually did a better job, we're going to have the two models with these five strategies play a 100 boards against each other to see which one gets better. So, we're almost there. Some of them are still generating. Um, let's see how many. One. All right. Looks like we're done here. So, let's run this. And demo gods. Come on. Yes. So, you can see model B, which is actually our fine-tuned version of the model, um, won quite significantly in the different games playing against each other and also got it overall a total higher score. All right, I do have one confession to make though. I mentioned earlier that we were running GPSS entirely locally. And while that was the case for the first demo, these two models were actually not running on my device. They are still running entirely locally though and not on some GPU in the cloud. Instead, um I actually got to run them on a very special piece of hardware.
Segment 5 (20:00 - 22:00)
So years ago, uh, Jensen delivered the very first DGX1 to OpenAI. And you can see it here on the screen. It was a milestone in compute. And while it would be fun to run these two models on a DGX1, it doesn't quite fit on the podium. And you all would have probably figured this out by now. So instead though, I have a great alternative. This is the DGX1 uh DGX Spark from Nvidia. And this little beast contains the same amount of compute as the DGX1 and in fact is currently running both of these models, but also fine-tuned the model while standing on my desk. You're all some of the very first people actually to see this live outside of a lab. Um, and Nvidia was so kind to give it to us for this demo. Um, in fact, it's not available at all yet. There's a pre-production system that Nvidia provided to us to help get the system ready for developers like you all. And because it's still pre-production hardware and software, it might not be fully representative of its final capabilities and performance, but it's been a lot of fun to work with. Um, so if you want to check out the code for the fine-tuning to learn more and try it out yourself, even if you don't have a DJX Spark yet, um, you can find it on GitHub. All right, to summarize uh, what we've seen so far, GPOSS is a great option if you need to run the model locally. um or on premise for example for safety, privacy or low latency use cases. And GPSS seamlessly integrates with uh the OpenAI tools that you're already using including the agents SDK and the codec cli. So whether you're building an agent that has to run locally or you're trying to code offline, GPUs can be useful. And GPUs and GPT5 can work hand inand allowing you for this allowing you for use cases where you need to run a blend of models to get both state-of-the-art performance and some of the benefits of open models. Lastly, GPUs can be fine-tuned for your own use cases, giving you the ability to have your own expert models on uh on your use cases while leveraging the same intelligence and capabilities of OpenAI's reasoning models, all while running fully on your own hardware of choice. If you do want to learn more, you can check out all of the resources on openai. com/openmodels. And with that, thank you very much.