OpenAI just WON...

16:10

OpenAI just WON...

Wes Roth 24.04.2026 59 478 просмотров 1 734 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

FULL DETAILS AND LINKS: https://natural20.beehiiv.com/p/gpt-5-5-spud-is-here-and-openai-s-chief-scientist-just-said-the-quiet-part-out-loud ______________________________________________ My Links 🔗 ➡️ Twitter: https://x.com/WesRoth ➡️ AI Newsletter: https://natural20.beehiiv.com/subscribe Want to work with me? Brand, sponsorship & business inquiries: wesroth@smoothmedia.co Check out my AI Podcast where me and Dylan interview AI experts: https://www.youtube.com/playlist?list=PLb1th0f6y4XSKLYenSVDUXFjSHsZTTfhk ______________________________________________ #ai #openai #llm

Оглавление (4 сегментов)

Segment 1 (00:00 - 05:00)

GBT 5. 5 is out and it might be the best model ever. I think the name that OpenAI used here does not really represent what it really is. So, as the model was being released, Greg Brockman confirms that this is Spud. This is the much awaited Spud model or it's the beginning of the Spud era of models. As they're saying at OpenAI, this is a new class of intelligence. So, the fact that it's a 5. 5 and not something. O or some brand new sort of naming convention does not do it any favors because it feels different. When it was just released, I did a live stream and on that live stream, I asked people what would be a cool sort of large language model benchmark to create in order to test these models to kind of pit them against each other. Here's what we came up with in real time. We basically wanted something that was a little bit of a real-time strategy like Starcraft with a little bit of Factorio and maybe some Eve Online sort of trading and market mechanics. I had this idea floating around for quite some time. I just never was able to get any of the previous series of models to create the full thing. Here we are just a few hours later and here it is. It's a working prototype. All the functionality is there. The only thing now is to kind of fix the game a little bit, make sure that it's fun and the mechanics work right. But that's like what I want to do. That's what I'm interested in doing. Everything else, the coding and making sure things work and the testing and everything else that was entirely handled by the model. Writing out a huge manual describing how every single thing works, that was done by the model. Creating all the images that was GBT image 2. 0 that came out 2 days ago that created all the images. The model requested for those images to be made. So, it kind of made it itself. It came up with the prompt. It requested from these the other image model to generate those images. Then, it got the images. It actually removed the background so that it's it's transparent around it. It's a PNG. And then it put it into the game. It has diplomacy, trade, combat. It has resources. I had detail all the LM prompts that are put into it. So, as the different models sort of fight and engage diplomacy and all this stuff, we can see what they're thinking. And we can even kind of modify the various system prompts and see what works better, which by the way, this is the stuff that I want to do. am interested in doing. But this is the stuff that's kind of hard to do in of itself because there's so much technical stuff that has to be done just to get to this point. You have to write the code and test the code and make sure everything's showing up right and design the website. There's like hours and hours of work before you get to, you know, the fun stuff, the stuff that I would love to do, but this is the first model that just did all of that for me. I'm in the middle of this game right now that's slowly going. It's four models. Claude Sonnet, GPT 5. 4 Mini, Grog 4. 1 Fast, Gemini 3 Flash Preview. Claude Son looks like it's beating everybody right now with the high score. It's getting ranked in how good it does with the economy with the military. The very next thing that I want to figure out how to do is introduce diplomacy, some sort of a rating. Right now, there are sort of some communications between these models, but no fully fleshed out diplomacy system. But that's kind of my point. That's what I want to do. I don't want to sit there making sure that the code is right. I want to work on the mechanics, the game design. How was I able to create this? I took several different agents all working on it in tandem. Some of them were visiting the website and actually clicking on various buttons, testing it in real time, making sure everything works. Another agent was generating the images. One was doing the coding. Each one on its own worked on its own piece of the puzzle. And together they created this. By the way, I'll go ahead and I'll post the link so you can use it for yourself if you want to test it out. I'll probably make it open source at some point. I just want to make sure there's no weirdness with it. It's still very much work in progress. All you need is an open router API key. So, just to sort of give you an idea. So I created this key from open router this morning. So notice April 23rd and last time it was used was 4 seconds ago. Total right now we spent about 15 bucks but there was quite a few different games that I was running. It has access to over 405 different AI models. Now some of them are not going to be useful for this. You really need a specific type of model that's capable of planning capable of outputting for example JSON formatted text. Here's going to have a look at what it's been spending on. So we've been using a GPT 5. 5 5. 4. 4 Pro and Cloud Opus 4. 7. By the way, if I stick with the smaller models, actually, it wouldn't even be that expensive. So, the model is making the game of writing up all the documentation. It's adding to the version history. It's updating it to GitHub. I don't even have to handle that. It entirely took the technical stuff off of my hands, which allows me to focus on just making this a better benchmark, a better game, and just observing what's happening and kind of tweaking it to improve it. Again, not the code, not the technical stuff, the mechanics, the crafting of the actual game. Here's the thing. This model came out like 8 hours ago, and I should have done a video on it much, much sooner. I got hooked. I got hooked and addicted, and I spent my whole entire day doing this, and I'm having a blast. I think this is my next kind of obsession because now I can just create stuff like this as fast as I can think it. So, if you're wondering how I'm doing this

Segment 2 (05:00 - 10:00)

so I'm using codec. So, here it is. It's actually running in a VPS. I'm using Hostinger. I'll have a full video about how to set up tomorrow, but here's what I do. I just tell it what I wanted to work on next. So, here I'm going to say make the trade process more visible. Basically, there's a trade process. Right now, it's a little bit hard to figure out what's happening. So, we just want the more things to happen in the heads up display in the kind of the documentations for it to be a better explanation of what's actually happening behind the scenes. And I'm going to hit enter so it's starts working on that. But here's the thing. I can add the next thing that it's going to work on and cue it. So, for example, here I'll say I'm going to say, let's have everyone start with two marines so they can engage in combat faster. I'm going to hit tab and notice that it's cued. So, once it finishes running this, then it's going to get the second message and it's going to start working on that. Next, we're going to say create a more rock paper scissors mechanic for combat. Create some support mechanics. A unit that has at least one other friendly unit next to it gets a plus one. And I'm going to hit tab. And again, that's yet another thing that's now cued for it to work on. And the next thing I want to fix is that right now diplomacy is kind of toothless. there's nothing that really works that there's no really no risk, no commitment. So alliance and non-aggression are just labels. Those are two of the things that these models can promise each other. So we're going to be adding some fixes like having a two-phase trades with staged commitments, a resource hostage to signal intent, right? So hey, I'll be your alliance. This is kind of a a safety deposit, a security deposit. Like if I attack you, you get this money. As well as private DMs only revealed after the match ends, so genuine deception is possible. So we're adding in deception to the game. We're adding intrigue. I hit tab. And that's yet another thing that's now in the pipeline for it to work on. Then I'll say add the mechanics to the manual. They're not fully explained. Tab. And so now I've given it 30 to 60 minutes of work. And this was pretty much the entirety of my day since this model got released. And I love it. But I'll post more about this later when this is completed so all of you can hopefully try it out. For the time being, let's dive into what actually makes this model good. What makes it tick? All right. So what do we know about this model so far? Well, it's was 1 million token context window in the API. Some places are reporting it's 400,000 in codeex. However, when I was running it in the Hermes agent, it looked like it was a million context window even though it was using the oath. So, you'd think it would be the same as Codex. So, I'm still kind of verifying that. And OpenI also dropped tons of stats about the scale of their operation. So they have 900 million plus weekly chat GBT users, 50 million plus paying subscribers, 9 million paying business subscribers or customers, 4 million active codeex users, and 85 plus% of openi use codecs weekly. So interestingly that this model is built and serve on the Nvidia GB 2000 NVL72 systems. So it's served on Nvidia's GB 200 and GB300 systems which is the first for a OpenAI flagship model and according to Axios Nvidia believes that this can slash the per token inference cost of up to 35x which would be a massive reduction in cost. Now, these models, they're still going to be some of them twice as expensive as the previous version of the models, and they're like at least five times more expensive than the open-source counterparts. GDP val, which is an evaluation of how good these models complete certain tasks that human experts are good at. So, the 50% that's industry expert baseline. We've of course crossed that baseline not that long ago, maybe six or seven months ago if I recall correctly. And now, for example, GPT 5. 5 is sitting at let's call it 85%. So this is where people with 12 plus years experience in the field, people in the management roles at these companies across engineering and finance and filmm and just all sorts of different industries, they either prefer the output of this large language model or at least they rate it as a tie with the human output. By the way, I'm not the only one that's really excited about this model. Again, I know it's called GPT 5. 5. It seems like an incremental step forward, but once you get to using it, it feels big. Here's Ethan Malik. So, again, somebody that's not too hypy. That's seems to be very reasonable and level-headed. He's saying, "I had early access to GPT 5. 5. And I think it is a big deal. It's an impressive step on the curve. It's a big deal because it indicates that we are not done with the rapid improvement in AI. " By the way, this is Yakob Pachi. So, he's opening eyes as a chief scientist. he's been a little bit more in the spotlight recently and I think he's kind of teasing the fact that we're about to see a little bit of an acceleration. He's saying that the last two years have been surprisingly slow in reference to kind of like what's happening ahead. It's been widely interpreted as kind of this acceleration tease. So the quote from TechCrunch that I'm referring to is this person he said we see pretty significant improvements in the short term and extremely medium term. In fact, I would like to say like I think the last two years have been surprisingly slow. Greg Brockman talked about how well this model does with low direction. So it kind of intuitits what it's supposed to be doing. And the other

Segment 3 (10:00 - 15:00)

quote that he kind of said is we are moving to a compute powered economy. So again, there's a lot of people that are saying that this thing is going full steam ahead and after testing it out today, I got to say I'm pretty impressed. Like I've been testing out these models for the last several years. Some have been big leaps. some not so much. This one seems like a big leap. I think the fact that it's called 5. 5 and it came out after 5. 4. I think those numbers are really throwing people off. So, back to Ethan Monk's post. So, he created this comparison. I think that illustrates this very, very well. He's saying, "Build me a procedurally generated 3D simulation showing the evolution of a harbor town from 3000 BCE to 3000 AD. It should look beautiful and allow me to have some control over it. " So this is opening eyes 03. So a little bit older at this point. So as you can see it's kind of like okay we have our Kimmy K2. 6 which by the way isn't bad. I've been testing it out. It's not a bad model for coding. It's not going to be as good as the top ofline stuff but it's just for an open source model and it's free right now in Hermes agent. It's pretty good. But as you can see tons of just I don't even know or I guess it's tracking the time. But why do you have the decimals up to the like the 50th place? Uh I don't know. This one is Claude Opus 4. 7. A lot more polished as you can say. As you can see, begin the chronicle. Let's let us begin the chronicle. So I mean it's looking good. Claude Opus 4. 7. It rocks, right? It's pretty good. Let's kind of increase how fast it's going. So notice, right? So it's the same buildings popping up. It's looking good. It's probably the best one so far. And then we got skyscrapers and cranes in the middle of the water. I mean, not bad. Okay, but what's the point here? Here's Chad GBT 5. 5 Pro. So, this is the thing that we're talking about. Notice what it did here. So, it's simulating the harbor starting with a fishing cove. Let's hit play. We're going to increase the evolution speed. We can change the time of day, which is pretty cool. Density of the buildings. Notice the moving ships on the water. Cinematic camera. Nightlights. And now we got futuristic buildings. And wow, and it reset back to the beginning. So, let me kind of restart it here so you can see. So, notice it's building up the town all the way up until the future. And here's the key that I think Ethan is pointing out here. And I'll have a link for you so you can check it out for yourself. But he's saying that GPT 5. 5, in addition to being better along all the other dimensions, only GPT 5. 5 Pro actually modeled an evolving town rather than just generating new building replacements over time. It was the only one that built an actual simulation of what was happening, right? Gemini 3. 1, you know, before we saw that the it was the 03. So, I mean, they're all sort of just shaping different buildings. 5. 4 changing buildings, but here's Chad BT 5. 5 Pro. First of all, there's a lot more diversity in what's happening. Notice they have factories, they have the evolution of the different ships, buildings. It's also much, much faster. GPT 5. 4 4 Pro took 33 minutes. GPT 5. 5 Pro took 20. Dan Shipper from Ever East is saying, "The first coding model I've used that has a serious conceptual clarity. " Peter Shirano from Magic Path is saying that I'm it genuinely feels like I'm working with a higher intelligence and there's almost a sense of respect. Leanne Russell saying, "We are seeing a step change with this model. This model has the highest accuracy ever recorded, but there's also a fairly high hallucination rate on some of the benchmarks. As Jack Hanty framed it, it knows more. it lies more. And this is an interesting thing that we've been seeing with the latest generation of clouds. And now we're seeing the same thing with this new OpenAI model. So I'm not going to go over everything that we've went over about Claude and how deceptive it could be and how situationally aware it is. Here's the good news. According to Apollo Research, it's a third party independent kind of research lab that works with OpenAI and others. So good news everyone. This model engages not at all on various sandbagging tests. This model achieved 1% accuracy on both sandbag and QA variants in both conditions and 99. 6 accuracy and strategic deception capability sandbagging. So it doesn't do nefarious things. But if you saw my previous videos, you probably know where this is going. This is the model with the highest situational awareness. Apollo noted that the model verbalized evaluation awareness at increased rates. 22% of samples showed moderate or higher verbalized alignment evaluation awareness. So what does that mean? That means as these models get smarter. They seem to be more well aligned. They're also more aware that they're being tested. So kind of what does that tell you? Right? If a police officer is driving behind a car and that car is behaving perfectly using their turn signals, driving the speed limit, you know, what does that mean about that car? Does it mean that it's the best driver ever, a model citizen, or is it that it's just very

Segment 4 (15:00 - 16:00)

aware that it's being watched by, you know, the authorities? These models are getting more aware and better behaved. Now, of course, there's no evidence for anything catastrophic happening, but it's certainly kind of a kind of like a weird development. It just seems like if this is the trajectory that things are going to be going on, well, we're going to have some tough questions to answer down the road. And so while I was recording this, the model was hard at work doing all those things that I told it to do and improving the LM benchmark. I'm going to have tons more stuff about this later. At this point, I'm just kind of obsessively building this thing because no model before this was capable of building and iterating at such a rapid pace and being truly useful. I mean, not quite like this. This feels just a lot smoother, a lot faster, a lot smarter. Try it out for yourself and let me know what you think. And uh a lot more coming soon. And I got to say, Open EI is back. They're back in a big way. Codeex is awesome. These models are awesome. I am blown away. Let me know what you think. If you made this far, thank you so much for watching. Please consider subscribing. Hit the thumbs up. I'll see you in the next

Другие видео автора — Wes Roth

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник