OpenAI Tests if GPT-5 Can Automate Your Job - 4 Unexpected Findings
14:06

OpenAI Tests if GPT-5 Can Automate Your Job - 4 Unexpected Findings

AI Explained 26.09.2025 67 526 просмотров 2 585 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
An OpenAI report released in the last 24 hours is the best look we have as to whether 2025 AI can automate your job. I’ll go through 4 unexpected findings, from which model is best at what, to practical tips and massive caveats. Plus UFC robots, radiologist essay, don’t trust videos and the blockers to the singularity. Gray Swan: https://app.grayswan.ai/ai-explained AI Insiders ($9!): https://www.patreon.com/AIExplained Chapters: 00:00 - Introduction 00:55 - OpenAI Report Summary 02:40 - Tipping Point Speed-up 04:11 - Better than Industry Experts? 06:33 - Big Caveat 11:10 - Karpathy and the Radiologist Analogy 13:30 - Outro GDPval: https://cdn.openai.com/pdf/d5eb7428-c4e9-4a33-bd86-86dd4bcf12ce/GDPval.pdf [GDP Impact: https://fred.stlouisfed.org/release/tables?rid=331&eid=211 Task List: https://www.onetonline.org/link/summary/11-9141.00 Summer Tweet: https://x.com/LHSummers/status/1971252567981146347 Emad: https://x.com/EMostaque/status/1971254153067593739 Robots: https://x.com/cixliv/status/1967663286679478759 Unitree G1: https://x.com/UnitreeRobotics/status/1970039940022239491 Don’t Trust Video: https://x.com/AISafetyMemes/status/1970453369446871420 AGI Tweet: https://x.com/hyhieu226/status/1968378785709133915 Blockers to the Singularity: https://www.patreon.com/posts/blockers-to-and-139264812 Framework: https://gemini.google.com/share/f4b9c85a6ae9 METR Study (Dev Slowdown): https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ Karpathy Tweet: https://x.com/karpathy/status/1971220449515516391 Radiology Essay: https://worksinprogress.co/issue/the-algorithm-will-see-you-now/ Non-hype Newsletter: https://signaltonoise.beehiiv.com/ Podcast: https://aiexplainedopodcast.buzzsprout.com/

Оглавление (7 сегментов)

Introduction

In the last 24 hours, OpenAI have released research on essentially whether current language models can automate your job. The big claim, albeit carefully worded, is that current best Frontier models are approaching industry experts in deliverable quality. But as you'll see from the title, there are plenty of unexpected findings in this research. Before I dive into that, there is one job we seem intent on automating, and that is one of being a UFC fighter. You can laugh at the lack of performance now, but like me, you might be laughing somewhat nervously. Take a look at this Uni Tree G1 robot, which maybe hasn't mastered kung fu, but he's getting a bit closer. Quick predictions. Do you reckon billionaires will have robot humanoid bodyguards by 2035? Let me know. back to the paper and they are only focusing on the most important sectors according to

OpenAI Report Summary

their contribution to GDP. What makes things more interesting is that the questions weren't designed by OpenAI. They were designed by industry professionals themselves with an average of 14 years of industry experience. They had to meet all sorts of criteria just to design the questions. And here are the headline results which you may have seen go viral with Claude Opus 4. 1 a model by Anthropic beating out OpenAI's models and coming quite close to par with industry experts. This I am obviously going to class as the first surprising finding. Not that Opus is the best model because Opus 4. 1 if you haven't tried it is indeed an amazing model. So no that's not the most surprising bit. It's that OpenAI published this result showing Opus beating its own models. I think that's great honest science by the way and I commend OpenAI for publishing this. Now you might be thinking no Phillip the most surprising bit is how close we're getting to parity with industry experts but I'll come back to that in just a moment. Right now I want to cover this second you could say somewhat surprising result which is that the win rate when compared to humans depended quite heavily on the file type involved. If your workflow involves submitting or producing a PDF, PowerPoint, or Excel spreadsheet, you might well find that Opus 4. 1 is a league ahead. All these figures, by the way, are on how often a model beats a human expert output as judged by a human expert. You may want to pause this one to look across the different sectors, and you may or may not find it surprising that it's in government where we have a model beating the average human expert. Personally, I'm a little bit skeptical that Gemini 2. 5 Pro scored so badly across these metrics. I find it a really great model, but then again, Gemini 3 might well be

Tipping Point Speed-up

around the corner. The third potentially unexpected finding is that we seem to have passed a tipping point where models tend to speed up human experts. To briefly summarize this table, if a model is too weak, even if you let it try a task multiple times and then only use its output if you judge it to be satisfactory, that doesn't actually speed you up. Essentially, the time to review its output is just badly spent and it's not worth the time. You might as well just do it yourself alone. However, by the time we get to GT5, this does actually speed you up. You guys may have experienced this yourself, but GT5 does a good enough job often enough that across the board on average in these industries, and I'll get to which ones in a second, you are slightly sped up. Two fairly critical caveats to this unexpected finding though and one is that where is claude 4. 1 opus because surely that would have had even greater a speed improvement for the human experts and second the bar for acceptance for what these models were producing was meeting the humans quality level. They call it the quality bar as judged by those humans. But what if those humans can't always spot the subtle errors that the models output? Reminds me of that developer study that meter did where the experts thought that they were being sped up by 20% but they were actually being slowed down by I think around 10 20%. Now though to the biggest finding of all and the big claim in the paper as iterated by Lawrence Summers he's a famous economist and

Better than Industry Experts?

former president of Harvard I believe and he said that these are task specific touring tests models can now do many of these tasks as well or better than humans. If that's generally true then that would lend support to claims like this one from another open AI researcher which is that their current systems are AGI. For example, one of their unreleased models was able to beat every single human in one particular coding competition. Logically, that makes some initial sense, right? If it can beat these experts at coding competitions and at least match experts across a whole range of domains, why wouldn't that be AGI? The former founder of Stability AI implied, "We're close to a tipping point. " The implication, of course, being that then we will start to see the automation of jobs wholesale. Well, I would say one of the big unexpected findings was of how robust human jobs seem to be to automation by current generation LLMs. The evidence from this paper to me suggests that we will need a further step change improvement in model performance to start genuinely automating whole swaves of the economy. Why would I say that when they just said in the abstract current best Frontier models are approaching industry experts in deliverable quality? Like we're really close, right? Not really. when you dive into the details of the paper. First, the paper admits if you look at adoption rates, the picture doesn't look so great for AI. And I covered that in a recent video with many companies dropping their pilot projects. But those are lagging indicators, as is GDP growth. It takes time for people to realize how good these models are. So, those metrics will be lagging indicators. Fair enough. It will take time for AI to diffuse. So, they're just going to focus on what current gen AI can actually do. Here are some of the tasks. By the way, for example, if you are a manufacturing engineer, then you were asked in this study to design a 3D model of a cable reel stand for an assembly line. All the other models were given the same task and then the results were compared blind graded. What then is the problem if these tasks were designed by industry experts and then blind graded? Surely that shows that the models are almost at human expert level industry performance. Even on task length, these tasks required an average of 7 hours of work for the expert. So these are realistic tasks. Well, first they excluded those occupations whose tasks were not predominantly digital. I had to dig quite a long way through the appendices to work out how they did this, but I want to give you just an

Big Caveat

example of the kind of thing they did. They looked at this table and found only those sectors that contributed at least 5% to the US GDP. Then they found five occupations weighted by salary whose work was predominantly digital. Take manufacturing. All of these five occupations have predominantly digital work apparently. But then of course if you dig into the data where they got that from, there are countless occupations within that category whose work is not predominantly digital. So for every one or two that made it into the paper, there were loads of course that did not. Okay, but what about those occupations that are predominantly digital? Well, even there, they didn't look at all of what those occupations did. I took just one of the occupations rated as predominantly digital property manager and categorized all the 27 tasks that was listed that they did in the official records. This was from ONET which is the same source that OpenAI used. GPT5 Pro ironically saving me lots of time categorized it thusly with about six or seven of the tasks rated as not being primarily digital. things like overseeing operations and maintenance, coordinating staff, investigating complaints and violations. The obvious point being that even if we can automate the 19 or 20 tasks that are obviously digital within this predominantly digital occupation, that wouldn't eliminate the job entirely. In fact, that profession might get even more well- paid as we're going to see in a moment for radiologists. So, not all sectors, not all occupations within that sector and not all tasks within each occupation. Fine. But what about the actual tasks themselves? Well, they were super realistic and you can look at the range of industries involved from Apple to the US Department of War. Now, Google and BBC News for example. But first, they were somewhat subjective with even the human experts having only 70% agreement between themselves about which answer was better, the model answer or the human gold deliverable. Next, sometimes it was obvious which answer was the model output because OpenAI models, for example, would often use M dashes. Grock would occasionally randomly introduce itself. Apparently, more fundamentally though, the tasks were oneshot. Here's the task, get it done. Of course, in a real job, there's much more interactivity where you ask questions of the person giving you the task to find out the scope and parameters of the task. Also, they had to exclude tasks that relied on too much context, like the use of proprietary software tools. Then there were the catastrophic mistakes. They admit one further limitation of this analysis is that it does not capture the cost of catastrophic mistakes which can be disproportionately expensive in some domains. They give some examples of catastrophic answers and I'll give one of my own. They said something could go dangerously wrong like insulting a customer or suggesting things that will cause physical harm. This happened apparently 2. 7% of the time. Here's something to consider. If the damage done by those catastrophic failures is 100 times worse than the cost savings you get from the model being better, then weighted by impact using quote agentic AI without a human in the loop could cost you more in the long run. Here's my example from a recent bit of coding where Claude admits and I saw it do this that it completely hallucinated a price set for a particular model. You're absolutely right, it said. I apologize for making up those credit numbers. That was incredibly irresponsible of me. Let me check the actual values. It thought about it, then said, "Yes, your diagnosis is 100% correct. I apologize again for making up those credit values. " You would have to be a pretty irresponsible employee or a downright fraudster to make up such critical values without asking anyone. This was Claude 4. 1 Opus, by the way. I am open-minded, though. Let me know what you think in terms of whether there will be more real life human fraudsters or just complete dotards in terms of the mistakes they make versus these catastrophic hallucinations from models. Speaking of catastrophes, by the way, you can help avert certain catastrophes by joining in the Grey Swan Arena. Link in the description. Essentially, you're rewarded with real human money for breaking an AI, for jailbreaking LLMs. Several of my own subscribers have joined in these competitions and won prizes. Actually, you can see in the corner, $350,000 worth of rewards have been distributed. And actually, scrolling down, I can see that there is a competition that is live and in progress as we speak. They're proving ground one. As I've mentioned before on the channel, I see this as a win-win. You can gain recognition and money, and the AI gets just that bit more secure. One more limitation and

Karpathy and the Radiologist Analogy

then I'm going to end on a positive. I think Andre Karpathy, formerly of OpenAI, made a fantastic point in this recent tweet. In 201516, Jeffrey Hinton famously predicted that we shouldn't be training new radiologists. And Carpathy linked to this article, which is indeed a great one. It said that there were models released back in 2017 that could detect pneumonia with greater accuracy than a panel of board-certified radiologists. I can just imagine the clickbait that could have been written about that study. So, how come eight years later radiologists have an average salary of over half a million dollars per year, which is 48% higher than in 2015? Well, some of this is about to sound familiar, but there were issues with training data not covering edge cases. There were, of course, legal hurdles, and just like in the paper we just read, there were also tasks within radiology that didn't involve such automation, like talking to patients. As best I could, I recently tried to delineate each of the blockers to the singularity, as I called it, in my recent Patreon video from the 19th. And I'm going to link in the description this framework that I created. None of these are unsolvable, but understanding each one will help you read beyond the headlines. Now, let's spot some more patterns because the AI for radiology didn't cover all tasks. It focused on the big ones like stroke, breast cancer, and lung cancer. What about things like vascular, head and neck, spine and thyroid? Well, relatively few AI products. Think of those tasks not covered in that spreadsheet. Then, if you're a child or an ethnic minority, then these AI tools perform worse. And think of the analogy with LLMs. Outside of English, they don't do as well. Notice how the study only focused on US GDP. Then there's the fact that OpenAI, for example, keep hiring new people despite designing a tool that's designed to automate AI research. Likewise, in radiology, headcounts and salaries just continue to rise. Carpathy's prediction, we will have more software engineers in 5 years than we have now. Just to end though, I would say don't sleep on this multiplier. You could be sped up by AI even if it can't automate your job. The AI, for example, in Descript can't fully edit my videos, but it does speed up my own editing of videos. understanding AI and getting familiar with using it is still I think one of the best bets you can make on content creation. There is

Outro

one tipping point I think we have reached which is that visually at least we can't fully trust that we are seeing the human we think we are at least on video. Thank you so much for watching to the end. I didn't cover chat pulse even though I am a pro subscriber cuz it wasn't rolled out to me. I wonder if it's blocked in the UK. I tried everything. Having said that, does seem to be a replacement for scheduled tasks. Do you guys remember that from January where you could ask Chachi PT to perform a task at a certain time? It never worked. Kind of flopped and then everyone forgot about it. But now we have Pulse. So, let's see if that does any better. Have a wonderful

Другие видео автора — AI Explained

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник