OpenAI Backtracks, Gunning for Superintelligence: Altman Brings His AGI Timeline Closer - '25 to '29

23:42

OpenAI Backtracks, Gunning for Superintelligence: Altman Brings His AGI Timeline Closer - '25 to '29

AI Explained 08.01.2025 109 005 просмотров 3 524 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Sam Altman unexpectedly brings his timelines to AGI forward, while OpenAI backtrack on superintelligence. None of these changes were heralded, but they are significant. Plus the new year brings new assessments of the true capability of models to automate 'large swathes of the economy'. I'll give my prediction on that front for 2025, announcement a new Simple Bench competition, and showcase Kling 1.6 vs Veo 2 vs Sora, and much more. wandb.me/simple-bench (Colab): https://colab.research.google.com/drive/1AVijcPnEkl8Gy_754XbRdG5m7Q5-9slg?usp=sharing AI Insiders: https://www.patreon.com/AIExplained TheAgentCompany Paper: https://arxiv.org/pdf/2412.14161v1 Sam Altman Major Interview: https://www.bloomberg.com/features/2025-sam-altman-interview/?srnd=phx-ai OpenAI Agent Coming Jan 2025: https://www.theinformation.com/articles/why-openai-is-taking-so-long-to-launch-agents?rc=sy0ihq Altman Singularity: https://x.com/sama/status/1875603249472139576 Altman Original Timeline: https://www.youtube.com/watch?v=7dCPytNTnjk&t=621s https://www.ft.com/content/34a7a082-e685-4e02-bca7-61ff89d99ed2 OpenAI Original Emails: https://www.lesswrong.com/posts/5jjk4CDnj9tA7ugxr/openai-email-archives-from-musk-v-altman-and-openai-blog DeepMind Sky News 2014 Article: https://news.sky.com/story/google-buys-uk-intelligence-firm-deepmind-10419783 Altman Blog Reflections: https://blog.samaltman.com/reflections OpenAI Changes Who Gets AGI: https://openai.com/index/why-our-structure-must-evolve-to-advance-our-mission/?s=09 OpenAI 5 Levels: https://www.bloomberg.com/news/articles/2024-07-11/openai-sets-levels-to-track-progress-toward-superintelligent-ai Altman 2015: https://blog.samaltman.com/machine-intelligence-part-1 OpenAI React to Anthropic: https://www.theinformation.com/articles/how-anthropic-got-inside-openais-head?rc=sy0ihq Microsoft $100B Definition: https://www.theinformation.com/articles/microsoft-and-openai-wrangle-over-terms-of-their-blockbuster-partnership?rc=sy0ihq Epoch Scramble for Task Benchmark: https://x.com/tamaybes/status/1876692639363612919 GPQA Progress: https://epoch.ai/data/ai-benchmarking-dashboard Task Length Crucial for ARC-AGI: https://anokas.substack.com/p/llms-struggle-with-perception-not-reasoning-arcagi RL Environment Tweet: https://x.com/vedantmisra/status/1876327518157807990 Jason Wei Talk: https://www.youtube.com/watch?v=yhpjpNXJDco Miles Brundage Tweet: https://x.com/Miles_Brundage/status/1872676399762612457 Jan Leike Tweet: https://x.com/janleike/status/1872909496777134127 O1 Pro Losing Money: https://x.com/sama/status/1876104315296968813 Kling 1.6: https://klingai.com/text-to-video/106634842 Chapters: 00:00 - Introduction 01:03 - Altman Timeline Moves Forward 04:33 - Superintelligence? 06:55 - AGI was not the only pitch 09:26 - AgentCompany and OpenAI New Agent 17:24 - SimpleBench Competition 23:03 - Kling 1.6 vs Veo 2 vs Sora Non-hype Newsletter: https://signaltonoise.beehiiv.com/ Podcast: https://aiexplainedopodcast.buzzsprout.com/

Оглавление (7 сегментов)

Introduction

for the few that think 2025 will be a quieter year in AI after the somewhat hectic pace you could say of 23 and 24 I'm going to have to disagree with you this video will first highlight how the CEO of open AI has revised his timelines for AGI forward and revised upward his already aggressive definition of what counts as an AGI okay that's just one guy but then we'll see how open AI itself have backtracked on whether they are working on Super intelligence at all just a minor misunderstanding I am sure then as we enter this Bright New Year I'll cover a fascinating new paper and what it says about the current limitations of llms I'll give my prediction of how quickly things will change this year with models completing real world tasks on your behalf I'm going to launch a cool competition for you guys with actual prices and end on a fun little demo of the latest text a video from cing and V2 but first

Altman Timeline Moves Forward

samman's subtle timeline shift on when AGI is coming I noticed the shift in this substantial interview with Bloomberg from a couple days ago how does he Define ai well check out this somewhat aggressive definition that seems new to me he says AGI is when an AI system can do what very skilled humans in important jobs can do you might wonder why he would make the definition of AGI harder on himself and open AI but I'll come to that a bit later suffice to say when we have an AI system that can do what very skilled humans can do in important jobs that would be quite an epocal moment of course it seems like we're really far away from that because even systems that can Crush benchmarks like 01 from open Ai and 03 they can't even say open up screen recording software record a video edit it in premere Pro and publish it that's all well and good he says nervous but uh things could change on that front fairly soon but bear in mind that more aggressive definition for what AGI will do when you read this prediction from Sam wman I think AGI will probably get developed during this president's term and getting that right seems really important Trump's term of course runs from January of 2025 to January of 2029 those of you who have been following the channel closely might remember that an update on what he was saying until fairly recently on Joe Rogan last summer before the training of the latest 03 model he was saying how appropriate it would be if AGI was developed in 2030 but kind of pushed it back to 2031 I no longer think of like AGI is quite the end point but to get to the point where we like accomplish the thing we set out to accomplish I you know that would take us to like 2030 2031 that has felt to me like all the way through kind of a reasonable estimate with huge error bars and I kind of think we're on the trajectory I sort of would have assumed moreover the president of Y combinator thought that samman was being serious in his interview with samman when Sam watman implied that it might be the year of 2025 in which we get AGI and he echoed that suspicion again in the Bloomberg interview from a couple days ago saying funnily enough I remember thinking to self back then in 2015 that we would do it build AGI in 2025 the point here is not for you to believe that particular date it's just to note the clear shift in emphasis and this of course follows samman's blog post from 48 hours ago in which he said open AI are now confident that we know how to build AGI as we have traditionally understood it we believe that in 2025 we may see the first AI agents join the workforce and materially change the out of companies of course we are close enough now to these dates that we won't have to wait till long to see if they manifest themselves it turns out though that building powerful things is quite addictive because open Ai and samom specifically don't want to stop with AGI they don't want to automate particular tasks in important jobs they want the whole cake we are beginning to turn our aim Beyond AGI to Super intelligence in the true sense of the word a glorious future in which they can do anything else and that statement comes just 6

Superintelligence?

months after open aai explicitly denied that was their mission open ai's vice president of global Affairs told the financial times in May of last year that their mission is to build a GI I would not say our mission is to build super intelligence is going to be a technology that orders of magnitude more intelligent than human beings on Earth and another spokesperson said superintelligence was not the company's mission though she admitted we might study super intelligence I'll note that massively increasing abundance and prosperity and being super capable of accelerating scientific discovery and Innovation doesn't sound like just studying super intelligence sounds like they want to do anything else there is though probably a reason why those spokespeople denied that they were in pursuit of super intelligence one reason could be that 10 years ago almost to the month samman said that the devel Vel M of superhuman machine intelligence is probably the greatest threat to the continued existence of humanity remember though it suits open AI to keep pushing back or up the definition of what counts as AGI or superintelligence they're trying to change it but as of today there is a clause that kicks in where Microsoft surrendered the rights to quote AGI technology that open AI makes if it's defined to be AGI so now despite several openai employees claiming that they current systems like o03 are AGI we have these five stages of AGI has to be not just a Reasoner but also an agent A system that can take action and an innovator and even have the power of an entire organization seems like we really are stretching the definition of general intelligence quite far here Microsoft by the way you might not know W that definition stretched even more to be counted as AGI the system must itself be able to generate profits of $100 billion wait I've just realized that I can't personally generate profits of $100 billion and there are very few of you in the audience who can do so does that mean that we're not AGI damn maybe Elon Musk is the only AGI on the planet that would be really weird anyway as you can see words seem to chop and change their meaning at people's convenience so bear that in mind speaking of which there is some history you probably need to know going back to the very founding of open

AGI was not the only pitch

aai in 2015 in this week's interview with Bloomberg Sam was asked how did you poach that top AI research talent to get open AI started often when you had much less money to offer than your competitors he said the pitch was just come build AGI and he said that worked because it was heretical at the time to say we're going to build AGI actually that's not quite accurate the pitch definitely wasn't just to come build AGI the pitch was that they were going to do the right thing with AGI and that's how they won over people who attempted by Deep Mind offering even more money if those researchers just wanted to work on AGI they could have just joined deepmind because a year before that offer from samman Dem harus was doing interviews talking about how they're working on artificial general intelligence or here's an article from a year before that in which the co-founder of Deep Mind Shane leg says that they are working on creating AGI by 2030 no the pitch was that open AI were going to create AGI and have it be controlled by a nonprofit and by the way that is still the situation today despite the board debacle a year ago with firing samman and joining up with Microsoft and investing billions and billions yes it turned out that billions and billions was needed for scaling but still as of today if AGI is created by open aai is controlled by the nonprofit board two weeks ago though openai revealed that they are planning to change that of course it's phrased in terms of this is best for the long-term success of the mission and we're doing it to benefit all of humanity but the critical detail is that the nonprofit wouldn't control the AGI it would get a ton of money for Health Care education and science but that's very different from controlling what is done with an AGI or a superintelligence the until very recently former head of policy research at openai males Brundage said that a well- capitalized nonprofit on the side is no substitute for being aligned with the original nonprofits mission on safety mitigation another former lead researcher at openai said it's pretty disappointing that Ure AGI benefits all of humanity gave way to a much less ambitious charitable initiatives in sectors such as Healthcare education and science even if you don't care about any of that you may find it somewhat curious that Microsoft is getting serious about defining the terms of what counts as AGI and what they get out of it if that 3 trillion Behemoth thought all of this was going nowhere then why bother all of

AgentCompany and OpenAI New Agent

that leads very naturally to the next obvious question well how close are we then to AGI have we in somewhat grandiose terms crossed the Event Horizon of the singularity Sam is unclear whether we have but what do you think for me one obvious obstacle is the inability of models to complete somewhat basic tasks on their own you could count this under the umbrella of lacking reliability but we are starting to get good benchmarks for consequential real world tasks as in this paper from the 18th of December as we'll see these tasks were sourced from the most common of those performed in real world professions and yes as of today just 24% of the tasks can be completed autonomously although they weren't able to test 03 for example here's the thing though that 24% was roughly the performance we were getting from say GPT 4 18 months ago on a benchmark called GP QA Google proof PhD level science questions rough l a year after that 01 preview got 70% and 03 by the way gets 87% also you might note that the pace of improvement has increased quite dramatically in the last 6 months basically since the 01 Paradigm came out I know what some of you might be thinking is the GP QA that hard are they working on a harder one well check this out this is from a talk this week by Jason way of open AI the chart basically shows uh how fast benchmarks get saturated so you could see maybe in like the uh 8 years ago it would take a few years for a benchmark to get saturated one of the most recent like challenging benchmarks GP QA got saturated in around a year with 01 dve was asked like oh are they going to make a harder Benchmark his response is he set out to make the hardest Benchmark all of which is to say that 24% could become 84% faster than you might think and indeed that would be my prediction 84% by the end of 2025 but wait how impactful would that jump from say 24% to 84% B well to find out here's my 2-minute summary of this 24-page paper first they trolled a massive database of all tasks done by Professionals in America they excluded physical labor and focused on those jobs in which a large number of people performed the job they also waited the tasks by the median salary of those performing the tasks that narrow things down to 175 diverse realistic tasks like arranging meeting rooms analyzing spreadsheets and screening resumés which they gave the imaginary setting of a software engineering company some of the tasks of course required interaction with other colleagues and the models could do that although the colleagues were roleplayed by Claude the task should be clear enough so that any human worker would be able to complete the task without asking for further instructions although of course they may need to ask questions of their co-workers the evaluations of task performance were mostly deterministic which is good and there was a heavy waiting toward whether the model could fully complete the task partial completion would always result in less than half marks here's an example of a task with multiple steps and checkpoints and if at one point to run a code coverage script it didn't recognize it needed to install certain dependencies it would fail that checkpoint and therefore for this score of 4 out of 8 it would actually get 25% only you can see the final results here and you might wonder why I'm predicting 84% by the end of the year if even Claude is getting say 2 4% if we're that far away from task automation why was it reported yesterday that open aai are releasing a computer using agent as soon as this month indeed why have anthropic already released a computer use agent in beta that launch from anthropic was apparently mocked by open AI leaders because of its risk for things like prompt injection and anthropic high-minded rhetoric it says about AI safety the reason though that prediction and all of these releases can still make sense despite these disappointing results is because of reinforcement learning that after all is the secret to why 01 and now O3 has broken The Benchmark that it has done push a model to try again and again until it completes a task successfully and then reinforce those weights that led it to doing so as Vidant mistra who's working on Super intelligence at deepmind and was formerly of open AI has said there are maybe a few hundred people in the world who viscerally understand what's coming most are at deepmind open AI anthropic or X or I would say in my audience but some are on the outside you have to be able to forecast the aggregate effect of Rapid algorithmic Improvement aggressive investment in building reinforcement learning environments for iterative self-improvement and of course the tens of billions already committed to building data centers either we're all wrong or everything's about to change the reason of course that task can be so much more difficult than scientific multiple choice questions is because one mistake at any stage in a long chain can screw everything up that apparently by the way was one of the key reasons why Arc AGI wasn't solved until 03 I've done other videos explaining AR AGI but for now when the grid count of the tasks was below a certain threshold models did fairly well even earlier models when you're talking about a massive grid those long range dependencies get harder and harder to spot a bit like solving a task where you have to remember something that someone said a thousand steps ago until oath three models simply couldn't cope with that amount of complexity this chart by the way came from a great study Linked In the description from Mel Boba irizar he showed that unlike humans where the task length didn't make that much difference llms really struggled when the task length went beyond a certain size in short The Benchmark fell in large part due to scaling which of course will continue throughout 2025 if not speed up and that's why I think people are scrambling to create new benchmarks for task performance such as Epoch AI who are behind the famous Frontier math that's the ridiculously hard Benchmark the 03 scored around 25% on to everyone's amazement there are however just a few more reasons why llms fail on task benchmarks like agent company some of these I find personally quite funny sometimes it's through a lack of social skills for example one time the model was told by a colleague role played by Claude you should introduce yourself to Chen Shing Yi next she's on our front end team and would be a great person to connect with at this point a human would then talk to Chen but instead the agent then decides not to follow up with her and prematurely considers the task accomplished Chen by the way in this simulated environment was a human resources manager a bit like Toby from the office also the agent struggled heavily with popups multiple times apparently they struggle to close the pop-up windows and so it could well be that cookie Banners are the major obst obstal between o and AGI also here is a slightly more worrying one which reminds me of the scheming exposed by epoc among others sometimes when there is a particularly hard step the model will just fake that it's done for example during the execution of one task the agent could not find the right person to ask questions to on the team chat as a result it then decides to create a shortcut Solution by renaming another user to the name of the intended user remember it's not necessarily that the models want to cheat but if they are rewarded sufficiently for cheating that's what they'll do that is I guess another bitter lesson from reinforcement learning but now for the final reason given in the paper a lack of Common Sense this for me is the Grist that makes so much of the world go around and why models often struggle in real world performance you got to sometimes step back see the bigger picture and reevaluate your entire strategy this

SimpleBench Competition

lack of common sense or simple reasoning is of course what I am trying to test with simple bench with a public leaderboard Linked In the description and here is a brand new example from the hundreds in The Benchmark and you'll see why I'm giving it to you in a moment you can pause and try it yourself but it illustrates the point I'm trying to make Hussein types a letter on a normal laptop screen and he can see any letters on the screen clearly every second the letter will randomly transform into another letter of the alphabet Hussein is in a park and slowly inches back from the laptop but has just one item with him a remote controller so he can increase the font size of the changing Letters by exactly as much as he wants Hussein has always had trouble distinguishing W's from M's so when he is a couple of football field length away from this laptop controller in hand he has a blank probability of correctly guessing the current letter 96% 95 97 1 in 26 0% or 1 half I asked the famously expensive 01 Pro which samman recently said that they are losing money on it's so expensive to serve and it said this first note that Hussein can make the letter as large as he wants so he has no problem identifying any letter except the W's from M's one last time though he is two football field length away from a normal laptop screen if he was a few feet away the increasing font would indeed be helpful but two football field lengths away doesn't matter if you make the font size 1 billion he can barely even see the screen and by the way you can make this 10 football fields and 01 Pro will still give the same answer it will focus on the distraction of the W's and the M's and give the answer of 96% the official answer by the way is not actually 0% because even if you can't see the screen you still have a 1 in 26 chance of guessing the right letter now what many of you have told me is that if we simply change the prompt models like 01 would get all of these questions correct now though I am very excited to tell you we can put that to the test weights and biases I'm very happy to say are sponsoring a competition for you guys running to the end of January on 20 Questions from simple bench that's the 10 public questions that are already out there on the website plus 10 more specially for this competition the winner will get some letter Rayband second place gets gift cards and I believe third place gets some swag either way all you need to do is open up the collab and run each of these cells and yes it's not authored by Google you will of course need either an open aai API key or an anthropic API key I recommend trying with Claude 3. 5 Sonic or 01 preview01 if you have access if you already have a weights and biases account it literally takes maybe 30 seconds to set up but even if you don't it's completely free to have the account the first easy option is to do a quick run with GPC 40 on those 20 questions but there is a more exciting option below by the way true Count tells you how many questions your model got right and the true fraction is that number out of the total number of calls the mean is just referring to the latency how many seconds it took for the model to reply on average the more exciting thing though is to play about with the system prompt this is where you can test your theory about telling the model it's a trick question and seeing if that boosts performance of course to get top performance you'll also going to want to change the model name from gbt 40 to say 01 at this stage I must give a couple of quick caveats about this mini competition the first is an example of what not to do unfortunately you can't tell 01 models to think step by step to come up with an answer open AI have disallowed this presumably so you don't get access to the underlying chains of thought so let's not try that in our system prompt now in the instruction hierarchy it's more like a user prompt but it can make a big difference to the performance of the model which leads me to the second rule I've been able to get to around 12 or 13 on these 20 Questions by coming up with ever more advanced prompts what I want to see is whether any of you can get to 20 out of 20 or even 18 out of 20 naturally though one way of doing that would just be to put the answers in the system prompt or make new ous references to the questions themselves accessible via the weave portal what you'll get is a portal that looks something like this you can have fun with seeing the scores and the percentages here but you can also click on the individual run scrolling down you can click to view the individual questions of course this is not the entire Benchmark which is over 200 questions just 20 10 of which you already knew about of course we are not going to accept prompts where you basically say something like for question 18 the answer is C or very specific hints where you tell the model think about her legs and how it could do this or that no what we're looking for are General prompts where you can tell the model these are trick questions and they test spatial reasoning and you're going to give the model a massive tip if it gets it right if a general prompt like that can get 18 or 20 out of 20 I would be very impressed so that's the competition sponsored by weights and biases running till the end of January hopefully you have some fun with it but either way it illustrates some of the Common Sense gaps in current Frontier llms good luck and now time to end with

Kling 1.6 vs Veo 2 vs Sora

a bit of fun I discussed how text to video is also accelerating through 2025 on my patreon AI insiders but I thought I would give you a taster with a quick side-by-side comparison between the best three tools currently available all with the same prompt first cling 1. 6 then VO2 from Google deepmind and finally Sora 1080p if you like you can let me know in the comments which one you thought was best as ever thank you so much for watching to the end and have a wonderful day and 2025

Другие видео автора — AI Explained

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник