Did OpenAI FAKE AGI ? (Controversy Explained)
13:30

Did OpenAI FAKE AGI ? (Controversy Explained)

TheAIGRID 22.12.2024 78 337 просмотров 1 448 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Learn AI Free for the first 30 days- http://brilliant.org/TheAIGRID Join my AI Academy - https://www.skool.com/postagiprepardness 🐤 Follow Me on Twitter https://twitter.com/TheAiGrid 🌐 Checkout My website - https://theaigrid.com/ 00:00 Initial controversy 01:15 Training details 02:15 Engineer comments 03:19 Benchmark creators 05:52 Sponsored segment 07:02 OpenAI responses 09:08 Training clarification 10:05 Frontier math results 11:05 Benchmark explained 12:05 Expert opinions 13:28 Final thoughts Links From Todays Video: https://x.com/rhythmrg/status/1870602244103766258 https://www.youtube.com/watch?v=K-zQPqGAB0g Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos. Was there anything i missed? (For Business Enquiries) contact@theaigrid.com Music Used LEMMiNO - Cipher https://www.youtube.com/watch?v=b0q5PR1xpA0 CC BY-SA 4.0 LEMMiNO - Encounters https://www.youtube.com/watch?v=xdwWCl_5x2s #LLM #Largelanguagemodel #chatgpt #AI #ArtificialIntelligence #MachineLearning #DeepLearning #NeuralNetworks #Robotics #DataScience

Оглавление (11 сегментов)

Initial controversy

so yesterday was a Monumental day in AI because many were speculating that open AI basically achieved AGI and many that worked at open AI also made this claim now recently we had a very interesting discussion on Twitter that claims that this AGI Benchmark demo that we saw isn't what we think it is so it all started when we got this tweet I'm getting GPT 4 vibes from this announcement when in March 2023 we had some massive progress and how did they get such a jump in Ark AGI then this person Neils Raj responded saying they did this by training 75% on the training set and then this refers to something that open I showed where they shared that they trained on the 03 training set and they tested on 75% of the Public Training set and it says that they have not shared more details now this is basically referring to this which is on the AR AGI web page where they look at the 03 results and it says here that it says note on tune opening eyes shared that they trained the 03 we tested on 75% on the training set and it says that they have not shared more details and they have not tested the arc untrained model to understand how much of the performance is due to the arc AGI data

Training details

now then we also had a notable AI critic Gary Marcus who said wow if this is true this raises serious concerns about yesterday's announcement and this is of course referring to the fact that if the model was quote unquote trained on the training set then it would have seen some of the examples that was the presumption now interestingly as well we also had this someone also said that they had an ALT manian slip in that 03 presentation the engineer said that they specifically targeted the ark Benchmark an Alman immediately corrected him saying that they didn't do anything special and Alman has no need or Reason to say anything unless it's a lie also has nothing to say now essentially what he's referring to here I'm going to show you guys the clip in a second but he says alanian slip which is quite like the word fian slip which is like where you say something that you're not meant to and it is apparently a verbal or physical error that incurs when an unconscious thought infers with what someone was meant to say now take a look at this small clip here CU I do think this is a little bit interesting but as

Engineer comments

this video evolves you'll start to see that the claims don't hold much ground you know it's also a benchmark that we've been targeting been on our mind for a very long time so we're excited to work with you in the future worth mentioning that we didn't we Target and we think it's an awesome Ben we didn't go do spefic this is just you know the general of three but yeah really appreciate the partnership and this was a fun one to do and that clip was basically referring to where the engineer basically said that they targeted The Benchmark but samman was stating that they didn't now one of the co-creators of the Ark AI Benchmark has actually responded to this and he says raising avilability on this note we added to address the arc tuned confusion said the opening ey shared that they trained the 03 we tested on 75% of the Public Training set and he said this is the explicit purpose of the training set it is designed to expose a system to the core knowledge priors needed to beat the much harder private valuation set the idea is each training task shows you an isolated single prior and that the evaluation set requires you to recombine and Abstract from those priors on the Fly broadly the evaluation task requires

Benchmark creators

utilizing three to five prior and the evaluation sets are extremely resistant to just memorizing the training set and this is why 03 is impressive now he also stated here that the other reason that he doesn't think that this matters is that it's nearly certain the arc's training set and the evaluation set mind you was included in the pre-training for GPT 3 4 and 40 because those have been hosted on a public GitHub repo since 2019 and the craziest thing about that is that those other models don't seem to perform that well on the AR AGI Benchmark which means that if it was in the you know training pre-training data then that means that 03 would likely actually be a different breakthrough because if that is the case then we should have seen those other models like GPT 4 or 40 actually perform well on those benchmarks now someone else here also said where's your result without this data set what's the scientific value of this number if you explicitly call out this limitation what do we learn from this and he said that you know we want to do the ablation study we just didn't have enough time before the announcement and I still don't think this takes anything away from the fact that 03 is a uniquely capable system and this is something that I do agree with when we do take a look at what O3 is capable of which I will mention later on in the video I can see here that this is where franchis CH also decides to respond the other individual who also created the ark AGI Benchmark says that this is the point of the training set though to train your model on it would have been more impressive if the model had no prior exposure to the arc data but the fact that the model was adapted via the training set absolutely does not invalidate its score so you can see here that the creators of The Benchmark also do not think that this invalidates its score now of course we are going to have notable AI critics like Gary Marcus that are going to say stuff when it comes to criticizing AI like I've always said I always do think that it is important to have a even discourse in AI because that is how you actually make progress rather than just being in a hype bubble that is how you actually look at your flaws and then of course decide to see okay if this isn't real reasoning how do we achieve the real reasoning and get to True AGI now of course there is some more information because a lot of people really have missed the mark with this one Speaking of AI if you actually wondered how large language models like the ones we talk about consistently actually work under the hood well this is where today's sponsor brilliant comes in brilliant is

Sponsored segment

where you learn by doing with thousands of interactive lessons in math data analysis programming and AI and what I personally love about brilliant is that they break down complex Concepts into intuitive Hands-On problems that you can build real understanding from the ground up now their courses on AI and programming are particularly fascinating you'll start building programs on day one with their built-in python editor and they have an excellent course that let you Peak under the hood of llms to understand the concepts powering today's AI technology instead of just watching passive lectures you're actively solving problems and building intuition has been proven to be six times more effective for Learning and what I especially appreciate is how brilliant helps you develop a consistent learning Habit in just a few minutes a day you can actually level up your programming and AI skills whether you're commuting taking a break or waiting for one of your AI models to train now if you want to try it out the first 30 days are completely free when you sign up with my link brilliant. org Grid or alternatively you can either scan this QR code on screen right now or just click the link right there you can see here Rune an open AI employee or M of technical staff says that oh man they trained on the

OpenAI responses

train set it's all over now and he's basically using sarcasm to say that this is not the case now when we actually take a look at what's going on here we can take a look at what this individual here who is conducting research at opening Isis he stated okay that the model we used for all our 03 evaluations is fully General a subset of the Ark AGI Public Training set was a tiny fraction of the broader 03 training distribution and we didn't do any additional do main specific fine tuning on the final checkpoint and this is in response to a tweet from someone else that works to open AI who says to anyone wondering if the high arc AGI score is due to how we prompt the model now I wrote down a prompt format that I thought looked clean and then we used it that is the full story so you can see here that the individuals working at open ey have clearly stated that they didn't do any additional domain specific fine tuning on the final checkpoint I know that a lot of people would have thought okay because it says tuned so that means they fine- tuned the model on that Benchmark but this is not the case you could also see someone here say were anyone on the team aware and thinking about Arc and Ark like problems as a domain to improve at when you were designing and training 03 and they said that this is the distinction between succeeding as a random side effect or succeeding with intention which of course you know potentially referring to the fact that you can Target a benchmark which is sometimes what tends to happen when people are trying to build these M you can see here he says no the team wasn't thinking about arc when training O3 people internally just see it as one of the many other thoughtfully designed evaluations that are useful for monitoring real progress and we can see someone also said here that you know what does tuned mean here so he says it's a strange way this is another you know researcher at open AI says it's a strange way of denoting that we included Arc training examples when we were training 03 and it isn't some fine tune version of 03 though it's just 03 and this is of course a very big difference because if you fine-tune something on you know certain pieces of data that is completely different than to just having it in its training data

Training clarification

so those are two different things and that is definitely a very strange way of denoting that which I think brings some confusion and of course I'm guessing that the team internally weren't really trying to Target that Arc a benchmark it was just an unintended side effect now whether or not that's true you're going to have to be the judge now with all of this information with some people stating that yes they did The Benchmark no they didn't because of the examples yada yada but we clearly see that you know we have both creatures of the Benchmark basically saying that you know this doesn't invalidate its score and of course that the evaluation sets are extremely resistant to just memorizing the training set which is why 03 is impressive I still think that even if okay which I don't believe anyways that even if this was just pure memorization which like I said I don't believe 03 is still a remarkable system because of this so this is the epoch AI Frontier math benchmark and the previous state of-the-art systems managed to get just two %. you can see here 03 manages to get over 25% which is pretty incredible

Frontier math results

when we start to think about what this means for the system now for those of you who don't realize how big of a jump this really is and how incredible this result is Terren tow which is often considered one of the greatest living mathematicians who is a Fields medalist and a professor at the University of California UCLA this guy okay said that this Benchmark you know these are extremely challenging and I think they will resist AI for several years at least this is what he said when he looked at the Benchmark when it first came out and we can already see that 03 is a system that manages to get 25% remember this is opening ey's second iteration of the model and it's able to do that well on a really challenging Benchmark and the craziest thing about all of this is that when we look at this Benchmark it's a thing that is really resistant to memorization because there is not much data out there take a look at what they actually said in the video they posted to their Channel and I think most people did miss this because the video has I think around 300 or 200

Benchmark explained

views existing benchmarks for complex scientific reasoning are close to saturating we need something new to be able to tell how much progress we're making towards expert level ability and that's why we've built Frontier math we worked with over 60 mathematicians worldwide professors IMO writers and Fields medalists to produce hundreds of original extremely challenging math problems it these are actually genuinely problems one of the things that will be I think difficult this is like a lack of trading material like pry Niche is not probably not that well documented it yeah so I took a look at the 10 problems uh you sent I think of the I could do the analog number three ones in principle and then the others I don't know how to do but I know who to ask the problems of Frontier math range in scope from Olympia style puzzles to research level challenges and span across all the major fields of mathematics they're also beyond the capabilities of current AI we've tested the most advanced systems available and each exceeded in less than 2% of these problems in the new term basically the

Expert opinions

only way to solve them short of having a real domain expert in the area is by a combination of a semi-expert like a graduate student in a related field paired with some combination of a modern Ai and lots of other packages and things like that although fronter math measures Advanced mathematical reasoning we use problems with close form answers like integers so they can be automatically verified without significant advances in the math libraries of proof assistance this is a challenging design constraint for the mathematicians writing these problems whose research is almost entirely conveyed through proving theorems an AI capable of solving this challenge will have a drastic effect on mathematics so that right there should show you guys that even if there were maybe some minor issues okay and very minor issues with the O AI Benchmark a system managing to get 25% of The Benchmark when we saw 01 mini and 01 preview get around 1% we know that this system is a drastically different system in terms of the capabilities now whether or not you agree that is completely up to you there is always going to be AI Skeptics which I think there should be because it allows for criticism and like I said it allows for us to achieve real progress outside of the bubble of hype but I wanted to make this video to clear up the fog in the air because there was a lot of discussion about whether or not this demo was faked whether the Benchmark was real and this video should just provide you guys with some insight with as to what this Benchmark truly measures with

Final thoughts

that being said if you enjoy

Другие видео автора — TheAIGRID

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник