OpenAI’s GPT 4.1 - Absolutely Amazing!
12:01

OpenAI’s GPT 4.1 - Absolutely Amazing!

Two Minute Papers 16.04.2025 87 479 просмотров 2 857 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
❤️ Check out DeepInfra and run DeepSeek or many other AI projects: https://deepinfra.com/papers GPT 4.1 (once again, likely API only, not in the ChatGPT app): https://openai.com/index/gpt-4-1/ 📝 The paper "Humanity's Last Exam" is available here: https://agi.safe.ai/ Sources: https://x.com/paulgauthier/status/1911927464844304591?s=46 https://x.com/ficlive/status/1911853409847906626 https://x.com/flavioad/status/1911848067470598608?s=46P https://x.com/pandeyparul/status/1911958369734107439?s=46 https://x.com/demishassabis/status/1912197180187897985?s=46 https://x.com/emollick/status/1911966088339894669?s=46 https://x.com/aibattle_/status/1911845556885893488?s=46 https://x.com/augmentcode/status/1911933204036243479?s=46 https://x.com/christiancooper/status/1881335734256492605 📝 My paper on simulations that look almost like reality is available for free here: https://rdcu.be/cWPfD Or this is the orig. Nature Physics link with clickable citations: https://www.nature.com/articles/s41567-022-01788-5 🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible: Benji Rabhan, B Shang, Christian Ahlin, Gordon Child, John Le, Juan Benet, Kyle Davis, Loyal Alchemist, Lukas Biewald, Michael Tedder, Owen Skarpness, Richard Sundvall, Steef, Taras Bobrovytsky, Thomas Krcmar, Tybie Fitzhugh, Ueli GallizziIf you wish to appear here or pick up other perks, click here: https://www.patreon.com/TwoMinutePapers My research: https://cg.tuwien.ac.at/~zsolnai/ X/Twitter: https://twitter.com/twominutepapers Thumbnail design: Felícia Zsolnai-Fehér - http://felicia.hu

Оглавление (3 сегментов)

Segment 1 (00:00 - 05:00)

Alright, GPT 4. 1. Three new models just  appeared, 4. 1, mini, and nano. This is a   mainly coding-focused AI assistant, previously  if you wanted to create a flash card app from   just one text prompt, I mean, this is okay, it  kinda works — you can create and review your   flash cards. However, if you look at the new one,  the bones are very similar, but the usability is   on a completely different level. This went from  good to great in just one release. Loving it! Note that this video comes in two parts, this  is part one, and part two will be where the   real fun begins. So, these new models form  a new Pareto frontier, which roughly means   that you can choose how much speed you are  willing to sacrifice to get more intelligence. If you are typing something and you  want the AI to autocomplete your text,   it does not need to be smart as Einstein  to do that, but it needs to be super fast,   so you probably need the nano for that. But  for most things, like the flash card app,   you will probably want to invoke the regular 4. 1. But it gets better. On coding tasks, 4. 1  can outperform the previous 4. 5, I know,   and surprisingly, 4. 1 outperforms even those  much slower, thinking AIs on coding benchmarks. Also, wow. The context window is now 1  million tokens long, you can put heaps   and heaps of textbooks in there, thousands of  pages in total and ask questions about them,   and then, what happens? This is called the  needle in a haystack test, and OpenAI says   that it recalls all of it very well. However,  when looking for 8 needles in the haystack,   the accuracy decreases considerably. Respect to  OpenAI for showing a little weakness there too in   the name of Scholarly integrity. Independent  tests also seem to verify that, and here,   Google DeepMind’s Gemini 2. 5 Pro reigns supreme.   I’d love to see some more rigorous testing on this   to get to the bottom of it. This is important,  because even if you don’t put in many textbooks,   remembering your past conversations  and understanding you will be super   important going forward. It better not  miss that marriage anniversary date! And, yes I hear you asking, Károly, GPT 4. 5 is  already out. This is GPT 4. 1. Yes, and look,   it gets better, I mean funnier,  look at this wall of other models.    You could say that the marketing could be a  bit better here and I would agree. However,   this also shows the breakneck pace of innovation  and competition here. More on that in a moment. Now, benchmarks. I enjoy seeing numbers going up  over time as much as the next Scholar. I mean,   having it defeat hundreds of PhD level  questions, mathematical and biological   olympiad level questions is incredible.   These systems do unbelievably well on them. But here is the problem with these benchmarks.   Almost all of these AI assistants are trained on   nearly the whole internet. What does that mean?   That means almost whatever your question is,   they had already seen something  that is similar. And what does that   mean? It means these benchmarks will  mean less and less over time. Thus,   I would not take them too seriously. Perhaps  as one little data point. One little pointer. So then, is testing AI assistants  reliably impossible? Well,   not quite. There is a solution. Sort of. Okay, this was part one. Normally here is where   most other videos end. Elsewhere. But not  here. Here we talk about research papers,   and the wider context around these  papers. For me, that is the coolest part,   that’s what makes Two Minute Papers different.   So here comes the coolest part. Part two. Here, we are going to answer questions like: Can we really know how smart these AI systems are?   Why do they say that it is  devilishly difficult to train them?   How do you make a small textbook  bigger? Yes, you heard it right. Dear Fellow Scholars, this is Two Minute  Papers with Dr. Károly Zsolnai-Fehér. So, the problem with benchmarks is that they  ask questions that these AIs already know. How about a test with things they  don’t know? Enter Humanity’s Last Exam,   a paper that just appeared where the authors  ask the smartest people in the world to create   questions that none of the current AI  systems can answer. These are really,   really tough. Questions from  all kinds of disciplines,   classics, ecology, mathematics, computer  science, linguistics, chemistry, you name it. And the result? Now hold on to your papers  Fellow Scholars, and look at that. Other   tests? Easy-peasy. Humanity’s Last Exam? Wow.   They fail spectacularly on that. A stunning

Segment 2 (05:00 - 10:00)

result. And even if you plug in the newer ones,  once again, Gemini 2. 5 Pro pops up again. Google   is getting their mojo back in AI. So, yes,  I would very much like to have more of these   systems tested on Humanity’s Last Exam. I think  we should keep asking for it. Doing my part here. But…wait a second. I hear you asking,  Károly, okay, but this one can be gamed too,   right? You just put these questions, or  similar ones in the training dataset,   yum-yum-yum, it eats it up, and suddenly,  the next version of the AI is incredibly   good at that, because it has already  seen it. Like with other benchmarks. Well, not quite. Because get this, many of the  questions are reserved in a hidden dataset that   is not published anywhere, making this one  a bit more truthful than most other tests. In short: private datasets might  still be a good yardstick to measure   AI progress in the future. Please test  new systems like GPT 4. 1 on those too. And as you see, luckily for us, there  is breakneck competition between these   AI labs to create the best models out there,  often for free. So we are completely spoiled   here. OpenAI came up with ChatGPT  and took the world by storm. But now,   Google DeepMind has Gemini 2. 5 Pro, which is an  absolute powerhouse at a very low price point,   so much so that it might even steal the show  for now. And DeepSeek is also on their tails,   just a tiny bit behind, but it is all  free to use for all of us. Amazing. So all of these, including GPT 4. 1  are amazing gifts for all of us,   especially that these systems are  devilishly difficult to train. Why is that? Now, I don’t have a lot of eye candy for  this part, so I really appreciate that   you are all Fellow Scholars, and you know  that most of the value is in the narrative,   not in the visuals. Huge thank you and  huge respect to all of you for that. Now get this. In an interview,  scientists are OpenAI were asked,   if they had to redo GPT4, a state of the  art flagship system from just two years ago,   what would it take? The answer was  something absolutely incredible,   I couldn’t believe my ears: they said  5-10 people would be enough to do that. And now, for their more recent models,   they need hundreds of people and  all the might and compute of OpenAI. But here comes the crazy part: both compute  and training data is growing like crazy,   but compute grows even faster  than data. What does that mean? It means that data has become the bottleneck.   So the goal now is to use the tons of compute   to squeeze out every drop of information  out of this training data. That means data   efficiency. Do you know what system out there  is really data efficient? The human brain. Oh   yes! That is huge. I mean, not the brain,  but the realization. So let me say again:   we are not compute constrained. It’s nice  to have more graphics cards, of course,   but now what we need most is more human ingenuity  to make better use of the data we already have. Imagine that you need to take an exam and you have  access to a textbook. And you have all the time   in the world, fantastic. You think you are lucky,  until you find out that there is one big problem:   the textbook is really tiny. Not even close  to enough for the exam. Why? Because it only   has two problems in it. But the test will  have a hundred problems. So what do we do? Well, first, what you don’t do is just memorize  the two problems it contains. This won’t help with   the other 98. So, instead you dissect these two  problems. You try to understand the fundamental   principles, the methods, and the reasoning behind  the solutions. And that, Fellow Scholars is going   to be the next chapter for AI too. Lots of  compute, and comparatively very little data. But the training it gets even more  difficult. Why? Well, they have small bugs,   tiny little problems during the training of  an AI system. Is that a problem? No and yes. You see, imagine a dripping faucet in your  house. Nobody cares about it. But note that   the new system is more than a 100 times more  demanding, a 100 times more complex, so remember   that dripping faucet that everyone ignored? Well,  multiply it by a hundred, that is now a broken   pipe that pours water into the foundation of the  whole house and then it slowly starts sinking. Oh yes, that’s a classic.   A small problem magnified   by a 100 is suddenly not a small problem anymore.

Segment 3 (10:00 - 12:00)

And don’t forget, this is just how things are at  the moment. But the landscape is changing really,   really fast. Once again, there is breakneck  competition. Remember Sora, OpenAI’s text to   video AI? When they first showcased it, the news  took the world by the storm. We couldn’t believe   how good it was. And by the time they released  it? I would go so far as to say that now,   it might not be able to compete with  DeepMind’s Veo2 that has appeared since.    And there are so many more models being  published for free, you can’t even keep   track of them. Some of them are 7 billion  parameters, they run almost everywhere. So, these are all amazing gifts to humanity  from OpenAI, I am really thankful for them,   and I am also thankful for the fact that there is  huge competition between many other labs too. The   result? We, the users, the Fellow Scholars  get spoiled. Often for free. Thank you! So   whatever you hear today on AI this, AI that,  this is still just the beginning of humanity’s   AI journey. And it is already so incredibly  capable. Loving it. What a time to be alive! This was super fun, hope you enjoyed  the journey, and consider subscribing   and hitting the bell icon if you wish to see  more like this. And check out our new sponsor,   because it helps you try a bunch of amazing  AI systems for free or for very cheap.

Другие видео автора — Two Minute Papers

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник