# DeepMind's New AI Looked At 1,000,000,000 Images!

## Метаданные

- **Канал:** Two Minute Papers
- **YouTube:** https://www.youtube.com/watch?v=zOU6usZRJvA
- **Дата:** 09.12.2022
- **Длительность:** 8:41
- **Просмотры:** 151,945
- **Источник:** https://ekstraktznaniy.ru/video/13366

## Описание

❤️ Check out Weights & Biases and sign up for a free demo here: https://wandb.com/papers 
❤️ Their mentioned post is available here: http://wandb.me/flamingo

📝 DeepMind's paper "Tackling multiple tasks with a single visual language model" is available here:
https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model

🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible:
Aleksandr Mashrabov, Alex Balfanz, Alex Haro, Andrew Melnychuk, Benji Rabhan, Bryan Learn, B Shang, Christian Ahlin, Edward Unthank, Eric Martel, Geronimo Moralez, Gordon Child, Jace O'Brien, Jack Lukic, John Le, Jonas, Jonathan, Kenneth Davis, Klaus Busse, Kyle Davis, Lorin Atzberger, Lukas Biewald, Luke Dominique Warner, Matthew Allen Fisher, Matthew Valle, Michael Albrecht, Michael Tedder, Nevin Spoljaric, Nikhil Velpanur, Owen Campbell-Moore, Owen Skarpness, Rajarshi Nigam, Ramsey Elbasheer, Richard Sundvall, Steef, Taras Bobrovytsky, Ted Johnson

## Транскрипт

### Segment 1 (00:00 - 05:00) []

Dear Fellow Scholars, this is Two Minute  Papers with Dr. Károly Zsolnai-Fehér. Today we are going to have a look  at DeepMind’s Flamingo AI that has   looked at over 1 billion images, and see  what it had learned from it. And, yes,   we will also try to break it. Successfully. Now,  its goal was to fuse AI models that understand   language with other AI models that understand  images. And what comes out? Absolute magic. For instance, it can be an amazing assistant. We  can give it a picture, ask it what it depicts,   a bowl of soup with a monster face. I particularly  like this one because this image was also made by   a different AI. So this is an AI commenting  on another AI’s work. How cool is that? Now, we can also subject it to a test that is  interesting because it is designed to break   the brain of some humans. Look. We need to read  the text and then name the color that was used   to write it. If you try to do it quickly, it is  easy to answer it incorrectly, and apparently,   this little AI is passing with flying colors.   But it gets better! Look! It knows that this   is the Stroop test, and it also knows that  humans read this kind of text slower. So,   let’s pop the question: is it challenging for  you too, little AI? “I am not affected by this   difference. ” Whoa. It is flexing. But,  it backs it up. What a time to be alive! Now, what it can also do is something called  few-shot learning. What is that? Well,   this is learning from only a few examples,  something humans are really good at, but machines,   not so much. For instance, we look at an image,  and say that this is a chinchilla. We look at   another image, and say this is a shiba. Then, the  third time, we look at an image, and we say “this   is”. So, does it know what it is supposed to do  with this? Remember, neural network-based learning   methods typically require thousands and thousands  of training samples to learn something new, so,   can this new one do it? Yes! Oh boy. That  is amazing. So, this is few-shot learning. And I have a hard time overstating how  incredibly useful it is. For instance,   we can quickly teach it to read handwritten text  from a piece of paper, or, here comes my favorite:   we can even ask it to reverse-engineer the prompts  that the AI used to create these images. We teach   it on 2 examples where we tell it the piece  of text that was used to generate it, and now,   for the third time, it knows the drill and does it  itself. Or, we can even ask for the ingredients,   or even the nutrients that are present in a meal,  or what songs a soundtrack of a movie contain. And, here is another reassuring thing. It does not   fall for the good old apple + iPod trick  that many previous techniques fell for. So now, let’s try to answer three  Scholarly questions. Question number one,   is it really better than previous techniques? If we compare it to previous zero-shot techniques,   this is a challenging case where the AI  has to perform something new, there is   no contest. The new one is so much better. However, what I found even more interesting   is that it can even give fine-tuned AIs a run  for their money too. This is insanity. So,   what is that? Fine-tuned means an AI that was  specifically designed and trained for one task,   and one task only. This new method can do many  things at the same time, this is a generalist,   so it is not expected to beat specialized  techniques. How could it? And here comes   the insanity part: it is not only competitive  with these fine-tuned techniques on several   datasets, but, wow, it even outperforms them  on some of these despite having access to ten,   and sometimes even a hundred times less training  data. I am stunned. These improvements are   concentrated around video-based question  answering datasets like NextQA and iVQA. Question number two, how does it  improve with the number of shots,   you remember the flamingo example, we had two  shots here. The answer is that it gets it early,   it does not need tens of examples to  understand what we are trying to ask.

### Segment 2 (05:00 - 08:00) [5:00]

Its zero-shot performance is formidable, so it  can do things it has never been taught before,   that is incredible, but even in other cases,  after just a few examples, it really gets it. And, three, the usual suspect.   Model size. Yes! As I hoped,   results still improve if we train a bigger  neural network, which is great news,   two more papers down the line and it will  be so much better, even without algorithmic   improvements. If we increase the model size,  it gets smarter, and here is a beautiful chart   that showcases that it can learn from only a few  shots significantly better. Dear Fellow Scholars,   this is perhaps artificial intelligence being born  right before our eyes. What a time to be alive! But wait, we said that it is excellent  at video-based question answering. So,   let’s push it to its limits and see what it can  do. Whoa, that is very nice. It understands that   this good boy is being weighed, understands what a  video game avatar means, reads a sequence of text,   even if only a few letters are shown at  a time as we pan the camera through it. And, hold on to your papers, because it  also understands humor. Yes, that’s right.    This needs careful prompting, but it can do  it. Here, we are looking at an image where   people are getting amused by Obama’s little  prank with the scale. With a little guidance,   it successfully identifies what is  unusual and amusing about this image. But, we haven’t broken it yet. And now,  you will see that not even this technique   is perfect. For instance, let’s try to mess  with this little AI and break its brain. How   do we do that? Well, of course, with silly,  irrelevant questions. We can even ask it what   it can see outside the window? And it says, a  parking lot. Well, not quite! Also, whom is the   person texting? It says, of course, the driver.   Well, you cannot possibly know that, little AI. And once again, this is an incredible paper,  perhaps a step towards the highly coveted   general intelligence that so many people  think we will never reach. I am stunned. So,   what do you think? Does this get your mind  going? Let me know in the comments below! Thanks for watching and for your generous  support, and I'll see you next time!