❤️ Check out Weights & Biases and sign up for a free demo here: https://wandb.com/papers
❤️ Their mentioned post is available here: http://wandb.me/flamingo
📝 DeepMind's paper "Tackling multiple tasks with a single visual language model" is available here:
https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model
🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible:
Aleksandr Mashrabov, Alex Balfanz, Alex Haro, Andrew Melnychuk, Benji Rabhan, Bryan Learn, B Shang, Christian Ahlin, Edward Unthank, Eric Martel, Geronimo Moralez, Gordon Child, Jace O'Brien, Jack Lukic, John Le, Jonas, Jonathan, Kenneth Davis, Klaus Busse, Kyle Davis, Lorin Atzberger, Lukas Biewald, Luke Dominique Warner, Matthew Allen Fisher, Matthew Valle, Michael Albrecht, Michael Tedder, Nevin Spoljaric, Nikhil Velpanur, Owen Campbell-Moore, Owen Skarpness, Rajarshi Nigam, Ramsey Elbasheer, Richard Sundvall, Steef, Taras Bobrovytsky, Ted Johnson, Thomas Krcmar, Timothy Sum Hon Mun, Torsten Reil, Tybie Fitzhugh, Ueli Gallizzi.
If you wish to appear here or pick up other perks, click here: https://www.patreon.com/TwoMinutePapers
Thumbnail background image credit: DeepMind
Thumbnail background design: Felícia Zsolnai-Fehér - http://felicia.hu
Károly Zsolnai-Fehér's links:
Instagram: https://www.instagram.com/twominutepapers/
Twitter: https://twitter.com/twominutepapers
Web: https://cg.tuwien.ac.at/~zsolnai/
#deepmind
Оглавление (2 сегментов)
Segment 1 (00:00 - 05:00)
Dear Fellow Scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér. Today we are going to have a look at DeepMind’s Flamingo AI that has looked at over 1 billion images, and see what it had learned from it. And, yes, we will also try to break it. Successfully. Now, its goal was to fuse AI models that understand language with other AI models that understand images. And what comes out? Absolute magic. For instance, it can be an amazing assistant. We can give it a picture, ask it what it depicts, a bowl of soup with a monster face. I particularly like this one because this image was also made by a different AI. So this is an AI commenting on another AI’s work. How cool is that? Now, we can also subject it to a test that is interesting because it is designed to break the brain of some humans. Look. We need to read the text and then name the color that was used to write it. If you try to do it quickly, it is easy to answer it incorrectly, and apparently, this little AI is passing with flying colors. But it gets better! Look! It knows that this is the Stroop test, and it also knows that humans read this kind of text slower. So, let’s pop the question: is it challenging for you too, little AI? “I am not affected by this difference. ” Whoa. It is flexing. But, it backs it up. What a time to be alive! Now, what it can also do is something called few-shot learning. What is that? Well, this is learning from only a few examples, something humans are really good at, but machines, not so much. For instance, we look at an image, and say that this is a chinchilla. We look at another image, and say this is a shiba. Then, the third time, we look at an image, and we say “this is”. So, does it know what it is supposed to do with this? Remember, neural network-based learning methods typically require thousands and thousands of training samples to learn something new, so, can this new one do it? Yes! Oh boy. That is amazing. So, this is few-shot learning. And I have a hard time overstating how incredibly useful it is. For instance, we can quickly teach it to read handwritten text from a piece of paper, or, here comes my favorite: we can even ask it to reverse-engineer the prompts that the AI used to create these images. We teach it on 2 examples where we tell it the piece of text that was used to generate it, and now, for the third time, it knows the drill and does it itself. Or, we can even ask for the ingredients, or even the nutrients that are present in a meal, or what songs a soundtrack of a movie contain. And, here is another reassuring thing. It does not fall for the good old apple + iPod trick that many previous techniques fell for. So now, let’s try to answer three Scholarly questions. Question number one, is it really better than previous techniques? If we compare it to previous zero-shot techniques, this is a challenging case where the AI has to perform something new, there is no contest. The new one is so much better. However, what I found even more interesting is that it can even give fine-tuned AIs a run for their money too. This is insanity. So, what is that? Fine-tuned means an AI that was specifically designed and trained for one task, and one task only. This new method can do many things at the same time, this is a generalist, so it is not expected to beat specialized techniques. How could it? And here comes the insanity part: it is not only competitive with these fine-tuned techniques on several datasets, but, wow, it even outperforms them on some of these despite having access to ten, and sometimes even a hundred times less training data. I am stunned. These improvements are concentrated around video-based question answering datasets like NextQA and iVQA. Question number two, how does it improve with the number of shots, you remember the flamingo example, we had two shots here. The answer is that it gets it early, it does not need tens of examples to understand what we are trying to ask.
Segment 2 (05:00 - 08:00)
Its zero-shot performance is formidable, so it can do things it has never been taught before, that is incredible, but even in other cases, after just a few examples, it really gets it. And, three, the usual suspect. Model size. Yes! As I hoped, results still improve if we train a bigger neural network, which is great news, two more papers down the line and it will be so much better, even without algorithmic improvements. If we increase the model size, it gets smarter, and here is a beautiful chart that showcases that it can learn from only a few shots significantly better. Dear Fellow Scholars, this is perhaps artificial intelligence being born right before our eyes. What a time to be alive! But wait, we said that it is excellent at video-based question answering. So, let’s push it to its limits and see what it can do. Whoa, that is very nice. It understands that this good boy is being weighed, understands what a video game avatar means, reads a sequence of text, even if only a few letters are shown at a time as we pan the camera through it. And, hold on to your papers, because it also understands humor. Yes, that’s right. This needs careful prompting, but it can do it. Here, we are looking at an image where people are getting amused by Obama’s little prank with the scale. With a little guidance, it successfully identifies what is unusual and amusing about this image. But, we haven’t broken it yet. And now, you will see that not even this technique is perfect. For instance, let’s try to mess with this little AI and break its brain. How do we do that? Well, of course, with silly, irrelevant questions. We can even ask it what it can see outside the window? And it says, a parking lot. Well, not quite! Also, whom is the person texting? It says, of course, the driver. Well, you cannot possibly know that, little AI. And once again, this is an incredible paper, perhaps a step towards the highly coveted general intelligence that so many people think we will never reach. I am stunned. So, what do you think? Does this get your mind going? Let me know in the comments below! Thanks for watching and for your generous support, and I'll see you next time!