# Microsofts New 'KOSMOS 2' Multimodal Takes Everyone By SURPRISE! (Now RELEASED!)

## Метаданные

- **Канал:** TheAIGRID
- **YouTube:** https://www.youtube.com/watch?v=A7gBbBkLkns
- **Дата:** 02.07.2023
- **Длительность:** 17:27
- **Просмотры:** 255,512
- **Источник:** https://ekstraktznaniy.ru/video/14796

## Описание

Microsofts New 'KOSMOS 2' Multimodal Takes Everyone By SURPRISE! (Now RELEASED!)

Paper - https://arxiv.org/abs/2306.14824
Demo - https://github.com/microsoft/unilm/tree/master/kosmos-2

Welcome to our channel where we bring you the latest breakthroughs in AI. From deep learning to robotics, we cover it all. Our videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on our latest videos.

Was there anything we missed?

(For Business Enquiries)  contact@theaigrid.com

#LLM #Largelanguagemodel #chatgpt
#AI
#ArtificialIntelligence
#MachineLearning
#DeepLearning
#NeuralNetworks
#Robotics
#DataScience
#IntelligentSystems
#Automation
#TechInnovation

## Транскрипт

### Segment 1 (00:00 - 05:00) []

so Microsoft have released another large language model called Cosmos 2. and this is a multimodal large language model that is very interesting now if what I said just confused you please understand and I'll explain multimodal large language models are essentially language models that you can basically use with other modalities other than text so for example with this large language model which is actually a working product and not just a research paper you can actually submit images and get back a response and this is the very big next step in artificial intelligence as you know chat GPT has taken the World by storm but every person right now who's working in artificial intelligence is trying to move the needle by looking at image recognition and this is what Cosmos 2 aims to do now Cosmos 2 by Microsoft is a little bit different because of how they tackle certain problems and I do think the way they've tackled a multimodal large language model in this research paper and with the live demo I will show you one later in the video is arguably going to be how it's going to be done in the future so let's take a look at the abstract so we can pretty much understand what they're talking about so it says we introduce Cosmos 2 a multimodal large language model and labeling new capabilities of perceiving object descriptions and grounding text to the visual World they also state that this work lays out the foundation for the development of embodiment Ai and sheds the light on big convergence of language multimodal perception action and World modeling which is a key step towards artificial general intelligence and essentially if you don't know what artificial general intelligence is that is an AI system and that's going to be an AI system that's capable of pretty much any task and it's going to be better than humans at literally everything and this is a key step towards that so what exactly is Cosmos 2 and what can it do so what we're going to do is we're going to take a look at some of the examples from the research paper because they showcase just how good it is at recognizing images categorizing them and then of course grounding them in reality so what we can now see are three separate images okay and what we're going to do is we're going to look at those images the questions submitted with those images and see if the AI can actually recognize those images so right here you can see that we have a picture of an emoji and the question is asking that I and the question asks the large multi-model AI can you locate the left eye of the Emoji then of course we can see that the completion is a small box around this left eye of the Emoji now essentially what we're looking at here is how Cosmos 2 actually works you see in the abstract they talk about how they use bounding boxes to pretty much identify objects in that image and that's how this AI labels and categorizes different things which will be interesting further on in the demo because you'll see it identify real things in real life with the same bounding boxes then of course we have a second image here which is of course a picture of two cows in some kind of jungle or near some vegetation then of course we have the question how many cows are here please answer then of course you can see two cows present in the image and of course it actually does this right here it bounds the two boxes over the two cows which is definitely going to be really interesting for those potentially with vision problems or just a multitude of different applications for the third picture we do have something that says welcome to carnaby Street you can see that the question says what does the sign say and of course the multimodal AI Cosmos 2 output welcome to carnaby Street and then of course balance the Box around this now of course you might think why is it using a bounding box why is this not highlighting everything differently I'm not exactly sure why they've chosen this maybe it's just the most affected method but so from these examples we can see that Cosmos 2 clearly understands how to locate certain things which is in the picture and of course these different questions highlight different understanding and what we can see from the first one is that Cosmos 2 you can clearly locate certain things within an image from number two we can also see that Cosmos 2 is able to locate how many things there are within a subject image and then of course in subject 3 what we do have is Cosmos 2 being able to read text from said image so although you might just think that they just three random things chosen at random this isn't random at all it's clearly demonstrating an ability to perform three separate tasks across three separate domains which is really good for a multimodal AI because every time a user is going to submit any sorts of image there's going to be various different applications now further on we do have three more separate images that do require different sorts of responses and they do ask them different questions which showcases just how good Cosmos 2 is at being a multimodal AI so one of the first images that we do see here is why is this animal unusual and what we do see here is a picture of a turtle and

### Segment 2 (05:00 - 10:00) [5:00]

then we have hair or a picture of a rabbit and essentially we know that this animal is unusual because Turtles are not notoriously fast and then of course the completion from Cosmos 2 is this animal is unique because it is a turtle turtles are known for their slow and steady Pace whereas rabbits are known to be agile and quick it is not common for a turtle to be faster than a rabbit especially in terms of speed so this first example shows us that Cosmos 2 has not just image identification but a level of understanding and it's important to know that this is completely different from those AIS that do have just image recognition because although image recognition is very good by identifying maybe that's an orange maybe that's a chair that's a table understanding the nuances in images and decoding that is a huge step up and take a look at this image right here then of course we have Cosmos 2 saying this is a boy attached to the boat and that shows a level of understanding now the example here the third and last one is definitely a very interesting one because it shows us once again that Cosmos 2 can understand differences and extrapolate certain things from images so the question says what is the biggest difference between bottle one and bottle two and then of course the answer is the biggest difference between bottle 1 and bottle two is the label so that's actually very interesting because as you can see from the labels we do see that one has a lemon and a strawberry and then of course the other one has a melon on it and the biggest difference clearly is of course the fruit and the label which is once again like we said this large multimodal AI that Microsoft have just released presents a level of understanding which is not really seen in too much ai's the only time we've really seen this level of image understanding and not image recognition is with gbd4's image identification which is slowly being rolled out amongst users if you did watch our recent videos so stay aware for that because that is going to be very interesting but also later on in the video we will try this and demo it and compare it to opening eyes now this is one of the last examples that they do have in the paper but it is one of my favorite because it does show that Cosmos 2 can provide detailed examples so we can see here that they have an input prompt and it says describe this image in detail and without looking at the description I describe this as a snowman sitting next to a campfire probably trying to warm his hands but at the same time it's a little bit confusing because he could probably be melting so it's a snowman at a campfire probably having a hot meal and then of course we can see the complete description along with all of the things that this AI has managed to recognize and then craft together so we can see that a snowman sits next to a campfire in the snow he's wearing a hat a scarf and mittens there are several pots nearby likely containing a hot meal for the Snowman the scene is Serene and peaceful with the Snowman looking at the fire with a warm smile now I got to be honest with you guys that is a much better description than I did and probably anyone who was looking at that image I mean being able to describe the sense of emotion the sense of the aura is definitely something that not everyone is capable to do and then of course being able to locate and identify exactly what every single piece of this image is about is something that is going to prove very useful I mean countless times on the internet we always see images where people are struggling to identify certain things and think about this in certain applications where you need to quickly identify what's going on in an image this is going to prove very used and of course we have a bit more technical stuff and this is where they describe how it kind of works now I'm gonna pretty much gloss over everything here and you can see that right here we have a dog in a field of flowers and essentially what it does is it breaks everything down it breaks it down into three parts you can see a dog a field and flower so we can see that it then tries to merge this together and drop the substrings which it's called okay so it drops the flowers it drops the field of flowers and then of course it says a dog in a field of flowers so this is essentially how the AI is working together getting every single piece of that image and then merging it all into one cohesive prompt and I think this is very effective um when it comes to providing us with the information that we need look at some of the results compared to other benchmarks we will start to realize that Cosmos 2 is actually better than these at first glance we do see that other models like visual butt glip fiber and Grill lots of these models do have higher results but what they don't have is zero shot capability and that's what Cosmos 2 excels in you can see here zero shot where four other models lack then of course in this table once again Cosmos 2 and Grill are the only visual models that are able to successfully do this with their referring expression comprehension with zero short accuracy though so if you're

### Segment 3 (10:00 - 15:00) [10:00]

wondering what zero shot is in artificial intelligence a zero shot refers to a scenario where a model is capable of performing a task without any specific training or examples related to that task instead the model leverages its general knowledge or pre-trained capabilities to make predictions or to generate outputs for tasks it has never seen before so essentially this means that Cosmos 2 is a completely different level now of course let's get on to some of the live demos with Cosmos 2. now what's also interesting is that if you're wondering about the text capabilities of this model because of course as you know it is able to identify images well it's actually pretty good it's actually on par with chat TPT 3. 5 in terms of its ability to recognize text and to be able to predict the next word so Cosmos 2 isn't just an image classifier or something that can understand images it's actually really good at those natural language tasks that we do know and love from chat GPT so now of course here is the moment you likely have been waiting for this is the live demo that Cosmos 2 presents us with now I do want to state that after releasing this video if the video does blow up there might be a time in which this page might be down frequently that does happen when we do release these videos so please give it some time either try and use it early or if the video has a lot of views just wait sometimes because since these pages are demos what often happens is the page will get overloaded and eventually the requests do take quite a lot of time to be submitted so I've got to be honest with you guys this software is far better than I thought like genuinely I thought this was going to be a very awful demo but the live demo is absolutely crazy and I've literally been waiting hours because I was greeted with this page where I had 502 bad gateway down for quite some time and as you can see I actually tried to grab an image where visibility was quite bad you can see right here that we have a image of a plane wreck and the plane is Rusted it actually looks like it's part of the coral and you can barely see the divers but Cosmos 2 is successfully able to identify the divers and quickly identify the plane wreck which is really interesting so what I'm going to do is I'm going to go ahead and submit this again to see if it's able to generate another response that's far more detailed as there are two different boxes which give you different responses one being a brief response and one being a detailed response so you can see here that after around 60 seconds we then get this image it says this image features a large old rusted airplane sitting on the ocean floor divers are swimming around the airplane exploring its structure and exploring the surrounding area there are several people in the scene some of them closer to the airplane and others further away the divers are spread out across the image with some closer to the airplane and others in the back we can see that that's a pretty accurate representation but of course I got it to generate this image where it says a leather soccer ball on a field now what I do like about this as well is that you can easily see which images it's talking about lots of the times when we do have these multimodal AIS they just tell us what's in the image but imagine you knew what was you didn't know where it was maybe you have a visual disability and Cosmos is able to outline exactly where it is and tell you what it's doing so I think this is really useful and we're going to see some more applications for this in the future but of course I submitted another image where we have a man and his horse and Cosmos 2 gets this really cool you can see it highlights the beach and then of course highlights his horse it highlights the man so we can see that this is going to work on a range of different tasks but I also wanted to present it more increasing more difficult images as we went on just to see how far we could push it and this was one of the ones that I did think was going to be a little bit harder but you can see that Cosmos 2 handled it pretty easily you can see an image of a person adjusting the rod with the holder was right there it's got the person it can actually put the bounding box over their arms then of course the rod holder and actual Rod so this is where you can see that it's able to identify every single piece of the image and then of course realize what's in it then we did another example of an image that was a little bit more detailed and you can see right here that it actually managed to get this one as well it highlighted the baked chord highlighted all the vegetables around it and was able to say an image of Bitcoin and vegetables on a baking tray and I did find that even though this was a brief description it's still a pretty good description because what I did actually search this up this does look like baked cod so it is really good at identifying things in images and of course I didn't actually know this tool was called a grinder that's why I input it into this image but it actually got it pretty correctly and I had to do a double check for a Google search to see if this was right and you can see right here that it says close-up of a worker cutting a granite slab with a grinder and dust is Flowing off the pavement so I don't think there's any image that can't handle that well it seems to be

### Segment 4 (15:00 - 17:00) [15:00]

performing very well at simply every image I throw it and this is why I say Cosmos 2 is really good because this was one of the images I threw at it and I thought there's no way it's going to get it now some of you might know what these are but honestly for me I was very perplexed I was confused I was like what are these Skittles are these random furry things I genuinely had no idea and then it told me that these were colorful pom-poms with googly eyes now for some of you that might be something that you immediately know but this object is pretty foreign to me and of course with a Google search I'm able to quickly verify that this is what it is so you could just imagine the amount of applications that this is where you want to immediately know what something is in an image online or maybe you're trying to buy something and you're instantly able to get there and these images were generated and around and these responses were genuinely generated around five to six seconds so this is actually pretty quick and for demos that is exceedingly fast usually demos take anywhere between 60 seconds to around 2 minutes to accomplish their task simply because the amount of users requesting it although as we mentioned once we release this video that might change so this is an image where it starts to hallucinate you can see that it says there are several cars visible in the scene with of course no visible cars in the scene except one then it says the parking garage is filled with various Vehicles including a truck parked to the left a car on the right side and a motorcycle parked further to its right so I don't see a truck to the left or a motorcycle to the right so it's clear that when you have the more detailed descriptions sometimes this hallucination can occur then of course one of the last things I wanted to do was to test if it could actually handle black and white images and it actually can except there was a small hallucination with the left arm because you can see right here that it says a dog when it's in fact not a dog it's just that person's arm but of course it says a man a trash can and of course the street and you can see here that the further detailed explanation doesn't actually hallucinate as much but it does describe this as a bench which is actually not a bench it's actually a sign but honestly this is one of the most promising visual large language models that I've seen so far and with multimoto AIS coming out soon I genuinely cannot wait for gpt4's full image release the only reason I do think it's currently being delayed is because they do want to ensure that it is currently safe and we do know that open AI always maintains their safety routine of rigorously testing large language models multimodal AIS ensuring that they're safe before public release