# This New AI Vision Model Beats Everything (Molmo Ai)

## Метаданные

- **Канал:** TheAIGRID
- **YouTube:** https://www.youtube.com/watch?v=WgQlG-IjlzE
- **Дата:** 06.10.2024
- **Длительность:** 16:51
- **Просмотры:** 32,489
- **Источник:** https://ekstraktznaniy.ru/video/14031

## Описание

This New AI Vision Model Beats Everything (Molmo Ai)

Prepare for AGI with me - https://www.skool.com/postagiprepardness 
🐤 Follow Me on Twitter https://twitter.com/TheAiGrid
🌐 Checkout My website - https://theaigrid.com/


Links From Todays Video:
https://molmo.allenai.org
https://www.youtube.com/watch?v=spBxYa3eAlA

Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos.

Was there anything i missed?

(For Business Enquiries)  contact@theaigrid.com

#LLM #Largelanguagemodel #chatgpt
#AI
#ArtificialIntelligence
#MachineLearning
#DeepLearning
#NeuralNetworks
#Robotics
#DataScience

## Транскрипт

### Segment 1 (00:00 - 05:00) []

so in today's video we're diving into Momo a groundbreaking family of state-of-the-art multimodal AI models that are actually setting new standards Momo doesn't just interpret images and text like other AI models it actually takes things a step further by learning how to point at what it perceives unlocking a whole new level of interaction with both the physical and virtual world whether it's our performing models 10 times its size or closing the gap between open and proprietary systems MMO is Paving the way for next gen AI applications that just don't understand their environment but actively interact with it take a look at this quick demo and then I'm going to dive into some of the rather shocking benchmarks is this option vegan no this option is not vegan count the number of people counting shows a total of 21 convert this table to Json here's the table converted to Json format I want to sell my bike on Craigslist write me a description for it Schwin bike for sale blue with white accents $300 or best offer do you want to buy it it's Monday at 1 p. m. can I park here and for how long yes you can park here on Monday at 1 p. m. you have six hours left to park I really like electronic music what show should I go to if you're into electronic music you should definitely check out the show featuring Tao and Zed at the Showbox sto can you write me a knock joke knock who's there rocket who rocket to the moon and back baby can you point to the model that takes the least time to train nice time to which of these models uses reinforcement learning the image shows a table of different models and their ass methods hey Momo order me a pumpkin spice latte from Starbucks I'll place the order through Uber e your pumpkin spice latte is on its way so we could see just then and there that we actually do have a variety of different things that this open- Source model is able to do and I got to be honest with you guys this is remarkably surprising so one of the things that we did see there was the incredible Vision capabilities we also managed to see the fact that this model can point at certain things within an image you can see right here that it was able to point at certain things within the table and then of course convert that to Json which was rather incredible so all of this is really effective and one of the things that they also somewhat teased in this they don't be too much on their blog but we can see here that they spoke about how wedge web agents powered by Momo can now browse the internet for you and here you can see someone ordering something from uberit so I'm wondering how effective this is going to be as I do know that web agents are some of the hardest thing to do with the complexity of Agents running into multi-step reasoning so the most surprising thing about Momo is the fact that it's got incredible Vision capabilities that are onp par with state-of-the-art models that are 10 times at size you can literally see here that the human evaluators actually rated Momo 72 billion parameters higher than every other closed Source model and basically put it on part with GPT 40 you can see our human ELO preference evaluations use 15,000 image and text prompt Pairs and they queried each visual language model for responses and presented the resulting image to text response triplets for all VM pairs to a set of 870 human annotators who gave the rankings so you can see right here that this is incredible because it manages to surpass Gemini 1. 5 Pro claw 3. 5 Sonic 1. 5 Flash and recently even though llama 3. 2 was released it also does crush llama 3. 2 in terms of these visual

### Segment 2 (05:00 - 10:00) [5:00]

analysis benchmarks so it's absolutely incredible in terms of its size speed and of course its Effectiveness so this is not a model that you want to count out at all because it is really effective now you might be wondering okay how on Earth did they do this well they actually have taken some tips probably from how they did F 3. 5 which is high quality data is all you need we can see here that they state that data quality wins over quantity so essentially with this model that they were building usually people just train these robots on millions and billions of different pictures and descriptions that they find over the internet but the problem is that these descriptions are usually messy and confusing which makes it hard for the AI to actually understand the right things and of course to learn because of this sometimes robots can hallucinate which is what is the biggest issue in AI right now but instead these guys actually used really good data even if that data was much less which is basically they used a few good pictures instead of a million blurry ones so they use about a million image description pairs which is a lot less than others but because their data is actually more cleaner and more details their robot actually gets a lot better at analyzing these images now they also have pixo which is the specialized way that they gather their data which is in two steps so well two types of data so first they got the dent captioning which means that they get people to describe pictures in lots of details so not just a dog but a brown dog sitting under a tree with a red ball next to it and then of course part two which is of course the fine tuning which is helping the robot do specific things like answer questions about things in images like reading documents or even point at things in an image instead of asking people to write what they see which can be slow or Inc compete they ask people to speak their descriptions out loud for a minute and it turns out that when people talk they explain more about the picture and they describe all the details and where things are like the dog is sitting left of the tree and they collected all of these you know detailed speech descriptions for thousands of pictures making sure that the robot learns from people and not other robots and this makes the I keep saying robot but I'm talking about the neuron Network SL the AI here and it just makes it a lot more effective and that's why this model just manages to crush the others now one example that I did see on Twitter was the fact that this model uh you know someone tweeted out that this there is only one AI model that can tell the time from a clock on the face and it's the most overlooked model out there it's not GPT claw Gemini Ora it's Momo by AI 2 and of course that is true gp4 gets it wrong Gemini 1. 5 Pro the most recent model gets it wrong Claude gets it wrong too but Momo manages to get it now what's crazy about this is that they also have some really small models so they have their most efficient model which is Momo Mitra of experts 1B which is based on their fully open source mod mixture of experts llms which nearly match the performance of GPT 4 Vision on both academic benchmarks and human evaluation and of course you can see they refer to their best-in-class MMO model MMO 72b achieves the highest academic Benchmark score and racks second on human valuation just slightly behind a GPT 40 so right now I'm going to show you guys some more demos one of the demos that they have here is really cool which is where they've integrated this into a Vision Pro showing you guys how advanced AIS with vision capabilities actually manag to transform our understanding of our environment hey I'm going to talk about how we could use Momo our multimodal AI model with an apple Vision Pro to enable us to better understand and interact in the world this is a model that can answer questions based on an image and it can also augment and explain its reasoning through pointing so let's get into it okay let's start off with something basic hey Momo what is this machine used for this is a sophisticated coffee station featuring a large stainless steel machine with a digital display surrounded by coffee cups creamers and sugar packets nice that was pretty cool now let's have it augment our vision by predicting points as visual tags on the image hey Momo can you give me an overview of these options the image shows a coffee machine with various drink options coffee is available in different strengths including a coffee strong X2 option for those who prefer milk-based drinks there are specialty choices like cappuccino and Latte flavored specialty drinks are also offered such as cafe mocha and vanilla latte chalk SL van options are available for chocolate and vanilla lovers lastly there are extras like hot water and chocolate shot for customization nice that worked pretty well too now remind me again what is espresso is a concentrated strongly flavored coffee made by forcing

### Segment 3 (10:00 - 15:00) [10:00]

hot water through finely ground coffee beans under high pressure point to Espresso pointing to Espresso point to maximum strength pointing to maximum strength point to double shot pointing to double shot why won't it let me start the machine is likely not allowing you to start because the cup hasn't been placed on the designated cup holder or dispensing area yet okay that was great now let's get a cup and fill it up and we can interact with Momo as we walk around in the world here we could quickly summarize information that were presented so I'm going to pick up this sign and ask hey Momo can you quickly summarize this sign for me the weekly lunch menu for September 23rd to 27th is displayed featuring 5 days of diverse Cuisine options including B and baguette Cafe Turk Cedars Indian Bloom fusion and Cameron catering finally let's try one more example here I'm going to go into a conference room and I'm G to ask how many chairs are in the scene and so MMA will be able to give me an answer much quicker than I would if I were just counting and so it's really augmenting my ability to see and interact in the scene hey Momo how many chairs are in this scene counting the chairs shows a total of 17 then we have another example here of where they're using Momo and Robotics in order to Aid robotics with their Vision robots are pretty good but sometimes the vision models aren't that good and I think that this is going to help out quite a lot if robots know where exactly to place things where exactly things are and a good onboard Vision model is going to be essential for ensuring progress in robotics so take a look um here we can see the uh what the robot is observing and you're going to ask a couple of question uh that will ask the uh M model um and we'll see what it says so one of the first questions we can ask is um what are the objects in this scene uh just to get a sense of what are the objects that you can see here um as you can see the video is being updated the scene contains several objects a white countertop with a wooden base a variety of snacks including La chips gummy bears and cheit a blue water bottle and a white thermos a laptop with stickers a white coffee mug a tray with napkins and paper plates a kitchen area with a white table chairs and a refrigerator a microwave a garbage can a string of flags a window a blue chair a water fountain a man wearing a gray sweater and black pants a woman with brown hair wearing a white tank top these items and people create a busy kitchen or break room atmosphere with a mix of food drink and everyday objects so now let's try to um ask the robot how we can pick up one of the objects so let's say we want to grab um uh the white bottle so let's say uh point to the white bottle just to see uh what we can get um and you can see that the point is shown here in the image but let's say we want to be more precise we want to grab the uh object by the handle so we can say point to the handle on the white bottle and as you can see it is pointing to the white handle uh the uh handle of the white bottle um then at this point we are able to issue a uh command for the robot Thea controller on the robot to go and perform the task now after the bottle is moved now we want to uh find a place in the scene that we can uh put the dirty bottles in so let's say point to where I should put the dir dishes in and you can see that it points to the sink so the robot then again the low Lev controller can take over the robot will move towards the sink okay so now the robot is here and we want to put the bottle that is in hand in the sink uh we need to find the free spaces in there so we can just

### Segment 4 (15:00 - 16:00) [15:00]

uh ask the robot ask Momo point to the free space in the sink and then press enter and as you can see it's pointing to a free space in the sink now again a low level controller can take over can just ask a couple of other questions uh from the robot for example we can ask the robot um they're point to the teab bags or actually we can even ask the robot point to um things I need to make TV bed and we can see that the uh Momo model is pointing to all the tea bags that we have in the scene uh point to the machine that I can use to make coffee with and you can see that it's pointing to the coffee machine so now that you've seen all of that let me know what you think about Momo this was definitely a rather surprise I didn't expect this just after the Llama 3 release this was certainly a surprise but this is why many individuals have consistently said that AI is rapidly advancing faster than you can imagine I can't imagine how many other companies there are out there you know training their own models doing their own things that are going to surprise us in a few months if you enjoyed this video I will see you all in the next one