# Llama 3.2-vision: The best open vision model?

## Метаданные

- **Канал:** Learn Data with Mark
- **YouTube:** https://www.youtube.com/watch?v=pUg_pHr91Dg

## Содержание

### [0:00](https://www.youtube.com/watch?v=pUg_pHr91Dg) Segment 1 (00:00 - 04:00)

llama 3. 2 vision is an open multimodal large language model by meta and it's finally available on a ll it's available in 90 billion or 11 billion parameter sizes now I'm working on a Mac M1 Max with 64 GB of RAM so the 90 billion model is not really going to work for me so I'm going to go with the 11 billion one in this video we're going to find out how well does it work does it answer follow-up questions well can you get it to compare images and how fast is it I'm going to be giving it tasks that I'd usually do with chat GPT or Claude which isn't really fair given the relative sizes of the models but let's see how it does now we're going to have a look at my Alama version and you can see I'm on 0. 4. 1 now you need to be on 0. 4. 0 or higher in order for this to work so make sure you update if you're on an earlier version and then we're going to run Al LL serve to start the Alama server this might be running in the background if you're working on a Mac in which case skip this step we're then going to pull down the Llama 3. 2 hyphen Vision model now this defaults to the 11 billion version if you want the 90 billion version you'll have to go and work out what the tag is for that and then we're going to just run a simple query just saying hello so that it loads into memory now if you want to check Which models are in memory you can do Al LL PS and you can see we've got the vision model in memory if you look under the size column you can see it's 12 GB of RAM that it's taking up right let's Now launch it again and we'll put it in verose mode so we can see the stats afterwards and we're going to ask it to look at this image here this is some code on an image that I created with carbon now you can see it's a SQL query from Duck DB and we're going to ask it can you extract only the code from this image and give it to me as a markdown code block I'm going to start a timer in the corner so we can see how long it takes and we'll sort of speed things up a bit and you can see it takes just over 30 seconds and in fact if you look at prompt eval duration you can see that's how the time that it takes until you get your first token back so it's 32. 8 seconds and then you can see it took 34 seconds in total but the output is good it's got it pretty much spot on we can then ask it what the code does and this time you get an immediate response and I think it's done a pretty good job of explaining how this query Works let's try something else so I often ask chat GPT to critique my YouTube thumbnails so I'm going to ask it to do the one from the last video so we'll say hey critique this YouTube thumbnail give it the image again it's going to take a little while until we get the response but then it gives us a bunch of advice about how we might change it I'm not sure I entirely agree with the suggestions it's come up with here a cool thing is though that we can ask it what it thinks the videoos about and in here it does a pretty good job I'd say like it perfectly works out what this video is about now I sometimes get chat GPT to convert notes from my notepad if I want to have them on my computer so let's see how llama 3. 2 vision gets on so I'm going to ask you can you extract the handwriting please of this image where I've written something on a piece of paper and you can see it initially says I can't extract handwriting from an image but I can help you with other tasks would you like me to transcribe the text in the image instead which I mean that sounds like the same thing to me but sure can you do that and this time it says the handwritten text reads and then it gets it absolutely spot on now something that people have asked in other videos that I've done about Vision models is can it compare images against each other so let's see how Lama 2. 2 vision gets on so we're going to say can you compare these images off Ronaldo and Messi to famous football or soccer players and it actually just crashes out of alarm and says Vision model only supports a single image per message okay so that doesn't work but I thought maybe what I can do is I'll upload the first image of Ronaldo ask it about that and then once it tells me about Ronaldo then I'll ask it to compare it to Messi so I said start okay let's start with Ronaldo and it just refuses to tell me who it is I try another way caption it still refuses okay who do you think it is still refuses and I couldn't come up with a prompt that would get it to tell me who this is which is really weird cuz other times that I've tried it has works I'm not entirely sure what's going on here but if you have any ideas let me know in the comments below but apart from this weirdness with the pictures this is the best open Vision model that I've played with so far if you're interested in learning about others check out this playlist next

---
*Источник: https://ekstraktznaniy.ru/video/38892*