100+ Insane ChatGPT Vision Use Cases
26:16

100+ Insane ChatGPT Vision Use Cases

The AI Advantage 07.10.2023 59 060 просмотров 2 053 лайков обн. 18.02.2026
Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
ChatGPT Prompting Course including Weekly Live Events: https://aiadvantagecourse.com Today we look at 100+ ChatGPT use cases as detailed in the Microsoft paper The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision). Links: https://browse.arxiv.org/pdf/2309.17421.pdf #aivision #chatgptvision #gpt4v Free AI Resources: 🔑 Get My Free ChatGPT Templates: https://myaiadvantage.com/newsletter 🌟 Receive Tailored AI Prompts + Workflows: https://v82nacfupwr.typeform.com/to/cINgYlm0 👑 Explore Curated AI Tool Rankings: https://community.myaiadvantage.com/c/ai-app-ranking/ 🐦 Twitter: https://twitter.com/TheAIAdvantage 📸 Instagram: https://www.instagram.com/ai.advantage/ Premium Options: 🎓 Join the AI Advantage Courses + Community: https://myaiadvantage.com/community 🛒 Discover Work Focused Presets in the Shop: https://shop.myaiadvantage.com/

Оглавление (6 сегментов)

  1. 0:00 Segment 1 (00:00 - 05:00) 1284 сл.
  2. 5:00 Segment 2 (05:00 - 10:00) 1205 сл.
  3. 10:00 Segment 3 (10:00 - 15:00) 1184 сл.
  4. 15:00 Segment 4 (15:00 - 20:00) 1199 сл.
  5. 20:00 Segment 5 (20:00 - 25:00) 1167 сл.
  6. 25:00 Segment 6 (25:00 - 26:00) 306 сл.
0:00

Segment 1 (00:00 - 05:00)

so it has become pretty clear that gbd4 with vision is absolutely insane and a team of researchers in Microsoft just published this brand new paper that pushes this Vision module to its limits want to learn how to do something no problem and what about this x-ray what does it say yep perfectly identifies the injury or maybe you just want a new way of interpreting Tik toks yep it can do that too matter of fact this paper outlines over a 100 use cases that they put to the test and look at it this monstrosity is over 160 pages long that's why I went through the whole thing and picked out the most interesting and fascinating use case is which I'll break down for you right now but first we have to talk about why all of this is actually so meaningful and that is the fact that prompting in GPT just changed because if you follow my tutorials you will know that I always break it down into instructions and context and the instructions remain the same you're still going to ask GPT to do a specific thing but the context game now completely changed because up until now you had to specify every little detail of the activity that you're trying to perform now you provide an image or free just like in this very first example where they prompted how much did I pay for tax and then uploaded threee images of receipts now as somebody who does the accounting for my own company this is a godsent by zooming in you'll see that this one says tax this one says sales tax and the third one says total tax and yep there you go it perfectly recognizes all three of them and adds them up for you it even gives you Clues to where to find the sum on the receipt just think of the time savings as a solo entrepreneur that has to do this for like hundreds of receipts every single month can't wait for this one but moving on and as mentioned I'll only highlight the ones that I found most significant and the next one is analysis of this picture right here what would the missing image look like is the prompt and then this is the image provided so as you can see here if something is marked in red it got it wrong okay so it didn't get this use case right but why was that the case well it didn't have enough info so this is not the death of prompt engineering or anything like that it just makes it simpler and you still have to understand the same Core Concepts because if you go in and give a little explanation about the Matrix and in what order it's supposed to be read it gets it right and it figures out that this is supposed to be a star if it's being red left to right yep perfect and as we look at more and more of these you will realize that it's really not just about image recognition because that we had before this is about image recognition with the understanding of gp4 baked into it so it can reason it can make sense of these scenes the relation of different objects as gp4 does with text but now apply to images and if you bring that understanding to an ID then you can do things like this you can provide it with a certain template and this is not for everybody but you can provide it with a specific template that you wanted to fill out and you can say give me the results in a Jason format and then it will do that so as you can right here it just fills out this template that was provided here above and it's not perfect okay it doesn't get the hair right although it's clearly stated that it's brown here I don't know if some of the security features here trip it up or what's going on but it does have problems here and there so this is not a Magic Bullet for all use cases especially on something messy like ID where there's a lot of visual elements don't expect perfect results although it does get it right on this one but the point here is just like with text prompting also here you can use templates for it to fill them out and then you can ask for specific output formats very powerful techniques that have we explored on this channel before and the next point I want to highlight here is an important one and that is the fact that you can point right because if I were to describe just with words a point in my frame here well if I said neon flame it's pretty clear what I mean hm but if I say the upper right corner of the picture in the background well is this the upper right corner or do I see a mirrored image or do I mean my perspective when I turn around or do I mean upper right from my perspective well if I skip the words and I simply do this you know which upper right corner I mean right because pointing at something makes it super obvious and it's no different with the vision module here you can see there's different ways to do this you can specify coordinates you could crop in or you can point Arrow at it you can draw a box around it or you can use a hand drawing and the brand new vision feature inside of cat GPT is going to have a drawing tool so you're going to get to point of specific things and it will pick it up so in practice it looks something like this if you Circle these lights here in the background basic image recognition would just tell you okay these are four light bulbs but it goes deeper than that and that's the whole point here look at the output here the pointed region in the image is a row of hanging lights on a wire fair enough the lights are small and round and they are evenly spaced along the wire then it talks about more details of the location but here's the kicker the lights appear to be decorative and are likely used to provide ambient light in the evening so it doesn't just describe the lights it understands the purpose of the lights in the context of this picture and I'm sure if you prompted on top of it could describe the lights that would be falling on top of the beer bottle because it understands the scene deeply and that is a really important concept here because if we go to this next example and we have these two arrows object one object two and then there's a yellow circle here the deep understanding here really becomes obvious look at this object one is a glass bottle of water it is clear and has a blue label on it all right that's correct object two is a glass bottle of Magna beer it is green and has a red and white label on it yep correct to but here's the tricky one in the circled glass it appears to be a clear liquid likely water so it is likely that the content of the circled glass are from object one the glass bottle of water and why I find this particularly interesting is because this glass of water is half covered by the snap and it also deducts
5:00

Segment 2 (05:00 - 10:00)

that the water likely comes from the bottle that is on a table and think about this is a translucent liquid on a sunny day and it's half covered so if it gets this just imagine the possibilities or actually you don't have to because the work has been done here and we can just look at what results were produced so here's a little prompting lesson baked into a review here when they ask what is the read of the speed meter and it's clearly somewhere around 9 mph it gets it wrong it's say 22 now the same happens when they prompt it a little differently or go into even more detail still gets it wrong but look at this as soon as they provide it with free examples down here okay two didn't suffice boom right away it gives the right answer and the lesson here is the following this technique here is called fuse shot prompting and this is the case when you provide multiple examples ideally three or more and basically by doing that you extend the training data of the base model and you allow it to reproduce the pattern that you just introduced because at a base level all these language models just recognize patterns and then reproduce them and by introducing the pattern of this image reoccurring it is able to effectively read it out so if you ever get stuck now we have a remedy just provided with some examples this goes for text prompts too but moving on from our little prompting lesson if you want more lessons like that check out my prompt engineering course which takes you from zero to a level where you're effortlessly interacting with jgpt and using all the newest features in just a few hours and if it doesn't work for you I have a no questions asked money back guarantee okay so that's it for my Shameless self plug now let's look at some other use cases here because here it gets really interesting because we start dealing with humans and as you can see it's amazing at recognizing celebrities it has been trained up on all this day and their images so it knows who all these people are as you can see it identifies every single one of them with ease and not just that it also understands what these people stand for so down here you have the CEO and co-founder of Nvidia and just from this image it deducts that he's likely holding a GPU so again it doesn't just see it understands but this goes beyond just celebrities it can easily recognize various landmarks and some of these are not that obvious but beyond celebrities and landmarks it can also recognize food and this is where I really stopped in my tracks and I was like what I mean look at this lowquality image there's barely any color contrast here right it's just fish inside of some form and it just knows this is unagi donburi and it originates from Japan it's a type of rice bowl dish typically served in a deep Bowl what all right or what about this could be any steak in the world right but no it knows that this dish is called brace short RP with chimy churi and that it features a succulent short RP that has been slow cooked until tender and topped with a flavorful chimy Cherry sauce I mean I don't know about you but this got me leaning back in my chair like really okay so it knows more details about various Cuisines and landmarks than I do fair enough but when we hop down to this one you can feed it a x-ray with a little pointer and by simply asking what's wrong it tells you there appears to be a fractural break in the fifth metatarsal bone this is commonly known as Jones fracture really this model is not even supposed to be specialized in medical use cases just think about the models that we're going to have that only do this and have only been trained up on data like this is incredible and the next one is scary because if you think about it we already have a problem with people Googling their symptoms this is going to take it to the next level look at the CD scan of lung in the image below tell me what's wrong and then it tells you this could indicate a lung infection or inflammation luckily most people don't have cd scanners at home right but honestly the implications for the medical profession here are severe okay so it can recognize Brands that's a no-brainer but here's the next use case that I had me just looking at my screen and kind of just slowly shaking my head like is this even real because look in this case it's not just a simple prompt it actually gives it four steps to follow okay tell me the size of the image localize each person recognize each person and then generate detailed captions for each bounding box and look at it the image is this big here are the four bounding boxes here's the name of the people and here you have captions for each one of them so if you visualize this right which not going to be possible right away but with di free released and this being able to pick up images it's just a question of time until these two merge and you're going to be able to recognize images and produce new images right but anyway this is just a visualization of the output here and it perfect perfectly recognizes each one of them it almost perfectly places the boxes on top of them and then it captions each one of these people like look at that Jeffrey Hinton computer scientist and cognitive psychologist known for his work on artificial neural networks and you know it could go deeper than that we just asked for a simple caption so not bad this is pretty impressive but hey don't worry the fascination doesn't end here because next up they looked at a bunch of memes and not just that it picks up on this little procrastination meme with Kermit the Frog right here no it actually gets one of my favorite memes and yep it understands the humor in this woman is angrily pointing and yelling and this cat is sitting there unimpressed and then it defines it so it understands jokes right not that would be something new we knew that gp4 could do that but it just hits a little different if you see it apply to images not sure how to feel about this but I do know one this is absolutely fascinating and I love it which of the organisms is the producer in this food web and this looks quite confusing right it would take a few seconds for you to figure out where that goes well gp4 Vision has no problem the producer in this food web are the berries and the flowers grasses and the seeds these are the organisms that use
10:00

Segment 3 (10:00 - 15:00)

energy from the Sun to produce their own food through photosynthesis okay not bad that's a pretty complicated image and Y you could do the same to any type of illustration just like so and it gets it right but here's another one that got me personally fascinated I looked at this and I was like wow we're really entering into a whole new era of surveillance right because look at this one suppose you're a detective what can you infer from the visual clues in the image okay and then it goes ahead and it dissects the image the room belongs to someone who's likely a young adult or teenager as evidenced by a casual clothing and the desk with a computer so just think about the implications of this if you're going to have ai powered surveillance cameras they're going to be scanning the entire scenes not just for humans but also for physical Clues and they might be as subtle as a certain body language that you express through your movement now that is some super scary stuff but right here you can already see this model is able to pick up on various Clues and then connect the dots and arrive at various conclusions again this one produces a lot of mixed feelings but I guess it shows the power of this thing also I'm looking forward to the day where we get a model that is so powerful that it can play geog guesser with the best of them anyway let's get back to some use cases that you could actually put to work today and one of them is you can read flow charts like this let's not go deeper into that I think that makes a lot of sense but I personally thought the next one was really interesting and here you just feeded the floor plan and then you can ask it things about the floor plan so where's the bathroom for the second bedroom according to this floor plan appears to be located directly adjacent to the bedroom okay perfect I'm sure you could provide this with some criteria in advance how you want your apartment to be laid out and then you could feed it a bunch of floor plans asking it to filter out the ones that don't fit your expectations quite convenient if you ask me but even more convenient than that is the next idea of feeding it a paper with diagrams visualizations and a bunch of text and then saying describe the paper in details and highlight their contribution so as you can see this goes a level Beyond just copy pasting the text into something like clae 2 with a large context window no this will pick up on all the little details like the reference es and the diagrams too because it picks up on the visuals right but this task is a little too complex and it starts making mistakes look it gets his number wrong and this part is completely wrong so while this would be one of my favorite use cases we might not be there yet but where we did arrive already is a level where translation seamlessly blends with image recognition because right here they're prompting it in various languages for example this is cck and this say describe this image and then gp4 comes back with somebody who speaks this language that's perfect and guess what it can do this in any language and these capabilities aren't just fascinating or interesting when you're on the road and traveling this can actually become a game changer because look at that you could just photograph any sign and it's going to give you the translation in your very context these two in particular I found super interesting here's a Greek stamp 0% probability of you deciphering this without the help of something like gp4 Vision right but there you go now we know this is the currency and all the other details and what about this one where the text is not even properly readable like this is a handwriting that is kind of hard to decipher but look gp4 Vision no problem it identifies that it's Portuguese and this says it is not normal to be afraid to walk alone on the street a minor thing I have to point out two exclamation marks instead of free here right but super helpful if you're on the go and if you find yourself in a situation where you need to figure out something in a foreign language like just think about going to restaurant in a foreign country and taking the picture of a menu in a language that you don't speak well this problem seems to have been solved forever and I know we had Google lends but now I will be also able to ask what should I order and because I would have saved my preferences while traveling in my custom instructions it will give me a recommendation Google lenss doesn't do that for now so beyond just translation when you're traveling it obviously also understands various cultures so right here it identifies the corresponding culture and then gives you the translation in the local language as so truly amazing stuff but let's move on from the travel use cases to something that you might just use every single day depending on your job just feed it a picture of a table and it can reformat it into something brand new in any format you desire now this is something I've been already doing with notion AI but now you can just input image and say reformat it to XYZ and it will do it very useful if you need this for work right and talking about useful imagine they turn this into a little widget and it could just help you on your internet Adventures especially if you're trying to learn a brand new software you could just go in and say what is this icon used for and if the model is trained up on that software it will be able to immediately tell you this unlocks a whole new feature set I don't have to tell you about the fact how great it would be to have a personal assistant that could explain every single thing on the internet right but from the looks of it this can do it but more navigating the internet with this later on in this paper because now we're about to talk about video because at the end of the day what is video it's just a series of images right you see 24 images per second you perceive it as video so what if you take some of those images so although the model cannot process video yet you can certainly feed it frames from a video and if you feed it five frames of how Sushi is being prepared you can ask it analyze these images and decide a correct order in which they should be displayed yep no problem
15:00

Segment 4 (15:00 - 20:00)

at all cuz it understands how sushi is prepared so it can do this easily but beyond something photo realistic what if the idea is a little more abstract right what if it's not hyperd detailed just like this drawing of the various South Park characters chat GPT dude with various keyboard characters yep no problem there are four characters in this abstract figure they originate from the cartoon South Park and it even names them I mean that's how deep the understanding here runs right you could just get the rough outlines and it will interpret something that is not even close to fin form and it gets even better than that because look at these low resolution images of various people it picks up on every single one of these emotions perfectly fear or anxiety yep that seems about right I mean heck that's a little scary this is not even high resolution footage and this publicly available model will be able to pick up on these things easily I mean yeah I understand that this technology is getting quite scary at this point and look it doesn't just recognize emotion it can also work with emotion right this is where we enter like dystopian territory because if you're asking it after seeing the below image how people might emotionally feel and react and you know what who cares about these little grammatical mistakes gp4 can pick up on it anyway and look at that now gp4 vision is going to try and pick up how humans could feel people might feel a sense of awe and wonder at the beauty of the sunset and the colors in the sky they may feel peaceful and calm or even inspired some may react with a sense of gratitude for being able to witness such a beautiful moment in nature others may feel a sense of longing to be in a place like this or a desire to travel and see more of the world's Beauty yep brother gp4 seems to understand okay but it goes further than this because it can actually understand how various content will produce various emotions look at this example so here they just prompted with the question which one would people consider to have higher Aesthetics like this versus this it's probably B right based on societal standards and Norms more people will likely consider the second image as having higher Aesthetics and then it reasons it so it really understands but not just that it's even funny if you go a step further and say describe the image in a humorous way it looks like the sun is trying to hide behind the horizon horon but it's not doing a very good job what it's like it's playing a peekaboo with us and the seaweed is just chilling there like hey don't mind us we're just hanging out the rocks are probably feeling left out they're like hey what about us we're part of this beach too and if you go down the dark route you can describe this image to make people feel uneasy and anxious and then it can do the same thing to the nicer image described image to lessen somebody's desire to visit the image shows a rocky Beach at Sunset with seaweed and algae covering the ground the water is murky and there are no clear areas to swim or relax the rocks are sharp and slippery making it difficult to walk or explore he it's doing a good job of kind of persuading me not to go to this beach but hey it's time to look at another downside because it's not perfect in the simple example where it's supposed to spot differences between images like everybody's familiar with this little riddle from their childhood right it actually doesn't do a perfect job like it gets some of the things right but then it says the hair is a lighter shade of brown compared to image 2 and that's actually not the case they're the same shade right so again it's not perfect yet but what it can do is that it can identify irregularities so if you ask it what is wrong with the object in the image it gets all of these examples right particularly I enjoyed this little screw here the object in the image appears to be a screw or bolt with a damage or stripped head like let's be real this is not easy to identify it would take most people a second to kind of figure out what's going on here but it does it just right and if you extrapolate this concept you will arrive at the conclusion that you can use this for any kind of problem right damage tires or you could go as far as evaluating insurance I mean if I were an insurance company not sure how much trust I would put into this but this is certainly going to change a lot of jobs because you're going to be able to simplify a lot of these processes where humans were absolutely needed beforehand okay but I think now it's time to learn something again because if you just give it an image of your grocery shopping like so it kind of gets half it wrong but here's the trick if you crop out the various images and you teach it first right so if you're a supermarket and you train a model on all the items you provide in the supermarket well then it will become really easy to take this picture and analyze everything perfectly and that's exactly what a lot of businesses are going to do they're going to train their own models over time and feed it their various products and the data that is relevant to their own business and then they're going to use models like this to analyze shopping carts from low resolution images like this and again this is not even a specialized model this is a general purpose model that does it all and it gets this right if you train it up on these few images I mean that's impressive and then here in the bottom part there's a lot of medical examples I don't want to go too deep into that but it seems to get all of these right now I'm not sure if these are cherry-picked but it barely made any mistakes just this one it seemed to have missed a fracture here which obviously could be critical but all the other ones did Super well which I find super impressive again considering the fact this is not a specialty medical model okay and for all your AI art Enthusiast this is the reason that do free works so darn well because it is able to rate these pictures very accurately if you just ask it what is happening in the image from a scale 1 to 10 decide how similar the image is to the text prompt a cake on the tablet with word Azure research written on it and then it looks at it and it reevaluates that hey okay
20:00

Segment 5 (20:00 - 25:00)

the text is incorrect here and because they have a powerful language model they're going to be able to produce a equally powerful image model which is dally free right now and this is the fascinating thing if you take this approach of okay you create something and then you use another instance of chat GPT to critique the output and then you recreate it and then you use another instance to critique it again we're kind of entering autonomous agent AKA autog GPT territory so just know that by adding visual capabilities these autog gpts will get so much better because they're going to be able to evaluate the results at a whole different level than they could up until now and if you go a step further with this idea and you say something like imagine that you're a home robot and it is asked to go to the kitchen to fetch something from the fridge the image below shows your current position please plan your next action well it does it turn right and move forward towards the hallway yep that's correct so yeah I'm just looking forward to the people that hooked this up to a 360 camera and let the robot walk around this will be some incredible stuff and beyond that there's a bunch of examples here how it can browse the web for you and I think after all the examples we looked at it becomes quite obvious that a clean web interface should not be a problem for something that can navigate some of these abstract and complicated real world situations I mean think about that buying an ergonomic keyboard on Amazon is ridiculously simple compared to analyzing an x-ray it has never seen before right and yeah while it's not perfect if you look at the mistakes they don't really matter here it successfully navigates all the way until the cart and then proceeds to check out and to me this just indicates that the current browsing model they released with Bing is just a laboom ised version of what they could actually do because obviously the vision module here can navigate the internet quite perfectly where just Bing just does a search and clicks a random link and that's the best case scenario you're godamn right so you can just imagine what internet search capabilities they have Behind Closed curtains where they use the newest version of their Vision module to navigate it okay and we're closing in on the end of our little journey here but there's a few exceptionally interesting ones here in the end so let's look at graphical user interface navigation for mobile apps okay so this is a new way of looking at Tik toks right you feel reped four images and then you ask explain the story shown in the images below and it kind of does it but here's the one that really caught my attention transcribe the video content given the frames below okay so it's asking for a full transcript of this Tik Tok video without hearing it without seeing it just has these images to go off and here you go hi everyone it's na Explorer here and I'm going to be sharing seven places you need to visit in Washington state and this video doesn't even have captions just think about it if you feed the Tik Tok screenshots most of them have captions so it will be able to perfectly trans described them so already we're able to analyze short form video content with this model interesting right that's not something I would have thought of by myself before this paper and again it's quite good at being funny and understanding humor but here's an interesting question what happens if you take this Vision module and you pair it with Bing search right because up until now we only had the text module GPT 4 paired with Bing but now we get the vision module so look if you just feeded this image and ask where's this photo taken GPT 4 Vision comes back with sorry I cannot answer the question accurately as no context or information is provided with the photo but if you feed this to the model well it figures out this image is taken in ismir Turkey it's related to the earthquake event and it gives you a magnitude of the earthquake and here it's clearly described as this happened on February 6 2023 which is after gp4 Visions training thereby gp4 Vision fails to identify the exact location without the plugin so this is where the abilities start to unfold right all of the plugins that are in the plug-in store Advanced Data analysis Binger D free image generation and gp4 vision over time all those will merge into one model which can do it all right so it seems like opening eye thinks that it's too early to give those capabilities out to the public but honestly they're just one step away from this multimodal model that can do it all as you can see right here and I think we're almost ready to wrap this little exploration session up but one thing has to be pointed out here and that is the use case on page 154 and it's the self-reflection and here we go one step further into this multimodal idea of all the advanced capabilities of chat GPT plus combined into one that is the fact that it can actually self-reflect and self-correct when you give it the right circumstances so look at that last example given a user imagined idea of the scene converting the idea into self-contained sentence prompt that will be used to generate an image here's the idea photo of a dog looks like the one in the given image running on the beach based on above information I wrote a detailed prompt exactly about the idea follow the rules and look at that in this case we spawn multiple instances of gbd4 and it talks to itself and uses table diffusion Excel to generate the image so this would be do free in opening eyes case but it goes ahead and prepares The Prompt for you a blue dog running on the beach with a happy expression on its face here's the image now the visual model looks at it again now here we follow up with a second prompt and this could be fully AI generated right they did manually here but it's basically saying that you're trying to improve the sentence prompt by looking at the images generated by an AI art generation model and find out what is different from the given idea okay so now it looks at it and it says the dog in the generated image does not look like the one in the given image the dog in the given image is a pug while the dog in a generated image is a different
25:00

Segment 6 (25:00 - 26:00)

breed to address this issue we can modify the prompt to specifically mention that the dog should be a pug here's the new prompt he regenerates it then it figures out still not a pug it should be a pug and then finally it arrives at a blue pug running on the beach with a happy expression on its face with a Sunset and palm trees in the background and this is actually what we were looking for so although the first result did not correspond with the idea that was specified up here the final one did and this is the note I want to end on although these models might not be perfect they have the ability to talk to themselves and improve their thinking process and if you wrap your head around How Deeply these models actually understand some of these images that we will be able to feed to them you might also understand that by adding image generation and internet browsing and the ability to write and execute code to a model that can pretty much understand most images deeply unlocks a level of capabilities that I think most people are not ready for yet but one thing is for sure I'll be here keeping you up to date on all the latest developments and in various live streams will be testing these tools as soon as they become available so I hope you found this helpful and if you want the information like this as soon as possible you might want to consider joining my course because we discussed this paper during a live event in a small group as soon as it came out all right and if you want more content on chat gp4 Vision then this video right here it's just for you I'll see you there

Ещё от The AI Advantage

Ctrl+V

Экстракт Знаний в Telegram

Транскрипты, идеи, методички — всё самое полезное из лучших YouTube-каналов.

Подписаться