Microsoft's VISUALChatGPT Takes the Industry By STORM! (NOW UNVEILED!)
10:28

Microsoft's VISUALChatGPT Takes the Industry By STORM! (NOW UNVEILED!)

TheAIGRID 08.04.2023 330 276 просмотров 2 444 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Microsoft Visual ChatGPT - https://github.com/microsoft/visual-chatgpt Microsoft Visual ChatGPT - https://huggingface.co/spaces/microsoft/visual_chatgpt Welcome to our channel where we bring you the latest breakthroughs in AI. From deep learning to robotics, we cover it all. Our videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on our latest videos. Was there anything we missed? (For Business Enquiries) contact@theaigrid.com #LLM #Largelanguagemodel #chatgpt #AI #ArtificialIntelligence #MachineLearning #DeepLearning #NeuralNetworks #Robotics #DataScience #IntelligentSystems #Automation #TechInnovation

Оглавление (3 сегментов)

Segment 1 (00:00 - 05:00)

so Microsoft honestly are completely dominating the AI race they just released a completely new tool which we've all pretty much been waiting for introducing visual chat GPT so visual chat GPT connects chat GPT to a series of visual Foundation models to enable sending and receiving images during chatting so remember when a gpt4 was announced and we all got teased multi-modal models of course chat gbt was upgraded from 3. 5 to gpt4 but one key feature that we really wanted was this right here you can see that this image shows exactly what happens when you have an image bind with the power usage of chat GPT so this is definitely very interesting and the paper definitely goes into a lot of detail as to how this all actually works now remember this is actually a working demo that you can try in the link below but I'm going to show you exactly what you should try and the certain examples that do work really well with this for some reason it seems like the community hasn't picked up on this so let's take a look at this paper right here and you can say it says visual chat gbt talking drawing and editing with visual Foundation models so if we take a look closely at this right here you can see that it is based on the four Foundation models it's based on a blip stable Fusion picks two picks control net and of course some detection so this is how it all works and you can see that there is a user query in there also and that there is also some iterative reasoning which eventually comes to your final solution as you can see the initial question was please generate a red flower conditioned on the predicted depth of this image then make it look like a cartoon step by step and you can see it actually does achieve that so what are some of the examples and what exactly does this look like and how does it all work when we're actually using it so this is visual chat TPT you have to understand that this is only a demo so I'm not sure if this is the full-fledged one because it does say that this is a demo to the work visual chat gbt talking drawing and editing with visual Foundation models and of course there are many examples all you need to do is paste your open AI key here and I'm going to show you the examples that I've done when using this because it's actually pretty cool and definitely pretty interesting and gives us a tease with what to expect with gbt4 but first let's take a look at some of the examples we have right here so you can see here from the example it says can you generate a cat for me and it simply generates a cat now you could be asking and simply saying that wait this is no different than mid-journey or stable diffusion but I'd argue that this is different because you're about to see exactly why whereas those ones are just simply prompt generators it says can you replace the cat of course you guys can see right here it says can you replace the cat to a dog and then remove the book and you can see right here that it simply does that so then it says that's cool could you generate the canny edge of this image and you can see right here that it simply does that instantly and then it says now generate a yellow dog based on this image and you can see that the image is right there and then it simply does that as well so it's actually very cool and then of course we have this last one here and when you send in the image it says received to know that the system has received the image and then right here it also says what color is this motorcycle the motorcycle is black can you remove the cycle and boom the remote cycle is gone so this is actually really interesting and really cool because now we're starting to get models that actually do have this kind of feature embedded in them and of course Microsoft are working on tons of different stuff incorporating many different large language models which I will cover in tomorrow's video so essentially what Microsoft said here was instead of training a new multi-model chat GPT from scratch we build visual chat GPT directly based on chat gbt and incorporate a variety of vfms and that's from Microsoft now remember this is very different from gpt4's multimodal feature which will be released this is not the same I just want to clarify that before people do get confused and if you're wondering what vfms are just visual Foundation models that essentially allow computers to see so they transfer images into text via descriptions now the reason this is cool is because it shows that when we have different pieces of AI working together we can get really interesting pieces of soft so one of the most interesting examples I really did want to showcase before I get into doing a live demo of the tool right now is exactly how this works so you can see that you know the user starts by saying hello how are you and of course the software explains hello and visual chat GPT here to help you with a wide range of tasks so the person says I like drawing but I'm not good at drawing can you help me I like drawing an apple okay and then of course they generate the Apple Okay now what's also cool about this is that the user inputs this very basic image which you can see right here and then they ask visual GPT saying hey look this is a sketch of my apple and a drinking glass can you help me to improve it and you can see I generated a new image based on your sketch the image is saved as this right here so you can see that this is actually really cool when it comes to generating images based on initial sketches and obviously there are programs out there that can do this and I'm not sure if those ones are particularly interlinked with this one but it definitely does go to show exactly how quickly Microsoft is moving with this AI software now what's also cool is that this person actually ask

Segment 2 (05:00 - 10:00)

them to make this into a watercolor image and you can see that it gets the prompt right there it also says wow this is really cool can you tell me what color the background is the background of the color is blue so like I said this is very different to traditional chat gbt models or traditional prompt image models where you simply just ask them to generate an image because sure you could use mid Journey or stable diffusion just to generate that image but if you're asked if you were to even ask me Journey or to try and ask stable diffusion exactly what the color of this image is it wouldn't be able to tell you yes mid Journey does have a new describe feature but that doesn't beat exactly what's going on here because you can literally say can you tell me what color the background is and they can say the background color of this is blue and then you can say can you remove the Apple in this picture and then describe the image and then of course the background is removed I mean the apple is removed and you can see they describe the image and it says it now contains a drinking glass with a blue background what's also interesting is that sometimes it does fail you can see right here that this is the apple and this is the shadow of the Apple then right here you can also see that although the apple is missing the shadow is still missing so the user then asks and says there are still some Shadows on the table in the image can you help me to replace the table with a black table and it completely does that very quickly and does it pretty accurately do so this is visual chat gbt this is me exploring the demo and let's take a look at some of the examples that I have done because boy oh boy is this interesting so you can see right here that I clicked one of the examples right here that says generate a figure of a cat running in the garden and you can see that it generates this image right here which we can open in a new tab and you can see that obviously this isn't the best exact image I'm pretty sure this is a mix of stable diffusion so that's why the images like that of course if it was another software maybe they could incorporate mid journey in the future but that is that so then I asked it um you know I sent it this image of Jack Kilby who is an American electrical engineer and I said who is this and it just says the images of a man with a suit and a tie with glasses so it does a very good job of describing what the image is but it doesn't do a good job of describing driving exactly who it is and I'm guessing that they don't want Facebook recognition services on this kind of platform because I guess there are a lot of privacy concerns when it does come to that but of course like I said you can definitely use this so I'm gonna ask it to generate an image so yeah I did try to use visual GPT for the last experiment but for some reason this does seem pretty buggy I'm not sure exactly what the bug is maybe you guys in the comment section below can figure it out but here is another example of something I did just use it for I also once again said generate a figure of a cat running in the garden it did actually this time generate a much better image that you can see this definitely looks a bit more realistic then of course I said can you make the cat disappear and then it says I have removed the cat from this image the new file name is right here and you can see that it shows this image without anything right here so this is actually pretty good and this one you can see that it actually does remove the Shadows on this one right here then of course I can see that there are flowers on the side right here and I was like these are bright pink flowers or purple flowers and I was like okay can you make the pink flower yellow and it says I have replaced the pink flower with a yellow flower the new file name is yada yada now of course this definitely does seem to be working but I wonder if Microsoft even is going to upgrade this kind of paper because I think this is just a kind of demo I'm guessing what they really wanted to do with this kind of software was essentially just show what's possible even if gbt4 isn't multimodal there's still ways out there there's still workarounds to get the desired outcome that you do want you can see right here that uh if I you know put something in the chat again usually what does happen is it doesn't register for some reason I think maybe you get three I'm not entirely sure but for some reason it doesn't always work so um yeah when it does work I do get lucky so I do click describe this image and it says this image shows a cat jumping over a purple flower which is only the first image if anything so yeah there are consistencies with this model that are to be expected like they did say in the paper and you can see that when reading the paper they show that this has significant limitations so it says although visual chat gbt is a promising approach for multimodal dialogue it has some limitations including and we're going to get to this right now um the limitations include dependence on chat TBT and vfms so essentially they heavily rely on activity to assign the tasks to vfms and execute them so essentially that means they heavily rely on chat GPT to get the correct vfm and if it doesn't of course that's going to influence the accuracy of the output of course it needs heavy prompt engineering which means if you're someone who doesn't really know exactly what to write into the to text box then potentially the output that you get isn't going to be that good it says visual chat TPT requires a significant amount of prompt engineering to convert vfms into language and make these model descriptions distinguishable this process can be time consuming and requires expertise in both computer vision and natural language processing it also says limited real-time capabilities because of course remember if you have something that is real time that is going to be far better there and something that is you know just general

Segment 3 (10:00 - 10:00)

I mean of course it's going to be useful but um something that is more real-time is going to be far more effective than something that is just based on what you can input at a computer now it's important to add that this will not be replacing gpt4's multi-modal features because some people may be confused as you can see from this comment right here someone asks how can I use gpt4 with images and open AI respond we aren't operating this service right now but when it does release we'll announce this to the community

Другие видео автора — TheAIGRID

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник