Microsoft NEW AI Agents ARMY Is Here! Fully Autonomous SOFTWARE DEVELOPERS (AutoDev)

14:29

Microsoft NEW AI Agents ARMY Is Here! Fully Autonomous SOFTWARE DEVELOPERS (AutoDev)

TheAIGRID 17.03.2024 50 149 просмотров 1 096 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

✉️ Join My Weekly Newsletter - https://mailchi.mp/6cff54ad7e2e/theaigrid 🐤 Follow Me on Twitter https://twitter.com/TheAiGrid 🌐 Checkout My website - https://theaigrid.com/ Links From Todays Video: https://arxiv.org/pdf/2403.08299.pdf Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos. Was there anything i missed? (For Business Enquiries) contact@theaigrid.com #LLM #Largelanguagemodel #chatgpt #AI #ArtificialIntelligence #MachineLearning #DeepLearning #NeuralNetworks #Robotics #DataScience

Оглавление (3 сегментов)

Segment 1 (00:00 - 05:00)

so there has been a ridiculous level of AI agents being released by many different individuals and Microsoft has stepped into the arena with something that I could call Devon 2. 0 they've called this autod Dev automated AI driven deployment so essentially this is a rather fascinating thing because they are introducing something that is very similar to what we had last week which was Devon and of course that one actually did shake the entire industry because it pretty much was on the radar so essentially with all Auto Dev it's rather fascinating because like I said Devon was released and it did shake the entire industry in terms of people realizing that the future careers and Landscape is going to change in ways we never thought before and that is of course rather fascinating considering the artists were the ones on the shopping block and now it may be software engineers and software Engineers aren't out of the crosshairs anymore because with autod Dev it kind of shows us what's possible with software engineering now essentially what they State here is they state that you know the landscape of software development has witnessed a paradigm shift with the Advent of AI power assistance and essentially they reference GitHub copilot which you all know about but essentially they say that GitHub copilot is just like a copilot it's pretty good but you can't build test and execute code uh and GitHub operations so they're constrained by their limited capabilities focusing on suggesting code Snippets and file manipulation within a chat-based interface so essentially they said in order to fill this Gap in we present autod Dev a fully automated AI driven software development framework designed for autonomous planning and execution of intricate software engineering tasks so you can see right here that this is a fully automated AI agent that is able to plan and execute intricate software engineering tasks and this is their first iteration I can probably guarantee that they're going to be coming up with a new iterations of this unless GPT 5's comes up with something that we didn't think of but um yeah it seems that Microsoft are working on this and I'm not sure if they're working on this in a response to Devon but I'm sure something like this did take some time so I guess that what we're seeing here is probably just a coincidence and it saids autod death enables users to Def complex software engineering objectives which are assigned to autod Dev autonomous AI agents to achieve these AI agents can perform diverse operations on a code base including file editing retrieval build processes execution and testing and git operations they also have access to files compiler output build and testing logs and of course they have access to other tools so it says this enables the AI agents to execute task in a fully automated manner with a comprehensive understanding of the contextual information required now so the crazy thing about this and I'm going to show you guys some more stuff okay is the fact that you know this is something where they have a bunch of AI agents working in the frame so this isn't something like where you have one AI agent this is something where they have a bunch of AI agents and these AI agents actually have different roles you can see these AI agents can perform diverse operations on a code base including editing reviewal da d da so uh it's pretty cool and I'm going to get into some of the benchmarks now and some of the other stuff so you can see right here that says in our evaluation we tested Auto Dev on the human eval data set obtaining promising results with 91. 5% and 87. 8% of past one code generation respectively so it's actually pretty good in terms of the benchmarks uh the only thing is that I wonder why they didn't compare it to the other bench marks which I will talk about in a moment but we're going to get into the architecture now because I think the architecture is very interesting so this is how Auto Dev Works um and it's quite interesting it's quite similar to Devon but uh I think this one they have more AI agents working in there so essentially it says autod Dev figure one autod Dev enables an AI agent to achieve a given objective by performing several actions within the repository the eval environment executes a suggested operations providing the AI agent with the resulting outcome and the conversation the purple messages are from the AI agent while the blue messages are responses from the eval environment so you can see right here that we have the user um and then it goes through all of this stuff you can see the AI agent is able to you know do all of the stuff it's able to run it uh realize when there's a failure and then of course test it in the environment bring it back and then of course run the test make sure it's successful and then bring back to the user and this is of course based off gp4 as you guys can see right here so this is uh something that is you know I wouldn't say it's pretty old but in terms of where we are in LM space um we do know that other AI systems recently released do actually improve on this quite a bit now like I said with this one the reason why I find this one a little bit more fascinating is because this is collaborative agents so it's a little bit different because you can see right here it says at this stage the user can Define the number and behavior of the agents assigning specific responsibilities permissions and available actions for example so they state that they could you know the user could Define a developer agent and a reviewer agent that collab work towards an objective so this is essentially some kind of uh I guess you could kind of say a agent swarm kind of thing where you have you know multiple things working collaboratively together and when you work collaborativ together you essentially have something that is rather more effective because you know they're independently working uh and figuring out the solution so I mean I it will be interesting to see if it does work better than one a agent that's able to do stuff or one that actess a reviewer and one that um independently Works uh on its own so here's again where we have some more of the automated uh framework so essentially this image diagram explains how the software which is auto Dev work so basically you can

Segment 2 (05:00 - 10:00)

kind of think of this like a kitchen where essentially you have the head chef which is the of course conversation manager and of course you have the specialized chefs which are the AI agents so you can see right here we have multi multi- AI agents like right here so there's a bunch of different agents and of course with this you can basically say that the head chef is then getting all these other chefs to build a mual or whatever but these are like all combining together I don't know if that made sense but you can think of like the eval environment here this is essentially you know the kitchen where these guys go to work and of course the eval environment is where you know the ingredients the state the tools the repository are ready and the docker of course is like a special safety seal that you know ensures everything is uh you know clean secure this Docker right here um and preventing cross-contamination now in addition what's really cool as well you can see right here we have the tools Library I didn't mean to do that but we have the tools library and with the tools Library this is essentially where you know these specialized chefs so to speak has their special kitchen utensils and essentially they've got file editing they've got retrieval building execution testing and of course the git which of course you know recipe tracking version system so essentially what I have here is this is what the agents are going to use in order to do the like many things that they want to do and of course conversation manager is this is where you type in your objective and then the conversation manager figures out what to do gets the agents to do that um and this is kind of how the uh you know entire thing works so in short basically you tell the system what you want to achieve and it coordinates to different specialized parts to make sure it happens correctly and everything is done in steps everything's checked along the way and I think this is rather fascinating because uh this kind of setup is a little bit different to the Devon one so it will be interesting to see how this uh framework actually compares to Devon on the benchmarks cuz I think this is pretty similar but it's a little bit different so it'll be interesting to see how it does work now it is of course based off gbt 4 and they talk about you know agents this say agents comprising large language models like opening eyes gp4 and small language models optimized for code generation communicate through natural language and these agents receive objectives from the agent scheduler responding with actions specified by the rules and actions configuration each agent with its neque configuration contributes towards the overall Pro progress towards achieving the user objective so once again they have multiple versions of GPT 4 that act as specialized versions uh one a the reviewer one a is the developer one a is this one a is that uh and like I said this is a little bit different because this is like an entire swarm of Agents uh that collaboratively that collaboratively come together and complete your task instead of One Singular AI agent so compared to Devon this is kind of interesting framework because it takes a unique different approach now in addition to here this is where they talk about the benchmarks as well they State the Cod generation results on human eval Auto Dev achieves top three performance on the leader board without extra training data unlike lats and reflection so lats actually stands for I think it's latent action research I'm not sure if it's action but it's latent action tree search something like that but the B part is tree search and that tree search is where they're searching through U multiple solutions to find a solution to the coding problem um and reflection is a different type of uh you know way to code that does achieve per performance increases upon the base model and you can see that they all use the base model um but this one has extra training uh reflection has also extra training that's why it's stable to achieve 94 um but you can see autod Dev although it isn't better than language agent research this is I can't believe I didn't just read that the St um but uh it doesn't require any extra training that's why it's so impressive because these ones uh require extra training but with the uh you know zero shot Baseline you can see gbt 4 it's at 67% but with auto Dev it now is up to 91. 5 and that's with zero extra training and you can see auto Dev attains a past one score of 87% on the human eval data set modified for the test generation task exhibiting a 17% relative improvement over the Baseline that utilizes the same gbt 4 model so essentially uh one test shot they got 87% on the human eval and I think with humans it's 99% so in order to break this down further this table shows the results of the testing process comparing humans and two different approaches using GPT 4 so if you want to break this down you know pass one is a score that tells us how often the first solution given was correct it's kind of like asking someone a question and seeing if they get it right on their first try now the passing coverage this actually shows how many of the problems could be solved at all regardless of how many attempts were made the overall coverage which you see uh right here the overall coverage this is a combination of how many problems were solved and how accurately they were solved now of course the humans wrote Because the humans are actually good at the test they get a perfect score 100 on their first try which is pass at one and almost 99. 4% problems can be solved Now when using Auto Dev with GPT 4 the first try success drops to 87. 8% which means it's not quite as good as a human which you can see but it is still quite impressive because it can solve as many problems as human can so we can see right here even though you know the first pass is 87. 8 we see that the passing coverage is 99. 3 so it's still pretty impressive but you can see the overall coverage is actually lower because it doesn't get everything right on the first try now of course you

Segment 3 (10:00 - 14:00)

can see that the zero shot Baseline this is just the attempt of GPT 4 without any special preparation training specific to the task hence you know zero shot and it gets the answer right on the first try 75% of the time which is where you can see uh 75 right here which is still less than autod Dev or humans and it can still solve a lot of problems but of course it's not is getting good at autod Dev because you know it's passing a rule coverage is just at 74 now of course like I said uh one of the benchmarks that I would have loved to see was this Benchmark the real world software engineering performance uh I think maybe they didn't do this Benchmark because either a uh it wasn't one that they wanted to do or B it wasn't just something that you know they wanted to do because maybe it didn't perform that well on the benchmarks and you know people would have you know done the comparison but either way I think when autod Dev you know is like open sourced and released if Microsoft do make it into an actual product and I think it's going to be kind of fascinating to see what they do considering the fact that um you know open AI their silence is completely deafening considering the fact that it's been over a year since their recent product update it will be interesting to see what kind of products they do release because that does mean that uh you know in the future I think they're going to be coming up with some Advanced systems because we're seeing people surpass gbt 4 in all sorts of categories and the fact that they haven't produced anything means that you know either a the most unlikely scenario is that you know GPT 4 is uh you know they did really well and now they're struggling to match it which I don't think is a solution at all or B gbg 5 and the other AI systems are working one are simply so Advanced that they really don't care about stuff like Devon about stuff like aut Dev because it's just so far ahead they have no worry and no pressure about you know releasing a tools because it's just so good that you know the competition isn't even anywhere near close so maybe that's why the silence is deafening which is why um you know it will be interesting to see if we come back to this uh before you know the 13. 86% on Devon for software engineers and then of course you know it will be interesting to see how that comes on in the future now in terms of the actual solution I probably should have had this you know in the in the ear on in the video but essentially we can break this down so of course the first things that it does is it you know it identifies the error and essentially uh you know the robot's working the air system is working they realize the piece isn't fitting as it should and you can see right here assertion error and then of course you can see it says the test has failed as the uh the test case that failed the expected y yada Y and of course this is where reasonings and you can say the test case is incorrect needs to be fixed and then you can see let's correct to this and then of course you can see updates in the environment then it tests it and then it says a summary pass and then it calls a test is successfully all tests uh you know the goal has been reached and then it ends the interaction so overall we can see that this is uh really effective it has a very simple feedback loop kind of thing and we've seen this with other pieces of soft with software as well so it will be interesting to see uh you know how they uh do this as well if we kind of get the same kind of thing where we had with Devon now one of the last things that they do have is they talk about you know Auto Dev allows AI agents to communicate progress on tasks or request human feedback using the talk and commands respectively and essentially um it says anec these commands have proven helpful for developers using autod Dev to understand the agent's intentions and gain insights to the agent's plan so essentially when the agent is doing something they do have a command where you can simply ask the agent why are you doing this what are you doing and they found that is pretty useful because they can understand why the agent's doing it what it's doing and kind of direct it and kind of help it so it's really cool so it says our future plans uh involve deeper integration of humans within the autod dev Loop allowing users to interrupt agents and provide prompt feedback so in the future they're going to be able to stop the agents completely and tell them look do this instead and I think you know that's going to be interesting but like I said it will be interesting to see if they're still going to do that provided gbt 5 is on the horizon AI agents are on the horizon Devon is on the horizon and we saw Miser kpu on the horizon being something that has ridiculous reasoning steps all in all we are moving in a rapidly changing word world and of course Auto Dev is one another step in that direction with that being said what do you think about these agent swarms where they have autonomous AI agents all working together to achieve a goal do you think this is going to be for coding or do you think they're going to have uh different applications like AI agent swarms for companies marketing reports I mean it will be interesting to see how this thing uh is applied because I think you know um you know autonomous AI agents working together can do stuff previously it's been it's been talked about quite a lot and there have been some early you know things but I think uh when we do have this in the future multi-agent setup I think that is going to be something that is really interesting let me know what you guys thought about this if you did enjoy the video I'll see you guys in the next

Другие видео автора — TheAIGRID

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник