# New OPEN SOURCE Software ENGINEER Agent Outperforms ALL! (Open Source DEVIN!)

## Метаданные

- **Канал:** TheAIGRID
- **YouTube:** https://www.youtube.com/watch?v=TgB6JO6gup0
- **Дата:** 02.04.2024
- **Длительность:** 16:02
- **Просмотры:** 22,114

## Описание

How To Not Be Replaced By AGI https://youtu.be/AiDR2aMye5M
Stay Up To Date With AI Job Market - https://www.youtube.com/@UCSPkiRjFYpz-8DY-aF_1wRg 
AI Tutorials - https://www.youtube.com/@TheAIGRIDAcademy/ 

🐤 Follow Me on Twitter https://twitter.com/TheAiGrid
🌐 Checkout My website - https://theaigrid.com/

Timestamps
00:00 Announcement
00:57 Open Source
02:26 Compared Benchmarks
03:21 How it works
05:07 New Design
06:34 Limit Information
08:30 Easily Configured
10:15 Demo
12:03 Paper Release
12:45 How expensive
14:36 Open Source Models Powering

Links From Todays Video:
https://swe-agent.com/

Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos.

Was there anything i missed?

(For Business Enquiries)  contact@theaigrid.com

#LLM #Largelanguagemodel #chatgpt
#AI
#ArtificialIntelligence
#MachineLearning
#DeepLearning
#NeuralNetworks
#Robotics
#DataScience

## Содержание

### [0:00](https://www.youtube.com/watch?v=TgB6JO6gup0) Announcement

so there has been an announcement of a advanced level open-source software engineering agent and you can see here that this is really striking because it was only recently that we had Devon be the first autonomous software engineer and it was something that took the industry by storm so in this video I'm going to be giving you guys 10 of the key takeaways on this open- Source agent and what is it able to do effectively so here we can see the announcement it says swe agent is our new system for autonomously solving issues in GitHub repos it gets similar accuracy to Devon on the software engineering benchmarks and takes 93 seconds on average and its open source and we designed a new agent computer interface to make it easy for GPT 4 to edit and run code so let's take a look at some of the 10 things that you should know one of the first is that it is open source

### [0:57](https://www.youtube.com/watch?v=TgB6JO6gup0&t=57s) Open Source

should know is that it is open source and in completely open source you can see right here that this is absolutely incredible on the open-source comparative benchmarks we can see that it achieves a 12. 29% compared to Devon's 13. 84% now why is this so crazy well if you remember Devon an actual software engineering agent that was an open source and actually had a $25 million series a funding round in comparatively this small team of open- source developers have managed to achieve a relatively similar results in I would argue a sh amount of time and with a lot less Capital which goes to show that open source could be achieving rapidly remarkable results in shorter time spans than bigger teams which does mean that it's quite surprising at how effective this team is in order to build something as quickly and as effective as they have now what's crazy as well is the distance between these two it's not like there is a massive difference it's only literally 1% so you could argue that these models are practically similar and it will be interesting to see how in future versions the actual abilities do increase with scale and with future models like GPT 5 or upgraded versions of GPT 4. 5 now like I said before if you compare this and point number two is that if we actually look once again at Devon's Benchmark we can see that the other systems that it was comparing

### [2:26](https://www.youtube.com/watch?v=TgB6JO6gup0&t=146s) Compared Benchmarks

itself against Were Far lower in comparison but if we look at the compared benchmarks in point number two we can see that it really has you know jumped up the Gap if it exists just here because that means now that open source has pretty much caught up to the state-of-the-art closed Source in terms of what is capable which just goes to show that open source models unlike llms it actually could potentially catch up or even overtake them and I'm guessing the reason that in comparative benchmarks why open source could catch up to close source is because both of these models are using the base level of GPT 4 or potentially clawed Opus considering they do have advanced planning and coding capabilities natively built into them now let's move on to point number three is exactly how this works so how does this software engineering agent work so the software engineering works

### [3:21](https://www.youtube.com/watch?v=TgB6JO6gup0&t=201s) How it works

by interacting with a specialized terminal which allows it to open scroll and edit through files it also allows it to edit specific lines with automatic syntax checks and it allows it to write and execute tests and the custombuilt interface is critical for good performance so this is something where they essentially describe how it works in a specialized terminal which allows it to think through its actions we can see that by looking at the demonstration there are thoughts and actions and then there are observations where it is able to check what it is doing so we can see right here it states that our production strip confirms the reported issue maximum and minimum are not being converted to R let's search for files related to R code generation it searches then we can see the observation and thoughts and actions so right here we can see exactly what the system is thinking and then its action so we can see the first of course right here this is the thought of the AI system the responsible file code is likely here we should do this then of course we can see the action at the end so the thought is in here and then of course the action and then we can see from the observation then it looks at the observation and then goes back to thought and action so it seems like the system we have here internally in point number three how this works is that it just thinks then it acts then it observes what's being done and then it thinks once again so what's cool about this is that we can see that this is an open-source software agent capable of seemingly long-term planning or at least iteratively planning as it moves forward now Point number four is rather fascinating because I did see something that I didn't think we would see so essentially what we did also see in point number

### [5:07](https://www.youtube.com/watch?v=TgB6JO6gup0&t=307s) New Design

four was the fact that there is a new design so it says simply connecting an LM to a vanilla bash terminal does not work well our key Insight is that LMS require carefully designed agent computer interfaces similar to how humans like good user interface design for example when the Lup messes up identation our editor prevents it and gives it feedback so essentially what we can see here is that the language model needs an agent computer interface that is very friendly in order for it to work very effectively and they said that connecting it to Simply a vanilla bash terminal just doesn't work well so they've essentially designed a new design that works well natively with these llms to make sure that they understand exactly what is going on and that they're more effective and we can see right here we can see that there is an LM friendly environment feedback it goes to the agent then of course the agent computer interface has very simple commands that it can use such as navigate repo search files you the file viewer edit the lines and then of course it then converts that into the computer and then it comes back so we have this entire system of how this works up the editor prevents mistakes and allows it to work more effectively so it's clear that the new design made a huge difference in terms of the performance capabilities of this now there was also something else on the sign in point number five we also saw that they were basically limiting

### [6:34](https://www.youtube.com/watch?v=TgB6JO6gup0&t=394s) Limit Information

the information on this AI system so they said here that another example is that we discovered that for viewing files letting swe agents only view 100 lines at a time was better than letting it view 200 lines or 300 lines and much better than letting it view the entire file so essentially what they're stating here is that they didn't want to give the a system all of the entire files when letting it complete the task they B basically said that it's only allowed to view 100 lines at a time and it was much better than letting it view 200 or 300 because I'm guessing that this likely increased the complexity of what was being done maybe it confused the model and I'm guessing lower lines allowed the model to process what was going you know better instead so it seems that from this we could judge that you know the internal system the internal agent works better when it has less things to do and I'm guessing that is something that is not too surprising but you know you would think that if an agent has access to the entire file maybe it would perform better but I'm guessing that just showing it at 100 lines at a time allows it to plan better and allows it to be more effective and dedicate all of the compute to ensuring what it does is correct and they also say good Agent computer design is even more important when using GPT 4 so if you are building a advanced AI software engineering agent it is possible that you limiting the software agent to only viewing 100 lines at a time might be better than viewing it to 200 or 300 lines at a time so that is something interesting and I wonder in future if there will be an optimal number of lines that you can have for a software engineering agent to view or if there's going to be multiple software engineering agents prob possibly collaborating you know maybe you got like four or three you know collaborating on different parts of the entire codebase fixing it at one time now Point number six was also rather fascinating so the software engineering

### [8:30](https://www.youtube.com/watch?v=TgB6JO6gup0&t=510s) Easily Configured

agent additionally can be easily configured and extended to improve future research on software engineering agents since the software engineering agent is open source anyone can experiment and contribute new ways for agents to interact with computers so this is something that I do find quite fascinating because now we have a system that is completely open- source which means that the development is likely going to increase even more remember if we looked back at Point number two in the compared benchmarks to Devon we could see that Devon wasn't that far off because you know the advanced software engineering agent the Open Source One especially in the announcement benchmarks was only 12. 29% which is only around roughly 1% lower than the closed Source Devon which essentially means that since this is easily configured and since is now completely open source we know that further development by other individuals and maybe even perhaps companies could take this to a whole entire new level which is thus going to increase competition and I do wonder what kind of software engineering agents people are going to be building with this because this seems to be very effective and so far this seems very promising for the future because this is something that I would argue has been built remarkably quickly compared to open source llms if you remember compared to the release of GPT 4 and GPT 3. 5 open source chat Bots did take quite some time considering the rigorous amount of training pre-training and all of the aligning that need to do to the model but with this it seems like people are easy easily able to build on top of existing Clos Source models and then of course get these Advanced software engineering agents Point number seven is also rather fascinating they actually leave a link to a pretty cool demo in

### [10:15](https://www.youtube.com/watch?v=TgB6JO6gup0&t=615s) Demo

which you can use it and see how the entire thing works and I'm going to show you that right now before we get to some of the other points so this is point number seven and this is where we have the advanced software engineering agent and you can see how it works internally so this is of course the web page where they do have a lot of stuff but we can see the demo right here so essentially if you just click Next Step you're going to be able to see exactly how it works so we've got the issue right here and this is the issue that we are trying to solve and you can see that this is all of the code that you've you know put in the thing okay and of course you describe the bug right here you can see this is exactly what you want and then we can see the next step so it says to start addressing the issue we should do this then of course you can see in the terminal what the system is doing and of course it's trying to reduce the bug and then of course it's saying we're going to paste it in and then of course you can see there of course it's done that and then of course you can see exactly how this works now I'm not a software developer but if you are a software developer this is really good because you can see exactly how it works and what steps it takes on the left hand side in its workpace and of course it has its terminal it's had its editor and essentially you can see that this entire thing took around 38 steps in order for it to be complete so you can see the error has been successfully reproduced which confirms the described issue the error message y y before proceeding with the fix let's do this let's navigate to here and then we can see exactly how it works you can see it's opening the tools Library um and it's yeah it's really effective as showing you exactly how things work you can also make this full screen you can also go ahead and make this full screen right here and of course you can see the terminal and I think that this is really cool because it actually shows you how the AI system is working and with Devon we did get to see a few demos but I really do think that this website is really effective now in addition

### [12:03](https://www.youtube.com/watch?v=TgB6JO6gup0&t=723s) Paper Release

at Point number eight they did talk about a paper release one of the things that many people do want is of course they do want technical details and on the Discord they said that we are aiming to release by April 10th so for the paper release they're aiming April 10th because that is when they think that they're going to be able to get the paper out and of course if you don't know what the paper is that's just essentially where the technical details of exactly what's going on should be you know released in terms of how it works all the benchmarks what open source or close Source systems they used how they fine-tuned it and some of their initial experiments on what was effective and what wasn't effective so next Wednesday should be the release of the entire paper where you can dive into more details Point number nine was

### [12:45](https://www.youtube.com/watch?v=TgB6JO6gup0&t=765s) How expensive

rather interesting because this is how expensive this is to run one thing you probably know about AI systems already is that a lot of agentic tasks where you have to do multiple different reasoning steps do require the model to Output a lot more tokens than an initial simple zero shot with a simple task now what's crazy is that they said we limit this at $4 per task and on average we spend much less for each solved task and we'll have a number in the papers next week on average of tokens in/out so I think right here they talk about how of course they don't want this to be an extremely expensive model and that is completely understandable considering the fact that in order for this to be usable in order for it to be viable for something that people can use on a day-to-day basis it shouldn't be very expensive I mean if you can get your you know software engineering issue solved for like 50 cents or something that is going to be something that is very effective but if every task took $10 to solve that would be very expensive very quickly because there are a bunch of different tasks and if you're trying to use this at a scale it wouldn't be that costeffective to use because you know you're going to rack up the bill pretty quickly now of course you know other models are coming out and other models are getting cheaper and more effective so I do think like I said before over time the cost per token is going to go down quite a lot but for now they set a limit at $4 per task but for solve tasks that is actually how it works and in terms of the you know how long it takes to solve 93 seconds on average is pretty incredible because I think on Devon if I remember correctly it did take around 5 to 10 minutes to solve but I can't verify that so don't take that as me you know hating on Devon what not but 93 seconds on average is very very impressive now the last point and point number 10 is of course is that

### [14:36](https://www.youtube.com/watch?v=TgB6JO6gup0&t=876s) Open Source Models Powering

will they use open source models and they said that you know that could be great but right now they mainly use closed Source models because they are quite strong and in the original software engineering Benchmark paper we found that a lot of existing open source models were fairly far behind so basically they're just saying here that they could use models like llama 2 or you know mistr but the point is it's is that these closed Source models like gbd4 and Claude Opus are you know quite a lot better than some of these open source models and due to that fact they're just going to continue using them because it does make sense now there are benefits to you using you know closed Source I mean open source models because of course they can run locally and that's of course really good in terms of privacy and Effectiveness but once again of course close SS models have you know billions of dollars in Investments and they are just far more effective than open source models at this time so it seems that they won't be using any open source models but maybe they're going to allow that if you're going to be able to do that but I wonder how effective that would be considering that you know open source models aren't as great as close Source models so let me know what you think about this do you think that this is very effective something that's really cool do you think that this is going to be something that takes down Devon because it is right hot on the heels and I wonder if open source could actually take down closed Source in the very near future with that being said it's been the AI grid and I'll see you all in the next video

---
*Источник: https://ekstraktznaniy.ru/video/14416*