Go to: https://www.hostinger.com/wesroth
and use code: WESROTH
for an additional discount on HOSTINGER yearly plans.
______________________________________________
My Links 🔗
➡️ Twitter: https://x.com/WesRoth
➡️ AI Newsletter: https://natural20.beehiiv.com/subscribe
Want to work with me?
Brand, sponsorship & business inquiries: wesroth@smoothmedia.co
Check out my AI Podcast where me and Dylan interview AI experts:
https://www.youtube.com/playlist?list=PLb1th0f6y4XSKLYenSVDUXFjSHsZTTfhk
______________________________________________
00:00 how to build anything with Hermes Agent
09:42 Installing on a VPS (Hostinger Sponsor)
15:40 connecting to your VPS with SSH
17:05 how to install Hermes Agent
25:40 Using Hermes Agent
27:35 The Point of Hermes
28:52 Seccurity
31:42 what I managed to build...
#ai #openai #llm
Оглавление (8 сегментов)
how to build anything with Hermes Agent
In this video, we're going to be talking about Hermes Agent. How to install it, how to use it, and we'll also take a look at this thing that I built with the assistance of Hermes Agent as well as the newly releasleased GPT 5. 5. I'll explain why I used Hermes. You don't technically need it, but it helped me out of a few tight spots. This thing that you're looking at is pretty interesting. Here you can see four gravity wells, aka suns. These are four suns and these blue dots are ships that our AI model has to fly around. It doesn't give direct commands. It has to actually write a code about how to fly it. The gravitational pole is real. It has a limited amount of fuel, so it has to use the directional thrusters to keep the planes flying in the right direction. There's conservation of momentum. There's collisions. And if you get too close to the sun, you might even fall in. The goal of the game is to stay within this little circle as much as possible. So, if I speed up the game a little bit, notice that the circle moves around. I don't want to say unpredictably. There is a very specific movement pattern to it. If you watch it for a while, you might even be able to figure it out. So, it's not random, but it might take a few sort of iterations and runs before you understand what pattern it's following. This entire simulation, grav or gravity well, it was built entirely by AI large language models. the website, all the functionality, everything, everything was built with large language models and all the scripts written to fly those ships were written by large language models. So, I did not really do any work other than sort of directing it and giving it guidance. So, you may have noticed that every tick or unit of time that the ship stays in this circle gets a plus one point and at the end it gets a score to 38 in this case. And that's actually a fairly impressive score. How do these large language models get the high score? It's not a one shot. Actually, we give it 20 iterations. We give it a description of the game in English. How everything works, how gravity works, how thrusters work, how the entire simulation works in basic English with some math and some functions. You're seeing the entirety of the LLM prompt right here. The model generates a script. The script gets put here. This is the actual bot code that we're going to use. And this bot code basically controls those three ships. Each ship has its own logic. So here for example is claude opus 4. 5. These last two iterations didn't run. So just kind of ignore those. Basically the point here is notice it starts with a low score 46. But over time and through many different trials and iterations it actually hits a high score of 276. This is what a smart model that's learning over time this is what it looks like. Here's claude son 4. 6. So, a medium-siz model. Notice it also learns. There's a learning curve, but it just tops out right here at around 78. It can't break past that. So, the first attempt might look a little bit like this. They're just crashing to the suns. They're flying off away from the playing field. They come back at way too high velocities and most of them eventually just crash and burn. This was a sonet 4. 5. It got a score of one. And that one point I pure accident. Here's a clot son 4. 5. Here is its learning rate. So notice it it's learning. It still tops out, but not bad. It still does pretty well. There is a bit of a learning curve. We have our leaderboard here, and I'm beginning to kind of flesh out the PVP mechanics where all of them just kind of go up against one another. Here's kind of what that's looking like. There's four teams here. We have multiple iterations of claude as well as some other models. Those are explosions from ships colliding. So, uh, exciting stuff. So, the AI agents built the entire thing for me. And when I went to sleep last night, they started running all these different models in a sequence to test them out to see how well they could perform on those 20 or more iterations. So, notice it starts around 2:17 a. m. and it just keeps going. We're testing GPT 5. 4. We've also tested 5. 5 Pro and 5. 5. We tested Gro 420. We tested Deepseek V4 Pro that recently came out. We test GMI 3. 1 Pro preview and many, many more. The enthropic models we test through their API. And so it finished all the tests by 5:32 a. m. So this is what an AI agent can do for you. No, I don't mean pilot little virtual ships around virtual gravity wells. I mean what you see here. As somebody that covers AI whenever a new model comes out, I want to know is it good? There's always a bunch of benchmarks and some of them I know exactly what's in them, don't. And I know some of the companies that train it directly on the benchmarks so that they can benchmarks, basically get the highest possible score and look like they're doing really, really well. I always wanted to have a battery of my own tests and benchmarks I can run to see how smart they truly are. This benchmark asks a simple question. Can this model receive instructions in English? Then write pretty specific code based on those instructions and then keep iterating on that code until it's as good as it can make it. I have the model iterate over 20 tries until it gets the best possible score. Then I take that program that's created and I run it through a hundred different seeds. Each seed is basically a slightly different position of the gravity wells aka suns and also it slightly changes the movement of the actual circle that they have to follow. Now I have about six more tests like this queued up that I'm developing right now. This one took about 40 hours to complete. Again, mainly me entering prompts into the agent, not touching code at all pretty much. I think there were three lines that I had to put in manually, and those were just because it was sensitive information. It was the API keys that were needed to run this. Other than that, I haven't touched anything. So, me and my AI agent build it during the day, during the night. I tell it to run all the simulations and test everything. So, basically the grunt work, the grind, I tell it, here, just do this until 5 in the morning and when I wake up, tell me the results. Notice some of these run times are 10 p. m. Some of them are like 3:00 a. m. So, the day is for collaborative AI work. The night is for automated AI agents just grinding away and doing all this stuff to get ready for the next day. Stuff like this is extremely valuable to me, but also I'm planning to put this out there to host it on my website. And I'm thinking about open sourcing the code. I just don't want anybody training against it, but I also want to make it available if people want to use it to build something similar or to build on top of it. And hopefully this will be useful to someone else as well, either as just information or as code they can reuse. If I'm doing a good job conveying to you kind of how amazing this is, how valuable this is, how gamechanging this is, please do me a favor, just leave a comment below and say something like this is amazing. I'm hyped. Something like it motivates you to build something of your own. Whatever you feel. And if you think that these AI agents are useless, if I haven't convinced you, also let me know cuz I'm curious because I know some people out there still think that AI can't create stuff is isn't truly helpful for myself watching my AI agents build this over the last 2 days and excited for what is to come. Now, most of the work here was done with the newly released GPT 5. 5, which notice got a very low score compared to Cloud at least. And I do want to rerun it because we're not going we weren't going directly through the opening API key. we were using the open router. So I feel like once we have an applesto apples comparison, I have a feeling this is going to get a lot higher. Now this video will be a quick tutorial about how to use Hermes agent. I'm going to walk you through how to install it in the exact way that I personally installed it and used it. Now Hermes agent is open source. It's free, but you can also use open eyes codeex or anthropics cla or open claw which is an open source. It was kind of one of the more like original agents or claws as it started to be known. So most of my work on the benchmark was done using codecs, but I found that using Hermes agent for a few specific tasks will really help me out. So here's how Hermes agent looks when you install it and it's ready to go. Notice that it comes with certain skills, certain skills that are already available upon installation. There's a lot more other skills that you can kind of install from the community, but notice it comes with cloud code and codec. So if I say our cloud code and codeex installed locally, it actually uses those skills to try to figure it out. So it says yes, it found them. We can use them. I already knew that since I installed both of them on this virtual private server on this machine. The reason this is kind of cool is because you can have Hermes agents call both those environments basically run an instance of clot code or codeex and have them working on stuff independently in their own instances being kind of not controlled by Hermes agent but Hermes agent is the user giving them prompts to demonstrate I'm going to say open up codeex and cloud code guide them through a game of tic-tac-toe tell me who wins why tic-tac-toe uh why not let's see if this works so apparently they played a tic-tac-toe game and managed to draw. Okay, next let's dive into what is Hermes, how to get yourself familiar with it. If you're not interested in the actual tutorial, but you do want to learn more about the benchmark that I'm building. In the last chapter of the video, I'll show you how I'm going to use Hermes Agent to actually run a lot of these simulations using the more advanced models to pit them against each other and have them do a little bit of PvP, player versus player combat. So, with that, let's dive in. — Good artist copy. — By the way, just so you don't get confused, this is Hermes agent. This is what they build. When I was building my initial benchmark, I had a few different designs or layouts that I want to try. For one of them, I said, "Hey, just copies the Hermes agent look. " And the bot created this, which I got to say is very, very solid. So, again, I took a screenshot of this, which you're seeing here, and I said, "Make it look like this. " And blow, here it is. I feel like it nailed it. I love the look. All
Installing on a VPS (Hostinger Sponsor)
right, so we're going to install Hermes Agent from scratch. We can do this on a local machine like a desktop and old laptop that we have lying around. I've used a mini PC to install some of these agents that works really well. Some people love their Mac minis which is another way to go. In my previous video about Hermes agent, I showed you how to install it on a VPS, a virtual private server inside Docker and with a kind of one-click install that was provided. So, it's super simple, super fast. If you prefer doing that, I'll leave a link down below with the specific timestamp for that portion of the video. But in this video, let's go ahead and do it kind of manually as noose intended. I don't know if they actually intended it that way, but let's do it kind of the full manual route. Believe me, it's not very difficult. In fact, really, here are the lines that you need to know to be able to do it. Now, installing agents on a virtual private server, a VPS, comes with a lot of advantages. It means that your AI agent is always available. you don't have to deal with hardware maintenance. It's someone else's job to make sure that it's online and connected to the internet. So, here I'll show you how to install it on Hostinger. Hostinger is a company that I've been working with for a while. They are a sponsor of this segment of the video. Now, I know some people get a little bit suspicious when we're talking about sponsored products. Here's the thing. Number one, Hostinger is the product that I personally use. As I showed you in the previous video, I have an account with them, actually multiple accounts for different VPS and other services. I use the KVM2 service and recommend it to others. When I host these AI agents online at this point, I use exclusively Hostinger. Also, this is not an affiliate deal where I get paid per signup. I'm not incentivized to get you to buy anything. Basically, Hostinger sponsors a section of the video so I can walk you through the actual installation. I choose them as a sponsor because I know, like, and trust them. You know how some of these companies, they're like, "Oh, what can I help you with? " They got that little pop-up send, "Oh, tell me what your problem is. " up until the point where you want to cancel your recurring billing at which point they're like, "Oh, we no longer speak English or any other language. We just we don't understand what you're saying. " And they're trying to figure out how to not get a rebuild is like this treasure Easter egg hunt where you have to find where they buried that button. Once you click the button, there's like a 50 pages like, "Are you sure? Are you really sure? But are you really really sure? " You know what I'm talking about. You know how it should be done here? This is my personal account. Look at this. This will blow your mind. I click on billing. Look at this. This is literally the next page. Auto renewal. Do you want it or no? Let's say I don't want this one anymore. Are you sure? Yeah. Done. This is why I like them. trust them. They don't weaponize my ADHD against me. So, the link is going to be in the description and pinned comments. Once you get to a page, it will probably look something like this. You're going to have a couple choices of which planet to pick. As you've seen in my account, I use KVM2. It's got two vCPU cores. with AI agents. This prevents them from freezing and running into resource issues. It's got eight gigs of RAM, which is more than enough for one agent with room to grow. And finally, it's got 100 gigs of NVMe disc space. As these AI agents grow and get more knowledgeable about what you're trying to do, more and more information gets stored in disk. This ensures that all those disc scans and retrievals, they're near instant. At $8. 99, it's a no-brainer, and I will even be able to get you a discount on that. By the way, if you're not seeing the same options as I am on the first page, just make sure you have a VPS hosting selected. That's the thing that you want for KVM2 for agents. This is what I used to use in the past for N8N, but whatever the case, get to this place. So, here we select KVM2. So, you're going to get to a page that looks kind of like this. They have different options that you can choose if you want them. At the bottom here, choose a server location with the best latency, United States. That's it for me. And then we select our operating system. We want a plain operating system, plain OS. This will give you just a basic no frills added kind of vanilla installation of whatever system that you want. I would go with Ubuntu. I like Ubuntu. I've been using it for a while and I'm noticing that a lot of these companies AI startups, even ones that were recently acquired by Facebook, they tend to run things on Ubuntu. I'm very pleased. If you're wondering what Ubuntu means, kind of means namaste. Anyways, we click on that and it'll ask you which version do you want to use. Basically, this is going to be the latest version. And then if you scroll down, the first LTS, that's long-term support. So those are more like the stable non-experimental versions. Now, it's not going to be a big deal which one you pick in general. If you get confused, you probably know what I'm going to say. Ask your favorite chatbot. So here I'm using Opus 4. 7. So here, Claude Opus is recommending to stick with the LTS with the long-term support. That's the more stable version. That has been my recommendation in the past as well. But Claude, I love Claude went even a step further saying, "By the way, Hermes Agent, that's what they test against. " So when they put out a new release, they tested against the, you know, LTS version, which means that you're probably going to be a lot safer just picking that one. All right, so we're going to pick 24. 04 LTS for you. Whatever is the most recent LTS one, just pick that. So just pick the highest number with LTS after it. Click confirm. Next, we scroll back up. We do have a coupon code. Using the coupon code Wes Roth gets you an additional discount on the year you plans. We're going to hit apply and then click continue. Next, register by creating an account. Login with Google or if you already have an account, log in. Enter your most favorite payment method and click submit payment. Congrats, your journey begins now. All right. Next, you have to secure your VPS access. So basically, you set a root password. So this is kind of like the admin password, the main password that you use to log into your server. You can create an SSH key. We'll talk about that later. You can always add it later in the VPS dashboard. For the time being, just create a root password. I'm going to hit generate. Make sure you keep that password. You're going to need it later to log into your server into your machine. And then hit next. They have some features here like malware scanner, Docker Manager. These are free. The Docker Manager is great. We've used it for the previous install. This might be worthwhile for the time being. I'm not going to select anything. Just hit finish setup. And it's going to set up your VPS. All right. So, in about a minute or two, everything is set up. You have your VPS dashboard and also you have your SSH access. So we're going to
connecting to your VPS with SSH
copy this thing. So I'm going to go ahead and open up Windows PowerShell. Different operating systems will have different versions of this. This is basically a command line interface CLI or it might be called a shell or a console. Technically terminal is probably the most accurate term. So open up whatever terminal you have on your operating system. And what we're going to do is we're going to enter this command SSH. So that's a secure shell. That's a way to sort of connect to another machine on the internet securely. Then space root. Root is kind of like the admin. So we're not just a user. We're like the user. We're the VIP. Roll out the red carpet. So root at and then those numbers. Those numbers are your virtual private server VPS. That IP address. So here 72. 62. 100. 87. It might ask you if you're sure. We're going to say yes. And it's going to ask you for the password. That's that password that you set up just a minute or two ago. I type it in and hit go. All right. And now we're logged in to our virtual private server on Hostinger. So notice it says root at. So root is us, the super user, the admin at svr and the number that's the server that we're on. So if you see this, that means you're sort of on that machine operating that machine. Whatever commands you enter here will get executed on that machine that we just created through Hostinger. And as always, I encourage people to also use their favorite chatbot as they're doing this and just ask questions, ask for explanations, clarification. So right now, I'm going to ask, you see what it says here, back at it, Wes? You know I am. So we're
how to install Hermes Agent
going to ask it how to install Hermes agent on a fresh Ubuntu install. Here's the thing. Even if you know how to install it already and you've done it before, this is still very helpful because your AI agent can give you some extra insights and knowledge that might be helpful. Also, over time, as it learns more and more about you, it might connect the dots that you didn't think about before, and it's getting better and better at doing that. So, here they're saying that all you do is you open up a terminal, that's this right here, and run this command that installs everything. By the way, really fast. You know that there's a lot of people out there that's kind of walking you step by step through how to install everything. They seem very, very smart, like they know what they're doing. I'm showing you literally how to do it, so you don't need anybody to show you how to do it. somebody sitting there walking you through step by step how to do something technical. That's the past. This is the future. This is the thing that will allow you to learn at light speed. Notice a couple of things here. It mentions don't use a pseudo. If you don't know what that is, great. You don't have to worry about it. If you do know what that is, you're planning to use it. This is that extra little insight that we were talking about. Now, remember previously when I asked it which version of Ubuntu we should use. Well, it's remember that and now it's referring to that saying, hey, this is the right thing on the right operating system. Good job. All right, so we're going to copy this command, paste it into our terminal, and hit enter. By the way, depending on where you're copying and pasting from. Sometimes it might create issues, so just make sure that there's no random characters that get attached to before or after. Make sure everything looks more or less correct. It does. We'll hit enter. And this is the Hermes agent installer. It gets to work installing everything that you need. Meanwhile, let's take a look at what it says next. It says, "Reload your shell afterwards with this command. " And once the install finishes, run this Hermes model. It also recommends using open router. We'll take a look at a couple other options as well. And notice it even gives us a VPS specific consideration. Basically, it's saying on a VPS, switch the terminal back end to docker after setup. What this does is kind of send boxes this agent. So, it kind of prevents it from doing a various shenanigans on the actual system that you're running. Now, it's asking us, how would you like to set up Hermes, a quick or full setup? But the important point is at this point you're more or less done kind of installing and getting everything set up to be running on your virtual private server. Now just a matter of running the full setup, choosing your options, adding various API keys, etc. So here let's do the full setup. This is where we select our provider, right? So in the past we've talked about open router. Open router is basically all of the models, all the different AI providers kind of in one place. You get one API key, you put it in there and you're able to pull from most of the AI models that you've heard of, they have 300 plus models and you know that's a lot of models. Now since then news research who are the people behind the Hermes agent they've created their own service that's similar. It's called newsportal. Now of course it's up to you which one of these you want to use. You don't even need to use one of the models the aggregators that have everything underneath them. You can just go with for example anthropic or open codex or whatever you want. Keep in mind that all those models will be in one of these top two. You'll be able to use those models as part of one of these two. The really cool thing about news portal and again please understand that these are somewhat new things. Hermes agent news portal etc. But news research I've known them. We've interviewed a few people from News Research. I like the company. I like their mission. They're an open- source company. They seem to be really behind open AI for everybody. Neutral AI as in don't let the big AI labs dictate what's moral. the power to control your AI agents should come from you the user, not from the AI providers. So, I like the company, but obviously do your own research, see what works better for you. The big advantage of using a news portal, which again is a pretty new subscription, is that they wrapped up a lot of the stuff that you're going to be using for Hermes agent within that one link. You get access to web search, image generation, text to speech, browser automation, etc., etc. Now, normally you would have to sign up for, for example, fire crawl, browser use, etc., to get those API keys and set them up during the setup process by going through noose portal. That just bypasses all that. That's all then is included. In the previous video, I showed you how to set up open router. In this video, let's go through how to set up news portal. And now we're going to select new portal. So, it's asking us to open this website and then if prompted, enter this code. So, that takes us to this page. Choose a plan to connect Hermes agent. Basically, you have different choices including the free plan which is $0 a month. Now, the important thing to understand here is that however much you're paying, like if you're paying 10 bucks a month or 20 or 100, you just get that money as credits towards what you're going to use. So, in that sense, it's not really like a fee that you're paying news research. It's just a credit that you get towards usage. All right, so I'll start out with 20 and see where that takes us. Then scroll down, subscribe, and connect. And once you're done, this should automatically update. Here, we're going to be selecting which models used. So apparently 35 minutes ago, Technium, who is one of the people from Adus Research, is saying that we can try Kimmy K2. 6 and Hermes agent for free for 24 hours. They do have a partnership with Kimmy, which is an open- source model. And by all accounts, it works very, very well for Hermes Agent. So you may or may not see something like this. I'm going to, of course, try Kim Kate 2. 6 out since it's free currently. And here we're going to click enable tool gateway, which is various tools like web search, image generation, etc. And notice here, this automatically gets prefilled because it's enabled via the news subscription. This is one of the kind of reasons why you want to be using it. If not, you would have to provide some of these API keys or probably each one of them you would have to provide separate. Do we want to add another credential for same provider fallback? Not at this time. All right. Next, it's going to ask me which terminal backend to use. Now, if you recall, cloud recommended that we use Docker instead of local. Local is the one that runs directly on that machine. Docker kind of puts it in the sandbox. I'll tell you a little secret that I've learned the hard way. If you choose a Docker right now, the whole thing crashes since it's a fresh Ubuntu install. Nothing else is installed on there. No Docker, no other software. So, for now, just use local. But later, if you wanted to put in the sandbox, use Docker. Then you have to install everything that you need for that and then go ahead and do that. But I'm going to say local. So, next it's asking for the messaging working in directory. This is specifically when you're messaging it from something like Telegram. Basically, do you want its sort of work area to be where you installed Hermes or do you want to create a separate kind of workspace? You can just hit enter to accept the default. So, next it's going to ask you if you want to enable pseudo support. Pseudo is super user do. Pseudo kind of escalates whatever command you're using to the highest privilege. So, it means like I do so I command you to. I am the super user. You do this. Here's the X KCD comic book that kind of illustrates this. This person says make me a sandwich. The other person says what? make it yourself. The person sitting here goes, "Sudo, make me a sandwich. " The person goes, "Okay, this is exactly how that command works. " Since we're running this in the root, we don't really need it. It just saves an extra password. If you're running it as a different user, giving this to the agent would allow them to do certain administrative tasks like more important tasks like installing various packages, etc. So, for the time being, I'm going to say no. Next, they're asking for maximum iterations. So, this is maximum tool calling iterations per conversation. So it defaults to 60. 90 is what you want to be using for most task. 150 for open exploration or deep research or things where it really needs to do a lot of tool calls. So 60 is a good place to start. I'll bump it up to 90 and then hit enter. How much should be talking off? Just showing you the new tool calls all. So saying every tool call or verbose where it just like spells out everything. We'll keep it on the default which is all compression threshold. So this is when the context window gets to a certain size. when do we sort of compress it and then start a new context window. So higher threshold like 0. 9 n5 you get more of the context window to use it fills up higher so that you don't reset it as early but there's issues with that potential to run into you know running outside of a room out of context window and. 5 is the default let's start with. 5 also your sessions when do they get reset so the recommended one is inactivity plus daily reset so it's like if you leave it alone for a while or at a certain part of each day whichever comes first we'll do that one inactivity timeout 1440 Again, you can change this which once you know what's better for you. I'll leave it as is. Daily reset hour. So, it recommends four. So, I'm guessing it's just like when are you least likely to be awake? So, probably three or four are the most likely candidates for most people, I feel like. And how do you want to talk to it? I'm going to be using Telegram. Do we want to configure some tools for the command line interface? So, here are the tools that it has already kind of enabled. So, we're going to hit enter to confirm and hit done. And we're at the end. Now I can hit yes to start chatting with Hermes agent now. So here it is.
Using Hermes Agent
We're using Kimmy K2. 6 with 31 tools. And let's test it by asking it for an image of a flower. It's using its image create tool to create that image for me. And it's right there. To open up links, use a controlclick. And there it is. Now I've walked you through the Telegram setup in the previous video. So feel free to check that out or ask Hermes or your favorite chatbot. In this video, let's do some of the more interesting and advanced things that Hermes can do. — Let's add some GPTs. — So, really fast, here's Hermes agent. This is how you enable GPT 5. 5 on it. And how you can also enable GPT images 2. 0 on it. So, first and foremost, run Hermes update. This will update to the latest version. And you can now select a provider and model like this. Now, currently, I'm using a news portal. That's the news research subscription that has all the models underneath it. But in order to use 5. 5 as of this second at least, you actually have to log in through the OpenAI codeex through the OOTH. It'll ask you to open this URL in your browser. Once you log in, it'll ask you for the code. And once you're signed in, you can select the model GBT 5. 5 for us. Then we run Hermes. All right, there we go. GPT 5. 5 running in Hermes. Now I'm going to hit Ctrl C to exit. And we're going to test out one more thing. Hermes tools. This is where we're able to configure which model creates the images for us. So we're going to configure our CLI and for our image generation and we're going to hit reconfigure an existing tools provider and we're going to do image generation and so we have new subscription which is I'm what I'm running we have file. ai which has all the flux models nan GPT but we're going to be using OpenAI codeex off. So this is where we're able to use the new GPT image 2 inside of well Hermes and anything else. No API key required and we have a choice between medium, low or high. Let's start with the medium, the balanced approach, and we're all done. Run Hermes, and away we go. Okay, so really fast, what is
The Point of Hermes
Hermes Agent? Hermes agent was built by our friends over at Noose Research. So Hermes Agent came shortly after Open Claw made its debut because I think a lot of people did realize what a big demand there was for these effective, intelligent, and truly useful AI agents. or at least with things like these, they were sort of scaffolding or wrappers around these models that allowed them to do some pretty impressive things. The big exciting claim with Hermes agent is that it grows the longer it runs. Persistent memory and autogenerate skills, it learns your projects and never forgets how to solve a problem. So for different skills that it creates, if you keep using them over time, it tries to make them better, more efficient. It approaches them kind of like a science experiment, like can I iterate this over time? By the way, also kind of the point of my benchmark. By the way, just so you don't get confused, this is Hermes agent. This is what they built. When I was building my initial benchmark, I had a few different designs or layouts that I want to try. For one of them, I said, "Hey, just copies the Hermes agent look. " And the bot created this, which I got to say is very, very solid. So, again, I took a screenshot of this, which you're seeing here, and I said, "Make it look like this. " And blow, here it is. I feel like it nailed it. I
Seccurity
love the look. — Quick note on staying safe. — By the way, when running things like Hermes and Open Claw and Codeex and Cloud Code, there are ways to launch them with a different sort of levels of approval needed. So for projects like this where I'm starting at zero and I need it to, you know, run for 48 hours and be completed. I personally don't want to have it stop every 5 minutes and say, "Oh, is it okay if I make a folder for this or is it okay if I create this text file? " It's like just go for it. And so a lot of these agents, these programs, they'll have something similar to this where you launch it and you say, "Hey, dangerously bypass approval sandboxes. So it skips all confirmations. It executes commands without sandboxing. It's extremely dangerous as OpenAI says here. " So I just want to throw that out there for people that I prefer to have it run on, you know, full auto, safety off, etc. But making sure that I run in such a way that if things do indeed blow up that the sort of the blast radius is contained. That's my approach to things. Some people agree with it. Some people find it absolutely insane. So, make sure you're doing what's right for you. So, at this point, my AI agents are either across a number of different virtual private servers. Some of them are running on a mini PC that I have here on my desk, but that would be a little bit difficult to show. I showed it during installing one of the videos. I also have this old laptop that I found somewhere. It's a Lenovo laptop. I dusted it off. I haven't used it in years probably, but I installed Linux on there and I just plugged it in and put a bunch of agents on it. So those are like the three different places that I run my AI agents. I would not install one on my main computer and run it like this with this bypassing the confirmation prompts etc. So if you're getting into this just think about it like how many layers of safety can you have between you and the agent if stuff goes wrong. Hermes has skills for example one password which is a great way to manage all your passwords making sure that if something gets exposed you can kind of like isolate that quarantine that cut it off cancel it. There are also ways of running these in a docker container which helps isolate some of the damage with a lot of these agents. I SSH into them and meaning that I am sitting here in front of my own computer at home. I open up the this command line interface terminal or PowerShell on Windows, right? And then I use this command to connect to that remote machine and then I can give commands like this Hermes is running. It's somewhere in Boston. Yeah, I don't want to show it just to be extra safe, but that machine is located somewhere on the other side of the country. I sort of secure shell into it and then I'm able to control it. Everything happens over there. So, if it causes some sort of a nuclear meltdown over there, uh, you know, I've never been to Boston. I don't know. Is it nice? Would we miss it? I have no way of knowing. But, as we all start using things like this more and especially with things like Mythos out there, and in fact, GPT 5. 5 is also a very good potentially a very good kind of hacker or at least they can find exploits for cyber security. This is just a quick reminder to be just a little bit more mindful of all the data that we're all oozing out there all over the interwebs.
what I managed to build...
— Okay, let's build. I just wanted to show you how I used Hermes to add to the game, how to test the game. So, you already saw how this benchmark looks, how Gravell functions here. I'm trying to implement a new feature. I called it a duel and I asked Hermes, which is being run by GBT 5. 5 by the way, to add that feature. Basically, it's a 1v1 and it's going to allow us to pit two models against each other directly. in the previous PVP arena. Basically, what I did is I took each model's best iteration, their best code output, and then I took four models. I took their code output. I threw it into the arena with all four of them, and just saw who would win. Notice Claude Opus 4. 7 had an 88. 3% win rate. They each also got assigned an ELO score to see just how good they were. But this is a bit different. This is going to have two models liveing that code, testing it against each other. So, they're each sort of getting their diagnostic report back at the same time. and they propose changes to their code sort of not necessarily at the same time but it gets submitted and they go through each iteration step together. So it verified everything is working. I verified that everything's showing correctly on my screen and now I'm telling it okay let's open up a fresh instance of clot code and codeex. So basically Hermes agent can call both of those as almost like tools. So it opens up actual instances of them just like you and I would and it's going to feed them the information that it needs to. It's going to work with them. So this is us and we tell Hermes what to do. That's our friendly bot Hermes. And Hermes talks to his two sub agents. One is Codex and one is Claude. Now of course don't forget that Hermes gets some context from the game. So that would be Grav. I did not come up with that name. If you don't like it, blame Claude. I should delete that arrow. It's going to get too confusing. But blame Claude. If you like the name Gravel, then I'll take all the credit. But any criticism direct towards Claude. So Hermes gets all the data from graph well and it tells both codeex and claude you know step one just give me your initial output. So in that case it's telling them about the game and it's saying okay create your first script. They do so and return the script to Hermes. Hermes takes the script and runs it through gravel. Again gravel you can just paste in the code. It runs it and it gives you back a diagnostic. Right. So after you give it the script, it gives Hermes a diagnostic report for one for Codex, one for Cloud in this case. Then Hermes again gives Codex and Claude that report saying here's how your first idea did. You know, here are all the results. Go ahead and see if you can do better based on this. Like what did you learn? How can you improve? So two, that's our iteration. I can't spell. I should do another iteration on writing that word, but let's just forget it. And this continues for however many iterations we set. I still trying to figure out where that sort of learning drops off. There's probably some sort of a drop off point after which increasing the number of games doesn't really amount to much. I think right now that's somewhere around 20 games, but we'll figure it out as we get more and more data. By the way, the next idea I want to test is how Sakana AI did it with their Darwin Goto machine where it keeps generating various attempts and the ones that do better continue almost like an evolutionary sort of lineage. So hopefully that's all coming for the time being and we're just keeping it kind of simple or you know simpleish. But the point is I just give one command to Hermes and then I go to sleep or whatever. So sleeping is step two. So I'm telling it to open up a fresh instance of clot code and codeex go step by step with them and play the game. And by the way you might recall that at the beginning of this instance I had to do exactly that and have them play a game of tic-tac-toe. So Codex went, claude went, Codex went, cloud went, or whatever order they went in. So we already saw that it's capable it. And also, it has that small little sort of example of what we're looking for in its context window. I'm also letting you know that if they mess up and the code doesn't work right, maybe they have some weird symbol in there that breaks the whole thing, just tell them what the error message was and let them try again. They're pretty smart. By the way, if you're wondering, we're running Opus 4. 7 on the claw code and we're running GPT 5. 5 with high thinking on the codeex. I'll actually ask it to just double check, but that's what should be running on there. I'm going to say let's stop at 10 iterations and give a report, but be ready to continue to 20 after reporting. So, I just want kind of a halfway checkpoint, just kind of like let us know what's going on, give us a little report. So, I'm going to hit enter and away we go. Now, the reason I'm doing this versus using the API key, this is not going to be good for, you know, a fair benchmark. But for troubleshooting and testing things out, this is excellent because we're not spending any credits doing API calls. We're using Cloud Code and Codeex. So, that use is covered under, you know, their respective plans and we shouldn't have any issues. I should have said to send us over the replay so we can kind of see them in real time, but I'm sure it will report the scores as we go along. All right, let's let it cook and we'll come back and see what's happening in a bit. All right, so that took quite some time, somewhere just under an hour or so. And notice it created a specific skill to do this job. They called it Gravo GBT agent loop. If we keep running the skill over time, it should get better and more efficient. In this one, GBT 5. 5 high won seven rounds and the cloud code opus 4. 7 won three rounds. I also asked it to add a replay feature in the list of all the battles that were, you know, simulated so that I can see how they unfolded in real time. So, here's that. It has all of the battles that were recorded. I mean, let's say we take this last one, load replay, and play it. You can see it playing out in its entirety from beginning to end. Blue is cloud code opus 4. 7. Red over here is codec GBT 5. 5 high. It's very interesting. Let's see if we see any collisions. So this is 10 iterations in. So hopefully they've had some time to work on their skills on their script. Looking great so far. Nobody's crashing into the suns. Everybody's using just a little bit of fuel here and there. That's one of the tricks here. You're not trying to just fly all over the map. You're just tapping the gas ever so gently to get to where you're going. And also you're not chasing the circle, you're trying to get in front of it. Absolutely terrific game. So Codex GBT 5. 5 high got 68 and the Cloud Code Opus 4. 7 got 43. So I'm loving this so far. I think this is going to be a great benchmark. This is just one of several different things that we're going to be testing. This one was just the most sort of visually interesting I thought. One really interesting I found when I was messing around with earlier, if you have a model that over time slowly becomes really good at writing these scripts, then it becomes just a good pilot overall. At that point, you can ask it to do whatever you want within the context of the simulation and it tends to do it pretty well. Like if you say, you know, slingshot the ships one at a time around a star, it will do that very well. If you say put each ship as close to the star as possible and just hover there, you know, thrusting towards the star so you stay in one place, it will do that and it'll be very good at it. So I had a blast designing this. Hopefully this was helpful. I encourage you to try this for yourself either with Codex or Cloud or OpenClot or Hermes like we tried it here. All of them have their own special quirks and abilities. But right now from where I'm sitting GPT 5. 5 especially high you know when you're putting it into codeex or you're using the ooth which is kind of the same as codeex you're using that within the Hermes model that model GPT 5. 5 it's doing incredibly well. It's surprisingly good at building these very long horizon tasks these big projects. Every once in a while it gets a little bit silly and you have to kind of go back and forth with it and troubleshoot but those times are getting far and few between. By the way, to compare it to the first round really fast so you can see kind of what a big difference those iterations make. So let's uh play it from the beginning. This is the first time they're attempting to write scripts to pilot these ships. Okay, so these two already collided and blew up. So that would be GBT 5. 5 high just crashes two of the of its ships right out of the gate. The second one goes too fast through it. That's one of the rookie mistakes these models make. They just expand all of their thrust to accelerate as fast as possible towards the star and they end up just flying out into the emptiness of space and have to use tons of fuel to slowly come back. So notice how you can see how bad this is right now. And over time with each iteration it keeps getting better and better. Sometimes not in a straight line. It'll go up and down. But think about how great that last game was. These are obviously pilots that are struggling. And the last one was ace pilots that really know what they're doing. So kind of an interesting thing to behold. Let me speed this up a little bit so we can get to the end of it. I wonder if there's going to be any more crashes. But notice how much time they spent outside of sort of the range of where we're playing the game. They have to spend a lot of fuel thrusting back towards the center. So at this point, the game is pretty much over. It runs for 200 ticks. So most of that time the ships just spent outside of these solar systems. There you go. Anyways, if you made it this far, thank you so much for watching. Is Roth.