n8n's AI Workflow Builder can iterate and fix it's own workflows. @theflowgrammer sat down with the AI Squad to break down how it works under the hood.
Oleg shares which custom AI Agent Tools the feature leverages during generation and iteration, plus a look at how the custom evaluations framework was setup for such a complex feature.
Chapters
00:00 - What & Why
01:30 - High Level Architecture
02:55 - Iteration as key requirement
04:46 - Custom Agent Tools
08:05 - Importance of Evals
08:36 - Evals Framework for AI Workflow Builder
11:36 - Evals vs. Feature Implementation Effort
13:35 - Wrap up
The AI Workflow Builder is available on n8n cloud, a self-hosted version is in the works. It let's you type out your usecase in text and generates a workflow.
Follow @theflowgrammer on LinkedIn: https://www.linkedin.com/in/maxtkacz/
Оглавление (8 сегментов)
What & Why
So David, we're talking about the AI workflow builder could you explain the feature and, and what it does? Yeah, very simply, the AI workflow builder is a way of building workflows from text. So you type in what you want, you hit a button, and a workflow appears. And could you get into like why we're building the AI work for Buildin? What n8n's goals are for that? Yeah, I think it's not this kind of genius idea to be honest. I think everybody has had this idea and the community has been crying out for it to the point where they've been building their own versions of this. so if e ever there was a sign that something is really wanted by people, it's that they go and they take the time to build it themselves. The promise of it, I think has kind of been there ever since chat GT came out. Right? I think when chat GT first came out, everyone tried this, the technology wasn't there yet. I think now it is. what does it do? I mean, at, at a very simple level, it helps people understand NA 10, right? we know that at NA 10 has a bit of a steep learning curve for some people, maybe like the people who aren't, super technical Having a workflow laid out for you can help, that can help you just like understand the concepts, but of course, more ambitious than that. it's also something that helps you build workflows that maybe you didn't know how to and do that faster, right? So those are sort of the main aims of the future NA 10 is a product for technical people. and we are building this feature. To make technical people's lives easier, but there is this kind of tantalizing promise if we can get it working really well, that maybe this can also make NA 10 a tool that less technical people can use.
High Level Architecture
so Ole could you maybe walk me through what happens when I type in my prompt, let's say we've got a blank workflow. I type in my prompt, I hit, you know, generate what happens next. so you, you type your prompt on the front end, on the client in the browser, and we're going to send your prompt to the, backend service, ai slash build, Just going to accept all this data and process Builder work service, also that in the also list in the a separate service There's a lot of context regarding how the workflows should be built in the best practices. We explain all the moves uh, and we also inject uh, about the current made For example, you might mention now if you have a, that canvas, it's helpful for the AI to know that it can use it in inter generation. this And agent, it has of tools, mutate the state workflow it, to get to a state match whatever you the whole, loop looks like this. So you send on the front end, goes to the controller, then we send the request to the LM and LLM might decide to call the tool. So it's, uh, it's going into this. Once the tool execute, process of the output tool, uh, send it, stream it back to the client, back and this goes back and forth. And Bill, I'll emphasize that it's finished, but then when it's very new. the final message.
Iteration as key requirement
I think with this AI stuff, often it's super helpful to compare to a human compared to how a human would build and humans build iteratively AI, I think, has a much easier time if it can build iteratively too. So one of the things that was really important for us to get in is the idea of like executing and refining. the AI does something. you run it and it can send that, information of how it's run back to the AI so that it can improve from what's gone wrong. When there's like an error, in that generated workflow or there's a broken connection or something, is it getting that information as well? And, and how does that iterative loop kind of work? So there's. Several types of errors that, it might happen during the generation. one error could be, for example, when you're connecting the nodes, LLM sometimes might use different connection type for a sub. No. So Sub notes. So we try to catch these kind of errors that we can service level, uh, already in the two, either feedback from the tool response to the so it immediately so it self-correct uh, by sending a different parameter. quite That works but Uh, but another error is you actually get a workflow on your canvas, you click on that execute fine button, execute their execute the work. send the state, to the LLM. And in that, uh, in that message, we That message would also include regarding the workflow or if there is an error. We would also send, the error error. the execution so the LLM can fix it or help you, fix it. it's able to iterate, basically. Is that, is that a correct statement? Oh yeah, very much so. It has a memory, so session is built The session is uh, to code your user Id, not code id. So when you're chatting regarding your workflow, it's that same session and we do some sort of compression over the context window. but you can, you can go back and forth with it
Custom Agent Tools
remember, Can you walk me through what, tools it has access to, and then maybe pick one tool that you think is particularly interesting and maybe go in a bit more detail there. Yeah, so these are some of the tools we have. So we have things like node search to basically that, that's the first two of the m would be calling to understand what are the available uh, notes and the connection sent. And, and this is very slim because it's just an exploring, so we don't wanna the overburden the L window context So we much data. So we send connection, uh, connection and the name and no type of, then we have no details. So focuses on the specific note that it knows that might for be useful for the generation. details about the no, get a, that their parameter node type definition parameters, but like at least partial node type definition. So that can confirm it's indeed possible to possible, create a workflow. Uh, with and you uh, you have adding notes to the canvas. That one is connecting notes. Again, that's one of the uh, tools because we need to do a lot of checking if the connection that the LLM wants to make is really So with this tool, we played around a bit with letting LLM specify, uh, the But that Uh, out to but that turned out to be too complex to work reliably. So now we're doing a lot of it deterministically. Where it just says, okay, I have this note and I wanna connect it to the other, no, other node. compute, uh, the connection type, uh, dynamically. Sounds pretty easy, but our connection type are quite complex because they might have an example, for the agent, let's say the agent could have a either made connection, made input, connection, made output connection, or it could be used as a tool in which, mm-hmm. tool output Out connection. LLM needs to understand when, when to use what. when we use what. So we do a lot of determin, there's a lot of determinative project we basically tell LM used we use. And if it's trying to connect two nodes that don't have the compatible production, we, we let it know that, that it's not possible to Uh, I guess the, the pool tool or the one that's doing most heavy lifting thing, it's the update No car tool. And that tool is It's a sub chain or because group we, we pass all the context, uh, about the current state of the workflow, maybe some execution data schema or, uh, resolve map. And then we also pass the node type definition. And as I said before, these new type, pretty heavy. So heavy, cases it in some cases be 30,000 tokens just for the no tag definition. It wouldn't be very feasible to have all of this with the main contact main agent. So we split it out into a separate, uh, LLM chain call, which is only responsible for filling out and the way it communicates between the root agent and this tool is that the root agent only says, uh, okay, I want to edit this note. Here's the note id, and here is the natural language array of changes that I wanna this note. we pass this to We passed this to another bulb with the no type definition that actually and does these changes including all the no type definition stuff in the uh, contact window and only the parameters to the data agent the state. it has a single responsibility configuring the node. and because we're using tools we can also call these and parallel.
Importance of Evals
And what was the most complicated aspect of implementing this feature? It's, it's a lot of code for sure, but, it's also A. of system prompting and that is obviously not as structured as just, you know, writing code and writing functions. it's more nuanced. So, uh. The evaluations were important to know if we're making the right changes, then figuring it out, maybe like a slight sentence change, change the it be possible, um, to see one of these evals?
Evals Framework for AI Workflow Builder
So this is, one of our data sets. it's basically mimic mimicking those example prompts that you have on, When canvas when you have you like it and we give you what objections, we generate. Mm-hmm. And we use those same suggestions for this so we sent Uh, so we set that suggestion to generate the workflow of times, run both brought both programmatic evaluation. And I will have the charge couple category, a couple of categories. Uh, how, how well generation, Several metrics that we're tracking score, but then we're trying to determine if the connections that it made, uh, makes sense. Ions, if the functionality of the workflow matches the, of or description of what, How's the. uh, note configuration if it's, uh, if it's most optimal the most path also, of course of track, the and output into the alpha token and the Uh, but then we also have some programmatic it. It's using the expression in role, we, we something we be doing because three weeks. the, We've seen that the sometimes in the agent, it wouldn't include input, it would just have like a message but without the That's, that's what being taken care of here. We also have some, programmatic to determine if all the generator all generate the workflow, Each, each one of these, at least the, has its own. these So these are also run power the where we generate the workflow maybe eight different evaluators power, just focusing on that specific area workflow and giving you a think that's of, I think there's also an interesting strategy to be chosen l So rather than asking LM to we have this concept of violations where we the violation could violations, major or critical. Basically back from the worst scope of running. And depending on this category, we will compute score evaluation. So the LLM doesn't need worry about like giving it numbers, say, but it can just say, okay. is the violation. I would classify it as and we critical decrease maybe 25 points from the, from the final from score is run. this round for, evaluator. ah, so you make it more of a classification case for the LLM versus like this qualitative like, Hey, be a senior engineer and, and review this or a senior flow grammar and, and tell me if it's a good work for, so you kind of break it down in more discreet terms. And then you said there's eight in parallel. Can you, explain that out a bit? That seems interesting. Yeah, so this is where we have, uh, these evaluators that are the LLM as a judge evaluator. So, let's have a look at example Connection Evaluator. So you see this is So if you do prompt for the evaluator, and here we. asking if for some analysis, as a reasoning, giving you a summary of the evaluation. And then there's a, a violation which has major critical nature, minor type description. deducted that are computed programmatically. again, you see this is pretty Uh, but since we're doing these evaluation only. or locally when are from RCI, affect users because these evaluations, they are not rough. When you're generating and here we would base here, we'll basically generate the workflow.
Evals vs. Feature Implementation Effort
There the user prompt the user from, the output of that scoring application. What percentage would you estimate of the, of the engineering time was actually spent on the testing framework and the evaluations and stuff so far? Say maybe 20%. Sounds about right. And that's both for programmatic and for LLM as a judge. Evaluations. Now we're also bringing some of the work from the programmatic evaluations generation because don't need calls. We can deterministically check, if some conditions are it as sort of a for step that agent can, uh, can run make sure like all these do deterministically are, are still passing. Nice, and that's basically a cheap check versus again, an LM call that's very latent, full and costly, or also having to run the whole workflow. These are checks where we know that it's not gonna work because just by looking at the syntax of the workflow, it's no good, right? much. Yeah. And there's also some variability in these LLM as a judge, like you call it five times and you get five slightly different results that Mm-hmm. For the actual evaluation, we, we call it more than once to get some sort of average, score. just for all the, all the decision makers watching this right now, who might not think to, to spend 20% of their engineering budget on evals, why are they important and what, like how would this project have gone if you weren't given the time to, to build out this eval set? It's. Very important because if you, if you want to make some change, you should have some degree of confidence that the change you're doing is actually helpful and it's very hard to. To do so without Adobe you know, copy paste these prompts and check all the workflows. but I think that would probably be more than your time just, you know, manually testing in and out. So I think it also the development. see, even though these evaluations take a couple of minutes to run, you can, you can have a more confidence that, the actual output made changes that, that you want.
Wrap up
One thing that we spend a lot of time thinking about is actually, you know, we talked about the initial generation. You go from nothing to a fully formed workflow, hopefully in a single prompt. But if you don't, and you have to iterate on it. I think one thing that we're thinking a lot about is like, how do you audit that? How do you know what's changed? obviously code has a great answer to this, which is diffs. You know, you have the green and the red, and you see what's added and what's what, what's been taken away. With a workflow, with a visual, graph like that, it's a little bit harder. So we're spending a long time thinking about how can we give you like a nice, concise overview of what the AI has done so you can audit it. Because I think auditing and understanding what the AI is changing is gonna be like the, one of the most important things we've all used cursor and. and, Had, been so tempted to just accept, accept every change that it does, and you end up in some pretty uncomfortable places if you do that. So we wanna make sure that, that's not what happens when you use this feature. And we're really hopeful that the fact that NA 10 is visual is gonna make that much easier