Evaluations in Agentic Workflows - n8n Builders Berlin (Live Demo)

27:46

Evaluations in Agentic Workflows - n8n Builders Berlin (Live Demo)

n8n 13.11.2025 2 706 просмотров 86 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Recorded at the Advanced Track of n8n Builders Berlin, this talk features JP van Oosten, who leads the AI team at n8n, explaining how he uses evaluations to make AI workflows more reliable. In the talk, JP walks through: ➡️ Why evaluations matter for AI workflows ➡️ How to handle inconsistent LLM outputs, context drift, and edge cases ➡️ Using evaluations while building, before deployment, and in production ➡️ Comparing models and prompts (including simple A/B-style comparisons) ➡️ Tracking things like correctness, helpfulness, and token usage ➡️ Working with evaluation triggers, data tables, and metrics inside n8n ➡️ Using LLM-as-a-judge with reference answers Good for: anyone building AI workflows or agents in n8n who wants a clearer, more systematic way to test and monitor them. Chapters: 00:00 Intro 02:00 Why AI Evaluations Matter 04:50 Evaluation Methods 06:22 How to use evaluations in n8n 06:49 Pre-Deployment Checks 07:41 Monitoring in Production 11:20 Live Demo 21:05 Q&A 👤 Connect with JP on LinkedIn: https://www.linkedin.com/in/jpvoosten/ #n8n #aiworkflows #AIEvaluations #automation #workflowautomation #llm #aiagents

Оглавление (8 сегментов)

Intro

Hey everyone, welcome to our first session. [snorts] Our next speaker is Japan and he's engineering manager at the of the AI team here at NAN. And JPEG brings more than 20 years of AI expertise and I just checked that with him and it sounds true. It's long, but he has a PhD from the University of Croningan and hands-on experience from multiple successful startup exits. at NA. He leads our AI team at the intersection of research and real world applications. And today he will dive into practical methods for making AI powered automations production ready and show why evaluations are the cornerstone of reliable AI workflows. Take it away, APE. Hello. Let's see if my slides work. All right. Looks all right. So, thank you all for being here. It's hard with the lights to see all of you. So, if you want to interject something, please shout. So, today we'll be talking about evaluations and your AI workflow. So, let's do a quick show of hands. Who is not using AI in their workflows yet? One hand. Okay. So hopefully you still get some use out of this presentation and little demo that I'm going to give. The more important question though, who has experienced inconsistent AI outputs? — Okay, I think that's almost anyone. Everyone that's using AI in their workflows, who here has already been using evaluations? Couple of hands. This is mostly a talk about like how do you set up evaluations, how to think about evaluations and things like that. I do hope that I get to some of the advanced stuff so that you still get some use out of it. So let's start first with why should I care about evaluations.

Why AI Evaluations Matter

evaluations. Evaluations are a way to think about productionizing your workflow. So how do I put it into production? It's helps you think about what's going to happen when my workflow hits the real world like real users, real people asking questions to your agents and things like that. Of course, there are LLM inconsistencies. So, slight changes to your prompt or sometimes even the same can lead to wildly differing outputs. You can have context drift. So, if you have longer conversations, the LLMs can gradually lose sight of the original instructions. And so you want to maybe check for that before you put something in production as well. There are edge cases of course. So sometimes you have input variations like missing variables, unexpected formats, things like that also lead to weird outputs. And of course important AI can change underneath you. They can change models. They can change APIs. Things like that. Recently Anthropic released their claude set 4. 5. Does your workflow work with that version? You want to check that maybe it's even better than before, but sometimes it behaves a little bit differently on your use cases. So when you are ready to make your workflow go into production, how do I actually do that? So there's of course traditional testing. You think about it as unit testing and things like that, but that really assumes that your workflow is deterministic, right? It always gives you the same output. You can really check whether the answer is correct or not, which is hard if you for example create a Q& A bot, right? You get an answer, but you have no idea of is that answer going to be correct. There are also like probabilistic changes. So sometimes, yeah, if you have a classification problem, how do I know whether something should be A or B, right? That is something that you can think about and why traditional testing might not be enough. So we need a systematic approach. So we need to address that nondeterministic nature. We need a way to assess that quality and we need a system for understanding if our workflow is ready for production. And even if it's in production, we need to maybe check whether it's still moving along with all the use cases that you have in the real world. And so that's why you want to have evaluations. They are your safety net. They are making sure that everything works as you should and there are objective metrics to help you build your workflow and track the quality over time. So how do I actually evaluate my AI workflow? And the first is you should probably start with some vibe checks. I hope this is readable for the people in the front. If you're in the back, you

Evaluation Methods

can look at the screen. But vibe checks are maybe not automated, right? But they're still very important. You want to use the thing that you're working on. You want to chat with your AI agent. You want to put some data into your classifier and things like that because this helps you get an idea for how the AI is working in real life. It sparks ideas for what to actually check automatically later and that helps you get a feel for what's going on in your workflow. And then you might move on to some deterministic checks. Things like did all my tools get called? is the classification correct and things like that. And we have some things to show there. Did I retrieve the right documents? So when people talk about retrieve log method generation or rag, they often think about evaluating the answer at the end. But one of the crucial parts of a rag system is actually getting the right documents because that is what it basing its answers on, right? So does that part of the rag actually work? And then finally you can think about LLMs as a judge. And this is can answer questions like does this answer from my AI actually match the reference answer that I thought it should have. So this is an important part of these like quality checks of Q& A bots and things like that. Before I move here any questions so far? Everything clear? Then let's move on to

How to use evaluations in n8n

how you can use evaluations. So there are three parts that I'm going to talk about today. And you can use evaluations and then while you're actually building your AI workflow, you can use it just before you're going to put it into production. And you can use it to monitor while your workflow is in production. And that is mostly for things like does the model change and whatnot. So let's talk a little bit about while you're building your workflow. This is

Pre-Deployment Checks

allow this allows you to test different iterations of your prompts. So we made it easy to run over your entire data set with the workflow that you're building so that you can quickly check whether all these cases that you collected are actually coming in correctly that there's no errors at the end and stuff like that. So that helps. This is also for catching edge cases early. put in some empty values, maybe some unexpected formats. And then finally, you want to iterate based on those evaluation results. You want to see quickly that your AI agent, for example, really messes up in certain use cases, and you want to correct for that by either changing your prompt, trying out different models, things like that. So that helps you when you start out with building your AI workflow.

Monitoring in Production

Then when you're ready, almost ready to start putting it into production, you can start by setting up some monitors for checking performance and see how your workflow is improving from one version to the next. You want to get going on that. You also might want to check the cost quality trade-off before you put it into production, right? So while you are running your evaluations, we automatically track things like the number of tokens that you use and things like that. So you can compare for example the number of tokens to the quality of your workflow. And that really helps with selecting one model over the other or something like that. And that's also where AB testing comes in. You can test different models. prompts and stuff like that just before you put it into production. And then as you put your workflow into production, you hit the activate switch on your workflow and you integrate it in into whatever you're doing. You want to keep monitoring that workflow, right? So we have you can build some checks for regressions. So you want to catch cases that used to work properly before. You want to check performance trends over time. So that keep adding these different use cases. Look at your executions and look at what your users are doing with your AI workflow and keep adding those different use cases to your test set. So you can iterate using that and you can track it over time. And then of course you can check if new models actually help or break your workflow. So with the new clots on 4. 5 does that matter for your workflow. Does that help? Does it break it? Is it the same? That's those are good things to see. And maybe you want to check for example if it's even faster for the things that you're doing and stuff like that. So that is some of the things that you can check as you have it in production already. So this is the scary part. I'm going to show it to you live. So there have been some people already showing evaluations before. I don't think that we've shown evaluations together with data tables in action yet. This is not a demo of data tables. It's a demo of evaluations, but I'm going to use the new data tables feature to show it to you. And so if I go to this workflow, this is a very simple basic LLM chain and I attached a entropic model with a cloud set 4 and I can open the chat here. I can ask it anything. There's no real system prompt available here. So it's just plain old Claude and I can ask it for example, who built the city of Rome? And then it will come up with an answer and it will figure that out. And this is nice, but how do I know whether this is actually correct or not? Right? How do I know this is in the right tone of voice? How do I know whether it's not confabulating things and stuff like that? And that's where evaluations come in. And I prepared a little data table. And this is what data tables look like. You can probably not read everything here, but this is for example the uh the question, what caused the fall of the Roman Empire? And I have a reference answer here. And the reference answer is actually very important. So if I look at the reference answer, it says something like the immediate trigger for the fall of the Western Roman Empire blah blah. It gives you an answer for what this is. And this is important because now I can check whether the AI gives me the same kind of facts as this reference answer. And let's go try that out to build it out. So what I can do is

Live Demo

I can add another trigger. I can use the evaluation trigger. And here is a new tab that's called source. So that's a data table. And from the list I can pick Q evaluations. Is this readable for you guys? Yeah. Okay. No. Okay. — So, better. All right. So, I picked this data table that I hit here. And when I press run, it pulls up one of those rows from that data table. Right? We can see here the question and the reference answer. I want to hook this up to my LLN chain, but I cannot do that immediately because it's expecting the chat input from the chat trigger. So what I need to do is add a set a node and maybe pull in the question and say here chat input and now I can attach this to my basic OM chain clean it up a little bit and I can say execute workflow and what's happening here is that it does that as you expect and after it's done you can actually see that it starts over but with the next question. So if I look at the edit field, it's now says what caused Brexit and now what triggered the French Revolution and things like that. So there's a theme for these questions, but it continues to run. And this allows you to quickly see, oh, hey, does my error like workflow error out somewhere? Imagine that this is a more complic complicated workflow that does something with the output of the LLM chain. You want to check whether that's still working, right? And you can do that using this feature. Let's stop it now. But then like after it run, I it would be also cool if I can see the actual answers in the nice format that the data tables gives me. So I can add something here that says set outputs. And what that does is I can pull up the Q& A evaluations data table again, add an output. And what did I call it? I called it actual answer. So actual and sir and I can pull in the text from the llm chain. So what that does now is if I execute this step it will execute of course the previous nodes and stuff like that and then we can in a bit when it's done we can see that happening in apparently not I execute the workflow from the fetching data set row and that should give me an output in my data table. So here we see the fall of the Roman Empire was caused by a complex combination of factors blah blah. So this works. It will continue running this over my entire data set. The reason why I think it didn't work before is because it didn't realize that it was started from the evaluation trigger. So I can but I can make that very explicit. Right? If I want to do something else in my evaluation, call another LLM in between my evaluations or things like that, I could do that by adding something that we call the check if evaluating node. And this gives you some if node that allows you to distinguish between an evaluation workflow and a normal workflow. And these outputs are nice and they can help you. For example, if you have some checkboxes, you can quickly see whether something is correct or not or things like that. So that helps you over time while you're building quickly visualize whether your workflow is doing on these values. Of course, this is yeah, this is something that you can help with your vibe checks for example and checking whether your workflow works and etc. But I want to track that over time, right? I have this while building now covered. You can do this and use these set outputs while you're building your workflow. You add some use cases that oh man, I need to tweak my LLM here or I need to use a different LLM there and stuff like that. And that helps while you're building and because it's quickly iterating over your entire data set. You don't have to do that all manually all the time. But what happens if we want to track this over time and we want to see like something evolve what our models do over time. And let me save the workflow and go to these evaluations tab. And the evaluations tab help you set up your workflow your evaluations. So this has two check boxes that we already finished. We wired up the data set. Done. We wrote the outputs back to the data set. Also done. Now it says set up a quality score and I can tap that and I can see that there's this other operation that's called the set metrics node and helps us evaluate if our data set if our answer from the AI is correct. There are a couple of metrics that we provide here automatically. So there's correctness, helpfulness, string similarity, categorations, tools used, but you can always define your own metrics. And here you can do whatever you want as long as it's a numeric output, right? This is important. You want to be able to track it over time. So there's this graph that we can show later. But let's take correctness, right? because we've been building out this Q& A bot and we want to hook that up to our metrics and we want to check whether it's doing well. So we need to add a model here and we need to wire things up. So let me run this one again and then run this and so it will kick off the entire workflow from the right evaluation endpoint and then I can start dragging things in my evaluation node. And what I want to do here is go to my when fetching a data set row and I want to pull in the reference answer that's the expected and then put it in expected answer and I want to pull in the basic LLM chain node. I want to pull in the output text and that I put in actual answer. There's a prompt here that you can change. If you think, okay, I have some very specific prompting techniques that I need to do for all these specific use cases that I have, then you can change, for example, what five means versus what four means and stuff like that. Let's keep it the default for now. And I hit save. And what I can do now is I go back into the evaluations tab. And voila, the third bullet has also turned into a checkbox. And I can click run evaluation. And what this does, it kicks off the workflow for each of the values in my data set. Right? So we have to wait a little bit and maybe I can go somewhere where I already did that. I did it here and you can see in evaluations here. This is what you get. So this is a table with some evaluations. And here you see what I mentioned before. You can choose between completion tokens, prompt tokens, total tokens, execution time. So you can measure for example this workflow is now running faster than before because I changed my model but correctness is more relevant to me. So I changed something apparently in the prompt here so that it went from 4. 2 to 3. 4 and then I it moved back up to 4. 8. Right? This is telling me something about what the workflow did over time. And we can check it's still running of course because it's AI it goes out to the cloud and stuff like that. But it's interesting because I can now click through to this run right and this gives me some more information and I can see that these test cases all succeeded. I can see that correctness was four in this case but other than that it was mostly fives. That's useful right? I can go back to all runs and I click this one and this one finished with some errors. And that's also really useful to see because now I can dig into this particular execution, see what happened there and fix it if I need to. The interesting thing is if I change my prompt and this is what you see here. I think I cut something short is that apparently it's talking like a pirate, right? This is this might interfere with the facteing of the prompt, but I also ask it to maybe insert a joke about a parrot. Okay, that's fun, but it's not adding to my factfulness of this answer. So, that's why there were some points deducted. So, you can actually also start to modify your prompt a little bit to see if the LLM as a judge is actually helping you evaluate that. That is the Q& A answers. And this is still running. It's terrible. There's a question over here. — Yeah. So, at the top there when you've run an evaluation, if you clicked on that first entry there, can you actually go to the execution? Yes. Look at it. — Yeah. The question is, can I go to an execution? If I click on particular thing. Yes, this is clicking through to the execution. So, I can see here what it did. that it that I actually added a system prompt that says answer the user's question but in answer in the voice of a pirate. I can see that right here. And yeah, I can do that for all the other evaluations. So many questions. Okay, I don't know how many how much time do I have? Five minutes. Okay, let's do the questions. And I wanted to show you all kinds of cool stuff, but there are so many questions. Who raised their hand first?

Q&A

Let's go with you, sir. Yeah, you Oh, — so what's your benchmark for correctness? Is it the initial data set you put in the tables or because the next one is more correct? — So, this is why I have the reference answer in there, right? So, I check with an LLM. I check whether the reference answer matches the one that's now generated by an AI. So I know what the answer should be for this particular question for when what caused the fall of Rome. I can check that against the newly generated answer. — The judge part. — Yeah, the judge then checks whether all the things are still in there. And this reference answer is really important. Right. — My question would be it perfectly works for static data, but I can have another branch of the workflow which actually brings the underlying values to check. Let's say example that I want to like double check the recent stock price of something. Yeah. — So how does it deal with that? So does it read the whole batch of the answers? Does it go piece by piece? So when it validates the piece it's going to take from whatever it finds in data table at the beginning of a process, how does it deal with that? — So it depends on how you wire it up. Right? So if I would wire it up after I have a full branch, I can wire it up later or earlier or things like that, right? So in this case, I just want to check my basic LLM chain. If I want to check a whole bunch of things, I would wire it up after those or before and after those things that I want to check. So this is important about how you do that. — My question was you have 20 examples. So for example, — my question was what you have 20 examples which you were waiting for execution. Yeah. So would it check whatever it found at the beginning of checking of that particular item if it change in a real time? — I don't understand the question. — Okay. So let me give you a specific example. So let's say you want to double check stock price. It's a stupid example, but I think it well explains the phenomenon. Let's say I go into a data table to reference. — Yeah. — Is it going to basically you want to know whether like it's pulling in data from other sources that might change on the fly. Okay. So yeah, that might be a might be an issue, but then you would probably need to wire up some sort of static source of that data to pull in. Okay, does it properly work when I pull in some data from somewhere else? But there are other data sets that you could also use for it. So for example, you could create a data set that checks whether your tools are all properly called, right? And if you can check that, then maybe you can use that. — Thank you. — You mentioned that one of the objectives is to AB test easily the evol table that you showed later. Let's assume I want to do what you said, test model, test prompt, and so on. How would I see that? Yeah. each of the variables that I change that I'm not confusing things. So let's see evaluate models here. And what I did there was I used the model selector to make for example the selection between one model or the other and if I go into evaluations you can see here that I it's not scrolling I for example have a category accuracy open AAI and I have category accuracy entropic right I can compare these two in that way and the data table will then look like something like where do I have it here. So I have for example the model provider here it's open AI or anthropic and I can in my model selector I can now say okay the model the first model is when the provider is entropic and if I use the other model if it's uh open AI so I can use the model selector to select between these two and you can do the same with prompts right you can put these different values like you can put the prompt in your data table and you can duplicate it over all your examples and you can change like okay I want to use this prompt if I'm evaluating — hi I'm wondering if you have the judge for the evaluation be an LLM who judges the judge so how do you make sure that the nondeterministic nature of the judge is not actually pretty — that's a very good question and so yeah you need to go in there and try things out and then deliberately try to mess with it to see if that actually works works. But you're right, eventually you should maybe even get like an evaluation for your evaluations. But that's really not necessary in most cases because you're trying to do a very small part. You try to evaluate a small part of it, right? You just want to check whether the answer matches roughly the reference answer. But yeah, in theory, you're right. You want to evaluate the evaluator and then who evaluates the evaluator of the evaluator? you get into this weird loop. — Hey, so I've been already using evaluations and I think they're pretty cool. But actually, so first part of my question got answered now [clears throat] because we experienced that we have inconsistent results and for example one of the main use cases that we have now is we are making a change for example of architecture to an agent and we want to check if it's worth it to push it to production etc. And then it feels like you need to run like a thou ev evaluation a thousand times or something to really get statistically significant or something result. So yeah, that problem exists. But also my question is maybe it's not a question, it's like a feature request or something. So we've been experimenting with setting custom helpfulness metrics and also you mentioned tone of voice. Yeah, assessing tone of voice as well. So right now you can only do that with adjusting the prompts pretty much and I think it would be cool if you could like basically stick a workflow call or something where like more complicated logic of evaluation leaves. So yeah. — Yeah, you can still get like the custom metrics if you want, right? You can do if you have this if evaluating bit that I put here. You can make this branch of your evaluation, you can make it as complex as you want, right? You can put all kinds of logic there. You could even call out another workflow to do the evaluation. So that's perfectly possible as to your first part. Yes, sometimes you need to add multiple times the same test case just to get like a little bit of more confidence in these different test cases. — I think we're going to have to wrap up now. So, if you have more questions, JPEG will be around. So, you can find him. — You can find me here. Yeah. — Thanks, JPEG. You know what's going to happen now. — That's really cool. Tiny TV. I'll send you the file next week. Thanks. — Thanks so much. — Yeah. Thank you.

Другие видео автора — n8n

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник