I can add another trigger. I can use the evaluation trigger. And here is a new tab that's called source. So that's a data table. And from the list I can pick Q evaluations. Is this readable for you guys? Yeah. Okay. No. Okay. — So, better. All right. So, I picked this data table that I hit here. And when I press run, it pulls up one of those rows from that data table. Right? We can see here the question and the reference answer. I want to hook this up to my LLN chain, but I cannot do that immediately because it's expecting the chat input from the chat trigger. So what I need to do is add a set a node and maybe pull in the question and say here chat input and now I can attach this to my basic OM chain clean it up a little bit and I can say execute workflow and what's happening here is that it does that as you expect and after it's done you can actually see that it starts over but with the next question. So if I look at the edit field, it's now says what caused Brexit and now what triggered the French Revolution and things like that. So there's a theme for these questions, but it continues to run. And this allows you to quickly see, oh, hey, does my error like workflow error out somewhere? Imagine that this is a more complic complicated workflow that does something with the output of the LLM chain. You want to check whether that's still working, right? And you can do that using this feature. Let's stop it now. But then like after it run, I it would be also cool if I can see the actual answers in the nice format that the data tables gives me. So I can add something here that says set outputs. And what that does is I can pull up the Q& A evaluations data table again, add an output. And what did I call it? I called it actual answer. So actual and sir and I can pull in the text from the llm chain. So what that does now is if I execute this step it will execute of course the previous nodes and stuff like that and then we can in a bit when it's done we can see that happening in apparently not I execute the workflow from the fetching data set row and that should give me an output in my data table. So here we see the fall of the Roman Empire was caused by a complex combination of factors blah blah. So this works. It will continue running this over my entire data set. The reason why I think it didn't work before is because it didn't realize that it was started from the evaluation trigger. So I can but I can make that very explicit. Right? If I want to do something else in my evaluation, call another LLM in between my evaluations or things like that, I could do that by adding something that we call the check if evaluating node. And this gives you some if node that allows you to distinguish between an evaluation workflow and a normal workflow. And these outputs are nice and they can help you. For example, if you have some checkboxes, you can quickly see whether something is correct or not or things like that. So that helps you over time while you're building quickly visualize whether your workflow is doing on these values. Of course, this is yeah, this is something that you can help with your vibe checks for example and checking whether your workflow works and etc. But I want to track that over time, right? I have this while building now covered. You can do this and use these set outputs while you're building your workflow. You add some use cases that oh man, I need to tweak my LLM here or I need to use a different LLM there and stuff like that. And that helps while you're building and because it's quickly iterating over your entire data set. You don't have to do that all manually all the time. But what happens if we want to track this over time and we want to see like something evolve what our models do over time. And let me save the workflow and go to these evaluations tab. And the evaluations tab help you set up your workflow your evaluations. So this has two check boxes that we already finished. We wired up the data set. Done. We wrote the outputs back to the data set. Also done. Now it says set up a quality score and I can tap that and I can see that there's this other operation that's called the set metrics node and helps us evaluate if our data set if our answer from the AI is correct. There are a couple of metrics that we provide here automatically. So there's correctness, helpfulness, string similarity, categorations, tools used, but you can always define your own metrics. And here you can do whatever you want as long as it's a numeric output, right? This is important. You want to be able to track it over time. So there's this graph that we can show later. But let's take correctness, right? because we've been building out this Q& A bot and we want to hook that up to our metrics and we want to check whether it's doing well. So we need to add a model here and we need to wire things up. So let me run this one again and then run this and so it will kick off the entire workflow from the right evaluation endpoint and then I can start dragging things in my evaluation node. And what I want to do here is go to my when fetching a data set row and I want to pull in the reference answer that's the expected and then put it in expected answer and I want to pull in the basic LLM chain node. I want to pull in the output text and that I put in actual answer. There's a prompt here that you can change. If you think, okay, I have some very specific prompting techniques that I need to do for all these specific use cases that I have, then you can change, for example, what five means versus what four means and stuff like that. Let's keep it the default for now. And I hit save. And what I can do now is I go back into the evaluations tab. And voila, the third bullet has also turned into a checkbox. And I can click run evaluation. And what this does, it kicks off the workflow for each of the values in my data set. Right? So we have to wait a little bit and maybe I can go somewhere where I already did that. I did it here and you can see in evaluations here. This is what you get. So this is a table with some evaluations. And here you see what I mentioned before. You can choose between completion tokens, prompt tokens, total tokens, execution time. So you can measure for example this workflow is now running faster than before because I changed my model but correctness is more relevant to me. So I changed something apparently in the prompt here so that it went from 4. 2 to 3. 4 and then I it moved back up to 4. 8. Right? This is telling me something about what the workflow did over time. And we can check it's still running of course because it's AI it goes out to the cloud and stuff like that. But it's interesting because I can now click through to this run right and this gives me some more information and I can see that these test cases all succeeded. I can see that correctness was four in this case but other than that it was mostly fives. That's useful right? I can go back to all runs and I click this one and this one finished with some errors. And that's also really useful to see because now I can dig into this particular execution, see what happened there and fix it if I need to. The interesting thing is if I change my prompt and this is what you see here. I think I cut something short is that apparently it's talking like a pirate, right? This is this might interfere with the facteing of the prompt, but I also ask it to maybe insert a joke about a parrot. Okay, that's fun, but it's not adding to my factfulness of this answer. So, that's why there were some points deducted. So, you can actually also start to modify your prompt a little bit to see if the LLM as a judge is actually helping you evaluate that. That is the Q& A answers. And this is still running. It's terrible. There's a question over here. — Yeah. So, at the top there when you've run an evaluation, if you clicked on that first entry there, can you actually go to the execution? Yes. Look at it. — Yeah. The question is, can I go to an execution? If I click on particular thing. Yes, this is clicking through to the execution. So, I can see here what it did. that it that I actually added a system prompt that says answer the user's question but in answer in the voice of a pirate. I can see that right here. And yeah, I can do that for all the other evaluations. So many questions. Okay, I don't know how many how much time do I have? Five minutes. Okay, let's do the questions. And I wanted to show you all kinds of cool stuff, but there are so many questions. Who raised their hand first?