# Building Trustworthy, High-Quality AI Agents with MLflow

## Метаданные

- **Канал:** Databricks
- **YouTube:** https://www.youtube.com/watch?v=NcHCkPMww7Q
- **Дата:** 22.05.2026
- **Длительность:** 32:16
- **Просмотры:** 1,490

## Описание

Building AI agents presents unique challenges, as outputs can be free-form and unpredictable, often requiring specialized domain expertise to evaluate quality. This session explores how MLflow provides a unified platform to manage the full agent development life cycle. Key topics include using MLflow tracing for end-to-end observability and debugging, leveraging automated LLM judges to scale expert feedback, and employing the prompt registry for versioning and optimization. The talk also highlights the role of an AI gateway in providing essential governance through permissions, rate limits, and input guards to manage costs and data privacy.

Key Takeaways:
- Implementing end-to-end observability with MLflow tracing for step-by-step execution analysis.
- Scaling quality assessments through automated LLM-as-a-Judge evaluations and human expert alignment.
- Iteratively improving agent performance using evaluation datasets and automated prompt optimization.
- Ensuring production-grade governance and cost control with a centralized AI gateway.

## Содержание

### [0:00](https://www.youtube.com/watch?v=NcHCkPMww7Q) Segment 1 (00:00 - 05:00)

Whenever you launch a new agent, there's a risk. Risk about the compliance by leaking like PII data or it could even providing offending users' information. You never know. This AI gateway here, it give you three most important part: permissions, rate limits, and input guardrails. With all three things in place, you will have proper cost control, access control. — Everyone, this is here. So, I'm a senior solutions architect here at Databricks. — The our topic going to be how we can build trustworthy, high-quality AI agents with machine learning flow. Let's start with some machine learning flow history. — Cool. So, we start building machine learning flow at Databricks more than 8 years ago. Simplify the ML stack is our original goal. Make it simpler to ship a classifier, say. Since then, machine learning flow has evolved to become the largest open-source platform for AI operations, helping developers building high-quality AI agents and machine learning models on a unified platform. MLflow now has more than 25 million monthly downloads and is supported by ecosystem of nearly a thousand contributors. Our mission today is in the AI area. How to provide individual developers, researchers, and organizations with an open platform that help them ship high-quality agents as quickly as possible. Before we dive deep into the agent development piece, it's going to be helpful to take a quick look at the best practices for software development, which has become a consistent, more or less well-defined theme across the industry. They start with writing code and run the code locally. Then, developer going to do some unit tests. And after developer do their own testing, we start with the official Q& A process. Finally, we launch the product, we collect telemetries help us ensure everything is working after the deployment, and alerts us if something breaks. But, what does that look like for the agents? Building agents going to be a completely different paradigm. There are several layers of complexity involved. First and foremost, agent outputs are free form and can be unpredictable. If I send the same question to the agent five times, I will get five different answers. Output quality is subjective and requires domain expertise. What you think is good may appear bad to a co-worker. And often, it may even something developer can measure, nor should they care, because they are just simply not the domain expertise here. This means, as a developer, you going to navigate across organizational boundaries to understand what is actually good looks like. Another thing is on the collaboration. Besides the domain expert domain expertise part we talked about, there's also a platform engineer needs to be involved. You want to making sure, once you have the agent deployed, it is going to be deployed in a high scalability high availability fashion. It need to be scaled to the use cases. And more importantly, it has to be cost effective. Whenever you launch a new agent, there's a risk, right? Risk about the compliance by leaking like PII data, or it could even providing offending user's information. You never know. Last but not least, folks are consistently talk about this trade-off between the cost, latency, and quality. So, for example, you going to pick a cheaper, faster LLM off the shelf, but with degraded quality. Will your user be happy about it? What's going to happen if they're not happy? So, with all that being said, we actually seen a pretty common anti-pattern emerging. So, everyone just writing wrong agents locally and going to go ship it and hope for the best. And unfortunately, most of the time, we start to get bug reports, calls to compliance, and all the other hazards we just covered. Unfortunately, there is a better way. We interviewed hundreds of open source developers and organizations to define a more streamlined process to help us building high-quality agents. First and foremost, developer still needs to building a working prototype and it needs to be passing their own sanity check. Second, instead of ship it, since quality is subjective, we're going to bring in domain experts for a test drive. We will uncover quality issues. But from there, we will produce, we fix, and we verify those issues. These cycles ensure the final agents we generated actually meets expert standards. Once we're satisfied with the performance, we will get stakeholders sign off for production. And then, last but not least, we release the agent and handling all the product-oriented concerns like guardrails, fallbacks, monitoring, rate limiting. And this sounds pretty simple, right? What's the big deal? Why do we need a platform for it? That was obviously the details. Let's take a further look at fundamental components

### [5:00](https://www.youtube.com/watch?v=NcHCkPMww7Q&t=300s) Segment 2 (05:00 - 10:00)

of ML platform that powers this life cycle. Firstly, we need end-to-end observability. You need to understand what is going on for each step. Where does the knowledge coming from? What context is being passed into different models? This is powered by the tracing component. You also need evaluation capabilities. How are you going to collect feedback from domain experts? How can you even automate the overall process with LMS judge to scale up the domain experts time and capability? You also need the prompt registry part. Nowadays, prompt remains extremely important and relevant to take it LM performance. We need versioning, we need comparison of different versions of agent prompts, including parameters and actual code that are being executed. Last but not least, there's an emerging category of the AI gateway product designed to provide governance and cost management for agents and LLMs. MLFlow has a gateway module to help you eliminate any surprises in the cost, provide guard rails for content so that you won't get compliant issues and any other hazards we discussed before. It is enough. These capabilities are great, but not enough. This space evolves so fast that every developers cannot be in this FOMO mode. They want to grab a new tool off the shelf every day. So, the platform you're building today going to be compatible with a broader ecosystem large language models, agents authoring frameworks, and programming languages so that your developers can build agents with your preferred tools with the benefit of using a same standardized workflow. This is why building a platform with those requirements is so hard, right? And this is also why we expand the MLFlow's capability with your agent development. Let's take a further look. Let's go back to this agent life cycle and take a closer look at how MLFlow can help us accelerating this journey. Start with the tracing. It will help us building the prototype faster. What does it mean by that? With a single line of code, you can actually tracing 40 plus different large language model providers and frameworks such as OpenAI, LangChain, Bedrock, so on. For these libraries, you only need to add this line mlflow. library. autolog. Before diving into the actual trace being generated itself, I want to call out the generated traces here conforms to the standard called OpenTelemetry. If you haven't heard about the OpenTelemetry, it's a number one industry standard specification for observability, which means this is not only available ML platform, on Databricks, you can integrate the generated trace with your own different backend, such as Datadog, Grafana, whatever you that you need today, it will be compatible, so there's no vendor lock-in at all. So, now let's moving on to the demo to look at what is the trace actually. So, for the demo content, we're going to be focusing on example agents that is designed to answering customer support questions for tel- telecom company. So, here we just call that telco agent. The first thing we can do is to import MLflow library from the Python SDK and add the MLflow. autolog to the code, which automatically enables the tracing for the agent. Then, let's coming over to the user chat UI. Here, we submitted example query about a cell phone upgrade. Okay, let me enlarge that a little bit so that it can Cool. So, because we added MLflow tracing to our agent, as soon as the request is being processed, the trace is generated for developer to analyze. Coming to the MLflow UI, now we're switching back to the developer view. So, during the full demo session, I'm going to bouncing between the developer view and the user view. So, now let's go back to the developer view, you will be able to see the step-by-step execution information detailing how the agent processed request. This includes a time view of each operation with inputs, outputs, and the latency. This is very powerful when building a prototype. They can submit example query and, you know, if they're dissatisfied with the response, they can use the trace UI to quickly debug, fixing the agent, and continue iterating on top of it. Without it, most of the time it's just going to be a guessing game. Cool. Now tracing help developer building a working prototype really fast. Now we can start gathering feedback from domain experts. Then labeling capability comes handy. MLflow offers a built-in type style UI to testing agents and recording feedback. Simply share this UI with any member of your within your organization and they can

### [10:00](https://www.youtube.com/watch?v=NcHCkPMww7Q&t=600s) Segment 3 (10:00 - 15:00)

use it to interact with the prototype agent and submit the feedback on the go. And it can include a variety of quality metrics. It could be like correctness, relevance, safety, etc. It also provides APIs for collecting feedback from any agent. Say if you don't want to use the MLflow built-in UI, you can build your own application with just a few lines of code. You can seamlessly integrate all the feedback collection process. MLflow also enable domain experts to label existing traces through the UI helping developers to collect the richer feedback. This help us reduce the back and forth between the developers and the domain experts. Finally, all feedback is stored directly alongside with MLflow traces providing a unified view of the quality and execution information. Now let's coming back into the agent UI. We can see that after like the actual response here is not every very helpful, right? We ask about like a plan information, but here the chatbot actually saying I recommended reaching out to the sales or the device upgrade team. So it actually suggest you to a human specialist, which is not helpful at all. Through here, I can actually hit the thumb down button providing the feedback and giving explanation of why the response is unsatisfactory. From the developer side, you will be able to see that assessments directly coming side by side with the trace itself. Right? They selected trace, they selected assessment. And this is how we collect feedback. Now, let's go back to the trace tab. During the internal testing, developers commonly need to request additional information from their testers about why a response is unsatisfactory, about what the agent should have done instead. In our example, we will see several instances of some down here from the internal testing, but there's no clear explanation or justification about the rating. Then, we are going to use ML Flow's labeling capabilities to systematically capturing those information about why they are giving those ratings. So, first, we come over to the labeling schemas tab and creating a labeling schema that defines the information we want to collect. And beyond just the pass and fail part, we can enable the comment section so that the domain experts or whoever testing this agent will actually be able to provide real comments on what need to be improved. After creating the label schema, we will navigate to the labeling sessions tab and create a labeling session here. This is a queue of traces we're going to ask our domain experts to label. So, we will give this label session a descriptive name and select a labeling schema we created in the previous step, which tells the domain experts how the trace should be labeled. And after that, now we're creating our labeling session. Come back to the uh to We trace tab and filtering all the traces with negative feedback that needs more justification or the context. We make those selections here and then once we identify those traces, we're going to select them and add them to our labeling session. After we have all the sessions to be labeled, we will go ahead click export to save those traces to the labeling session. And now when I navigate back to the labeling session itself, we will see that all of those selected traces have been added and now they are appearing on our UI. So for the next step, we can share the labeling session with our domain experts or whoever can provide more additional context here. Okay, here we go. Here we use MLflow UI to request the input from any member video organization. They don't have to be part of the Databricks workspace or platform. Sharing the labeling session generates a link. Then you can directly sending the link for whoever need to give us a feedback. And then when the domain expert opens the link, they see a child style UI for the traces that are added to the session. They will also be asked to provide additional input according to the label schema we configured previously. And now in addition to the pass and fail rating, the domain expertise can fill in more detailed explanation for the rating itself. This detailed information are saved back to each trace so the developer can review them in the MLflow UI. So as you can see until now, everything

### [15:00](https://www.youtube.com/watch?v=NcHCkPMww7Q&t=900s) Segment 4 (15:00 - 20:00)

is created based on your requirements. There's nothing being pre-configured. You have the capabilities to schedule your own questions. You have the capability to call out what are the metrics you want to collect for. So now we collect all the feedback from the testers, it's time to hone in discovering some quality issues. Based on those feedback, what goes wrong with our agent? Let's come back to the trace we were looking at previously. We know that agent's response isn't particularly useful on from the human tester, and the way our agent works here in this specific example, it starts by classifying the query to determine whether it's related to the user's account, telecom products, billing information, or several other topics. In this case, then the query is routed to the specific specialist or we call it we can call that sub-agent that's designed to handle those specific queries. Looking at the routing behavior itself, we can see that actually the request should be routed to product information, but instead it's going to the billing department. And the result, the billing sub-agent attempt to handle this question, but lack of the information required to answer the question about the products. Now, that's why the assistant give that piece of feedback. Cool. Now we identify the example of a clear issue with the request routing. Luckily, this trace is only two hops. What if I have agent that have say 10 hops? As a developer, do you actually want to go through every single step yourself? This is another capability we introduced from MLflow side. It's called MLflow assist. So from that, you will be able to Here we go. So by clicking debug the arrow in this trace, it will help you analyze what's going on, and give you the concrete example about the problem summary. It will give you the problem summary here, and root cause, and it will actually do additional analyze on top of the root cause. And then, we can see the real issue here. The ambiguity routing problem and the missing routing category. Cool. That's great. This is But, still the same. This is only one single example, right? It is worth you spending the time on how prevalent this issue is regarding to all the overall agent quality. Should you prioritize fixing this problem? Should you prioritize your time fixing others? Like, even with AI assistant, it's going to be very time-consuming to analyze hundreds, even thousands of the traces to identify the problem of the routing issues. This is where the MLFlow's automated LLM judges and evaluation capabilities can help us finding those specific examples in a matter of minutes. So, LLM judges, right now, it's a it's industry's term, like, it's often used to define the automation process for evaluation. With the description of the issue itself, LLM judge can find all the traces that share with the same issue. In our example here, as you can see, we give a name that The example here showcases a for formality issue formality type, and we're going to flip that into our actual demo. Right here, we give the instruction. We give it a model we're going to use. As simple as that, we define the LLM judges. On top of that, MLFlow includes a variety of built-in judges already. Some of them analyze the responses to determine whether it's correct, its relevance, safe, or conform to domain-specific guidelines. Others analyzing tool calling behavior, ensure the agents are accessing the actual data correctly, or taking actions in an appropriate manner. MLFlow retrieval judges are focused on measuring the quality of retrieved documents, such as the relevance to the query. This goes back to knowledge base. Last but not least, MLFlow also integrates a variety of judges from the popular third-party libraries. If you're ever using a deep evil or if you rugas, we have the support to that directly. One more thing, how can we automate the judge alignment? MLFlow has a capability for judge optimization. We ensure your judges act accurately detect the quality issues and you can align the judge you defined with human experts to making sure you improve this judges quality consistently. Simply pass a list of labeled traces to the judge align API. MLFlow will leverage state of art instruction optimizers. Right now it's mostly we're using the GEPA or DSPy if you guys familiar with that to tune the judges instructions match the label. Cool? Now, let's hone in into our original routing issue. Here, we're going to create a

### [20:00](https://www.youtube.com/watch?v=NcHCkPMww7Q&t=1200s) Segment 5 (20:00 - 25:00)

routing accuracy judge. We give it a name here called routing accuracy and we select the model we want to use. Now, we can use there some of the instruction agent should do for as go through each trace, right? It's about hey, what are the available agents, which trace you're looking at and what action should you take? And you can also specify the out type. It's could be default, it's could be boolean style. Then, I will go ahead select the traces. It should be wrong. In our case, we will select the previous trace with the routing issue. We're going to select this trace here and click on select and now we're going to run the judge on top of that. Cool. This is what the result should looks like. In a few seconds, you're going to see the result. The judge tell you about the routing issue is identified. Now, we have added the quality of this judge. Cool. It is It can give me the expected result I want. How can we run this judge on hundreds of traces to see how prevalent these issues is? MLflow itself provide a high throughput evaluation API that runs one or more LM judges across a set of the traces in parallel. If you can see the code snippet here, we have our routing accuracy judge included. Meanwhile, you can also include in relevance to query. You can include in the tool call relevance. Those are all built in. And you can provide additional guidelines on top of that to specify which judge should be used. Now, we have built the judges accurate detects a routing issue. Let's run it on the most relevant 200 traces to find more examples of this issue. Having more examples will help us implement more robust fix and verify it's actually generalized across different type of the queries. After a short while, MLflow produce a comprehensive evaluation report. Right? And then with the judges rating for each of those 200 traces. Click on that. We will open up the evaluation report. We can see that right now, it's only 79% of the traces have the correct routing classification and specialized routing. This actually means nearly 40 of the 200 traces contains routing accuracy problems. This actually confirms this issue is systematic and it's important for us to address this. 79% is not a good number for sure, right? Now I know this is a problem we should spend time fixing on. And now let's digging into that. Now, we identify the traces with routing issues. We should we can turn them into a reproducible task cases using machine learning flow evaluation data sets. Evaluation data sets, think about this as a snapshot from the input of the agent that has routing issues. Developer can using this additional input as existing trace to a data set so that they can test inputs with a new version of the agent and verify the quality has actually improved. Coming back to the MLflow UI, let's start to build an evaluation data set by selecting all the traces that face routing accuracy issues from our evaluation report. After selecting them, we're going to create new evaluation data set to add them to. And every machine learning flow evaluation data set is backed by either S3 or Azure Blob Storage, so enabling developers to govern and query those data sets in the same way as your additional data sets. Here, in this example for now, we're just writing into the Unity Catalog. This is our Databricks solution on the governance layer. After creating our data set, we're going to click on the export button to copy the input from the traces actually into the data set. And automatically coming into the data set UI, we will see a row of each of the traces. Each row contains the input that were passed to the agent, which can be edited. You can as well as a link to trace to the trace that the node is actually from. Developers can using taggings, can using field filters on top of that to fine-tune this specific evaluation data set as needed. The power of those task cases actually provides a structured way for you to fix the real-world issues as you iterate. You're not just being guessing all the input from the customers. Those are the actual data from your domain experts. Now, we've identified the real issue with our agent through the internal testing. We set up the evaluation data set with nearly 40 examples of the issue. Now, we can start implementing

### [25:00](https://www.youtube.com/watch?v=NcHCkPMww7Q&t=1500s) Segment 6 (25:00 - 30:00)

fix. As I mentioned at the beginning, prompts still play a key role in agent's quality. And examining the routing instructions or routing prompts in our agents going to be a natural first step of solving that. MLflow's prompt registry help developers analyze, modify different version of the prompts, providing a structure and reproducibility into the prompt engineering. Let's see how prompt registry in this case can help us debug this issue. So, let's open the prompt registry here. Where we see that there's a prompt for each of the components. We can take a closer look at the prompts that agent used to make a routing decisions. And now, if we go back, here we go. So, inspecting we can like inspecting this part of the prompt, we can easily locate instructions about how the agent should route those queries. As we know, improving a prompt itself manually can also be very time-consuming. Like developers going to sit in there like spending like an hours to making a change to them and do the manual test to see it actually resolve the issue. This is where MLflow prompt optimization comes handy. We have a beauty in functionality to automatically optimizing and improving prompts. This is similar with the judge tuning capability as we discussed before, right? Just specify the version of prompt from MLflow prompt registry and evaluation data set and one or more LLM judges, MLflow will leverage the prompt optimizers to automatically generate a higher quality prompt based on the rating from the LLM judges and that will also help you align with the actual — [clears throat] — with the judges uh Let's hopping on back to our notebook session. Here. In this notebook cell, we will load our evaluation data set, load our judge. Then, we call mlflow. genai. optimize_prompts function, specifying the agent, specifying the data set, and the prompt we want to optimize. We use GPT-5 in this case to generate new candidates prompts based on the data set and judge output. Now, let's fast forward to the next step. Here we go. After running the prompt optimization, we're going to see several new versions are actually being created here, and accumulating the most recent version with the best performance. So, whoever on top, we always have the better performance. I've generated a new version of my routing prompt. How can I be sure it actually improves? It's time to verify the agent's routing and query in a more accurate manner. This time, instead of passing a list of pre-existing traces to the evaluation API, we can specify the evaluation data set with the new version of agent that use updated prompt. So, we don't have to change the agent code at all, right? Since agent will always read the latest prompt from the directory. We execute the code. A short while later, we have a new evaluation report being generated. If we open that up, we will see that the routing accuracy claims to have improved 100%. It improved dramatically. We need to verify those results, since we can't really simply believe that, right? This time, we can see that updated agents with the query being reproductively and it actually routed to the query to the correct specialist. So, there is actually a chance that this agent actually doing the the the work correctly. Our judges also produce a reasonable and detailed explanation for why the product specialist was the correct choice, ensuring we can keep improving on the ratings. Here, we can also One more capability introduced is we can compare the results with a previous evaluation. So, it will always give us a side-by-side comparison of the same query with the improved routing logic and original logic. Here, we can clearly see the 21% increase in routing accuracy is for real. And we can also compare both agents on individual inputs to see how the actual improvements looks like. We see that updated agent after providing a detailed product recommendations, in contrast to previous version, only giving you the escalation to actual human. Now, we successfully identify the issue using our LM judge to find more examples of the issue, optimizing the our agent's prompt to fix issue, and verify that the issue actually been addressed. As we saw, the workflow we worked through mostly been through the UI driven or API driven. A lot of agent developers rely on coding assistant today to build and iterate on their agents. ML Flow has a built-in MCP server that you can simply pass in the ML

### [30:00](https://www.youtube.com/watch?v=NcHCkPMww7Q&t=1800s) Segment 7 (30:00 - 32:00)

into the ML MCP server URL for your coding agent to pick it up. After we do all the test, we finally have the everything, it's time for us to get leadership buy-in on the product. We do this by providing a really simple ML Flow UI. Quickly compare the overall qualities of the agent and build custom quality dashboards to share with stakeholders. What are the KPIs they care the most? Give them the actual data-driven results, so that they can make decisions. Last but not least, we're going to release agent production and keep monitoring the quality. Fortunately, nothing really changes. Since we already add ML Flow tracing to the agent, we can simply deploy it and begin collecting traces from the production grade. Additionally, the same APIs we use to gather feedback still going to be the same if we are seeing the feedback from actual real world users. To ensure that any regressions are detected, we can keep running our M judges online to monitor the quality of the agent. Finally, we can analyze the production traces to add them to evaluation data sets in order to implement and verify fixes exactly the same way we do this with internal testers. Last part I want to talk on is the AI gateway. This AI gateway here, I want to zoom in. It give you three most important part, permissions, rate limits, and input guardrails. With all three things in place, you will have proper cost control, access control. Combining all the different pieces together, we covered the overall agent development life cycle and how machine learning flow can be part of the journey can help you accelerate the overall experience. So, this is our new release and road map. Like I won't go super details on that, but the key idea here is we are consistently improving on top of the ML flow. We are consistently learning from the trend from the market and building the required capabilities so that you can focus on building agents by yourself website. And this will be my session today. Uh thanks for listening.

---
*Источник: https://ekstraktznaniy.ru/video/52957*