# Trace Any AI Agent with OTel, MLflow, and Unity Catalog

## Метаданные

- **Канал:** Databricks
- **YouTube:** https://www.youtube.com/watch?v=zWe8IRsTh_g
- **Дата:** 04.06.2026
- **Длительность:** 14:13
- **Просмотры:** 631

## Описание

AI agents generate massive volumes of trace data, but traditional observability tools make this data expensive to retain and difficult to govern. This demo explores how to use OpenTelemetry (OTel), MLflow, and Unity Catalog to unify your AI observability stack. See how streaming agent traces directly into the Databricks Platform allows you to securely govern your data, build custom token cost dashboards, and run continuous LLM evaluations without the risk of PII deadlocks.

Learn more about Agent Tracing and AI Observability with managed MLflow here: https://www.databricks.com/product/managed-mlflow

Read the launch blog to learn more about Governing AI agents at scale with Unity Catalog: https://www.databricks.com/blog/governing-ai-agents-scale-unity-catalog

Read the blog Observability for any agent, anywhere: Production-ready tracing with OpenTelemetry & Unity Catalog on Databricks: https://www.databricks.com/blog/observability-any-agent-anywhere-production-ready-tracing-opentelemetry-unity-catalog-databricks

TIMESTAMPS:
00:00 – Challenges in AI Agent Observability
02:28 – The Continuous Improvement Flywheel
04:20 – Demo: Building a Support Manager Assistant
05:39 – Setup: Trace Integration with MLflow and Unity Catalog
08:27 – Analyzing Traces and Native Dashboards
10:13 – Offline Evaluation and LLM Judges
13:04 – Closing the Loop for Continuous Improvement

## Содержание

### [0:00](https://www.youtube.com/watch?v=zWe8IRsTh_g) Challenges in AI Agent Observability

Databricks recently released the ability to send open telemetry traces from any agent running anywhere over to Unity Catalog to take full advantage of the end-to-end observability stack on Databricks. Before we see that in action, let's look at some of the common observability challenges that we see in the field. So, teams will often have an agent running in production and they want to see how it's doing. So, they send the traces over to an observability vendor, which could be a traditional APM or more modern agent-specific tools. But, the first thing that they run into is that they end up sending PII or sensitive data over uh to a third-party tool without the right governance in place. And so, that opens up security risks. The next thing that they face is that these observability vendors end up being very expensive at scale because the traces themselves are often big and have a lot of data. Once we have the traces in place, the teams will often want to run offline analytics uh such as building dashboards and asking questions via natural language. However, a lot of these vendors simply don't have the capabilities in place. Additionally, they're missing a lot of the business context which exists elsewhere in the organization. And so, teams have to uh send that data and sync it over to the lakehouse. And so, they end up with these data silos and the requirement to build and maintain brittle and expensive pipelines. And teams then also want to do production monitoring on top of the traces uh to see how the agent is doing in production to do by doing things like running uh LLM judges as well as uh custom scores that they may have. But, a lot of these vendors don't provide those capabilities and so, the teams have to do uh custom work to make that happen. And finally, those same traces that are in production need to be used in order to iterate on the agent as part of an offline eval process. But a lot of those tools either don't provide those capabilities or these end up being two separate stacks, and so the again, the team needs to build out some custom tooling or pipelines to make this happen.

### [2:28](https://www.youtube.com/watch?v=zWe8IRsTh_g&t=148s) The Continuous Improvement Flywheel

Now, with this new capability, you can use a couple lines of code to instrument an application or an agent and send open telemetry traces from the agent running anywhere, including outside of Databricks, on any stack with any framework, and put those traces over to Unity Catalog tables. Now, these are simply tables in UC, and they have all of the governance capabilities that Unity Catalog provides. Additionally, they're just Delta tables, and so they work very well at scale at a fraction of the cost. Now, once we have those traces in the tables, we can then use the full end-to-end MLflow stack directly on top of those traces, so you can do things like inspect the traces, search them, look at sessions, do offline evals on top of them, as well as they can then run the you can then run offline analytics directly on top of those traces as well. So, you can do things like build dashboards, ask questions via natural language, as you're seeing here, and use any of the rich capabilities that Databricks provides in this space. And then, you can use the production monitoring capability on top of the traces in production to see how the agent is performing. And those same traces can be taken and used as part of the offline eval process to iterate on the agent. And so you'll end up with this continuous improvement flywheel where you're observing, evaluating, and improving your agent as you continuously iterate.

### [4:20](https://www.youtube.com/watch?v=zWe8IRsTh_g&t=260s) Demo: Building a Support Manager Assistant

— We walked you through the flywheel and why customers care. Now, I'm going to spend the next few minutes on what this actually looks like running on Databricks and how customers can close out this loop using the Lakehouse platform. Let's go ahead and dive right in. Starting with our agent, we built a language support manager system that calls a Genie space as a tool. So, let's go ahead and start by asking a question. Which team has best performance? So, now it's going to be calling our tools, in this case our Genie tools, in order to be able to answer our questions. Let's go ahead and wait for the answer. So, there we have it. Now, we have our we see our answer. The interesting part isn't the agent design, but it's the actual setup code and how easy it is for customers to set this up on Databricks. Let me show you.

### [5:39](https://www.youtube.com/watch?v=zWe8IRsTh_g&t=339s) Setup: Trace Integration with MLflow and Unity Catalog

The setup has two parts. The first part is a one-time setup script. Let's go ahead and take a look at what it looks like. The setup script creates the MLflow experiment and it binds to this UC catalog and schema that we have specified over here. That single binding tells Databricks to auto provision for Delta tables for our traces. And the second part is the agent code itself. And three lines are needed for our configuration over here. We need to set our tracking URL, the experiment name, and then run autolog. And that's it. From now on, every lane graph node, every tool call, every model call gets captured automatically. Now, one detail worth flagging is that the agent itself does not need to run on Databricks compute. In this demo, it runs on a Databricks app in one's workspace, and the traces flow to a completely different workspace. The same flow is going to work for your if you're running your agent on your own local laptop, on your Kubernetes cluster, or another cloud. It does not matter. Customers often assume that they need to migrate the runtime for this, but they don't. Once traces landed in Unity Catalog, every governance control just works. And that alone solves the PII review friction that a lot of customers hit when using other SaaS observability tools. So, next, let's go ahead and take a look at the experiments we have created for our agent. So, here I have already pulled up the experiments UI page, and these are our traces. So, if you recall earlier, we asked a question of which team has the best performance and here it is. Every invocation is going to show up as a trace and I can click into any one of these traces to see the full execution path. Let's go ahead and take a look at one of the traces, the one that we asked earlier, which team has the best performance. So here we can see the entire execution path. Every LLM call, every Genie tool call, and every reasoning of the step is going to be shown here in the details page. And that's it. Traces for our chat app agent now lands directly on the Unity Catalog or the Lakehouse.

### [8:27](https://www.youtube.com/watch?v=zWe8IRsTh_g&t=507s) Analyzing Traces and Native Dashboards

And here you can see the catalog and the schema where our traces are landing. Now let's talk about what customers will see from day one without building anything. Let's go ahead and take a look at the overview page. And the MLflow experiment UI now ships with native dashboards. There are dashboards for traces, trace volume over time, latency, errors, token usage, tokens per trace, as well as cost breakdown. And if we take a look over at the tool cost tab, we can see the individual tool costs. So in this case, we have for example our ask support data tool. And here we have the performance for that tool over time as well as latency comparison and so on. Now for most teams, this is enough. They just want to know whether their agent is up, whether it's failing, whether it's fast, and their costs. And all of that is here right now. But what if the customer needs a metric that's not included here today. The answer is because data is already in UC, customers can create custom dashboards over their traces table. And one example for customers to want to choose to use a custom dashboard would be for something like calculating cost per trace using a contract pricing. The key takeaway here is that customers are not tied to just these default or native dashboards. If there's some kind of metrics not available here today, they can simply use their own custom dashboards for this.

### [10:13](https://www.youtube.com/watch?v=zWe8IRsTh_g&t=613s) Offline Evaluation and LLM Judges

Now, the part that matters the most for the flywheel. Taking these production traces and turning them into evaluation data. Let's go ahead and jump over into our eval runs page. So, now we are at the evaluation runs page. Let's go ahead and select one of our runs that we have over here. So, I can select this one. Let's go ahead and open on a new tab just to see here exactly what it looks like. So, here we have some traces on this run. Let's go ahead and pull one of them. So, here we have the traces from one of the runs. And this is the prompt for tell me who my top two performers were. And here we can see the full execution path and every prompt and every response. But users might now wonder, was this a good response? And that's where the eval comes in. So, you might be seeing some stuff over here on the right like the assessments and the feedback. But before diving into what these means, let me show you our offline eval script. — [snorts] — So what you're seeing right now is a small script that does three things. First, it's going to pull the recent traces directly from UC. And then it's going to be building an evaluation data set against those production prompts from earlier that you saw on the page. And then it's going to run the MLflow judges against that data set using some custom guidelines that we wrote. So in this case, we have three guidelines. One of them is for example, the actionable insights, where it says the response must provide at least one concrete data-backed recommendation. With this, each trace they're going to get scored against each judge. So now let's go back to our eval run that we're looking at it earlier. So here we can see the traces they got scored against each of our judges earlier. We can see which ones passed, which ones failed, and the reason why. Let's take a look at one of them. So for example, we have this one where it failed actionable support insights. Let's take a look at the reason why. So it says while it mentions analyzing data and providing actionable takeaways, it does not explicitly state a recommendation based on data. So therefore, this is the reason why this failed.

### [13:04](https://www.youtube.com/watch?v=zWe8IRsTh_g&t=784s) Closing the Loop for Continuous Improvement

One thing to note, these scores are also written to the OTELs annotations table in the same UC schema as our traces. So this means they can be registered as production scores. For example, you can run any auto eval run every 15 minutes on new traces as they arrive, and write those results back into annotations table. So this is going to be using the same code as the offline eval script from earlier, but just running continuously. And that's the loop closed. Production traffic becomes the eval set. The score tells us where to invest. We improve the agent. New traces flow in. And the loop keep going. And that's the continuous improvement flywheel. To recap everything, customer has an agent. Doesn't matter where it runs. And they instrument with ML flow OTEL. Traces feed evals. Evals drive scores. And scores drive iteration. That's the flywheel. Thank you everyone for attending this walk-through.

---
*Источник: https://ekstraktznaniy.ru/video/52954*