# Must Haves For Agents in Production

## Метаданные

- **Канал:** Sam Witteveen
- **YouTube:** https://www.youtube.com/watch?v=aIy85-gIDzI
- **Дата:** 15.04.2026
- **Длительность:** 15:13
- **Просмотры:** 4,302
- **Источник:** https://ekstraktznaniy.ru/video/49618

## Описание

Most LLM agent demos look great. Most fail in production. Here are the 7 must haves  every team needs to nail before shipping — tools, MCP servers, budget monitoring, tracing, and agent evals. Explore in depth at: https://www.truefoundry.com/ai-gateway?utm_source=influencer&utm_medium=youtube&utm_campaign=sam

Checkout a Live Demo: https://www.truefoundry.com/live-demo-lp?utm_source=influencer&utm_medium=youtube&utm_campaign=sam
Docs - https://www.truefoundry.com/docs/create-and-setup-your-account?utm_source=influencer&utm_medium=youtube&utm_campaign=sam

Twitter: https://x.com/Sam_Witteveen 

⏱️Time Stamps:
00:00 Intro
01:42 Model Control
03:47 Prompts and Prompt Registry
05:40 Guardrails
07:17 Budget Limiting
08:54 Tools and MCP Servers
10:02 Monitoring and Tracing
11:45 Evals

🚀 ABOUT TRUEFOUNDRY 
TrueFoundry provides an enterprise-grade AI Gateway that encompasses an LLM Gateway,  MCP Gateway, and Agent Gateway—enabling enterprises to connect, observe, and govern  agentic AI applic

## Транскрипт

### Intro []

Okay, so clearly everyone's building agents right now. And recently in another video, I talked about the whole sort of distinction between single user agents and multi-user agents. And the thing is, while people are building a lot of single user agents for themselves, almost nobody is thinking about what actually happens when you want to ship one of these multi-user agents into production. And over the past 2 years, I've seen so many different things go wrong, both with my own team, but with other people who have consulted and stuff like that. You see things like API keys getting leaked. You have things like rogue agents running up like a 10K bill overnight. And then you have even more weird things where suddenly you realize that for 1 day last week, your agents just kind of hallucinated for 200 different users. And while people are doing all these amazing things with Claude code and open Claude, and this is really having its moment, if you're going to put something actually into production that real-world users use, there are seven things that you better make sure you've got locked down before you put any agent into production. And in this video, I'm going to go through each one and talk about the different things that you need to be focused on. Now, to show these off, I need an actual platform. Recently, I've been doing a number of things with the guys from a startup called TrueFoundry, and they've agreed to sponsor this. And one of the reasons I agreed to this is that they actually have all seven of these things in one place, which is actually rare even for a lot of the hyperscalers. But here's the thing, even if you never use TrueFoundry, these seven things apply to whatever system you're building on. Think of them as your checklist. So, let's get into it.

### Model Control [1:42]

Okay, so the first thing is model control. You need a unified layer between your code and your models. And this is for a number of different reasons. So, first up, if you're building any agentic system nowadays, you probably don't just want to be using one model. You a variety of different models. And often that's even going to be a variety of models from a variety of different providers. While Anthropic's Claude models are fantastic for things like tool calling and stuff like that, they're not great perhaps at multimodal. And in that case, maybe you want to use Gemini. And then other times, you basically just need really sort of simple calls. And if you realize that, okay, we're going to be doing a lot of those calls, you may even want to use just an open model that you could fine-tune to give you back a very specific JSON output, etc. The other thing, too, is you don't want to be hardcoding model names, especially with all the labs updating their models almost like monthly at the moment. You want to be able to easily swap in and swap out models. And you really want to abstract any API keys to one particular place where you lock down the security. Another thing that I'm seeing a lot that relates to this is that the model companies are deprecating models really quickly. If you had something working fantastic on Claude 3. 5 Haiku, and that gets deprecated, you better have something that you can swap in very quickly. So, in TrueFoundry here, they're letting you connect up lots of different model providers. But then you've also got a playground where you can actually test these out. This is basically where we can look out to see, are we getting the structured outputs that we want? Are we needing to make changes to the system prompt? All those sorts of things. So, you want something that easily lets you connect to all your different models from different providers. And even nowadays, be able to quickly test open models with things like open router. And of course, one of the cool things that you can do here is that not only can you select which model, but you can select which region you want it to be configured in. Now

### Prompts and Prompt Registry [3:47]

the second thing that you really got to lock down are your prompts. Your prompts are your intellectual property, right? If you're doing something for any kind of LLM app, the prompt is often what makes the difference. And while models are getting better at more generalized prompts, if you're looking to do things with structured outputs, which you certainly should be, your prompts are your IP. And you've got to stop thinking about these as just being strings. In many ways, you want to think of these as being like a second tier of code. So, you want to make sure that you version them. often you don't even embed them into the code itself. One of the things that's really nice in TrueFoundry is that they have a prompt registry. So, for each project that you've got, you can go and put in your different prompts in there. When you're working them out, you can come to a playground. And this is not just a simple sort of chat interface for the playground. Here, you can basically see what's going to work with different tools. Here, you can see things like, am I getting the right structured outputs out? As well as being able to test, for example, okay, how is the OpenAI model working for this versus the Anthropic model working for this? If you're working in a team of people as well, one of the things you will often do is have some people that are just working on the prompts, right? You kind of want to abstract your agent logic away from the prompts that are actually being used. So, a prompt registry lets you save the entire config, the prompt text, the model, the temperature, any guardrails attached, any sort of tools that you're going to be using for this. And your workflow for something like this would be that you experiment in a playground. You save it to the registry. You then publish it to an agent running your evals. And then you want to be able to swap out different versions of how all this comes together. Now, the next thing

### Guardrails [5:40]

up is guardrails. And before your agent talks to even a single user, you need to make sure that you've got good input and output guardrails if you're doing kind of any serious project in production. These tend to fall into a number of different hooks. You've got sort of pre-LLM, post-LLM, and then sort of pre-tool or post-tool. And that you could think of those as also being pre-MCPs or post-MCPs. One of the classic examples of this, if you're doing anything for a big company, is that you've got a lot of things around the law and around compliance. So, you need to be able to deal with things like personally identifiable information, PII, and also PHI, which is protected health information, if you're doing anything like that. Now, there are a number of different services out there for redacting these things. You can also make your own little models for doing this. But when you're actually working on different products, you don't want to have to reinvent the wheel each time for this. So, you want to look for a service that basically has these sort of guardrail systems built in. Now, a number of the hyperscalers have things like this. TrueFoundry's guardrail system lets you use both things from commercial providers, but also things that you want to build internally. And one of the things that I think is really cool here is that because you're actually using them as a gateway for actually making your calls going through this, this is literally just adding an extra header to your call to automatically have these guardrails that you've set up yourself to run on the agent calls that you're using in your code. Okay, the fourth thing that you

### Budget Limiting [7:17]

want to make sure that you've got set up is some kind of budget limiting. So, if you don't have any actual way of limiting the use and the budget that you're spending, very quickly these things can go wrong. Don't forget anything with these LLMs is almost impossible to predict exactly how it's going to go. And I've seen it time and time again, even with code that I've written myself for testing, often you can be sort of one runaway loop away from a nightmare invoice here. And I got to make the point here that the big cloud providers, they don't make this easy, right? There's an argument to be made that perhaps this is even by design. They're not making it easy for you to set per budget caps, per things like that. Here in TrueFoundry, you can see that we can just come into controls, we can set budget limiting. And then I can set up a budget that, okay, this is going to be called quite a bit in our agent, but I want it to be just max out at $1,000 per day. I can then set what this is actually for. I can even then set, okay, what model does this apply to? So, in this case, we've got the Grok models where we want to use the Kimmy K2 model. And we can set it up so that we've got a limited budget of $1,000 per day for this particular model that we're using in this agent. And the thing I'm going to say is that if you've got multiple projects going on, and especially developers, this isn't something that's optional. You've got to basically limit your liability from rogue agents and rogue processes suddenly spinning up a huge OpenAI bill or Anthropic bill, etc.

### Tools and MCP Servers [8:54]

All right, the fifth thing that you want to really be focused on are your tools. So, this can be generic scripts. It can also be MCPs. You want to make sure if it's an MCP, do you have your authentication for the MCPs all in one place? Increasingly, agents are using all sorts of tools, right? Whether that's APIs, whether that's things like browsers, etc. And the production problem here is that if you've got 15 MCP servers, your agent has to access all of them. And you want to make sure that you've got some kind of granular control over that. So, you want to have different sorts of permissions for that. You want to have a central place where you're doing authentication for that kind of thing. And really, what you want here is where your agent authenticates with your gateway, and then your gateway handles all the other security that's going on here. So, I really think that this honestly deserves a video of its own. Perhaps we'll do that in the future. But make sure that any sort of tools you've got being used by your agents are being locked down, especially if they're tools that can cost you money via compute or via API calls. All right

### Monitoring and Tracing [10:02]

the sixth thing that you've got to make sure that you've got on your checklist is you need to see every request, every response, every error, every latency spike. So, you need to be able to trace a single user's journey through your agent. You've got to remember that in many ways your agent is like a black box. Your user report to bad response, and if you've got no idea of what actually caused that to happen, you could find yourself that you're stuck with like all these possible things that could be. That could be everything from things to be like 500 errors on models, a lot more common than people think. It can be things like your tools not giving back the right context. Somebody's changed an API and the response format back from that it has changed. And really you want to be able to see these traces in the context of your agent and how everything maps from one thing to another. So, yes, there are other tools out there for doing this, and there are open-source projects that do this. In TrueFoundry, they've got a whole monitoring suite where you can look at everything from model metrics to your sort of tool metrics with your MCPs, through to the actual raw traces where you can break it down and see, "Okay, what's going on for each of these things? " Now, just like with models, we can also select the region that we want our traces to actually be stored in. So, traces generated on TrueFoundry are OpenTelemetry compatible, which means we can set up a close to real-time export to any sort of compatible system like DataDog, New Relic, etc. One of the issues if you're not using a gateway to do your observability is that you actually have to set it all up. Here, because we're going through the gateway, everything is logged by default. And

### Evals [11:45]

that brings me to the last thing that you really want to be on top of as you're going into production, and that is evals. Without evals, you're really just flying blind here, right? You need to systematically measure whether your agent is accurately doing a good job, capture and catch any problems before your users do. And the evals sort of come in two different parts. You can think of evals as like before you're going into production of checking, "Okay, does everything work? Is the system doing what we think it's doing? " etc. But often far more important are the evals that happen after you've gone into production. And these can be a few different things. These can be things like where suddenly you realize, "Hey, there's a new model out, it's half the price. We need to basically take a few hundred of the previous traces and run them through this model and see if it's going to work for what we're trying to do here. " Also, the negative use case is where you see sort of 3 weeks later after you're into production that suddenly now about 15% of your queries are just not working the way that they used to be, and no one's really noticed. So, you need to have evals that both work on the whole entire system, but also on each component. You've got to think of evals as like your dynamic way of building tests. And you want to have full test coverage for this because often that's going to tell you where you need to basically update a prompt or you're going to need to update a tool. All right, so just to reinforce these particular things. The first thing you want to be thinking about is some kind of model control as you go through this. The second one is going to be your prompts and setting up some kind of prompt registry where you can both play with different models and try looking at the different outputs that you're getting back. Third is going to be guardrails. You want to basically protect both the inputs to your system and the outputs to your system. Inputs for protecting against things like people trying to hack your prompts or that kind of thing. And of course things like PII and PHI information. And then outputs for things like you often want to make sure that your agent is not mentioning competitors or not having any sort of obscenities and stuff like that in there. Fourth thing is budget limiting. Don't go broke. Really try to lock these kinds of things down. The fifth thing is your tools and MCP servers. Make sure that they're working as you expect, that you've got tests to be able to test each one, that you've got any sort of authentication in a nice central place. And then sixth is your monitoring and tracing so you know actually what's working, what's going wrong with this kind of thing. And then finally you've got your evals again to sort of build on the monitoring and tracing in here. So, regardless of what platform you choose to use, you really want to make sure you follow these principles. If you want to try out what I've shown you today, I've put a link to TrueFoundry in the description. You can get started there. You can hook up your own models, your own keys to actually get started and make it easy for you to take something and actually put it into production with your team. So, while this video in many ways is aimed at teams and companies who are putting agents and LLM apps into production, I think these seven things are key to what you need to be focused on for any kind of multi-user agent that you want to eventually put into production. Anyway, as always, let me know in the comments what you think. If you like the video, please click like and subscribe, and I will talk to you in the next video. Bye for now.