# Why Your Agent Looks Fine in Testing but Breaks in Production with Ben Hylak

## Метаданные

- **Канал:** Jason Liu
- **YouTube:** https://www.youtube.com/watch?v=nKmPZVxfzY0
- **Дата:** 16.12.2025
- **Длительность:** 54:05
- **Просмотры:** 202

## Описание

Building reliable AI agents requires more than just prompts and models, it demands systematic experimentation and production-focused evaluation strategies. What happens when your chatbot makes legal commitments on behalf of your brand, or when users interact with your AI system in completely unexpected ways?

In this talk, Ben Hylak (Co-Founder, Raindrop) discusses the shift from traditional unit-test style evals to production-driven experimentation, based on lessons learned from building monitoring tools for conversational AI applications.

We discuss:
• Why offline evals can only take you so far. Users interact with AI systems in unexpected ways that can't be anticipated ahead of time
• Real production challenges: chatbots making legally binding promises (airline refunds), content moderation failures, and how rules and legal expectations now apply as AI becomes mainstream
• The evolution from simple tools (get_weather) to extremely open-ended capabilities (search the internet, run shell commands, generate images) and why companies need domain-specific data models that LLMs can query
• Differentiating monitoring (is your app doing well? is this prompt better?) from tracing (latency, token counts, costs)
• Testing hypotheses in production: using semantic search to find conversation patterns, label examples, and track distribution changes over time

Ben reveals why focusing on experimentation and production monitoring matters more than achieving perfect offline test scores, and shares insights from building tools that detect issues across production data in multi-turn conversational applications.

About Raindrop: https://www.raindrop.ai/

Connect with Ben:
LinkedIn: https://www.linkedin.com/in/benhylak/
X/Twitter: https://x.com/benhylak

TIME STAMPS
00:00 Introduction to Evals Debate
02:20 Ben's Background and Experience
03:41 Technical Difficulties and Examples
13:24 The Importance of Signals and Intents
17:49 Discovering and Clustering Issues
24:44 Clustering User Frustrations
26:09 Tracking Discovered Issues
30:47 Fixing Issues Beyond Prompts
42:49 Deep Search and LLM as a Judge
48:49 Monitoring AI Products Effectively
52:00 Future of AI Tools and Data Models
53:58 Q&A Session

If you want to learn more about improving rag applications check out:  https://improvingrag.com/

Stay updated:
X/Twitter: https://x.com/jxnlco
LinkedIn: https://www.linkedin.com/in/jxnlco
Site: https://jxnl.co/
Newsletter: https://subscribe.jxnl.co/

## Содержание

### [0:00](https://www.youtube.com/watch?v=nKmPZVxfzY0) Introduction to Evals Debate

As many people have maybe following on Twitter, there's been a lot of debate on how people should think about evals, right? Do we want to build these things like unit tests or do we want to run things in production? And you know, I think once you start looking at the lectures from week four, you know, my belief really is the fact that you can only do so much in these like offline settings and a lot of what you'll end up having to do in real life is understanding where your users are coming from and what kind of questions are they actually asking for these applications, right? Maybe you have a system and things are working just fine, but as you deploy this in production, you realize people really care about time filters or people really need to understand how to organize their data by their contact information or organize based on data types. And these things aren't really things that you can design a priority, right? If you knew them ahead of time, of course you could. But realistically, what we find in production is that users tend to use these very soft systems in very strange ways. You know what raindrop does is it allows you to sort of define these intents and define these things in production and then allow you to build these things offline whenever it makes sense. And so this is why I'm really excited to bring in um Ben the CTO to talk a little bit more about how do you think about building reliable agents and the intents that we have to bring in and the intents we have to discover from our user data. And so as always if you have any questions please add them in the slido. We shared a link in the Zoom chat there. you can add your questions, upload questions you want to hear answered. And with that, Ben, uh, floor is yours. — Sweet. Thank you. And, um, yeah, I'm going to try to leave like a good amount of time for questions at the end. Like, I think the with these sort of things. Um, I think the advice is always so hard to generalize because every single product is so different. Um, and so a lot of it is just like what is your unique case? Um, but yeah, so I think if I if there's one thing that I could have you take away from today, it's just this idea of like experimentation. Um, I think it's becoming more and more important um as uh agents get better actually and that's something I'll get into, but if this one word is the one that if you walk away with any word, if you remember anything about this talk, it's just that one word.

### [2:20](https://www.youtube.com/watch?v=nKmPZVxfzY0&t=140s) Ben's Background and Experience

Um, so yeah, a little bit of a little meme here, but um, first of all, a little bit about me. So, like Jason said, CTO of a company called Raindrop. Um, before Raindrop, I was actually, um, doing avionics at SpaceX. I was a designer at Google. Oh, sorry, designer at Apple on the human interface team for four years. Um, and um, so I've kind of had a weird background, but now we're kind of making what we call Century for AI products. And I think it's actually a really exciting time for AI products. Um, you know, like there's kind of like deep research is like, you know, it's kind it's actually old now, which is crazy, but like I think it was like one of the first like uh longer running agents that was actually like starting to become useful, right? It was like the first time where you'd have something that would run for like 15 20 minutes and then like you're like, "Wow, that was worth 15 or 20 minutes, right? " So that's actually a huge deal. Um, at the same time it's like that same company that could make something that good is like also like um at least for me like Codex is like kind of consistently will do this like kind of very silly stuff a lot. Um where like if you can if you know how to whisper it to it in the right way like it can have pretty good results but also like if you don't know how to whisper to it in the right way um then it'll do kind of pretty silly stuff. Like this is obviously a very silly uh sort of unit test. surprisingly that h that is actually the hash the shaan hash of the word hello. Um but uh yeah um and also I think the other kind

### [3:41](https://www.youtube.com/watch?v=nKmPZVxfzY0&t=221s) Technical Difficulties and Examples

of — wait Ben I don't know if it's me but I hear you moving slides but I don't see the slides moving. — Okay, let's see. Uh — I don't know if it's just me but uh I just see now. Okay, there we go. — Oh, I just recicked the share button. Okay. Um let me Can you guys see slides moving now? Yep. — Cool. Okay. Awesome. Um, so anyway, so this is like this like one uh codex example. So where like I asked to create a test and I created five tests like this that were all like uh testing if you know it correctly hash the word hello like you know this is obviously just a garbage test. You'd fire someone if they um made a test if you hired them and they made a test like this. Um and I think the other thing that we're seeing now is that um the there was sort of like a one or two year period in the beginning of like uh the development of AI where like the laws or rules or like none of these things mattered really. Um and now I think companies are finding out that as this becomes mainstream like there are rules right like there are laws there are expectations. Um like just one crazy example is like um so ver like virgin money I mean this is like a really bad thing. So uh someone used the word virgin in a virgin money chatbot and it like threatened to like end the conversation with them. Um we there was a lawsuit recently where um I forget which I think it was I forget which airline it was but it promised a refund to someone the chatbot and um then when the user called customer support they were like no you can't have a refund. That was like a hallucination and the uh and the consumer that was promised a refund won like the lawsuit. And it makes sense. It's like yes like if your brand tells someone that they can have a refund like you actually do owe them that. These chatbots are now talking on your behalf. And we're seeing this over and over again. Um it's similar to the character AI lawsuit as well where character tried to argue that um you know there was a case that was like uh one of their users were suicidal and the chatbot sort of like their agent um you know didn't tell them not to do it and they were trying to argue like you know this isn't us essentially and the judge was like no this is your company this is you that that's producing this. So kind of that's the territory we're moving into. Um, again, like it's just kind of weird how like unevenly good these things are. Like I'm trying to use Google Cloud and like I'm like asking where my credits are and it's like oh are you talking about Azure credits like audible credits? Like these kind of problems are happening in every product. Um, you know, Grock like had the South Africa genocide thing recently. Uh, it's also very bad with a lot of stuff. Um, and so all I'm trying to say here is just like, uh, even though there's like this like crazy dichotomy of like how good the models are and then when they're actually integrated in products and like those that performance isn't lighting up at the moment. Um, and then also just setting the scene here is like one question I think we get a lot is like, okay, will it get easier to make AI products? Like should you guys as builders just be like, let's wait, like let's wait till the model gets better. I think the first kind of answer is like yes and no, right? Like these models are getting better. Um, but also they're failing in such silly ways already. Um, and I think that like one of the kind of key problems here is that communication is actually a really hard problem. So it's actually really hard, not even just a model, but like just if you think about it like another human, you know, it's actually like really hard to tell someone what you want. And so I think that as these agents are getting more capable, they're actually ending up more like they just there's more and more undefined behavior. Um — then I don't think the slides are moving still. I still see the codec slides. — What? Uh okay, let me see what I can do here. Um — apologies for the technical difficulties. — Um — now we see test input expected output. — This is terrible. Um okay, — it's okay. We'll fix this in post. — Yeah. Can you see me going through slides right now? — No, we just see test input and then expected output. — Okay. Um — because you're using Keynote, right? Is this a — I am. This is totally a Keynote issue. Um let me see if I can um like full screen this then. — Yeah. — Is this — So we see text test input and expected output. the eval probably. — Okay. So, now you see me going through. — Yeah. — Okay. Let's um just tell me if you like you can see this now. — Yep. Still moving. — Okay. I I wish I could tell you last time I used Kina. Oh, you guys missed so many good examples. I mean, you guys missed this one. I mean, uh Okay. So, you know, let's go back backtrack a little bit. You know, these are the the Azure credit ones where it's asking me if I need Azure credits. And this is Google Cloud Console, right? So, this is their Google Cloud's chatbot is asking me if I'm talking about Roblox credits. Um, this is the Virgin money one. Um, so anyway, um, you guys can still see me going through slides. I have like PTSD now. Okay, great. — Yep. — Okay. — Okay, fantastic. — Um, and so this is kind of what I was referring to before as well, so it makes more sense. Um, but like Paul Graham had this tweet like, "Okay, I think HGI means the end of prompt engineering. " It's like no like if you think about how hard it is when you hire like a junior engineer onto your team to like tell them in order to like context load them and so they actually do the right thing like it's a really hard problem. Um so I don't think that this problem is like going away. Um and so that sort of gets us to the eval problem. And so that that's now we're on this slide. Hopefully you guys can see test input expect output. Um so this is kind of like what eval most people think about evals today. um like specifically, you know, there's a lot of different words. I think the terminology is really confusing. Like these are sometimes offline evals. If someone just says the word eval, they're often talking about this. It's like you have a test input and some sort of expected output. Like you have a test of, you know, um maybe you're [clears throat] like maybe it's like, you know, when was uh what's the capital of the United States, right? Um and then you you're expecting like Washington DC. Obviously, like those inputs and outputs would be whatever is relevant for your business, whatever your product is. But it's like that's the idea right and there's all different ways of trying to you know judge that output you know like and whatever but like the key of you know the core of an eval generally is you have some sort of input and some sort of output. The problem is that like as agents are running longer and longer, uh, as they get more and more capable, um, there's actually this entire like context and conversation that can span days or weeks or months that's happening before that. Um, that's often very hard to like now have an eval set for. Um, and then you start layering in things like now you have like memory for example, like you have like the agent trying to compress memory over the course of a long conversation. Now you have tools, right? and all different kinds of tools. Potentially the user can add custom tools in some cases or like have different sets of tools enabled. Um so you start just ending up with this like impossibly high like infinite combination of different things um that different almost states that could exist at any given time. So it's really hard increasingly to say like you know given this input I expect this output. it really depends on so much um so much external state. So, um, and I think again we're seeing this like I think OpenAI constantly has, you know, they're really at the forefront of trying to figure out and you know, encounter a lot of these problems and they had this sort of like psychopant uh, you know, incident. And I really liked their post their second postmortem. They had this really good uh quote like real world use is what helps us spot problems and understand what matters the most to users. So like it did really well in evalu because they couldn't they can't predict every issue, right? That that's the like the quote. We can't predict every issue. Um so that kind of brings us to this like testing versus uh monitoring. So testing is like where I put evals like offline evals. Monitoring is kind of more what we're talking about today. Um I think both are important. Like if you I I think that as much as possible try to think about this like normal software engineering um like I think that you know in every new domain there's this tendency to try to like reinvent everything from the ground up like um try to like and certainly with AI there's a and agents there's a lot of like new things but as much as possible try to apply like old paradigms like testing you know in normal software engineering there's we have tests and we have monitoring um you know you probably have unit tests for some of the most critical parts of your codebase. You probably use monitoring for a lot of other things. Like if you make a change and you want to know if something's going to be slower or faster. Um if you want to know if like you know uh if something's not so critical, you can sort of ship it and see if there's issues with it after. Um it it was really a judgment call. I think in general we see a lot of software engineering you know moving from testing and really rigorous QA to more monitoring right if you especially if you consider like back from the CDROM era to now right like so we've seen a lot a huge shift um towards things like century and data dog um

### [13:24](https://www.youtube.com/watch?v=nKmPZVxfzY0&t=804s) The Importance of Signals and Intents

and I think one of our really like core beliefs is like in order to build reliable agents you need signals um you need some sort of live signal from your app that things are going well or poorly. Um again bringing it back to like you know traditional software engineering like if you think about it like with century right you have these issues you have these errors and then you also have like number of users affected and time number of times that it happened these things are really important if you think about like Sentry if you open Sentry and you see okay yeah there's some error you don't really know what it is but it happened to one user and one time like nah like you're probably not going to really care about it right if you saw that it happened to 80% of your users, you're going to like wake up, you know, your engineers and like you're going to, you know, make sure that this gets solved, you know, right away. Um, so that's sort of like that it's really important to have signals that you actually can understand the impact of. Um, the thing that makes this really tricky is that for agents, there's often no concrete error, right? Like if you think about any of the cases that I just mentioned, like okay, it's going to kick someone off for using the word Virgin like uh Virgin Airlines, like that's not an error, right? Like there's no exception that gets thrown. Um if you think about the open ad one, if you think about like you know it promising refunds in the wrong situations, if you think about like you know um you know an agent like using the wrong date or like forgetting things, there's just there's no error that gets thrown and that's what makes our job as builders really tough. Um, and so this is kind of what we describe as like the anatomy of an AI issue. So we think about it as like a combination of two things. The first is signals and the second are intents. So and u a signal is some sort of sign that something's going poorly or well from your app. And these can be explicit. So these are like thumbs up, thumbs down can be a signal, especially if you have enough users. Um, and I think really interestingly and very new for AI products is that we have implicit signals. So these are something weird about what the agent's doing, user is saying, um, something we can detect in either that user input or agent output that can tell us if something's going well or not. Um, so for example, if a user complains that, you know, the agent forgot something and already told it or the agent says that they, you know, it doesn't know and it'll have to be reminded on something or refuses something or says it can't complete a specific task for a specific reason. Um, there's, you know, infinite number of these and they change per, you know, sort of domain, but these implicit signals end up being really interesting. The second part of this are intents because this is what the users are actually doing. Um, so for example, if you have a coding agent, if you think about it, there's kind of two levels of intents. There's turnbyturn intents and then there's conversation level intents. So if you imagine any interaction with like cursor, there's probably an intent where you like present the problem, right? So you're kind of like, and maybe it's actually more specifically, it's like adding a new feature, right? You're like, I want to add a new, you know, menu for so and so thing. Maybe you want to add an entire new page. Maybe you're actually creating an app from scratch, right? like a lot of these look maybe you're actually you got a bug and you're debugging. Um and then after that each turn they're sort of like you know maybe it's like correcting the previous output, maybe it's adding another feature, maybe it's like so so each turn has kind of their own intent and these look very specific depending on your product. There's also sort of often a higher level intent in the conversation sometimes. Um like maybe it's like you're building a landing p like a marketing page, maybe it's like you're working on you know an internal dashboard. Maybe it's like so it really depends. Um but both those turn byturn intents and conversation level intents are usually interesting. Um and I think that unlike um and this is where we have to do a little bit more leg work. Unlike century where um you know clustering those issues are actually I don't want to call it easy because it was hard at the time but you know like there's an error code there's a trace you can tell it came from the same place. So it's very easy to cluster. It's certainly very easy to discover. Um, we sort of have to have a little bit of a different loop for AI products and that's this sort of what we call this like discover, track, and fix

### [17:49](https://www.youtube.com/watch?v=nKmPZVxfzY0&t=1069s) Discovering and Clustering Issues

loop. Um, so I'm going to go through kind of each of these steps what we mean by discover, track, and fix. Um, and so when we say discover, what we mean is defining these initial signals, defining some sort of initial signal and then clustering those things. Um, so this is how we think that you discover issues. So first of all, what is a signal? Okay, so I kind of talked about it before. We think of them as ground truthy indicators of your app's performance. So like the signal should be something where like if the number spikes by a few percentage points, like you care about it, you know, just like if you see that error rates go up two or 3% in your normal app, you probably care. You're probably like, "Wait a like that doesn't really make sense. " A lot of times you want to count signals not by the time number of times or number of events but by the percentage of users that day that hit that signal. Um what do I mean by this is that like if you just count the number of events a lot of times you'll have a single user that like complains about something a 100 times that's not really that useful. Um if you can say that like oh x percent more users than yesterday um complained about something then that's an interesting signal. So we kind of talked about there's these explicit signals and there's these implicit signals. So chat again uses explicit signals a lot. So they have like thumbs up, thumbs down, but then uh they have a regenerate button, but then they actually use more than you even know. like the copy button. Interesting. Um, interestingly enough, actually, that if you even just copy a portion of text from a chat, they'll actually track that request and which portion you copied. So, it's like you can see here in the bottom, feedback, feedback type, copy, selected text. So, like they're actually recording you copying text from a message as a as feedback, right? So, super interesting. Um they of course do AB tests uh pretty frequently. Um which is they acquired stat sig actually since I made the slide. So um you know goes to show um and there's all you know you really like for every app has different kinds of explicit signals. Again we're talking about explicit signals almost like an analytics event right now. Um but these are things like thumbs down, thumbs up, regenerating, upgrading. Um, and what you can do with these is actually which I'll show in a second is and these are really underutilized right now. Um, and when people are building AI products like I think that there's a lot of alpha in this if you could just look at all the messages that people regenerated or regenerated more than a certain number of times, right? Um, this often like you'll find patterns there. Um, you'll discover interesting things there. Um, so I'll skip this. So, so we have the other way is like you can actually search. So, we have a feature called deep search. It's like a really powerful. It's like almost like deep research. Um, oh, sorry. Actually, that's not why this example is in here. So, this examples here is like is one of the feedback signals we use to make our product better is like um we have implicit signals as well, but we also have explicit signals. So for example, when someone performs a search, if they um it's like a semantic search or deep research, we present results and we have them mark is this relevant or not. So it's training the model under the hood, but also we'll flag searches um where, you know, more than a certain number were marked wrong as related, you know, compared to the ones that were marked correct. And this helps us know like are there certain kinds of queries that were, you know, underperforming on. So again this like analytics almost approach is like really underutilized at the moment. Um there's also implicit signals. I think these are like the most fascinating interesting and novel. So we think about again implicit signals that are they're more like detecting not judging. So instead of asking like you know Claude like rate how good this you know grammar is or how compelling this writing is that's a little bit of a circular problem you know um like uh a model that's going to be good enough to judge another model's output is like either maybe it's the same model maybe it's a more expensive model and then also it's like these things end up being really finicky and really hard to perfect but if you think about it more like does the response or input have x issue um these end up being easier to make reliable. So for example, these are just common ones we see and most of our users start with are like things like refusals, task failure, um the model saying being lazy about something um forgetting uh and then on the flip side on the positive side like wins for example are actually useful. So this is like the user being like oh my god thank you so much this is so helpful or like I really loved that suggestion you gave yesterday. These sort of things are actually like very useful signals. Uh as well once you define these highle signals like if you just think about like task failure or user frustration right these are not fixable things yet. These are just like uh almost like uh starting to filter down the data a little bit. The next step is you actually have to find patterns in those things. And the question is like how do you find patterns? So like one is you just look um and I think this is like a really like underutilized skill. Um you can just look at the data especially the things that are flagged as user frustration and things will stand out. Um it's a little bit hard to do this if you're just looking at 100% of your you know your trace volume. But once you kind of even have those initial sort of course buckets you'll start to see sort of patterns emerge yourself. So anyway, really underutilized and I would say even the prerequisite to number two, which is clustering your data. So there's a lot of different ways to cluster your data. Um, you know, generally it's something where like you have some sort of prompt in between that's describing the events themselves. If you can narrow it down to like just a specific bucket, like just user frustration, um, this is a great way to like avoid having to go through like millions and millions of events. Um but what you're looking for are again patterns and again you can do this not just on the implicit data like user frustration you can do the same exact approach u of clustering on like user feedback events that are being regenerated etc. Um on the explore front, just like text and semantic search are actually really useful and again underutilized here. Like if you can just um just do keyword search like for things like sorry or um I hate that or like you know fu and like all these sorts of things. Text search is really really powerful and underutilized.

### [24:44](https://www.youtube.com/watch?v=nKmPZVxfzY0&t=1484s) Clustering User Frustrations

underutilized. Um all right but to give an example is like let's say you have user frustration. If you cluster this, what you're going to end up with are these sort of subclusters like um uh okay, it doesn't handle math well, right? Maybe you don't care about that. Uploads are getting stuck. Like the user is trying to upload a file and they're like complaining that it's never actually uploading. So, this is interesting, right? Because this is in the chat, but it's not actually about the agent probably. Um maybe the agent is often using the wrong dates like years in the past. uh or saying you can't access contacts even though that like your app allows you to connect your contacts and it should have access context. It's a tone thing. Maybe it's forgetting what the user's already told them. So like this just like process of like defining some sort of initial signal clustering those things and um and then looking at those clusters is like really succinctly like how you discover new issues in your product, issues you don't already know about. Um and there's a lot of things to do from here like you could create evals from this from the ones you find etc. Um but um yes that's the discover that is the discover step and as far as discovering intents go it's the same exact process like you take your data you cluster it and now you have intents. Um so I won't go into this in too much detail but that's essentially uh how you discover intents as well. Um the second one is

### [26:09](https://www.youtube.com/watch?v=nKmPZVxfzY0&t=1569s) Tracking Discovered Issues

tracking. So once you actually discover those things, you actually have to be able to track them. And I'll explain why, which is like if you think about century, that like number of events and number of like users it happened to is really critical. So discovery is useful for finding out things you didn't know. Um but as far as actually understanding which ones are the sort of like most important, um it doesn't quite solve it yet. Um but if we go through user frustration for example, you can see these subclusters where it's like okay I don't care about it not handling math well actually uploads getting stuck was actually an old issue I don't care about anymore like we fixed it like we found out about it you know from if someone pinged us I don't care um maybe like those two ones that are in red were things I actually completely didn't know about and they're things I'm going to like really fix soon where like the things that are in yellow are like yeah like um these are things I know about and like I would like to make it better, but they're never going to go away, right? They're not like fixable things. Maybe um one difference between discovery and tracking is that accuracy doesn't matter as much for discovery. Like you can imagine that those clusters might have actually and often would have a lot of events in there that sort of like don't belong perfectly. Um but like maybe for example for user frustration some of the ones in that cluster will be just sort of the user just generally being mad at something. Um right but not the agent itself. Um so while it doesn't matter for discovery much it does really matter for tracking and tracking is the prerequisite to fixing. Um why is because like when you think about actually being able to fix something you really you know again we look to Sentry here and you can see here they show you this breakdown of like here are the browsers it's affecting here's the you know um devices they're on here's the environment it was here's like you know they can give you all this metadata all these tags that really help you um sort of uh sort through the issue and understand what the culprit is um and the you know you want to do the same thing for AI products if you can show like here here's the model that's most commonly affected here's the sort of if you support voice and chat like which one is it happening to more right um if you have like an intent router like you know was it routed to a specific intent um and like yeah we in our app like we think about this again driving home that word experiments like we think about it as like okay what are the tools that were like invoked in this sort of issue, what were the you know, like did those tools have errors more commonly? Um, and uh, you know, like how often were they on average invoked? Like you just sort of want as much metadata to be able to like as much information that's like reliable that will like give you breadcrumbs in order to solve the problem. Um, and this is also where like intents are really useful. If you think about even be if you don't cluster user frustration if you just combine them with intents like you know user frustration and math homework someone asking for help on their math homework versus like pricing like asking about pricing like or versus like trying to generate an app. These are like you can sort of even though user frustration is a broad category you can actually like each of these start to become like describable issues. Um which is like a really important concept here. Um and the last step here is like you often will like refine and define new issues. Like this process of like looking at data and clustering and creating new issues is like not something that actually goes away. So you want to just be finding as many ways of getting that data to you as possible or like make that setup for actually clustering those things as like easy and streamlined as possible. Um, and yeah, just like these are kind of obvious, but just like look at your data, even the ones that are not being flagged as issues. Talk to your users. Um, you know, always adjust existing issue definitions. Like I think one thing you'll find is that if you think about something even like forgetting like the agent forgetting something user told it. Um, that might seem like a really easy thing to define up front, but what you'll find is that like for example, if the user is like, "Hey, are you forgetting things like I never said that. " um if you're really narrowly trying to test your memory system, that actually might be more of a hallucination, right? Like if the agent just makes something up, um that might be a different kind of problem. So being very precise about these things actually does often matter.

### [30:47](https://www.youtube.com/watch?v=nKmPZVxfzY0&t=1847s) Fixing Issues Beyond Prompts

Um so the last step here before I wrap up uh the formal part is just like on fixing issues and obviously there's prompt changes here. I'm sure that like there's a lot of material on this on the internet already. I'm going to kind of talk about things beyond prompt changes because I think it's interesting. Um, actually one of our customers leave which has they're a studio that has like five plus million users like actually they're all old numbers so like 6 million RR is like very old now um but they have a bunch of like uh amazing apps um and uh they when we were working together came up with this framework um that I think is actually like really interesting. It's called Trellis. And the idea is that like um you can sort of like once you have an app in production, the the challenge is like how do you grow that thing without messing up other things you already fixed? Um like I think people see this a lot with prompts where it's like you get your prompts right. You know, it's right enough to like grow and get some initial number of users. People sort of like your app and then what right? like the last thing you want to do is sort of mess that up. Um, one of the really interesting ways um I think to think about AI products is like how can you sort of descretize um or or bucket your app's functionality and build for those buckets independently so that they there's sort of as little cross talk between those buckets as possible. And one of the realizations here is that tools are really actually sub agents, right? Like if you look at a lot how like clawed code uses tools or how like chatbt uses tools like generating an image in chatbt is just tool call it's just going to give a description of the image that should get generated and there's a whole another model that's going to like go generate it right you think about like um you know like web search right is now like a tool go and there's like another web search agent that's going to go and like interpret that request properly and go do it right so as much as you could offload Um and actually giving an example from our own work is like um we have a um uh like a SQL like agent essentially like you can you can explore all of your data ask it questions generate charts etc and um you want to be able to like one thing that's needed is that the agent actually has to generate um you know a query in our query language which is like not standard and if we try to like force all that knowledge of how that our query language works into the main agent. A it's like slower. B it gets like confused and starts using that all the time. C is like if we have to change our query language now we have to like change like the main prompt for our main agent. So what we did instead was like have a separate model which is like um I think we used five nano for that um which understands the query language like has all the context and it's essentially invoked as a tool. So it's just like it gets a description of what the query needs to do. So essentially the main agent is describing given all the context of the conversation and the request what the query should be in natural language and that we essentially have a sub agent that's going and translating that request. Right? So as much as you can descriize these the functionality in your app like you're able to sort of make targeted changes um you're able to create new tools right without changing any like the main prompt the main functionality obviously there still can be side effects but it's as modular as you can get when making AI product um and two yeah you can really like target and improve these tools experiment with different models for these tools etc. Um this is sort of a trellis again this is their thing and it's very similar which is just like you know you start with sort of discovering you know like launching an actual you know something that actually works you have you do that first um you're sort of observing you're trying to descertise based on intents based on issues um convert those like essentially iteratively for every intent it's like can you break that thing up into a targeted sort of workflow that you can refine and um instead of it just being one part of this like general uh agent and really you're just doing this recursively. Um and yeah, this is again this is where the sort of like tool call like have a way even if it's just normal sentry or monitoring like have a way to like actually see the tools that your agents calling the errors that they're hitting. Like if you need to know like if one tool is getting called like five times in a row and it's failing every single time. Those are the kinds of things you need to know about. Um there you really have to start thinking about tools as almost an extension of your prompt. Um like even the name of your tool can uh can really impact performance. Like uh labs like anthropic and open AAI are doing RL for specific tool names. Um but yeah, the last kind of takeaway here is that like you know we sort of think about it like this AI magic really has to be like engineered, repeatable, testable and attributable, right? You have to be able to attribute it to something um and not uh just accidental because otherwise it'll wash away with the uh you know like a sand castle as soon as the next model comes out. So yeah, — awesome. Thank you so much. Um I think we have probably 15 minutes for questions and then we can um yeah, maybe just do some live ones. Um and then also I would love to hear a little bit more about like some case studies you might have or horror stories of how things have gone wrong. uh for other app developers. Um I think the main question was around sort of is there any case for like not actually using evals, right? I think one of the questions was just like I don't see any cases where production applications shouldn't need evals. Um yeah, we can talk a little bit more about that like when does it make sense to use evals? actually do something in production? I think that pretty much everyone should have eval like uh to be clear like I can't imagine if again as much as you can try to model it like normal software engineering like so I it's really a question of like prioritization. So um would you ship [snorts] like an MVP of something without like rigorous tests? Like you might right like you actually might like it depends on like how critical it is. Uh but you might like uh and I think that's just like be honest about it. Um you probably want some sort of monitoring to just like even if it's just like if you think about like um like oftentimes I ship a new MVP I'll just throw in like Sentry or I'll throw in like Versell analytics or something just to see like did anyone even use it. Um so you probably want some sort of monitoring. We use evals a lot. Um I think eval are really important. Um the question is more just like proportionally how much time do you spend on each and I think that really depends on the product that you're building. — Yeah. I mean I would also add there you know sometimes I feel like the eval is kind of superfluous. You know oftent times we can just go and like fix the issue itself. Uh there's been plenty of times where if there's a couple of bugs, uh it's one thing to go fix the bug and then write like 15 more tests to show that you get full test coverage, but practically there's actually a lot of ground that you can cover just by fixing the issue and knowing these things will be fixed with like guardrails rather than trying to measure, you know, or make up a bunch of these really difficult evals. Um I guess another question was around like cinema analysis. Um, on top of things like user frustration and wins, do you feel like there's other examples of how we can use sentiment analysis to understand how good our systems are working? — So yes, there's a lot of like I think um I'm not 100% sure whether like the sentiment in this context, but like if I reframe the question as like semantics like um yes, there's a lot of things. So the there's a lot of like really specific examples for different products that you find as you start going through the data. Like the date one where it's like using dates that are like multiple years in the past um is a really tricky one because it's not just like if it references a past date like you could think of it was like a search engine or something, right? Like if it was like oh this thing happened in 2021 like that's it might have literally happened in 2021. But if it's like, oh, you know, here's some recent events in the last years and it starts like starting with 2023, then that's an issue. So like it gets like very specific to each product. But like at the highest level, the ones that we think about are like task failure. So this is every single time the agent says it can't do something. Um it's like if it's like, oh, I'm sorry, like I can't do that right now or I'm not able to do that or I don't have permission like this isn't working. Um, so that's one of like actually one of the most interesting ones I think. Um, honestly, um, that's sort of related to refusals, but we actually bucket them differently. Um, so refusals are saying just like it's not allowed to do that versus I can't do that. Um, so those are some of the ones on the agent side. On the user side, um, like so forgetting, for example, for anything that has like a memory system, especially companions, I think that ends up being really interesting. And again, you can then bucket those like what kinds of things are it forgetting? Um is it forgetting? Sam Whitmore from um who was working on dot um had has a really good article around memory and how like memory is actually extremely like there's a lot of different kinds of memory, right? Is it like situational? Is it about a friend etc. — That makes sense. Um I guess on the same topic of data um of like user frustrations like people are really curious about having like some other interesting stories or like what kind of features can you put in outside of just uh the text or the messages themselves. — Sorry one more time. — So like when you think about you know understanding things like the user frustration or uh I guess maybe the question is looking more for like these different case studies of what kind of features are you using on top of things like uh just the message history. Yes. — Yeah. — There's a lot of really interesting things. Um there's again I think about like two buckets. There's things that I kind of use the words like explicit um or like explicit or implicit or manual or semantic. Um like for example, if you see that a tool failed like five or 10 times in a row, um that's very interesting, right? So you can just see that like it's in the hotel logs. Um that's generally really interesting and the reason for why it failed is very interesting. Um you can see things like for example the user repeating themselves is often a very interesting one and you can do that just by um just with embeddings just compare like the user's message to the one before essentially. Uh or you could do that with a prompt to like look at the whole conversation. But um that ends up being very interesting. Like I think users will often just copy and paste the previous one or regenerate or whatever it is. Um and then um so the last one I'll add also on the kind of like lower not lower brow but like easy like easier ones or something are like there's a lot of like reax related things for like formatting and stuff like that you can often just like apply as a check. Uh, and this is this is often one of the more useful ones as well. Um, keywords is like al also really like underutilized. Um, I don't know if that answers the question exactly, but — yeah, I think that definitely covers a little bit more about this idea of like feature engineering when it comes to doing this kind of um, detection, but I guess I think maybe one confusion folks are having around is really around using LM as a judge. I

### [42:49](https://www.youtube.com/watch?v=nKmPZVxfzY0&t=2569s) Deep Search and LLM as a Judge

know that uh you guys have some like pretty cool tools around uh using LM as a judge especially around like searching things and then labeling things yourself. Could you maybe talk a little bit more about that because I think that's one of the coolest features of uh Rainrop. — Yeah. Yeah. So um the it's a really good question. So we use LMS to like label data for people, label issues, label tests. A lot of people ask us like oh like um is that LLM as a judge? Um, I think that like LMS as a judge, um, as you'll usually see them described, usually involve getting an LM to either score the output like, you know, you'll see this example of like how funny is this joke, right? Or like, how good is this writing? You know, like maybe it's a maybe it's binary, maybe it's a score from one to five. Um, I think we've seen a couple problems with that approach. One is that um a it requires a lot of eval b is um you essentially need e models for your evals like you need a way to know that those scores are aligned with the scores that you would choose right so it becomes another product to sort of build and if you we all are here because we know how hard it is to get a prompt right in the first place right so now you're like every single time you want something to score uh you know you need to have a evalu so um the other problem which is really crucial and sort of uh not talked about enough is that um you really in order for this sort of stuff to I would say it's the most useful when you have really good coverage across all of your data and um if you're using the same model to score um you know the output as you're using to generate the output or potentially a more expensive model um you're going to be only able to cover a very small percentage of production data. So, what we do is actually we kind of um we have this feature called deep search. So, it's I was kind of showing a little video of it before, but the way it works is like it starts by doing a semantic search and then like pulls in an LLM to like rerank and score and essentially decide for every single event is it a match to what you're searching or not. We think about it like essentially a binary classifier like are you like does this you know if you describe an issue like using dates using the wrong dates and we're and then you're going to um define criteria. So like actually as you mark things as right or not we're at we will generate questions and ask them to you like oh do you mean if it uses this phrase or um and then what you end up with is not like you know Claude thinks that this is a three out of five but you know this issue that you've defined and you've kind of calibrated what it means um is you know present is affecting 8% of users um and so we sort of automate that entire process of you know like searching through your data. We train a little model under the hood so that we can process millions of events a day without anyone going broke. And um the product it's raindrop. ai. Um yeah that's the and the feature is called deep search. — Craig I don't fully understand the question you're asking here but maybe you can uh chime in. It's around the uh model capabilities. Well, I'm just asking what kind like it seems to me that I haven't heard announcements from any of the model creators that they're really catering to helping us solve this challenge which is how do we control these outputs from the models in ways that we can reliably produce results for users and what kind of features would we be listening for if they were working on it? — That's a good question. Um, I think it's a really hard problem and there are there's actually more things being done than you would think. They just off they are often very buried in the uh developer documentation and they don't really make like um uh uh they don't make you know they don't go viral on Twitter. Um so like for example um opening eye with GV5 they released like CFG support like contextf free grammarss which makes generating like DSLs or um or like any schema uh any language like a lot more constrained. Um I found it to be a little slow in practice. Um but that's the kind of feature where like you can essentially say like these are the rules of my language like do not uh you know adhere to them essentially. Um I think there if you're familiar with structured outputs there that's another example of like we do structured outputs a ton where you can say here is a zod schema or here's a essentially I think they support pedantic as well here's a schema um generate an object that conforms to this. I would say uh cfgs are kind of like an extension of that where like there's some things that are very hard to define as like a an object schema. Um so I think you know those are some of the ones like that stand out to me. Um but um I do think there's more like for example I don't know if 2v5 still has log props but um uh but I think that like log probabilities were something that did make trying to like debug specific prompts a lot easier for a while. And um that was in the opening API for I think it's just completions though, but um no one really used it even though I used it. I thought it was very useful. Um but anyway, there's a lot of stuff that's buried in the documentation. — I would also call out that uh you know I think cohhere and anthropic for example have citation APIs, right? I think one of the questions I saw here was like how do you make sure models are citing things correctly? Um there are APIs to do those kind of things. I don't know if 2D5 has citation APIs. I imagine they're probably working on that in some extent. Um yeah, but I think you're right in the sense that these are not the things that tend to go viral and so maybe they're just harder to discover uh you know job for the devs. Um one question I think would be

### [48:49](https://www.youtube.com/watch?v=nKmPZVxfzY0&t=2929s) Monitoring AI Products Effectively

really great to hear your answer on really is around how do you differentiate yourself against the things like Langmith or like Phoenix or any of these other monitoring tools because I generally have an idea that this is sort of more of like tracing versus actually understanding how people are using your product but — I would love to hear more. — Yeah, it's a good question. Um so we really think of ourselves like monitoring and um I think that when you pretty much every platform will say that they have like monitoring for AI products. I think what they mean by monitoring is more like things like latency, token counts, um costs like these sort of things which are obviously important but um when we talk about monitoring what we mean is like is your app doing well or not or is this prompt better than that prompt? Is this model better than this model in the real world? Um so our kind of we have the benefit of focus a little bit where like these you know linksmith arise etc. They generally treat everything as like a prompt or a span or like as a sort of like highle thing. We really focus on uh conversational apps. Um so apps where there's sort multi-turn apps is actually the better way to put it. So where there's like a user input and assistant output. And we um are the only platform that actually trains models to be able to detect these things across all of your production data. Um and uh yeah, I think that like we really focus on experimentation. We really focus on like alerts for example. So if you push a change and it's like worse, you'll get a notification describing like what changed. Um yeah, so most actually most of our customers use us with uh either Langmith or Brain Trust. Um and uh yeah, for now they're quite compatible. — Yeah, I generally find that sort of the one place where I think raindrop really shines is the ability to sort of test a hypothesis, right? And so, you know, if I'm using something like Brain Trust, maybe I notice that all of a sudden people are like more expensive because the token count is higher and I can track that or maybe the latency has gone up. But it's really hard for me to go in and say, you know what, I have a hypothesis. Can I find examples of usage that is, you know, like how often am I giving discounts, right? Like I could use I could track the, you know, discount creation tool, but maybe if they're just conversations about it, that's where something like deep search really helps, right? I can maybe say, you know what, I want to search examples or I'm offering coupons and discounts, label 20 examples, and now I have a way of monitoring what that distribution looks like over time. Um and that monitoring over time is I think something that's really valuable. Um I know we have five minutes left and the question I ask everyone um is uh you know what is something you feel like folks are not thinking about as they build out these applications. You know sometimes it's thinking people have suggested things like you know being a little bit more like cache optimal or thinking about eval obviously. Um, but from the production applications that you've seen, yeah. What is something you feel like folks are not asking themselves as they build out these applications?

### [52:00](https://www.youtube.com/watch?v=nKmPZVxfzY0&t=3120s) Future of AI Tools and Data Models

I think that um I think that [snorts] one shift we're seeing a lot is um that the kinds of tools that models like if you think about like a year or two ago what a tool meant it was like get weather right it was like a tool now it's like search the internet. Um it's like run any random shell command. It's like so it's like generate a photo but just like here's a description of what to generate. So these tools are getting extremely open-ended. Now, the reason why I think it's relevant for us as builders is that increasingly I think products are going to essentially need to have what's like uh I keep using this word like domain specific language or like con whatever, but like instead of just having an API and like saying like, okay, here's all of my APIs, you actually have to think about like your company as like a data like what is your company's data model and how do you access it give an LLM um like in instead of just like get weather or like these things, it's like it needs to be able to like run the essentially like the SQL query that like allows it to get the exact information it needs. Um so how do you allow it to do that but like tied to your company's data model and you know scope to the correct access etc. So I think that um companies that are sort of in the data space have this like leg up. Um, but if you think about it like um, yeah, like I think that will increasingly be like models will need really open-ended and extremely powerful tools and we as builders need to be thinking about how do we like how do we define those primitives in our company really well so that like models can interpret those and actually like invoke them and write you know whatever it is. — That was great. Awesome. On behalf of

### [53:58](https://www.youtube.com/watch?v=nKmPZVxfzY0&t=3238s) Q&A Session

everyone uh, thank you so much for all this information. I'll see you in SF too. — See you soon. — Awesome.

---
*Источник: https://ekstraktznaniy.ru/video/52972*