# Why the Best AI Agents Start With Evals, Not UI

## Метаданные

- **Канал:** Jason Liu
- **YouTube:** https://www.youtube.com/watch?v=JdrrYtpj-XA
- **Дата:** 13.01.2026
- **Длительность:** 50:02
- **Просмотры:** 286
- **Источник:** https://ekstraktznaniy.ru/video/52970

## Описание

Building AI agents in production requires more than just connecting an LLM to tools. What if the key to shipping successful agents lies in designing the right evaluation framework before you write a single line of UI code?

In this talk, David Kim (Design Engineer at Braintrust) joins us to share lessons learned from building and shipping Loop, Braintrust's prompt optimization agent, based on real-world experience going from prototype to production.

We discuss:
• Starting with "aspirational evals" - designing evaluation environments that define what success looks like before building the product
• Why they built Loop with four specific tools: run eval, get summary, sample rows, and edit task
• How Claude Sonnet performance improvements made the agent viable for production after initially poor results
• Shipping to production in two weeks once eval performance validated the concept
• Real challenges post-launch: handling complex multi-step workflows, preventing infinite loops, and mana

## Транскрипт

### Introduction and Session Overview []

Thanks everyone for coming to this session second session of today shipping and agent production from a designer's perspective. I think this talk is going to be pretty exciting. Most of us know brain trust as a tool that is helping out with observability but today's talk won't actually be about that. I actually asked Anka and the team to think a little bit more about how they're building their own agents, right? And so this will be mostly around how to think about these things from a design perspective. How do you think about you know building the agent themselves and then you know a little bit more about how brain trust uses brain trust I imagine to sort of do these experimentations and really learn about what works in production. And so as always if you have any questions we'll leave them to the end. If you check the zoom chat there'll be a slid link. Please add your questions there. Upload the questions before the Q& A and we'll just sort of kick off from top to bottom and ask any questions that make sense. And with that in mind, yeah, David, the floor is yours. — Thank you. Appreciate it. Hi. Uh my name

### Braintrust's Journey in Building an Agent [1:01]

is David and I'm a design engineer at Brain Trust. And uh like Jason mentioned, uh at Brain Trust, we relatively recently went on a journey of building an agent ourselves. uh which has been a very like a journey with a lot of learnings. Uh we've partnered and helped uh our customers build incredible agents. So companies uh like notion versel who have incredible agents uh in production but we ourselves haven't had much of an AI surface in our product uh before this. So this has been a great opportunity for us to like go through the process dog food and then sort of uh go on this journey together with our customers and today's talk uh will be just sort of walking through that journey and I want to sort of uh go through how it started how it evolved once it made it into production and end with sort of the lessons learned uh through that process. uh and my hopes is uh you know as you go on to build your own agents uh for your side project or your business or as a part of a company that these some of these can be helpful. Um so to start off uh I want to talk about how loop started. Um so you know Jason mentioned uh that you all sort of know what brain trust is. It's an evf infrastructure and observability startup. Uh so we let you do things like uh get logs of your app in production and you can sort of use that to uh and we provide you with the eval infrastructure the primitives to run evals and you can sort of create a fly flywheel of getting signals from users and iterating on uh your AI app to make it work better. But today I don't mean this to be like a pitch for the product uh but just in describing uh what our product does because the agent is uh going to be uh used in that context. So uh given our platform and the product that we have we had this idea that we really think we have all the context necessary for an agent to go and optimize a prompt. So we have the data

### Initial Challenges and Prototyping [3:12]

set, we have the prompt, uh we have the score, uh and so we can have the agent run eval in a loop and see okay if I edit this in the prompt uh will the scores uh go up for the task and sort of uh go in an unsupervised way uh go in a loop and see if it can go about improving uh the prompt. So we thought we had all the context to sort of make this happen. So we took this idea and we translated into an eval. Um and so it looked something like this. Uh we had sort of various scenarios that we set up. Uh if I were to talk about some of them like on the easier end, we had this eval environment where uh the data set was something like uh the input is uh the boy with a lightning scar on his forehead and the expected answer for that is Harry Potter. Uh, and so and another one could be a boy who got bit by a spider and started shooting webs. The answer for that would be a Spider-Man. And uh, you know, we had hundreds of these inputs uh, and uh, a starter prompt that was instructing the LM to sort of guess what the movie is. And then, you know, we can use a score that is something like an exact match to see if uh, the prompt is doing the job. uh and on the more difficult end we had uh evalu setups like uh for SQL generation and being able to take uh an API response and sort of be able to uh deduce what platform and what the API response was for. So the on the harder end we had those kind of environments and what we did was we prototyped uh around this eval design we prototyped loop and gave it these four tools uh run eval get summary sample rows and a task run eval as the name suggests is the ability to run the eval um in the given environments that I just described. Get summary gives the agent the context, the ability to uh retrieve context around the summary of the environment. So it can see what prompts are there. It can see if an eval has been run prior uh like the summarized data, average score, average latency, duration, cost, things like that. Uh it can also uh sample rows from the eval that's been run to get detailed information. So it can look at uh the exact score um and the exact duration cost for each of the eval uh each of the inputs coming from the data set. Edit task is the ability for the agent to actually suggest and uh overwrite the prompt. Um so in theory what would happen is loop will uh go into these environments uh run the eval see how the scores come up get the summary sample rows to see what it thinks it can do to improve the app make that edits and then it would go in a loop until it no longer feels that it can improve on things. Um so what we found when we ran these evalu uh with various sort of environments is uh at first it didn't perform well and this I would say is what our company would call an aspirational eval. It's an eval that sort of starts from an idea and we sort of know this isn't going to perform well at first uh because it's a hard task. Uh but uh as you prompt tweak and as new models come out, what we saw was that the performance on these evals went up. Uh and it was class for cloud for sonnet when we distinctively felt okay like this does perform like pretty well uh for us to feel like um if we were to package this up as a product and put it out into the world, it will bring good utility to our customers. Um and so when cloud first onet came out and we started seeing pretty consistent good performance out of uh loop we sort of sprinted and uh did bunch of prototypes uh to see what the UI experience should feel like. We've you know played around and flirted with a dynamic island kind of experience also tried sort of floating things. uh we played around with different uh you know microanimations, loading states and things to communicate states. But the important thing here is that uh we had the uh eval set up clearly designed to see if uh this agent can do the task that we want it to do and is this task when it when the agent can do this task well will it bring good utility to our customers. All of that was encapsulated in the eval and so we had the clarity to be able to sprint and uh just ship this thing really fast. Uh and at the end of two weeks I think we ended up with this in production where you know we had a traditional uh conversational like agent UI on the side uh and it can sort of uh interject with the UI when it thinks it has a suggestion to make. Uh you all are probably familiar with you know how cursor works. It's very similar uh where uh when it needs to take action uh that requires confirmation. It will ask uh but it will also go in a loop uh with it owning its own control flow uh to get the task done. And that sort of is the story how we uh were able to go from an idea uh set up an eval uh and over the course of a longer time uh wait for new models to drop do prompt tweaking uh and do some optimization the tools to get it to work well and sprint it really fast to uh put it out into the world as a product. uh once this was out in production uh what we saw and now I'll go into how it evolved from that to what it is today uh what we saw was that

### User Feedback and Iterative Improvements [9:21]

people were trying to do or our users things that loop couldn't do at the time uh but we were getting really good signals through our observability on what they wanted to do so they would ask things like hey I want to generate rows into the chat to find out that loop could not at the time edit data set rows in our product. They wanted to create scores whether that's LM as judge or code scores uh to find out that it can't. They wanted uh loops to summarize things for them gener generate uh BTQL query um which is our own flavor of uh SQL. We built our own uh data store uh because we wanted it to perform well with AI workload. problem is uh we have our because of that we have our own flavor of SQL that has a bit of a learning curve for users and of course users wanted to have AI do that for them and anytime there's like a UI error on the screen uh there's a lot of like copy pasting that into the chat to see if loop can fix those things for them and also asking uh just general questions about brain trust that we can answer by doing sort of a retrieval on the docs Um so that these signals that we got from the logs uh we sort of translated then into an eval so uh the great thing about and I'll talk about this later again having observability and eval infrastructure set up is uh and especially in this kind of a an AI application where the main entry point for the users uh not always but often is this open prompt box. The great thing is users will tell you what they want happening uh in natural language and describe it to you which can be incredibly rich uh signal for um what they want and what we should build next. And that's what we did and this is an example of how we translated that into an eval for generating uh synthetic data set rows. Um we took uh these sentences like generate 10 more rows and uh there are a lot of signals there's a lot of inputs that users gave around like specific instructions that they had uh and we sort of created a data set off of those and also injected more uh that we thought were relevant to uh this eval uh and then set up uh loop with an additional tool which is the ability to edit data uh and then we ran loop against this eval uh looking specifically for is it following the instructions of generating the specific rows that users are asking for. The other thing important thing was semantic consistency. So if I if we go back to the example of the movie guesser uh the rows would look something like boy with the lightning scar on his head is Harry Potter. Um the bit by a spider uh is Spider-Man. And then maybe uh you know the thing that loop could generate is off of that if you were to keep the semantic consistency it would be something like a hobbit that was chasing after a ring to rule them all would be Lord of the Rings. So you'd want to be able to uh infer the pattern that exists from the context and suggest things that keep that consistency but while keeping the consistency you don't want loop to generate the same thing 10 times in a row. you wanted to sort of bring about diversity within the consistency. So we set up scores that would check for things like that. Um and we were able to sort of uh run these evals uh and also the evals that already existed made sure things didn't regress uh and ship this uh relatively quickly. Um the other one and this one we didn't have to create any new tools. We just had to sort of build in more UI entry points which was uh the problem that users were uh conveying to us through the logs of hey uh I'm going to copy paste this me error message I see in the UI I want you to fix it uh and so we took uh all the UI messages that users are copy pasting into the prompt box turned that into a data set ran it against this eval to see if uh when loop was done with editing the uh the data underneath did it result in a successful run because it if it did it means uh that it resolved the error that was given to it uh and we also in addition to taking the logs and turning into data set we also uh looked into our codebase to see what other errors can surface in the UI and added that to the data set to sort of uh bring more completeness to the eval design. Um and then another one is summarize the experiment. So users would uh use that prompt box to ask loop to summarize things for them. And this loop had the tools to do. Uh this is another case where we didn't build any tools but we had to do fun but kind of interesting slash annoying tweaking because whenever uh users are asked anything related to summary summarize overview uh loop would overindex uh on uh the word summary and use the get summary tool which I explained before is the ability to sort of grab context that uh sort of averaged out or so like if it's a experiment it will grab things like the average score, average cost, latency and things like that. But uh loop would sort of overindex in the word summary, use the get summary tool, stop there and just uh spit back the summary data that it got. When in reality if we zoom back and think about uh if we ask our friends or co-workers, can you summarize this for me? What you're looking for is not just take the average data and tell me what it is. What you're looking for is hey can you go through look at this deeply in detailed way and grab the most important bits and nuggets out from that and then give me that information. So we would want in this case loop to not only use get summary but sample rows to get detailed view of the rows and use that sort of information to give me a more detailed but still concise summary. So we used we checked for things like completeness uh conciseness uh and to sort of uh design this and I remember in the prompts for this uh we were tempted to sort of hardcode in when this happens use specifically these tools and then only then uh sort of proceed to the next thing but uh we saw that doing something like that resulted in regressions across the board. So we ended up sort of uh making it a little more vague of an instruction where we said uh something along the lines of uh when you have to give an overview or summarization of of an information. What is generally correct uh to do is to get uh detailed informations about the thing at hand and get all the context and then and only then proceed with what you were going to do. uh and we saw that resolved the issue of loop overindexing on certain words but also uh it didn't regress on the other tasks that we want loop to be able to do. [clears throat] Um and the other one and this one is my personal favorites uh is uh we gave we saw that uh users uh were like trying to uh parse the logs. So you know logs come in about how users are interacting your AI app. Parse the logs to get some analytical insights. Um so for example it would be uh how many user interactions that I have today. Uh over the past week uh how much percentages did it uh go up by break it down by the hour? Uh was it higher during the day at night over a month? What's the week- toeek increase in user interaction? Being able to ask questions like that uh and have the uh agent sort of uh slice and dice the logs to come out with the analytical insight was some of the things that we saw users wanted to do. And so uh you know I'm showing two separate things uh here but the first thing we did was uh what we see here uh find uh that is exemplified by find popular use cases in the logs which is more of an analytical thing. So for this we built out two tools infer schema and run BTQL. Infer schema is the ability for a loop to grab context around the shape of the data that it's looking at. So it would give it all possible fields popular values uh for those fields and so then loop can take that context and generate valid uh BTQL query because it knows the shape of the data at that point. Um and so we uh were able to get loop to give uh really good analytical insights uh and uh test that with this eval. And then we realize we can actually take the same tool and provide another experience that users are asking for which is around uh help me write this query. I don't know BTQL. I know some SQL is how I would write in SQL. What does that look like in BTQL? Um, and so we were able to sort of build another eval uh that was more geared towards generating SQL rather than running SQL to BTQL to uh gather context for an analytical insight. The fun thing here was this was uh one of the first times that we uh played around with dynamic prompting where uh if you remember the analytical insight part, you can imagine uh the behavior we want the agent to do here or loop to do here is like we want it to run be sort of trigger happy with the anal with the tool usage of run BTQL. to grab different kinds of context. I want you to make experimental queries to grab the right context to make the analytical insight rich. Whereas with something like uh the other one where it's more like uh just generate this query. I'm not asking you to like uh create come out with some analytical insight. Just generate it. I'll run it. So uh even though the mechanism or the uh program programmatically the mechanism to like generate the query and like val validate it and things like that is the same sort of the user intents or the UX that the user expects is very different. So uh we did the thing where depending on the page you're in we give slightly different contexts and slightly different prompts. So that in one end if you if it's being used for analytical thing be trigger happy go out there find the insights you need run it five six seven eight nine times if you need with the other one you know we wanted to be more like be concise uh be fast uh and sort of have a different behavior. Uh so this is an interesting one as well for us. Uh and I did there are a bunch of other stuff that uh we ended up building out based on the signals that we saw in the logs. But uh all in all uh Loop went from a very specialized prompt optimization thing which um you know that is the main thing that Loop does well. Um, but we were also able to expand out the category of workflows Loop is good at and can solve for and ended up with a suite of these tools uh and a lot of eval to like make these work well. uh and uh now it can do a lot more and in fact I think uh either last week or two weeks ago generating synthetic data set rows has overtaken prompt optimization workflow as the most used workflow by the users uh just by a little bit. But that was pretty interesting to see how we were able to sort of take uh signals from users through our logs and sort of build this out and cater to the users. And I think the fact that it's being used a lot is a testimony to the fact that we were able to solve a problem for the users. Um, so this encapsulates sort of like uh how it went from just making into production and like using sort of the signals that we get from the users through our logging to uh then inform what we build, how we design the eval to be more of a general purpose uh assistant agent type thing for the product. Um [clears throat]

### Lessons Learned and Best Practices [22:34]

uh I want to go into now lessons learned across the board here. I think I've talked a lot about them. So it might be a bit of a repetitive thing, but I just want to drive home some of these things which is you don't have to use brain trust, but I really like you should start with an eval infrastructure and obserability. There's a lot of great tools out there. Uh but once again having logs uh in this uh AI app uh whether you're building an agent or uh another sort of AI capabilities logs are so rich for uh for these tools. Um it's a little different from traditional uh sort of uh products where logs we use mainly for debugging uh what errors pop up and things like that. But we were able to use logs as like a real signal for what users wanted out of our agent as well as uh use it to sort of check where the vibes are off and then turn that into a data set to eval to get existing uh sort of features to behave uh in the correct way and sort of great uh provide a great user experience. So having logging great please start with it. Uh and having eval infrastructure is going to be able to allow you to like move fast and make sure things don't regress. Um even though I will talk about how regressions did happen for us and it will probably happen for everyone. Uh it will help with that. Um another thing for me that was really helpful as a designer is to think about like thinking about eval as a design tool. So if we think about design process uh and I can very confidently say as design a design process is different for every company and every person but largely I think it maps into this bucket of understand the problem the user has uh design potential solutions to that problem uh bring it to the users whether by shipping it or uh getting in the room with the user get feedback use that signal that you get from uh the feedback to uh you know inform your understanding of the user problem better design a different solution uh and then get feedback. It's like this iterative loop of these buckets of things. Uh for us like we can sort of encapsulate most of this through an eval where the logs are sort of the way we could understand the problem the users have or the thing that users want out of your agent and you can sort of take that insight and uh design an eval around that and the eval is the prototype or the design solution in this case. uh and uh you can sort of tweak based on the eval tweak your system prompts or uh your tools to make it behave a certain way and then ship it and then you can sort of iter iteratively do this and then when we when I thought of evals as sort of this prototyping or a design tool I started writing a lot of them and there's a lot of evals that I have that like loop just performs poorly on but I am really like you know looking forward to loop performing well on those things because I know then we can try out more novel user experiences. Um, and so like thinking about evals as a design tool has helped us like move really fast and build out new experiences really fast as well. And in more ways than one, uh, I noticed this a lot when I don't know if you work with designers, you probably drop into a Figma file and it looks like crazy mess. there's frame 10,000 unnamed frames. Uh you have no idea where things are or what what's going on. What am I supposed to be looking at? And I noticed this happens with evals, too, where I've ran a bunch of them and they end up being like eval number 10,000 and I forgot if the good one was 9,000 or 9,0001. And so I think this parallel uh jokingly kind of works well as well. uh but use evals as like a design tool and uh I think what you'll see is that the agents that you end up with once you obsess over making them perform well in various conditions will have uh generally good UX. Um [clears throat] system prompt design. I think I talked about this a little bit but uh what I would suggest is and what we have seen work well is try not to do this uh like you must use this tool and that tool in this order uh because it's just going to create regressions across the board. Uh see if you can sort of and this is tricky and going to be annoying. see if you can uh sort of like find the right words to give the agent general uh values or guidelines or beliefs that it should have about how to work and that let that be the guidance for the way it uses the tools. We found that works really well. Uh, and we've also found that organizing prompts in a specific way works really well for us as we go about uh, rapidly, you know, tweaking them as we try to build out new experiences or when we see regressions new cases in our logs where our uh, agents wasn't performing well because when we once you sort of break them down into different sections, it's easy to sort of go in there and snipe a change uh, and uh, not let everything go bad. So that's another thing that I wanted to share. Another thing uh is regressions happen. Uh we have a crapload of EOS. We're not going to run all of them every single time we make a change to a tool or a tiny bit of a system prompt. We're going to run the ones that we think is really important that we don't regress on. So, uh, just know that regressions happen, but you know, do your best to mitigate them by, uh, having those important meals run, uh, on CI/CD or, um, just constantly looking at the logs. Have, uh, alerting or live scoring set up so you can be alerted when the vibes go really off on, uh, live user interactions. Uh, but also don't beat yourself up if aggressions happen. The other one is don't wait for querfect to release uh because uh for us we released something that was pretty minimal uh in this prompt optimization case uh to find this beautiful thing which is that users wanted a lot more out of our agents. So uh you know don't let the perfect be the enemy of goods. Uh regressions will happen. You're not going to be per have a perfect agent possibly ever. So just get out there and learn what you can. Uh the other thing uh and this is another thing that eval allowed us to do. Uh I use this graph to portray this idea of an aspirational eval before where we set up that optimization eval but when new models came out uh with new prompt tweaking we got it to work really well. Uh another part of that lesson is when you have these eval set up. The beautiful thing is with one line of change you can swap in a new model, run the agent against the same test cases and see how it does uh and sort of make that and you might not release that new model with the agent because you find that oh crap it's thinking too long and it's like making the exper UX bad but at least it'll be an informed decision as opposed to waiting around a week or two weeks for you to like release that new agent. capability or new model with the agent. So have eval set up and uh you'll see that when the new models come in, you'll have a lot confidence to uh ship it or not ship it uh and know what you need to do next. If I'm being honest, we weren't great about something that I have on my list to do is loop doesn't perform very well with uh GPT models and uh 03 models or uh open AAI models. Um we know this through our evals and we're working on prompt tweaking so that it behaves well on those models as well. Uh and what we've seen that uh different models require sort of different prompting strategies and eval catch that. Uh so working on it but uh I'm grateful that we were able to catch that behavior. Another thing, this thing I had no idea would uh sort of end up happening. Uh but I'm thankful that we had the infrastructure and the observability set up for this, which is collaborating uh with other functions. Uh I'm trying really hard to say like non-technical other functions because I think that's a derogatory term. But uh you know, I showed like the evals. It's like this code uh written with our SDK. You know, it's hard to parse through and it's a lot easier now because you can get agents to describe what's going on for you. Uh coding But still, it's very daunting for non-technical um uh users to sort of go in and make changes or design their own evals. uh when you have this infrastructure and the logging and I was able to collaborate with um Mangyang who's the head of data for a company she has incredible amount of uh domain knowledge around experimentation and how prompt optimization should happen. Uh she was able to go in sort of set up live scores to see uh where the vibes are off go in so see those conversations uh in real time and annotate what she thought was wrong. sort of study it and go to our UI playgrounds uh and pump tweak uh loop and create a measurable difference in the acceptance rate of one of our tools the edit task tool uh so that was really cool to see and I know we work with a lot of customers uh who's building you know for example healthcare apps or apps that law firms use and then in many cases when you're building an healthcare agent or uh agents that those law tasks for a lack of a better word. Uh it's like the lawyers that are doing the eval design, lawyers doing the prompt tweaking or the doctors or the behavioral uh scientists doing sort of the uh system prompt work and it's going to be so much better when those people can go into the UI uh and play around with things immediately as opposed to having them sort of figure out how to run these uh things via code. Uh so this is another benefit that we got because we started with an infrastructure and observability from day one. Um I think that's about it in terms of uh the talk that I prepared.

### Q&A Session [33:43]

I'm happy to sort of uh I believe I'm answering questions now. But uh yeah, we we'll definitely jump into some questions. Thank you for this talk. This has been super insightful. Um I actually just saw the GitHub action slam. might want to ask about that later, but let's go over some of the top questions first. How's that sound? Cool. Um, I think the first question folks asked was actually around some of the scoring tools that you had. Uh, you know, they said that Brain Trust offers many LLMs of a judge such as things like which ones did you find to be most useful for the team and how often do you have to make your own evaluators? I saw some correctness conciseness stuff. I would love to hear more. — Yeah. So I we use so to the question of which ones are most helpful. This is a top answer but I think it just depends on like what the thing is that you're evaluating. But uh what I can say is I didn't use our uh we call them auto evals. It's like our like out of the box evaluators. I didn't use them uh exactly as is. I started with them when I first designed the evals, but I would see the evals and I would see, okay, wait, it's scoring things, but the vibes are really off here. So then I would go about, okay, like I want this to uh score them the way I wanted to and sort of uh change up the LLM as a judge as I went along. So I think they're great as a starting point and we provide that a lot of them but what I would say is they're not perfect and uh I would say use them as a starting point. Use this prompts uh LM judge prompts as a starting point but create your own uh go out go off of the auto eval and build the one that you feel like would uh is judging the task the way you would want it to judge. Uh and I think I wish there was like a perfect answer but like see the vibes see how it's uh judging and compare it against like you know take a data set or take sort of an eval uh you score it and have an have that LM judge uh LM judge score it see how it differs and prompt tweak the judge itself to get it to behave the way you would score the thing and at that point I think uh you're going to get better response instead using the thing out of the box. Totally makes sense. Um Evan had another great question here which is around the best practices on deploying and hosting agents on the web. Uh do you use something like the cloud agent SDKs? Do you have your own SDKs? Um how is everything done behind the scenes? — Yeah, I think we like personal projects like I love like AI SDK coming out of our cell. For us, we just built our own. We didn't use any SDKs and we just like built uh our own code on top of the uh the proxy that we have. We have a proxy that allows us to hit uh sort of any model that we have configured. Uh but outside of that, we didn't use any SDK and we sort of uh wrote our own code uh on top of things to have like the most minuteed control over how things work. But I've heard uh from our customers great success using AI SDK and other SDK. So I don't think you can go wrong with any of those solutions. Um just you know make sure like you know it's really going to depend on the specific kind of interaction the UX you want to provide uh to the user but I'm sure you'll find success with any of those as well. Makes sense. Um, one of the big questions that people had was really around your comment on sort of finding looking at the logs and finding these other use cases, right? So, how did you can you walk us through a little bit more about how you went through this analysis process? Were you just reading at like, you know, a 100,000 logs day one and just understanding what people wanted? Were you interviewing people or were you maybe doing something like clustering or trying to understand or maybe building classifiers to understand what is the proportion? because I thought I heard you mention that you know one use case has now been growing compared to another. Um I think clarity on that would be super helpful. — Yeah. No, that's great. Uh so at first when we shipped I was going by them one by one because you know when first release it's not like you have a bunch of users like hammering this thing at all times. And so that's how at first we were gathering signals around oh wait the users are asking to generate data set rows. uh but we obviously can't do that yet and we saw like many cases of that. So then we sort of built up the uh the confidence that that's what we should build next. And then um at some point and at the point when we had uh the capability for loop to be able to slice and dice the logs to do fuzzy things with the logs, we did start classifying things into like okay look at the past week's interactions and break them down into like what kind of workflow uh that users are trying to do and we use that. It's like the dumb classifying thing, but we use that to also inform ourselves at some points. Um, but a lot of it for us was like going through the logs and sort of combing then seeing where users are asking things that were a bit different than uh like optimize this prompt for us. Um, the other thing is this is not to say if I gave the sort of uh the nuance that you don't have to talk to users because you have logs. Please know like we have our users tell us all the time, hey, this is cool, but I want to be able to do this with the agents and we use that as a signal as well. Uh, and that's a huge signal for us. And so we talk to users as well uh to inform what we do. Another good question that Evan had was around using things like sub aents. I think a lot of folks have had different uh experiences with like write only sub aents, read only sub agents. Um you know do you feel like have you explored that at all and does it make sense for the team to explore these things? — What have your experiences been so far? We haven't explored it yet, but we uh the conversation came up recently because with like with more tools at play and the more we like give the agent the ability to grab more context, we're seeing more and more it runs out of the context window. And so then we started thinking what are some strategies we can have to make the most out of the context window. uh impression or sub agents and things like that. So we haven't done anything on that front yet but uh that's something that we want to try and when we try I'm sure we'll see a block coming out of us with some evals. So u that's something that you can look forward to but to be honest we haven't tried yet but the idea of it sounds sounds good for certain use cases to be able to sort of — throw a sub agent to do a certain task come back with only the important parts. Uh but yeah that makes sense. Um oh this is a good question. One of the questions was around how do you recommend organizing your prompts to make them easy to find or update? Um, in the past I've seen a lot of companies just have prompts in code in which case I'm actually a little bit confused on how brain trust will be able to edit that or does brain trust now have some prompt management features as well. Yeah. So we have our prompt in code but there's I think there's multiple strategies uh if we're thinking specifically brain trusts you can upload a prompt uh to our uh product and edit in the UI but in code do something like prompt. load load and it will just grab the latest version of that prompt. So if you have a someone who wants to edit prompts in the UI and make that sort of the prompt that's being used in your app, you can sort of create links like that as well. Another thing is we have this thing called remote eval uh which allows you to sort of like run the agents locally but open up parameters to the agents and surface that up to the UI. So for example and this is what I did with uh MYN exactly in this slide when I talked about collaborating with uh non-technical uh teammates very convenient uh I sort of opened up a parameter called system prompt and you can sort of hook that into the agent that's running locally and so in the UI someone can just go and edit that prompt to see how that uh sort of tweaks the behavior of the LM. Um, so there's multiple things you can do. Uh, it's just whatever fits uh the workflow best for us. This the source of truth system prompts lives in code. Uh, but there's other ways I think you could go about it. I see. That's very cool. Um, another thing I want to just would love for you to share a little bit more on was the GitHub actions. Can you talk a little bit about that? I think that we really glossed over some of those really cool features. — Yeah. So uh we have like a GitHub actions where you can say uh you know so I'm not an expert with uh like the GitHub actions but what I know that we can do is uh like you can say when there's changes to these files run them. So in this case for us it's when there's changes to system prompts or any of the tool code uh run the GitHub actions and you can define these are exactly the eval you want to run and for us it would be uh the eval that we think are the most important workflows. So like the prompt optimization, uh the quality of synthetic data generations, quality of BTQL query generations and we run them whenever there's any kind of changes to code that could affect the agent. And so that way uh you can sort of uh watch out for regressions um if you've made a change to sort of other surface area in in the agent uh if the other parts regress or not. And you know, you'll just see and it's obviously like not deterministic. You'll sometimes see like it didn't actually regress, but the scores might come down a little bit. So, it will be sort of a requirement for you to, you know, once again go in there and see if the vibes are off or it's just like, okay, it dropped 3% but it's not really regression and sort of do that uh critical thinking uh yourself. But yeah. I feel like if you just offload all the thinking to the LLM, what you're saying is you believe it's smarter than you. Um, didn't take another look at some of these questions. — Oh, yeah. One question here I'm really curious about is you talked about how these GPD models or these OpenAI models are really underperforming the anthropic models. Um, how do you feel like the solution would actually look like? Do you feel like you would have two separate prompts for two separate models? Maybe you would just say, you know what, I just want to move forward with Cloud Four until GP sort of like cleans up their mess. Um, yeah. How do you think about the situations where Yeah, like one family of models really underperforms another. — Yeah. So, for us that answer the answer is really easy. I think it would be harder for other products. For us, we are a infrastructure provider model agnostic and we are strictly like bring your own key. So there's we have a bunch of customers — who because of the company layout like they they're only allowed to use sort of open AI models or they're only allowed to use anthropic or they can use whatever. So we want to have separate prompts uh each prompt sort of uh specialize for the model or the provider. If there's like a generalized thing we can do for the provider rates. But I think what will happen is we'll have a specific prompting strategy for the 03 models and then the GBT models separate and then we're going to use the ones we use today for claude. Um so for us it's easy because we want to be agnostic for uh if you know if you have a product and you're bringing your own inference and you might even say we're going to hide what model we use and in that case you know you can just use one prompt for that case but yeah — yeah that definitely makes sense I didn't realize that even these models were on their own keys that's actually a pretty good uh that's a pretty good feature actually um yeah let me take another look at the question. But before I do that, one question I like to ask everyone is, uh, you know, what do you think is something folks are not really thinking about as they build out these agents? We went over some of the, you know, key takeaways from having better evals, looking at your logs, but um, like what do you think is still one mistake folks are making as they uh, build these things out? — Sorry, was that a question for me? — Yes. — Oh, sorry. totally zoned out. Um, one mistake I think like I get in this mindset as well and I've seen like my co-workers or others as well is when you sort of and this is going to be a very designic uh when you sort of like see evals and think like okay like let me move the needle for this percentage uh let me make this perform you sort of hyper optimize for that. But what you have to sort of I think always keep in mind is the thing that you're the thing that matters the most is the user experience that you're providing. And so like the eval will if you do a great job designing the eval there will be a good mapping between the success criteria of that eval and the user experience the user has. But like keep in mind that the person on the other end of the thing using this agent is another human. And like there's a human aspect of it. For example, you can have let's say I'm going to make him an example like GPT5 incredibly smart and have it run forever and like tackle a complex task. But if that took let's say like five 10 minutes but uh is that a better experience than clock if we said like clock 4 took 2 minutes uh did a little worse but you know which one's a better experience. So like I think you have to think about like what would users like what kind of experience would users want to have and evaluating it but I think you still have to think about that as well. That totally makes sense. I think with that we can definitely wrap things up. Um before we go, is there any other message you wanted to have for the audience here? Yeah, I think uh I mean I'm pretty sure everyone says this, but it's an incredibly exciting time and we have this thing here where uh just by tweaking the prompt, you can mold a program to behave in a way that sort of you know you can sort of inject your uh belief about how it should behave through prompts whereas prior uh you had to write incredibly complex code to make programs behave a certain way. So like experiments and like build out new things and try things out whether you're technical or non-technical like system prompts can like really change the way models behave and the more evas I write and I see that and so have fun with it and I think you'll find that it is really fun. — Thank you for sharing your knowledge here. Thank you everyone for coming by. — Thank you. Take care.
