The AI Skill That Will Define Your PM Career in 2025 | Aman Khan (Arize)
46:27

The AI Skill That Will Define Your PM Career in 2025 | Aman Khan (Arize)

Peter Yang 12.01.2025 8 809 просмотров 151 лайков обн. 18.02.2026
Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
My guest today is Aman Khan. Aman is the Director of Product at Arize AI. Last year, we both heard the Chief Product Officers of OpenAI and Anthropic share that AI evaluations will be the most important skill for PMs in 2025. Aman gave me a crash course on how to build this critical skill in our interview. Timestamps: (00:00) Evals force you to get into your user's shoes (03:42) 5 skills to build right now to become a great AI PM (07:43) How curiosity leads to better AI products (10:20) You can build AI without the job title (17:00) A deep dive into why AI evals are so important (24:53) Example of running evals for an AI customer support agent (31:32) When humans need to be in the loop for evals (35:46) How to get better at writing evals right now (39:39) My personal prompt and eval for transcripts (44:23) Catching the AI wave is alot like surfing Get the takeaways: https://creatoreconomy.so/p/the-ai-skill-that-will-define-your-pm-career-aman-khan Where to find Aman: LinkedIn: https://www.linkedin.com/in/amanberkeley/ Arize: https://arize.com/ 📌 Subscribe to this channel – more interviews coming soon!

Оглавление (10 сегментов)

  1. 0:00 Evals force you to get into your user's shoes 770 сл.
  2. 3:42 5 skills to build right now to become a great AI PM 826 сл.
  3. 7:43 How curiosity leads to better AI products 544 сл.
  4. 10:20 You can build AI without the job title 1544 сл.
  5. 17:00 A deep dive into why AI evals are so important 1661 сл.
  6. 24:53 Example of running evals for an AI customer support agent 1379 сл.
  7. 31:32 When humans need to be in the loop for evals 862 сл.
  8. 35:46 How to get better at writing evals right now 807 сл.
  9. 39:39 My personal prompt and eval for transcripts 988 сл.
  10. 44:23 Catching the AI wave is alot like surfing 394 сл.
0:00

Evals force you to get into your user's shoes

evals actually force you to get into the shoes of your user like you can no longer just hypothesize I wonder if this they'll do this or I wonder if they'll do that like it kind of forces the PM to be way more empathetic of the customer and what the customer is going to try to do and articulate that in writing which is a whole another skill so you want to translate your understanding of the user into text and then you want to kind of use that to iterate on the system I think that's hard I also think that you know there's a high level of subjectivity and non-determinism which makes it challenging like it's not as easy to predict as writing code like what the outcomes are going to be in this case and you're trying to model human behavior as much as possible let's say the llm writes a sentence or the AI writes a sentence like was that a good sentence it depends a lot on the context it can depend different scenarios that you're putting the system into and so understanding and looking at the data and basically seeing okay here's the clusters of types of scenarios that the system is going to put into and being able to break the system up into more components and more fine grain scenarios makes the writing of the evales easier because now you're iterating in a tighter Loop but it is really hard to think about this when you're first getting started all right welcome everyone my guest today is Aman KH director of product at arise AI so Aman and I attended a talk recently Chief product officers of OPI and anic who both agreed that writing AI evaluations will be the most important skill for product teams moving forward so I'm really excited digging with Aman today on the key skills that aipm need including learning how to get better at writing evaluations so welcome am on yeah awesome thanks so much for having me Peter and yeah it was a pretty crazy moment to be sitting in the audience there and then I remember we both like left the talk and started talking about evals and yeah really glad we get to do this yeah so maybe you could start by introducing yourself and like how you became an aipm at a company that specializes in writing even yeah for sure so funnily enough I've actually been kind of working on the same problem for most of my career so I actually started by working on EV valves for self-driving so actually started as an engineer at a company working on in the self-driving space and then kind of realized that it's really messy trying to figure out what exactly do we need to do to make the car better or worse and this is super applicable now even in the llm and generative world where you're trying to work with these black boxes but you're not really sure am I making this thing better you know how do I measure the performance of the system and I kind of just kept that through line by you know realizing there's actually a lot of products to be built in the space a lot of you know really messy like dashboards or getting the right metrics and right data in place and so that actually allowed me to LEAP to going to Spotify to work on the machine learning platform and while I was there I was sort of still working on the same sort of data problem how do we use data in machine learning in the models that were shipping and production and then it was actually there that I figured sort of actually discovered the company I'm at now called arise and Spotify is actually a customer of arise uh which sort of worked out in a funny way there but I sort of just kept that through line going of trying to make it easier for people Engineers product people that are building with AI systems understand the impact of their models and I think this is going to be a pretty longtail problem so a lot of really interesting product challenges along the way to solve yeah cuz you know these models are like a black box can't really tell what's going on but you know like everyone keeps talking about aipn it's kind of become a name this aipm thing and you know everyone wants to become aipm these days so if you had to list
3:42

5 skills to build right now to become a great AI PM

like a couple of key skills that you can build right now to become aipm like what would it be yeah so it might even be helpful just to like sort of recap for folks that you know if this term is new just in case just to like set the right terminologies uh and if defined correctly but I kind of view AIP PMS as people that are building products with or around Ai and that means you know it sounds silly to say now like I think that AI Fields very new it's like everyone's trying to catch up but you know I ultimately I do think long term all PMS are going to be some flavor of AI PM either you're building on top of some sort of llm API doing something in or some model in the backend performing some action in your product or you're going to be using these tools in your day-to-day you know to solve problems and so I think it's helpful to sort of frame this as really AI starts to become a core part of the skill stack for PMS in the first place and then once you kind of start from there I think there's maybe you know like kind of a few sort of high level ways to think about how to build your own knowledge in the space I sort of look at like maybe like five different areas here so I would kind of view this as you know start with the fundamentals what are the high level areas of AI and ml that you know you would you really are being used today in the industry to actually deploy these models in the first place and think about this in the context of what would you want to apply in your products or in the problems that you're trying to solve for your customers there's a couple helpful resources I sort of plugged the Andre karpathy one in the last podcast I did with Lenny but I think another really helpful resource I've kind of been referencing myself and came across is there's a really nice algorithm map that merily who's an AI product manager as well at Google has put together where she actually goes through you know here's when you would use this algo for this thing and this was this is what machine learning is useful for I'll give you an example things like you know understanding okay what do I do to predict a user's actions should I use a regression model A classification model it helps you frame the decisions that you want to make in the form of statistics and models and I think that's just a helpful framing to have so basically the goal here is understand what these things are good at and what they're not good and that's really the core intuition that you're trying to build I ironically yeah I'll pause there in case there's any thought no I think that's a really good point because it's important to realize that there's actually different types of models like you said that solve different Pro problems like a model that solves like the discovery is very different from like a Al Mel right totally yeah you can't just like it's like showing up to a hardware Storer and you're like you know I need a tool and the people are going to be like what tool do you need and you're like you can't just use a hammer for everything right like there's that expression that goes like if you're a hammer everything's a nail it's sort of the same thing with AI like if you're using llms for everything you're probably going to have a bad time these things aren't necessarily great at everything you might need a different algo approach they're very good at generalizing but it's helpful to understand how to think about the space more broadly so I do think that fundamental starting point is really important and you might actually find as you're going through this journey that certain algorithms or certain spaces of machine learning or AI are maybe more interesting to you and more interesting for you to go down that deep rabbit hole it's very likely that you know people are already talking about an industry that we're sort of hitting a plateau with what llms can do and so ultimately we're going to need to combine different algorithms to actually get the results that we want and we actually see that ourselves as well with customers that we work with it's not just an llm going and doing something it's usually like combined with a ranking model to give you a result or a classification model so that's the foundation I'll kind of go through a couple more areas which I actually think are core product skills in the first place customer Obsession
7:43

How curiosity leads to better AI products

and curiosity are really the two things that are going to drive you kind of coming back to okay now that you have the foundation you have at least the basis you're not going to have a master's or PhD in the subject but at least you know okay this is what this thing is good at other Now you kind of let your curiosity drive you to the next stage which is you know what are the problems I'm trying to solve for my customers what's the thing that's most important to solve for them and then allow you to really take that Curiosity and keep trying to prototype and like basically try and solve the problem for yourself and there sort of this like thought where you know I think that prototyping is going to be a pretty core skill I think that curiosity is going to Pivot into prototyping and just being able to try the tools yourselves you'll learn from you know the AI experiences in the first place what is useful in your day-to-day and what's like a terrible tool in the first place so you're kind of like protyping developing that intuition and then I do think last but not least okay you've got the foundation you've got the knowledge you have the understanding for what to go and solve for your customers and you know how to build prototypes you're not done there you actually do still need to understand how well your system is performing overall and that's really where EV valves and observability come in and I do think that's going to be maybe one of the most important skills for PMS be able to learn is you can't just ship with AI you have to be able to measure what your system is so Cas just like quickly recap like the key skills again is like I kind of if I had to boil it all down I would say the fundamentals customer Obsession curiosity to learn and prototype learning from great AI experiences and then evals kind of make up this like five pillars of what you'll need to do or need to learn as an aii PM yourself and if I were to put it into one sentence I would say build knowledge experience and best practices on shipping high quality AI products in the real world and then be able to measure the impact got it I actually think like the Curiosity part is actually pretty important I think like a lot of big company PMS actually don't have that Curiosity or they they're like too busy just like riding slide backs and like making dogs and sometimes you actually pay stuff themselves exactly I mean I think that's like that's sort of the realization myself as well like you know imagine you could just show up to the meeting with a prototype instead of PRD I think that's the shift that PMS are going to have to undergo as they figure out how to communicate better in this new world and also I think it's important to emphasize I think that you don't have to have this you don't even have to have PM as a job title to start
10:20

You can build AI without the job title

working on the skills you know because like AI has made it it's actually arguably easier to ship stuff with AI yourself than to try to get buying a company can say something yeah I think that there's like a it's funny like I'm sure you've I've seen that you've like made fun of this stuff in the past but like there's always that like Trope of like the PM is the CEO of the product you know like that's what people say oh the CEO of the product but like I actually think that has never been true until now because now you actually have these like superpowers that you as a CEO of a product can actually go and be like let me go write some marketing copy and test it out let me go build a quick version of the website I don't have to wait for the engineering team to do this like you can just go and start getting things done and that can really be motivational for the team and I highly encourage that's like where you'll learn okay I actually you know this is where I need help this is where we need to iterate so actually you can be the CEO of the product now I think before was like you know people were just saying that to feel better about thems yeah um and like you know when people think AI tools AI products a lot of people just go actually towards the chat bot right just make another chat bot but like you know I think a lot of the really good AI products are actually not very SE sexy but they like they're more focused on like fixing tiny little friction points in the core workflow so I'm curious like out of like you know chat CPT and like you know cloud and stuff what are some of your favorite AI tools in the market around yes I think lately I've been pretty obsessed with the prototyping tools out there in like coding agents we can plug like a Twitter thread where I actually break down there's a lot of them out there so it's a little bit overwhelming I spent a weekend just trying a bunch and then like dumping my own thoughts and I've just been using those a ton for prototyping I'll give you an example so we were sitting in a meeting and we were like darn we really need to like ship this usage dashboard for a customer so that they can see you know across their Enterprise how many people are using the product and we're going to need to put like an engineer on this for like three weeks or three days or a week to like go and figure out okay here's how we pull the data in the right place and then like build a nice UI and then like while were debating all of that I literally just went to uh repet and replit just lets you type in a prompt it's not really a chatbot because it's not going to give you a bunch of text it's actually going to go and write code for you to actually build a real functional prototype and I literally just did that in the meeting I copy pasted R the pendo docs which is tool that we use and I basically told okay build a usage dashboard for this customer here's an API key and in the end of the meeting we had like something we could look at and be like oh wow we just did that while we were talking about it so I think that bias to action is going to help a ton but I'm obsessed pretty lately with the coding agents another example is maybe like two more examples here actually one is that you know I needed you know a quick refresh of my personal website and I still remember the first time I coded my personal website you know after HTML then I wanted to like do something a bit more like react native and I took like a week in college to go like at the end of my nights and weekends like go and write oh this is you know like my personal portfolio whatever and then like you have like the Wix templates but you don't really like those you want it to be a bit more personal and now with versel there's a tool called vzero which builds really beautiful front ends and so and you know it does more than that as well it's super pretty capable product but personally I like the look and feel of that and you can just like Drop images in and get you know pretty functional prototype result at the other end of it so I use that to build my personal website and then I think last but not least I've been playing around a lot with like C Claude writing styles so you know you can actually give it sample of your writing now and Claude will try and build a you know sort of you know actually sort of mimic the way that you write based on that and it's pretty good it's not fully there yet but it is the cold start problem there a little yeah that that's what I've been doing with Claude just like giving A tes on my really good writing and like you know being lazy asking to edit stuff for me yeah it works out well yeah and then the skill just starts to become how do you get the right product out of that how do you prompt it the right way to actually get the result that you want yeah the Prototype is interesting like I also try play around with some the stuff I I feel like it's pretty good for making like little toy apps and like Pro concepts for yourself and maybe a personal website too but like it's not it hasn't got to a point where you can actually like make a s product and S I like but I feel like it's pretty close I don't know feel like a year from now it could happen yeah I think that's totally true like what's funny is that like if you go on YouTube there's like so many videos all you have to do like a little hack if you if like to anyone listening to this and like wants to try this out themselves but they're like you know may maybe they're sitting on their sofa they don't want to like go reach for their laptop just go to YouTube and like type in you know cursor or vzero and you'll see a ton of tutorials where people will like build a SAS app in like 10 minutes A lot of people are claiming to be able to go and like build entire companies on top of that I don't know how like how feasible that is necessarily but it is really interesting to see you know that's where the space is headed for sure got it yeah I'm not sure I believe people claim that they're Mak a lot of money from this stuff but we'll see yeah but just to wrap up this chat like I feel like what I really love about these AI tools is that like dude there's so much time Wasted by PMS to like even try to get a job as PM you go case interviews that are kind of like hypothetical and then you got to do like in your PM job you running all docs you do go through all the reviews and none of the actually about the real product that you're trying to build right yeah so like I hope with this AI tools like we actually sh the attention back to the actual product trying to deliver to customers you know that is my dream yeah I think that's I think we're getting there closer and I you were at the same conference but and there's this talk from like CLE vau where you know there's like vend diagrams of like functional areas and jobs and like I do I actually do believe in the thesis that like jobs and worlds are going to be less important and actually skills and how you use tools to your advantage are going to be more important so that's like an example where you know you're not just going to be writing docs I think pretty soon I think it'll be a lot more about like actually solving the end users problems even more so yeah that hopefully make the PM job a lot more fun yeah but yeah let's talk about eval let's talk about like the most important skill man for PKS so yeah well what exactly is an eval why is it so important yeah so I think it's helpful to frame this a little bit as well like in the context of a product and I'll make it really simple this is my mental model that I kind of use to think about Building Products in the
17:00

A deep dive into why AI evals are so important

first place I think if you really boil it down a product's imagine like a you know a diagram a product is a box in the middle and you have an input and an output and the input could be things like you know a customer coming and trying to solve a problem and the output is the solution like the product is trying to solve that you know in some way could be like let's make a tactical let's say I'm going to a website and I want to you know have like in you know basically have this experience where I type in I'm going to you know San Francisco and help me book a trip to do that you know help me book everything in the middle and the outcome is you actually get your trip book now there's a lot of steps in the middle like we were oversimplifying this but at the end of the day what you care about is that the user has a good experience and evals are the way that you measure how good or bad the box in the middle the product is and I think before in like software you know 2. 0 you know whatever software 1. 0 2. 0 like days you could kind of view like okay are we calling the right API are we you know is the code being constructed correctly like this is a very deterministic system and now the entire system is a lot less deterministic because of large language models these are statistical systems now so there's a certain degree of like variability and there's work being done to make the systems more reliable and just reason better and think more you know like you know open AI released this just yesterday they have a much more reliable model called 01 Pro but it's much slower and so you know really what you need are tools that measure the goodness of the system and those can come in a few different flavors of we can get into in terms of like code feedback and you know using other llms to actually measure the performance of the system as a whole so that's a high level of like evals are how you measure the goodness of the system and that's really consists of your data your eval method and you know basically the metric you want to use to actually analyze the system as a whole how good is the output given the input right like how good is black BLX I making the output given some user input exactly how good is it at solving you know how good is your product at solving this problem qualitatively and quantitatively yeah I think the difference is like you know in the good old days the user only had like one type of input like click a button and it does this but now the user has like thousand different types of inputs definitely and it's like you know these systems are becoming like multimodal there's voice did they hear you correctly like you know and even then I would actually even argue like in the product you know PM days of like oh is the user clicking this button like now you can get much deeper and understand like why are they you know doing something or not you can get a lot more the data becomes much richer to use for debugging with llms yeah that's a really good point because the user is actually maybe telling you the problem through chat or something as well as clicking a but yeah exactly yeah so can you talk about briefly like where do eval spit in the process of building a AI product yeah so you I was kind of giving this analogy to someone like imagine that you're like going to study to become a doctor you know that like you want to be yeah you want to go be a doctor like imagine that's the product that you're trying to build is like some sort of like doctor or something like this the eval is basically the test that you're giving the doctor before the doctor's actually starting to see patients in the first place I mean there's a ton of reviews you want to make sure that the system like understands what it's getting into like you know there's like technical things you know scientific things you're trying to like solve for to make sure that this system is actually good enough in the first place before it starts getting into production and an eval is sits at that early step it's like think of it as like the medical text or the test that you're giving the system and you want that to be representative of the real world like it doesn't make sense to like test doctors on like theoretical things that they're not ever going to apply so you really want the eval to be sort of at the earliest stage of development what you're using to guide the system as a whole so once you've identified a problem and you figured out okay here's what the inputs look like here's what it should do and here's the output that we're going for the eval helped you measure at the early stages of development you know are we actually building the right system so that that's really where eval start okay so is it kind of like analis to like running a AP test or something like you know or maybe I guess AB can be part of evals yeah that's I think like I almost feels like evals are like a new kind of ab test like before you would have to you know split your audience and like get into production and see like what are certain people doing you know and are they like what do they like this color or that color better like evals are sort of the way that you do that with uh llm based systems and eval sort of tell you yeah there this new flavor of ab test in a way it's a good way to put it okay so let's talk about the how to S me a little bit right so like well first of all I'm not an expert on writing emails but you are but when I write emails I you know usually start the manual process right so like you like have like what are fire the most common questions the user will be asking and then you try to write like okay here's like the five ideal answers and it just feels like honestly it was more painful than like trying to Rite a trb because you have to put yourself in US shoes something but like why is this so hard man or is there way to make it easier yeah that's a great question I mean I think that you bring up really good point which is evals actually force you to get into the shoes of your user like you can no longer just hypothesize I wonder if this they'll do this or I wonder if they'll do that like it kind of forces the PM to be way more empathetic of the customer and what the customer is going to try to do and articulate that in writing which is a whole another skill so you want to translate your understanding of the user into text and then you want to kind of use that to itate on the system I think that's hard I also think that you know there's a high level of subjectivity and non-determinism which makes it challenging like you just it's very it's not as easy to predict as writing code like what the outcomes are going to be in this case and you're trying to model human behavior as much as possible and so when the system is subjective that can be really tough I can give an example from like the self-driving space actually that's very you know analogous to here which is when we were trying to solve EV vales for aoft driving car we had a bunch of we basically have like humans sitting and saying like Okay did the car make this left turn correctly was it a good left turn to make did was it smooth did they did the car commit at the right time was it too fast whatever it might be and it turns out that's actually pretty subjective like you know someone's you know good left turn is different like some people would be like no I would never have done that left turn and then when you start adding in more variables like imagine someone is like Crossing at the same time on the sidewalk and the car makes left turn should it have done that it can get really specific and nuanced and you know I think that's actually really relatable to the llm world now because it's hard to say like let's say the llm writes a sentence or the AI writes a sentence like was that a good sentence it depends a lot on the context it can depend different scenarios that you're putting the system into and so understanding and looking at the data and basically seeing okay here's the clusters of types of scenarios that the system is going to be put into and being able to break the system up into more components and more fine grain scenarios makes the writing of the evales easier because now you're iterating in a tighter Loop but it is really hard to think about this when you're first getting started for sure got it yeah you got you gotta like think about all the edge cases think about where it could go wrong yeah got it okay so let's
24:53

Example of running evals for an AI customer support agent

make it more practical let's take like a real example of a so like you know I think a pretty common AI product is like a customer support AI agent right like it helps you answer any customer support answers and how would you use different types of eils to make this thing better yeah so like the sort of customer support like classic like you know I'm trying to get a refund or something like this like that type of like customer support agent yeah like for like a retail or something yeah for retail yeah so I think there's a couple ways that you kind of want to think about the system a lot of it comes down to the data itself so how do you maybe to like reframe your question your thought is like what are the ways that I can measure my system as a whole right like how do I know if this retail agent is actually like solving the customer's problem so an example is like I want to return you know this laptop or something like that well you actually let's think about the system a little bit more deeply you actually have a few things you can measure so let's imagine it is a chat box just to simplify the system even more just to get started and you can say okay well there's one user that's typing in I want to return my laptop right and then the agent can like perform some actions you know maybe reason about okay like are you in the within the return window did you is the laptop undamaged do you are you know able to actually return this thing and those are all sort of rules those are like you know those are things that the agent can kind of the customer support agent can have like coded into it and so that's pretty straightforward in the sense that you can check okay you know you have some inputs from the user and you're transforming those into the rule-based system that the agent needs to follow so that's kind of like the high level where you have real user data and you have the out the llms outputs and you can kind of compare where there's actually some ground Truth for instance I can say no I'm not within the return window if the agent says okay well let's continue to process your return like that's not going to that's an example where the agent has gone off the rails it's hallucinating and you can test that by actually having another human grade the responses and say this is an example where the you know the agent was incorrect so that's your immediate starting point is just getting the right data in place the input from the user and the human label I think you can also use some pretty like pretty straightforward simple classifications or guard rails like if I start swearing in the chat like that's a string match that doesn't need to be a super sophisticated system but you can also say okay is this you know customer being rude do you want to be able to detect that is the you know are they trying to jailbreak my system or try to get it to say something it shouldn't do which happen sometimes that's made in the news so that's another example of where you would want to be able to check the system is not being manipulated so that's like a high level where you really want to start getting started with the data and then when you actually you know have like those first five or 10 examples you can use that to actually generate even more data to for testing so you could try to simulate more scenarios where the llm might be incorrect and try to make sure that your system is robust against those different scenarios and then lastly last but not least you want to actually test this so you go out and you test with beta users and you do the ab test like you were saying and you might be testing different prompts against each other too or different tools or different parts of the system that you are editing and you want to compare that to the original sort of examples you were using in experiments and see okay how far off are our experiments from The Real World do we need to make our experiments better and so this actually starts to look more like a iterative loop where you start with the data you kind of build on top of that in different met with different metrics you get more and more data and then eventually you use the real world to actually improve the system as a whole okay so just so I understand it so step one is you have a bunch of like maybe you upload your return policy to the AI and then you have a bunch of past real human agent conversations with customer support agents right so maybe that's the data that you start with and then maybe you write a few round choose answers like for different stuff okay and then there's like easy ways to filter El words or like are completely not related to the customer support that's step one and then you actually get some other LM that you get some other to generate sythetic data set based on your human data set and maybe like get one LM to evaluate another LM right exactly yeah so you're kind of building this the click on that you're using the initial human labels as the way for you to seed is this you know how do you augment or automate that part of the job with an llm that you don't need to have a human labeling every single you know response once you actually ship the product you can have another system sitting on top and doing that okay and then we get to a point where like okay this is actually good enough to test with real users so then you run an AB test or maybe like yeah you run AB test with like maybe like 5% goes to this AI think and 5% still go to the real human agents and you compare some super metric like what are some metrics for saying that you would look like accuracy or yeah so I think it depends very highly on the system that you're trying to build so you give a good example of like customer support agent there's you do want to check like accurate like the rules like if the llm following the rules that's accuracy that's a pretty simple like comparison you can also do if the llm has to give more open-ended responses and you want it to the tone of your company be maybe extra professional if it's a legal company maybe you want it to sound more like casual and fun if it's like a travel company so you can do things like matching the tone is the llm starting to get frustrated if the user is getting frustrated like you don't want that to happen right so that's an example where you actually want to validate depending on the use case and pick the right metrics but at the Baseline you're looking at correctness you're looking at hallucination which is the llm you know responding correctly when it has the context and then what is the overall tone and sentiment those are maybe a few starting points that are pretty common that we see but it gets a lot more specific depending on the agents for instance if you're you know going back to the tool example like is the llm using a hammer when it should be using a screwdriver just as an example for like a different API you want to make sure it's using the right tools as well to solve the problem but it depends a lot on the use case and generally it's like you're using human labels to validate as you go through so like how do you EV validate something like
31:32

When humans need to be in the loop for evals

Tong like do you get do you like randomly sample some of the ai's answers and see like get human labor to evaluate t or yeah I think that's like I think in what you're hitting at is like in the whole point is like humans are still in the loop when it comes to uh the initial evaluation because someone needs to ensure that you know we're still you're still solving problems for real humans on the other end and even if there's an agent on the other end like you're the one that's determining okay is my product actually solving the problem so you do need a human to let's go back to like the simple input output like humans need to look at the inputs and the outputs before you can start handing that off to another llm to that for you let me and go a little deeper let's say this customer support agent you know I have like a fanty prompt that I use but I also have like a retrieval system like you know I'm retrieving different parts of my return policy or documentation based on the questions that the user asks so like if something goes wrong like how do I know do I upd my prompt do I up my retrieval like how do I know what to fix that's man Peter you're always asking hard questions here yeah we're so I joke because I really do think that in the same actually your questions is a good example of you know how to actually determine if the system is performing well you want to break up the problem into smaller problems and components think of this as like you're building a Lego structure or something like that like each the Lego structure consists of blocks cons consists of maybe different components it's not going to be like one big Lego block right so you kind of want to break down your system as much as possible to evaluate each component in its own sort of iterative Loop and that actually is a principle that also gets borrowed from like software engineering right like you have your unit tests at the individual granular levels and that's very applicable to the AE vals that you write you want to be testing for hallucinations using the context that you have as well as correctness as well as tone and then you want to keep going deeper and deeper so you know let's use an example like your question was like how do I know if I need to like tweak my prompt or change my retrieval or change my model you actually should be able to iterate across all of those things and see which one is having the biggest impact and so now what actually comes back to like your initial point like I can't tell you okay you need to tweak the prompt retrieval you need to tweak the model you actually just need a tool that helps you do that really well because those are all parameters that you can update and so picking the right tool to help you iterate with all of those different parameters is going to be you know kind of actually the problem you want to solve it's like the diagnosis on your overall system okay cu yeah you don't really know until you actually make the change like it exactly for the speed of the inovation boops yeah exactly like you just it's we've seen you know systems where we thought man this is going to be so much harder to solve you know an example we have our own co-pilot in the platform that actually helps users debug their Ai and that's actually kind of unique like we're actually building AI to help you debug Ai and all the time we'll like hit a scenario where like man this you know this one use case keeps coming up like this customer keeps getting Stu or the agent's not very good at this one thing and then we'll like add one line to the prompt and it will like completely solve the problem right it doesn't happen every time but you never know so you just have to be able to iterate very quickly on different ways to solve the problem got it yeah so the waterfall prod in dead man like you know I can't just write a PRD and just like you know head over to designer and engineer like hey go do this it doesn't work anymore it could be fun to like rip on that in a whole different setting of like waterfall is dead you know what does it look like and like product management look like now with AI like what's the right term to use I actually have no idea what the right analogy is there uh I mean you just got like itate every hour man like every few hours but yeah but no clock cyle maybe it's like the way to look at it like clock cycles for GPU or something like that yeah so we started this
35:46

How to get better at writing evals right now

conversation by talking about like you know you don't have to have all those credentials right you don't need to have like a PhD ml or something to get better at writing ebals now right so how you know say I'm L random person with a la laptop how do I factors getting better at writing yeah so let's go level deeper so we've been talking about evals in the context of like this is how you measure your system it can be code it can be an llm you know evaluating your system overall but I I want to get like one level deeper on that which is the evalid is really in many ways if it's an llm you know evaluating your system it's just another prompt right like the example would be you know you are a llm you're an agent that is you know the best customer support agent in the world like this is the eval prompt that you're writing and your job is to check the work of this other agent and all of a sudden you've created like a judge llm right and you say here is the context here is the right you know here's the what the agent responded with was the agent correct or not correct and so I think when you look at that system as a whole it's really just prompts it's just text and so in the same way that you have to build that first early initial version like I think that the PM actually has this opportunity to put themselves in the shoes of what do you want the customer experience to look and feel like and the way that you want to think about that is like okay well we all know that if you type in something into Claude or chat GPT and you say and I gave you the prompt I just have like was this correct or incorrect you're going to get this like long response on the other end right it'll be like oh the agent was incorrect because you know used the context here but it didn't reference that like that's not actually something that you can use and to improve the system it's just going to be like another essay so the way that you get EV valves to actually be usable or you think about like the outputs in the form of labels and so you really wanted to say like at the end only answer with two words correct or incorrect and you actually probably want some enforcement or some system that checks okay let me pull out the correct or incorrect and we actually have a open source tool that does just that it's just like two lines of code and you can reference them off the shelf prompts that we have or write your own and you just provide what we call rails which is only correct or incorrect so that you don't get the essay out on the other end so that's one example very common problem that comes up you also want the evals to be fast and so like picking the right models and multi-threading is also pretty important when you're writing evals because you don't want to be sitting there waiting for like 100 rows to be you know one by one go to chat PT and come back out like we you want the apis to be to feel fast responsive and then I think the other point is like using the right prompts you have a great prompting guide I think like having the right prompting in place and iterating with your prompt is you know you'd be surprised certain you know certain methods that you might use can have a huge impact the most classic example of that is actually providing examples in the prompt so when I go back to the agent that is checking the other agent like you are a customer support agent checking this a other you know agent like make sure that it's answering correctly here's an example of a correct answer an incorrect answer if you just put that in your prompt I guarantee you that will also improve the results of your system as a whole but there's a lot of other sort of tweaks you can do along the way and then that kind of gets you to your first level of writing a good eval that you can top up last kind of like tip I'll say is like don't try to optimize too soon like try to optimize for the best result versus you know to you know trying to save money or cost on the model in the first place got it I have just thinking like so I had this PR that
39:39

My personal prompt and eval for transcripts

I use to turn these conversations into written interviews just as uses or right like for cloud and you know some sometimes it does the edit and it's not really that great like I have to go man clean it up and then sometimes I will like past back in the manual edit and be like hey you know what's between your edit and my edit and then just like here's all the differences and I'm like okay so then can you update this prompt to make it get to my edit in one shot you know so that is like a pretty dumb version of right that's basically what I'm do here I'm trying to do it totally I think that's a really smart version eval because it's like you're using your data you like let's just use the exact same example you just gave of like with an entire llm system like you are basically building a system using llms and Ai and you're on it with evals like you're saying here's my data this is what you wrote this is what I preferred now use that to actually go and update the prompt in the first place and so you're actually building a system that's learning from itself and you're you in this case Peter are the eval which is like good or bad right and then you're providing that back eventually you could probably build an agent that was Peter checking his own work on the written output and saying how would you improve on this pull out the interesting insights that Peter would want and then that back into the first prompt now you have a multi-agent system and that there's ways that you kind of are building kind of this like mini company in itself that's how I kind of view it as like you are the CEO of the product so you're kind of like orchestrating these agents as employees in a way yeah I need like a senior agent to manage the inter agent yeah little group PM agent yeah but you know the thing like doing this stuff with cloud is like very man right so like let's talk about arise that's like how can arise help people with EV yeah so we actually are arise we actually build systems that help you build AI in the first place the way we do that is kind of like at three levels so we have a ton of free education free courses you know sort of work like handbooks that help you get started you can go straight to our website to You Know download those immediately and that's actually like a helpful resource if you kind of got some takeaways from this you know episode but also wanted to go deeper on a few of the subjects we talk about so I highly recommend checking that out the second level is open source tooling so I me we mentioned EV valves but like you're not going to want to sit there and like write your whole system to run EV valves like it just doesn't make sense for you know Engineers to go and like rebuild what's already out there so we actually have open source tools that help PMs and Engineers that are getting started write EV valves and also trace and monitor their products in the first place so we actually help you do that like breaking down of your system using traces which actually is like the Lego block analogy and then evals to help you actually evaluate the system as a whole all of that is open source and free to use you can host it yourself you don't need security permissions it's honestly just trying to get more people to like look at their system look at their data so we're going to keep putting out more and more there to make it free and then I'd say like last but not least we do have an Enterprise product that helps you know more mature products and teams that need like security support and scale and so yet also free to try there's no gate on the product and if you have any feedback I'd love to hear it so yeah we're really trying to get more people to look at their data and build better AI systems because let's be honest we need more and better AI systems we need more different perspectives and different PMS thinking about how to use these tools that's really that's my goal that's awesome that you have I realiz you have all these open source tools like where can people find the open source tools yeah so the open source tool is called Phoenix so it's Phoenix arises so it's like phoenix. arise. com and yeah and there's like a GitHub repo again you don't even have to talk to anyone we also have a slack Community where if you have any questions you like can actually join it and you'll have it's funny like the first line of defense in questions will be like actually a pretty good AI agent and then you'll actually have like you know more discussion or humans on top of that too so it's a good place to learn get resources and like you know basically learn from as well awesome any closing words of advice for folks watching this podcast yeah I mean I think the thing that keeps coming back to me is like I get this like General sense that in Silicon value that people feel like man the AI Revolution is here and like how do I like pivot my job how do I get that title aipn like so many people on LinkedIn are like updating their resumés and like trying to like catch up they feel like they're catching up and my sort of note
44:23

Catching the AI wave is alot like surfing

here is I honestly think it's a lot like surfing which if you've like surfed before you know that things kind of come in waves and you actually have to like predict or see what wave is coming and time that you know and the goal is actually to be on like so much of Surfing actually comes down to just being in the right place on this wave before it comes and my note is like the wave is actually pretty early still like you there's a there is room there's a lot of room for different perspectives and approaches and I think bringing your own unique approach to building with AI is really what the ecosystem needs right now we just need more people being in the right spot so that when the wave really starts to pump like you have more people riding it so that's how I view it like you're not late in fact you're probably still pretty early and the goal is really just to get to the right spot on the wave yeah you know I want to emphasize that you know gain to the right spot is not like taking like five AI courses or something or like trying to get a Playbook on like how to do this aipm thing right it's just like actually trying to stuff out yourself try it out yourself join the right Community I think is like another tip there as well try it out yourself join the right communities be around people that are trying these tools out and building AI products so that you can learn from them and they can learn from you and then I think that that's really what I'm trying to leave people with awesome where can people find you online on month if you have more questions yeah if you want to reach out to me website is Aman k. a new domain there and then you can always reach out to me on LinkedIn or X as well I might be slow to respond if I don't respond just hit me again with another message promise I'll try to get back to you but yeah really trying to help more people build with a awesome man this is awesome conversion yeah thanks so much yeah I thanks for having me Peter yeah

Ещё от Peter Yang

Ctrl+V

Экстракт Знаний в Telegram

Транскрипты, идеи, методички — всё самое полезное из лучших YouTube-каналов.

Подписаться