AI Won't Replace Engineers, But This Framework Will Change How They Build with Rohit Girme

19:41

AI Won't Replace Engineers, But This Framework Will Change How They Build with Rohit Girme

The Data Engineering Show - Podcast 07.05.2026 60 просмотров 2 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

In this episode of The Data Engineering Show, host Benjamin Wagner sits down with Rohit Girme, Staff Software Engineer at Airbnb, to explore how Airbnb built a Gen AI evaluation platform to assess LLM outputs across product surfaces, from customer support bots to search and booking experiences. Rohit shares insights into Airbnb's infrastructure choices, evaluation workflows, and lessons learned about leveraging AI tools while maintaining human orchestration. *Chapters:* [00:00] Intro [00:39] Building a Gen AI Evaluation Platform at Airbnb [00:04:10] From Customer Support Bot to Evaluation in Action [00:05:03] Why Monolithic Prompts Fail: The Case for Specialized Judges [00:07:07] Real-Time vs. Offline Evaluation: A Dual Approach [00:10:54] Using AI as a Tool, Not a Replacement: The Human Orchestrator [00:12:38] Measuring Real Productivity Beyond Token Consumption [00:15:30] Zero to One is Easy, One to N Still Needs Humans [00:17:48] Key Takeaways & The Future of AI-Driven Engineering If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here: https://www.fame.so/follow-rate-review

Оглавление (9 сегментов)

Intro

Earlier with code it was deterministic more rule based you know what the code did but then with AI it's a black box to us as well we have to figure out another way to evaluate the surface we've been figuring out how do we use AI in the product the issue here is how do we know what we are building will work correctly we've started building what we call the genai evaluation platform if it's a realtime interaction you keep it lightweight you need smaller models tuned very specifically for that. Then we also have an offline workflow which uses more prompts, more metrics, bigger models. — Boy is a staff software engineer at

Building a Gen AI Evaluation Platform at Airbnb

Airbnb. Has been there for the last 7 and 1/2 years. — I've been working at Airbnb. More recently of those years, I have been on the search and ML infra or 0 to one is easy now. But the one to N, which is the scaling part, I think we still haven't figured that out. You still need humans for that. — The data engineering show is brought to you by Firebolt, the cloud data warehouse for AI apps and low latency analytics. Get your free credits and start your trial at firebolt. io. All right. Hello and welcome everyone uh back to the data engineering show. Today I'm super happy to have Germa on. Void is a staff software engineer at Airbnb. has been there for the last about seven and a half years. Welcome to the show. Uh do you want to introduce yourself and tell the listeners what you're working on? — Uh cool. Yeah, thanks Benjamin. Thanks for having me. Uh excited to be here and uh share what I've learned till now. The last as you said 7 and a half years I've been working at Airbnb as a staff software engineer. more recently uh of those years I have been on the search and ML infra org more recently among those 7 and a half years been figuring out how do we do evaluation in this world where everyone's using AI uh how do we get the right data is the data good what is the definition of a golden set evaluation set and uh so on so forth — okay very cool nice so maybe take me through some of the projects you're working on at the moment, like what's top of mind. — Yeah, so as I said, for the last year and a half, as some of the GPT models and cloud and deepseek have taken off, we've been figuring out how do we use AI in the product in general and you can see this in the Airbnb earnings call and everywhere about how we are using AI. But I think the issue here is how do we know what we are building will work correctly? Is it going to produce the right results? So that's why we've started building what we call uh the geni evaluation platform around hey um if someone comes to us saying hey I have this I'm integrating this LLM in our app or in our product surface it could be using a third party one or a fine-tuned open-source one how do I make sure what's the evaluation story look like because earlier with code it was deterministic more rule-based. You know what the code did or at least most of the time and then you wrote these unit tests around the code to make sure it worked the way it did. But then with AI we it's a black box to us as well, right? We don't know how it's working underneath. We don't we are not the owners of the code inside it. So we have to figure out another way to kind of evaluate the surface. — Right. Okay. So take us through the tax stack when you run these evals, right? Kind of how does it work under the hood? What systems do you use for it? Uh do you sell hosted? Do you use services? Would love to understand that better. — Yes. So this is very similar to what every other company's doing but very I guess tailored for Airbnb. So we use Python for all of our services and to

From Customer Support Bot to Evaluation in Action

build all of the code and to build the framework itself. But then we have uh we reach out to either Azure AWS for our hosted LLMs and then internally we use some of the open-source inference frameworks um like VLM is I think is the most popular one. So that's the hosting part and then we have several layers before that like we have a gateway to talk to all of these services. We have our own serving infra built on top of Kubernetes to run this VLM based infra or inference engines and then on top of that is this platform that we uh evaluation platform that we build and it's configurable in a DAG sort of way. Uh so if you know airflow which came out of uh Airbnb you can create this DAG saying like hey

Why Monolithic Prompts Fail: The Case for Specialized Judges

here these are all my steps and then go run it. — Okay take us through a concrete example maybe of like a product feature in Airbnb that's infused with AI and then how it connects to that eval platform and how data ends up there. — Yeah. So I think one example that that's already out there is the CS customer support bot where the earlier issue was uh when people users raise an raise a customer bug it goes to the human support agent they figure out what's the policy how do we resolve it so there was some friction there and to make it easy we introduced uh an LLM layer on top uh of uh the customer agent so when people reach out they hit uh the customer bot um support bot and then if the bot cannot answer the question then it's pushed on to a human but this interaction between the customer and the bot uh we need to make sure it's functioning as necessary so that's use case for evaluation — right okay and then that ends up in your eval step right like how does a customer support interaction like that end up mapping to like an airflow DAG basically Yeah. So I think there's different steps in the evaluation, right? Like first is when you actually define these what we call virtual judges because they are replacing human judges. You need a dev test cycle similar to how your software engineering flow is. So you have a dev test cycle where you let's say you tune your prompt and figure out what's the best way to identify some issues. And we found out that having one giant VJ or giant prompt doing everything doesn't work. So we then figured out hey we need to split it out into multiple ones each focusing on one specific metric like is the data retrieved for this policy even make sense. Um so content relevance could be

Real-Time vs. Offline Evaluation: A Dual Approach

one um judge. The other is hey given this customer issue and this policy does the output of the bot rely on just this or it's just hallucinating and coming up with some random stuff. So that's one metric and so on so forth. We build a bunch of these around that. So once you have that tune then that's where the framework or platform comes in. And it helps you tune the prompt, test it against some data. And then once you have that ready, then you need to plum it into a larger workflow where hey, this is my incoming conversation. Now run these five judges in parallel or in some sequence. If it's in sequence, maybe we filter out. It's more like an if else kind of way like, hey, if the first VJ says it's something's going wrong, then we run another one to make sure what's the actual issue. then another one to figure out what is the actual issue. So they could be in sequence, parallel. — Gotcha. But that then also runs during my support interaction basically. So if I have support interaction with the Airbnb chatbot, it might in the background call five LLMs or agents that generate different responses and then there's this like judge agent-driven judge that decides which one I see or all of this is offline. — Yeah. So I think there's again so the in the flow of evaluation you start with dev test you tune your prompt next step is figuring out what the right workflow is you test that in isolation and then as part of that step maybe human evaluation is also needed right like so for first the volume is handled by the LLMs then you filter it down and then go to they are uploaded to humans that's the workflow but then where you run it is also important right you have to run it as part of when the C interaction is actually happening and then as a follow-up uh what we call passively monitor these conversations because in real time you cannot have the customer wait for several seconds until your LLM figures out is this a problem or not. So the kind of model use the kind of prompts use also differ. So if it's a real-time interaction, you keep it lightweight, you keep it sub millisecond or subsecond latency. So you have to adjust your models accordingly. You need smaller models with smaller parameters tuned very specifically for that. And this happens in real time. You check if it's not outputting some garbage. it's adhering to the brand values and what the company stands for uh while also trying to solve the problem and then another step is this is real time but then we also have an offline workflow which then uses maybe more prompts more metrics bigger models to kind of figure out hey something did happen now how do we fix it — awesome okay that's interesting nice so if you look ahead in the next couple of months, right? Like what's on your road map that you're really excited about? What's a big area of focus uh to improve for you and the team right now? — Yes. So, good question. So, I think first of all, there's some incremental or enhancements that we want to make and improve the surface areas that touch or that use AI starting from like different area different areas in the product. The first one was customer support. Maybe then it's in the search experience. Maybe it's in the booking experience. So all these flows need to be uh updated. But then going ahead, the other thing is also how do we make it very easy and fast for users to experiment and onboard

Using AI as a Tool, Not a Replacement: The Human Orchestrator

these LLMs, right? Because as our CEO says, we need to move at the pace of AI. You can't use your traditional slow moving process. Everything needs to be very quick. So we are figuring out ways to make this faster which could mean improving the tooling, improving how people use the tools, how people everything in the flaws like are the evaluations running faster, how do we adjust? Let's say you have a certain bandwidth uh being assigned by Azure. Are we using the entire bandwidth? Uh are we using all the tokens available? Are we getting rate limited? How do we solve that to all the way to do we have the right tooling for let's say observability are we getting the right traces back from the system then once we have the traces are we using those traces then to build the right product so that's the cycle you build something output traces the traces go back into your product so I think that's the entire gamut uh that we are looking at um — interesting cool nice so maybe shifting gears a bit right is like this is a very let's say AI forward part of your product at the end of the day. um how do you build that leveraging AI right like take us through your actually like kind of like a software engineering life cyclist like what tools do you use to build test how's that going what's working well what's not working well I think that's a great question very relevant in today right because I think no one's denying AI slop it's there it's in it's maybe it's here to stay I think what I feel or what I see is humans should be the orchestrators of these tools and not just hand off everything to these tools. I think

Measuring Real Productivity Beyond Token Consumption

that's the best use case right now. So the way I personally use it is because I have all the context I will kind of split the problem into smaller chunks and then give enough context let's say to these tools including let's say cloud code or codeex and then solve small problems and then I will do the part of building combining everything together instead of just handing off the whole problem to the LLM and then having it figure out. Most of the times what I've seen is if we do that which is handing off everything then it will make a lot of assumptions. it doesn't know the code enough because again context is limited because even I know there's some there's like millions of tokens as context but if you look at I think this works great when you are building from zero to one because there is nothing there so there's no context and you can build cleanly but in companies where you have I don't know millions or billions of lines of code it's not enough and that's where I think we as humans come into play like you have to break down the problem and then use it as a tool. — Sense. How do you learn as a team? Let's say like you figure out the specific workflow where cloud code works great, right? Like it's kind of like how does organizational learning basically work in this world around how to kind of best leverage these tools, how to become better and better. — Yeah, I think that's uh relevant. Again, I think we start with the age-old way of surveys. uh we survey developer experience uh and figure out uh what needs to be improved but I think one thing we have seen is documentation is become even more relevant now earlier humans had a lot of tribal knowledge you know like one person who's been there for like 10 15 years knows everything about the code but that doesn't scale right because he's one person uh now if it's an LLM they need to know everything so everyone can kind of scale up so I think documentation is an important way we are going ahead head uh how to format the doc so that agents can use that content better there's different formats I think direct access is one very uh popular one documentation is one aspect the other is everyone's using LLM right like how do you can me how do you measure productivity in LLM right one way is like hey are is everyone using LLMs is everyone using how many tokens are they using but I think that's not enough you need to figure out what's the are they building something on top so I think measuring we are adding a lot of scaffolding or instrumentation around these calls like hey if someone uses LLM and it outputs something was that helpful or not I think that's one uh way to figure out if it's helping or not and I think lastly it's also about attitude

Zero to One is Easy, One to N Still Needs Humans

or your the way of thinking earlier what we have done is we try to solve a problem then If we are stuck then we go search on using some search engine or look at docs right but I think that needs probably to flip if you want to move fast is you get all your information using AI or using the LLM you process that then you break down the problem and then you solve it in chunks so I think that's like a 180°ree shift from the earlier approach uh which I think is helping Super interesting. Nice. Cool. So awesome. Like unrelated to your specific road map maybe to kind of close out today's episode is like it feels like engineering is changing so much right now in general and it's like a very exciting time to build, but it's also crazy to actually kind of think about how the future of engineering looks, right? And kind of like how much the world will change we'll build half a year from now. like maybe what's your what are you excited about and kind of what do you hope will actually improve in terms of our development workflows as we kind of go throughout the years and the models get better and the workflows get better is kind of yeah not necessarily related to Airbnb actually so I think I was reading this uh article which talked about some survey uh about what general population feels about AI in some countries it's very positive negative. Maybe that's how the narrative is being conveyed. Like some companies in the states have this dystopian view of we will replace 50% of the programmers uh with AI. Maybe that does happen. I don't know. Maybe that's a that's a marketing technique to bump their valuation. I don't know. What excites me is there's actual value in this, right? I've seen myself how I have supercharged my workflow, day-to-day workflow either at work or at home where access to information is so easy. I don't have to read 10 websites to get what I want. So text summarization, information

Key Takeaways & The Future of AI-Driven Engineering

summarization or information democratization is much easier with LLMs. So I see that being helpful and that may uh evolve and the second is even in there's always going to be uh blind spots for every person and that's why we work in teams but I think with AI that it'll become even faster because you have this very short cycle of you talking to the AI instead five humans in a team. So I think there will be a faster cycle there which would I think help in this whole software development process but then when you put all of these together I think shipping products or shipping features would become even faster where earlier we took I don't know if you were following the waterfall model it took I don't know weeks or months to go from planning to feature now it'll be days and I think lastly it's the way internet democratized information I think with LLM's its capability I guess that would be democratized. So if you have a good idea, you can build it very quickly. But there's a caveat, right? Like 0 to one is easy now. Uh but the one to end, which is the scaling part, uh I think we still haven't figured that out. Uh you still need humans for that. — Awesome. Cool. Well, thank you so much for being on the show today. It was great having you. Um yeah, and excited to follow your work at Airbnb. — Thanks, Ben. Uh great chatting with you. Hopefully this was helpful to everyone. Definitely. The data engineering show is brought to you by Firebolt, the cloud data warehouse for AI apps and low latency analytics. Get your free credits and start your trial at firebolt. io.

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник