Become an Agentic AI Engineer who designs production-ready systems: https://academy.towardsai.net/courses/agent-engineering?ref=1f9b29
Our FREE Agentic AI Engineering Guide. A 6-day email course: https://email-course.towardsai.net/
#ai #agents #agenticai
Оглавление (12 сегментов)
Segment 1 (00:00 - 05:00)
Welcome. Let me share with you what we believe is the AI engineering foundations that you need to know as AI engineers coming from the Towards AI team. I've given this in a workshop very recently and I thought it would be an interesting video as well for you to know and why not give it for free since it's just part of the workshop and the actual five other hours are really more applied and super interesting, but we have to reserve that for the companies interested. Still, I hope you enjoy this foundation talk of around an hour, I would expect. First, why learn to build with language models? Well, because they can be linked with existing products nowadays and with existing workflows which have been difficult based on the non-deterministic nature of LLMs. But now, we can finally link them with structured outputs and tons of other features that we add. And this is interesting for AI engineers because LLMs are not good enough out of the box. They do mistakes, so we need to mitigate those mistakes, those errors, those hallucinations with retrieval, fine-tuning and other techniques. But also, they just generate text. So, we can implement tools, web search, better prompting and tons of other things like agents to have them do things rather than just generating text. And when you do that properly, they bring a competitive advantage to the application and fortunately for us, you don't need huge budgets and it's not just the biggest companies that can implement language models. We can now all do that, even a very, very small startup, which means it's very relevant to learn to be able to implement language models into existing applications. And any software developer can do it with some upskilling and practice, which we hope to provide in this training. Well, this is the foundational training and typically in companies we give like a whole day after that of teaching the agentic stack and agentic coding and proper ways to do so, which will have some more videos on that, so subscribe to see them. But otherwise, we also give trainings inside companies and reserve a lot of the more applied training for the companies that contact us. If you are at a company and want such a training, please reach out. Now, what do we teach exactly and in this training? Well, we teach the AI engineering role. I already shared this slide many times, so I will just go very quickly over it, but it's basically the new role between ML engineers where I come from and product engineers or full stack engineers. It teaches how to build around language models, but not to train them or to build the language models themselves or to integrate them into existing applications, create the infrastructure around it, create the toolings, the chains, the agents, everything around language models, where the core competency is architectural decisions. And who are we to teach this? On my end, I'm Louis François Bouchard. I'm the CTO and co-founder of Towards AI. I was a PhD student until I fully switched into Towards AI to educate and build AI solutions for companies in 2020. And this, as I said, comes from a workshop we did at a conference with my friend Paul and my colleague Omar, where we did our best to teach the agentic coding and best practices to the conference. So, if that interests you and you want a personalized training for your company, please reach out and we might come. But otherwise, for this talk, it would just be me and our lessons from Towards AI. So quickly, Towards AI, well, we are an educational platform in AI since 2019 and we've done tons of things. We wrote a book in 2024 and working on a second edition. We have an online learning platform with online courses to upskill AI engineers and we do coaching and training for companies. We've built tons and tons of free resources and not only on this YouTube channel, but also in our blogs, Discord community and newsletters. Now, let's get into it. I will do a very quick overview and typically I tell people to stop me with anything, but right now it's hard for you to stop me. So, I will just try to go quickly, but try to be as understandable as possible. So, first we need to understand the obvious, the language models. So, LLMs are obviously in the AI landscape. But, the interesting thing here is that they generate stuff. And not only they generate stuff, but they understand things even better. And this is where generative AI comes from. Typically, we only saw predictive artificial intelligence
Segment 2 (05:00 - 10:00)
especially before ChatGPT, where they mostly would learn representation from data in order to classify something or to understand patterns, which generative LLMs also do. But, here with the goal to predict what happens and what happens next. To create new data similar to the ones that we have. So, in the case of language models, to create new text similar to text we have given it to teach it. — [snorts] — Some examples of predictive versus generative AI systems is the image classification or sentiment analysis, just basically giving a label to some inputs. And the generative examples is more of continuing the generation, so generating more text, more images, more tokens, generating pixels. And in order to do that, there are four components that make large language models exist. First, we have the data that is necessary. Then we have tokens, embeddings, and a deep neural network, which is called the transformer, the most important architecture for powering language models. And I believe those are the four core components we need to understand for a good foundation of language models. And the other things, the math, the whole everything works underneath, isn't so important for AI engineers compared to those four pillars. First, data. So, there are two kinds of data to train these language models, which is important to know just to be able to understand their limitations and what they are good and bad at. First, we have basically the whole internet and tons of books. They are trained to replicate the internet, Wikipedia articles, and stolen books, basically. But just tons of writing that is available on the web. And also your GitHub repositories that are public, ideally only public. There's another type of data that is needed is the curated data sets that are basically samples of questions and answers that we want the model to imitate. So, that's a much more expensive type of data to produce, but we also need that when training language models. We'll see that quite shortly. And then, since language models just work with numbers and don't know what text is, we need a way for them to understand our language. And these are the tokens. Basically, our text is split into smaller chunks that each are associated a number. It's basically like a large English dictionary where each specific word has its own number. And so, this is what models see instead of the actual letters, actual words. This is also why we have some problems with like the number of Rs in strawberry because they don't know letters, they just know that, for example, the first word the is 976, which doesn't mean anything. And for models to mean something, we need to transform these tokens into embeddings. Embeddings are basically other numbers. So, basically just more numbers to represent the same token, but we can see that as plotting each word into a multi-dimensional space. Here, the example is just on a three-dimensional space, XYZ, but in reality it's like thousands of dimensions, which we cannot see with our eyes. So, we just stick with three for the example. And here, what we see when it plots kitten, cat, and dog is that they're all very close together in the space, which means that the embeddings for cat and kitten are very similar, which means to the models that cat is very similar to kitten and means somewhat the same thing. So, this is what the embeddings do for language model. They create the meaning of each word. And all this meaning is just based on uh these two points and their distance from each other. So, the only understanding language models have from text and from tokens and from words is how close they are to each other. That's it. They don't have any other type of understanding than mathematical distance in this what we call the latent space here in this three-dimensional space, but thousands of dimensions in the real case. And lastly, we have the deep neural network or the transformer, which very quickly has two crucial blocks that are repeated many times in order for them for the embeddings to be transformed and understood by the model. First, we have the attention mechanism
Segment 3 (10:00 - 15:00)
that will basically be used to compare each word with the others to have the model understand what's happening in the sentence. So, for example, here the word bank, we need to tell the model if we are talking about the financial bank or just a bank of a river. And thanks to the sentence and the attention, the token bank or the embedding of the word bank will talk quote unquote talk with the other embeddings to be able to understand what's the situation here. So, that's the role of attention and it's crucial for the model and for the embeddings to have a common sense in the sentence and to properly understand the sentence. And then, we have the other part of the network, which is called the feed forward layer. And this is basically a tiny neural network inside the transformer that will basically transform each embedding into a new version based on the transformation from the past attention layer. So, we are just transforming the current embedding into a better one for the language model's understanding. So, it's not really important to fully understand the the types of connections and models here. It's just important to understand that there are two crucial blocks, one that serves to connect each embeddings together and another that serves to evolve each embedding individually. And then we repeat both many times until we predict the next token. And we do that again and again for each new token. So we do one token at a time, which is why transformers are called auto-regressive because it can just generate one token or one word at a time, which is why it also allows streaming because it's by default how they generate text. But to be able to generate these tokens efficiently and in good quality, we need to train these models. So what's this training about exactly? Well, it has three phases, the pre-training, the post-training, and uh in the post-training, we have instruction fine-tuning or instruction tuning and reinforcement learning or reinforcement fine-tuning. I'll go over these very briefly to understand what type of model goes out of these because we can only need pre-training for some tasks and only need uh post-training or more post-training for some other tasks. So first, pre-training. The goal of pre-training is to learn how to write to predict the next word and basically understand our language. And as a consequence, the model will learn somewhat of a lossy compression of the whole internet and books it digests. And what this means is that when uh you then give a sentence to a model, it works as a total completer and just continues on the sentence. So if you ask a question or an instruction, it might give you very unexpected behaviors like continuing the instructions here that we see on screen or uh just asking you more questions instead of answering your questions or uh answering your instruction. This is because the model wasn't trained to answer you. It was trained to complete the sentence. This is what pre-training does. For post-training and more specifically fine-tuning or instruction fine-tuning, we are refining the model for our needs and it's the same thing as the initial training where we teach it to write better, but in this case, we give it examples of what we want. So, we teach it we give it a question in the prompt and then we teach it to regenerate exactly the same answer token by token. And this makes an instruct model, which means that it follows instructions. If you ask it to write a poem, it will do that instead of just completing text. And as I said, here it needs a very expensive curated data set. And finally, we have reinforcement learning from either human or AI feedback, which serves to align the model. Here, we are not teaching the model token by token anymore. We are rather gathering examples of answers it gives, ranking them with a human, training a model a reward model to imitate these human rankings, and then using this new reward model to further train the initial language model. So, basically, we are just trying to align it with the better responses that we get and not teaching it to regenerate tokens, if that makes sense. So, we basically have for each answer one training signal, good or bad
Segment 4 (15:00 - 20:00)
compared to uh when we were doing pre-training and fine-tuning, we had a reward signal for each token generated telling it exactly if it was the right token or not. And this has the benefits to um make the model talk the way you want and use uh like bullet points if you want to use bullet points and the types of answers you want or even don't want like teaching it to refuse malicious queries. This is in the reinforcement part you will teach it. And then once you do this final training step you can serve the model to your users. And what this is it's called inference. It's basically streaming the tokens. You have chat features, history and from the open AI library you can have a system prompt and assistant prompt and user prompts which are basically list of dictionaries between the assistant and the user. And obviously you are trying to balance quality, cost and efficiency with either smaller or larger models and augmenting language models with memory and tools as we will discuss. So just to recap what's going on. We have a general knowledge LLM that we first pre-train. It learns to understand our language and generate it back. Then we fine-tune it to get an instruction model to follow instructions. Then we train it further to align with our needs and to teach it reasoning and other things that we want to teach to basically answer our questions better. Then optionally you can distill the model into a smaller one. So basically teaching a smaller model from the big model to save on cost and increase on efficiency. That's fully optional but for example all big companies do that with Gemini Pro and then Gemini Flash. They distill the Pro version into the into a smaller flash one for us to use. And then they serve it to user where each new query is tokenized, embedded as we saw and processed auto regressively one token at a time to answer the user. And here I mentioned autoregressively. So, this means that we generate one token at a time, which implies that each new token is based on the previous sent and the training data of the models, what we used to train them or what OpenAI used to train them. It also means that it's purely statistical. It's not really intelligent or it's debatable, but it's just it works on by using statistics and what are the chances that the next word is this one. It's definitely not conscious and at least it doesn't have the same intelligence that we have. They are probabilistic and not deterministic, uh which can be a problem for many applications when especially when you implement them in your applications. They remember fact if they see that often, but it's not a real memory. You cannot be sure that they will remember a fact. So, that can be quite tricky to handle and deal with and you need to be careful around that. Also, it cannot tell the difference between a truth and a lie. And even though it says that it was mistaken, it doesn't know any of this. It just generates tokens. And that is all true whatever the size of the model. But, the good thing is that all of that is just for now. As you know, language models evolve extremely rapidly and this may all change in the next few months even. I also put a list of useful resources here for more theory because that's basically the entire theory that I wanted to share. But, it's definitely not the end of this workshop. There's a lot of other things that are relevant to know as AI engineers that isn't the theory. And the first of which are the limitations of language models. We definitely want to know about them in order to best implement them in applications. So, what are the biggest limitations of language models? First, the hallucinations and errors. You all know this already, so I will go very quickly, but they can basically invent things. Fortunately, we can use more reasoning or tools like coding to ensure and force the models to not hallucinate. There are tons of fixes that we can do for hallucinations, but we always need to be careful about that. Second, biases. All language models are biased, that's normal. All humans are
Segment 5 (20:00 - 25:00)
biased even, and there's no real fix for this. The best thing you can do is to have the best data possible, and since typically we use language models from other companies, there's nothing we can do. But we can still prompt it as good as possible and give it future examples as neutral or as representative as we want to diminish the bad biases and augment the good biases. But in any case, language models reflect the stereotypes present in training data, and there's some things we can do about it, but not so much. But we need to be careful about the stereotypes and have evaluations to measure those and change our instructions and prompts to be able to limit those. Then, we have the classic knowledge cut off because these models are trained once and then typically not trained again. They probably need internet access to be able to have updated information, but otherwise, typically some knowledge is not in the training data set, and it may cause problems in this case, especially if you work with like your proprietary data. So, here you may need to add information in its context through rag or fine-tuning or whatever technique that best fit the situations. We will talk about those very shortly. Another limitation is the context window, because even though these models can go up to 1 million tokens, which is very large, it typically uses techniques like infinite attention that basically just works as usual processing around 100,000 tokens at a time and do that in a loop. And each time it processes the next 100,000 tokens, it compresses the previous ones, so you lose information at each steps, which means that the results degrade over time the more tokens you add. This is called needle in a haystack, and basically it happens because to train long context models, you would need a large data set of very large books and content where you ask questions about the book and you want the model to understand the whole book to be able to best answer your question. But this type of data set would be extremely expensive to produce, and instead, what they do is that they insert a needle in the book, so just one fact in the book, and they ask the model about that fact, and they basically train the model to retrieve it better. Which is far from ideal because it just teaches the model to retrieve one thing, but ideally you want the model to be able to leverage the whole context to be able to answer your query. Which means that the more information you give the model, the less chances you have that the model will use the whole information to answer your question. And typically, instead, you want to use something like retrieval to chunk larger context into smaller pieces that you can only insert if relevant. Now, next limitation is the reasoning limitations of language models. They are basically very struggling with basic logic. They overfit to memorized examples rather than just genuinely reasoning. Just like the examples here that where we see that they are basically very bad at math. But unfortunately here, they can use code to be better. This is an old example and it has been fixed, but there's one recent with even Opus 4. 7 that is very popular these days. When you tell it that you want to wash your car and you ask it if you should walk or drive because it's just 50 m away. It says that it's shorter to walk there because it's super close. But you need to wash your car. So obviously you need to go by car, but anyways, that just shows how Claude or any language model is not truly reasoning or like genuinely thinking like we do and it's just generating tokens. So that's just a limitation to always keep in mind when you use language models or when you implement them because you cannot assume they will understand the user's query. So, how to make LLMs work for you? First, you need to pick the right pattern. Whether it is a workflow, an agent, multiple agents, you can use them interchangeably. And if you use the wrong one, you can end up either spending way more tokens than desired like using a CrewAI multi-agent for something just one LLM code could have done. Or on the opposite, you could just build a workflow that would not work at all for your use case.
Segment 6 (25:00 - 30:00)
And to be able to choose which framework is best, you need to think about the task and ask yourself questions. Which is what we're going to do here. And typically, I like to follow some kind of progression ladder like this one where you definitely want to just stop at the first level that passes your test set or evaluation set. So basically, you need to build yourself some kind of evaluation set. So examples of questions and answers that you actually would like your model to follow. And then, you try just a prompt in ChatGPT to see if it works. If it doesn't, you try longer context adding more information. If it's too much information and it doesn't work, you try adding retrieval and then augmenting the workflow and then going to agents and multi-agents. But every step to the right here increases the cost, the latency, and the debugging complexity, and it decreases the control you have over your system. But it adds autonomy. So it has pros and cons, and the only way to figure out what you should do is to think about their problem and try them one by one from simpler to more advanced. But typically, the best is to start with just prompting and Kag because you don't need to create any type of infrastructure, you just test with whatever system that is already there. So even prompting can be quite advanced. You can have your system prompt, but also few-shot examples and detailed instructions on how you want it to do things. It can learn a lot on the fly giving the right examples to your system and even counter examples. So you will start with the prompt and the few shot examples, run your evaluation sets, which is basically you can have just a quick judge rating the I actual answers of your system and giving true or false binary rating on different criterias that you come up with. But basically, you just have an evaluation set and you evaluate it either by looking at it or by having some kind of um, prompt with another language model evaluating it. Then, obviously, if the quality passes uh, the the minimum bar, well, you ship it, you stop there, but otherwise, you add the next step, which would be CAG or context caching. And here, for context caching, you basically want to use it for adding extra content to the prompt, but that you know before query time. Which means that, uh, you have like a static context, like a big report that you know that you will query and ask multiple questions. Or like, if your users want to ask questions about some of your policies, you can just give it all your policy completely. But you only want to do that if it fits 200,000 tokens or so, maybe less, but not more, because otherwise, the results start degrading because of the limited context window that we discussed earlier on. And if the data that you want to give to the model is much larger, in that case, you may want to use rag. For rag, you basically want to use it when you don't know what exactly the user will ask about at query time, but you have the answers in your data. You just want to give it based on what the user asks about. Which basically means when the knowledge changes frequently or when you need citations or traceability to answer the user confidently. And ideally, you want to optimize this part first if you add additional data, because if you don't give the proper data to your language model, even the generation doesn't matter. So, rag is basically a more dynamic way of providing the right data based on the user question. But, it's often overkill, as we said, when it's smaller than 200,000 tokens, when your data is smaller, like your policy is just a two-page PDF, you just want to always give it directly. It will be way less expensive and easier to put in place. Then, after adding a basic retrieval or contest caching, you may want to jump into workflows if your task is not solved. When to use them is when the steps are known beforehand. So, when you can define each step, the order, and the validation gates, so when to exit the workflow or continue on depending on different cases. So, if you already know your steps and it's always the same order, well, you just build a workflow directly. It's way cheaper, way faster. You just have hard-coded conditions or directly defined in the prompts. It's perfect, deterministic, easier to debug, to test
Segment 7 (30:00 - 35:00)
and to improve. But, if you need more dynamic branching and more autonomy, you then may want to switch to agents. Which is our next step. So, when to use agents? Well, it's when the next step depends on what you would discover at the current step or mid execution. The best signal to use agents is when you have potential incomplete data from different APIs that you use, when you need to branch dynamically under certain conditions that you cannot predefined, or where the steps can vary depending on the user querying the system. And here are some general directions that we give to expose tools via MCP wherever possible so that any of your systems can use it whether you are using cloud or cursor, you will be able to use the MCP from the client. We also have some non-negotiables. You need termination conditions to not have infinite loops. You need maximum iterations more than subjective ending such as looping until the results are good enough. This often doesn't work. It's much better to test yourself and for example have just three iteration of feedback. That's it and it happens every time. We've seen that in our case having a maximum and minimum of three iteration or whatever the number based on your current use case is much better than just trying to reach a minimum limit of subjective or quantitative feedback that a judge would automatically give it. In any case, that's when you have loops agentic loops. You just want to have more deterministic number of loops than a more flexible subjective number of loops. This is all with the same goal of determination conditions to not enter infinite loops that would cost way too much money and not yield better result necessarily. And then we have the thin agent heavy tools that we heavily rely on where basically you want as few agents as possible. Just one general agent is ideal with a tools serving as capabilities. So it's basically your tools doing all the work and the agent using them. But agents and workflows don't live separately. We typically use both. So real architectures are compositions of workflows and agents with a deterministic router where you basically use workflows for known intents and steps that you know in order and use agents for the more open-ended parts of the query or the system. And some systems, examples of this is like a support system where you would have a very specific workflow for FAQ with retrieval, but an agent handling the refund and escalation with multi-step action. And for coding, for example, you could have a workflow for lint fixes, but agents to figure out why some tests would fail. Now, since all architecture need to manage context and basically manage a discussion over time, we need to talk about memory. And with these models, we can have different types of memories. First, we can have the working memory, which is basically the current and past few turns prompts, the tool outputs, and every discussion that you had with the language model recently. Then, we can have the episodic memory, so just past sessions that you saved or that the model saved for you, that you retrieve only if relevant. And you have the profile memory, which we used in our course to give the system user fact, such as how to write, how to reply, preferences, and different settings that you want the model to have based on the user. Which is the default system for more advanced system to have all three of these in three different layers to be able to debug and use separately, independently. Some do's and don'ts for these is that first, you only want to store the facts that are actually sourced, and that you can attach a confidence score to them. You don't want to store everything, obviously. So, you don't really want to rely only on the user, but rather, if you can confirm the fact online or if you have a source attached to it from a folder or from a past research of your research agent or whatever, you may want to save it. Otherwise, it's not necessarily relevant. Then, similarly, you don't want to blindly take what the user says because the user can be joking, it can be wrong, they can change their mind. So, you definitely want to do some processing for the memory. And that's also to avoid prompt injection because it can be a way to
Segment 8 (35:00 - 40:00)
inject models, to sneak untrusted input into the system memory. And here specifically for multi-agent or anything that persists across sessions, the memory access should be very explicit. The system should clearly call something like read or write instead of doing it silently. You should be able to trace if it's reading or writing to your memory. Um so that you can be able to revert or review what's being done. And speaking of memory, this also means we add it ultimately to the context of the model so that it can use it. And unfortunately, we cannot add everything to the context because it becomes way too big. And bigger context windows don't fix the context problems. Like you cannot just send more to the context and it will work. As I said, 1 million tokens isn't 1 million truly useful tokens. You have the needle in a haystack or lost in the middle problem that is real until 100 or 200,000 tokens that the results get worse and worse after that. So, you need to optimize the token. Likewise, tokens cost money. So, you have a budget per task and you need to respect it. You don't want to shove everything into the context and pay the cost of all these tokens. So, in order to optimize that, you want to compact aggressively. You want to summarize the old turns, replace the tool outputs with just the the final decision, the final valuable output, and you even want to reset if the system feels like the session is drifting. This is what Cloud does a lot where it sees that you are trying you know different things, it may ask to just reset the session into a new one to be more efficient. That's a very good thing to do in many cases. And lastly, you may want to delegate to tools or sub agents with their own context window so that you reduce the current agent or the current LLM's context window. And so, talking about compacting as much as possible, what deserves to be saved or to be added to the memory to then be added back into the context? Well, the main preferences, constraints, tasks, facts, decisions, anything restated each session that you keep on iterating, that you truly need to give back every session should be added to the memory. And you should skip everything else, whether it is like guesses or tool outputs or just verbatim logs or timely things, you don't want them in the memory. And ideally, you want to let the user be able to inspect and edit memory themselves just like ChatGPT does with their memory. Because memory poisoning is real, and users can help with that by removing random facts that your system decided to save that wasn't automatically relevant. Okay, now let's switch very abruptly to fine-tuning or just changing LLM's. Should you do something to your language model instead of just building around it, what should you do with it directly? Well, for fine-tuning, you typically don't really want to do that very often because fine-tuning changes the behavior of the model, but it isn't ideal to bring new knowledge to it. And there's tons of downsides when you decide to go the fine-tuning route because first, it's very complicated to do, very costly to do, you need data to do it. And then, if you do it, you are stuck with this model, and if there's a new better model, you need to get that model and refine to it. And typically, proprietary models like Gemini or whatever model is much better than the open alternative that you can fine-tune. And if you are to fine-tune, well, it's important to know that SFT and other fine-tuning techniques is pretty bad at injecting new facts and ensuring that these facts stay verifiable and are actually inside the model. It's typically much better to use rag or even tag to inject knowledge, especially in production. And so, when do you want to fine-tune? It's mostly when prompting fails and you have lots of labeled examples already that matched what happens in production. And a good example for that is if you need to teach a new language to the model, like a European language that is not really used that much and that ChatGPT doesn't work that much. Or to teach some kind of programming language that it doesn't know about yet. And typically, you may want to use some efficient techniques like Laura to fine-tune your models. Otherwise, you could use also reinforcement learning techniques
Segment 9 (40:00 - 45:00)
if you have pairs of examples like an answer is better than another, you can use reinforcement learning techniques to train the models to imitate answers and not others. It's easier to do and needs much less data. And also, it's to teach behavior, not new knowledge. And typically, our clients do that most times because they want to have more privacy, so they want local models. That's I guess the best reason to fine-tune and retrain smaller models. Otherwise, you are probably better off using whatever Gemini Cloud or GPT models with better system around it, better retrieval, better prompting. And so, with this, we have the next question that most people ask, should you just use a bigger model? Well, not necessarily, because a bigger model is much more expensive. It can be 75 times more costly or even way more if you use much cheaper one. It all depends on the task. So, you want to use the right model for a task. You want to use cheaper models for easier tasks, whether it is summarizing, routing, classification, just digesting data. Smaller language models are really good at doing that. And you want to use frontier larger models for synthesizing, for generating things, for planning, for more complicated tasks. And typically, you want to break tasks into subtasks before updating your model, and separating the better model for the better pipeline, and the smaller models for the easier parts of the pipeline. And here, I added that you may want to distill models. You can do that as well, which is another way to retrain a model using a bigger one to train a smaller ones. But this talk is just like to know about all these, not to teach how to do them, because we actually teach in a more applied way normally with the teams or the people directly or in our courses. This is more of a what to use and when. And here, distillation is very relevant if you want your own super small model running locally that could be very good on a specific sub field that the larger model is good at. So, you just you could use distillation in this case and there are libraries that help you implement that. Next thing we have to talk about is evaluations. Because as I mentioned, when you go from prompts to CAD, rag, workflows, agents, and multiple agents, and tools, you always need to evaluate for multiple reasons. What are these reasons? First, making a prototype takes approximately 10 minutes, but optimizing it can be extremely long if not forever. Because AI is not just magic, it doesn't solve itself by itself, even though cloud code codes itself, but that's another thing. It's typically software development and it requires the same rigor if not more because it's actually not deterministic, so harder to match with existing software engineering. And evaluations allows for many things. First, easier debugging, easier improvement, so knowing what to work on next, ensuring alignment with real user needs. It allows us to have confidence in our models, find edge cases, and failures that we wouldn't have found otherwise. And it helps with the march of nines. So, basically going from 90% success rate to 99% then 99. 9% reliably without just making cycles and turning around with different fixes that break other things. Because without evaluations, it's impossible to know if you are really progressing or just feeling the progress. And I have an example to illustrate that, a true example. LinkedIn developed a chatbot for a skill assessment, skill fit assessment to their user. And the chatbot responded by saying that users were terrible fits. That's it. So, it's like makes sense. Obviously, if it's a terrible fit, you want to know. But, that's not what the users wanted. They wanted to know, "Why am I a terrible fit and how can I not be a terrible fit? " And LinkedIn shipped this chatbot without having proper user feedback and user like evaluations. They just thought their chatbot worked because they had evaluations in place. They could tell if the user was a good fit or not. But, they couldn't be truly useful to the user. And this is why you need proper evaluation that considers your users or that reflects the real interactions with your users. Which is why you need to
Segment 10 (45:00 - 50:00)
spend time building these evaluations at first because if you spend time there when starting, you will save a lot of time in the future and you won't have such uh viral examples of the wrong thing to do or basically just dissatisfy your users. And what does evaluation look like in practice? Well, typically, after a proof of concept, you want quantitative data. So, you want to use metrics. And in addition of the subjective ones, obviously, you want to have your vibes as well. Look at the results, look at the examples, and feel if it makes sense or not. But, you also want standardized metrics and quantitative metrics. If applicable, you may want to use benchmarks. But, that's really not uh the core of evaluation. Ideally, you want to create your personalized evaluations where the goal is to prevent problems, solve limitations without creating new ones, understand your system's performance and what to work on and improve next. It allows for AB testing and for optimizing your system one element at a time. And as I said, you can continuously improve your system thanks to evaluations. And ideally, you'd align it with your real users. And there are three levels of evaluations to help you do that. First, we have the unit tests that are basically quick automated tests that you can set up in place, just like regular software engineering, with regex rules or strict conditions. Then you have the manual evaluations. You just look at the results or you set up an LLM as judge to do it that for you. Basically, to have some criterias that you give the model to and that you use to evaluate answers of your systems with the same data set that you use and reuse as you have all the system. And lastly, you have AB testing to literally test the system changes with real users, not in a test scenario. And lastly, we have to talk about that because what we build is facing real users. And what I'm talking about is security. It matters because language models introduce tons of new attack surfaces and risks beyond traditional software. I will go quickly over that because I believe you're all aware of this, but LLMs can be compromised and you can have data leaks. You can make them generate harmful content, do reputational damage to the company, and so you need to consider this at every step or every layer of your current workflow. There have been many examples of chatbots showing that they weren't secure enough. I'm not even talking about the club but era but before that where some people abused of big companies chatbots set in place without proper protection. And these protections are guardrails that are basically mechanisms that the AI engineers or ourselves can put in place to make LLMs and chatbots safer. We can do that at inputs, at outputs, or anywhere inside our workflow to prevent malicious uses. What does that look like in practice? In practice, an input guardrail can be just prompt filtering. You detect some keywords or even you have some smaller language model or classifier to detect if it's a malicious intent or not. You can have topic restriction as well as we do with our AI chatbot. We tell it to just answer AI queries. So, that's one thing you can do very easily. And on the output, you can have content moderation. So, have another LLM analyze the response for toxicity and just block or restrict any unsafe output types. And you can have something even more powerful, which would be format validation. So, you just always generate following a specific format, and if there's an attacker, they don't know the format you are using. So, if they are able to make the LLM say whatever they want, it probably won't be in the format that your system expects, and it will block it automatically. That's things we teach in the course. There are a few guardrails you can use specifically for agents like permission checking. So, just having the agent limited to do specific types of calls. You can have argument checking to ensure that the parameters passed to the tools are safe and well-formatted. And you can obviously analyze the outputs, same thing here. And regarding these attacks, we actually were part of a big research called the Hacker Prompt, where we tested many
Segment 11 (50:00 - 55:00)
people on trying to attack models, and we analyzed the different ways to attack models. We found out there were many different types of attacks that we all documented into the paper. But that's just to say that there are many different ways to attack models, and guardrails and defenses that you can put in place to prevent that. So, what's an attack specifically? Well, it's any type of untrusted text reaching the model. Just like we see here on screen, this is basically a very simple attack for agents, where you would try to convince the agent in the metadata of the the product or anywhere in the product page to buy the thing rather than following the normal instructions of the agent. This one is too obvious. It's just a meme, but this happens in metadata and hidden to the human eye. Basically, and we have two main different ways to do that, either direct injection, where you tell it, just like here, to ignore its instructions and do something else, like buying the shoes, or you can have indirect injections, where you have hidden instructions inside specific tools or even inside MCP, which you need to be careful of. And speaking of being careful of, we have a few best practices defenses you should put in place in your applications. First, you definitely want to use just better models, well-aligned models, which is why companies that you trust, like OpenAI, Anthropic, and Google, might be better than some other companies, just because they have guardrails in place already. Likewise for some open models, like Lama. You want to have robust prompting with clear instructions, explicitly telling the model to not follow the user's instructions and just answer the query in specific conditions. You may want to add input filtering and analysis as we discussed just for specific keywords or classifying intent. You may want to use parameterization, which is basically the same thing I described with structured parameters where the model exchanges in some structured way hidden to the user so that if the user injects something to make the LLM say something it shouldn't, it won't be in the right format and this will be flagged automatically. You may also want to have output validation, so just checking if anything is weird or not related to the topic it should be talking about. You want to limit capabilities of agents as much as possible, just reading if it just need to read or whatever type of uh capabilities you may want, you need to give it the least privilege possible. — [snorts] — And then you definitely want to monitor and adapt. You want to test, to run evaluations, and to constantly check for various attacks that exist. And typically these defenses are in various layers. You have an input, a policy, a runtime, an output, and monitoring layers that I all discussed already. And you want security at each of these layers. For inputs, you want to validate and sanitize every input you receive. For the policy, you need to separate the policy from the data that you have because no model should have access to both your policy, your data, and everything at the same time. It should have only access to what it needs to do its current job. For runtime, you also need least privilege for every tool access and every MCP server. And you need a sandbox for execution just to not break anything in your current environment, for example. For the output, you obviously want to validate against the schema before acting and route any irreversible or um non-negligible action to human approval beforehand. And finally, for monitoring, obviously, you want to trace everything. You want to use tools like Opaque to be able to easily monitor and see what's happening. And finally, some key takeaways is that LLMs are probabilistic autoregressive generators. So, it's not magic. It's not conscious. It's not even really intelligent. It just predicts the next token, but it does that very efficiently, and it's extremely impressive how well it can do that. But still, you need to be careful because it's just predicting the next token. It's not truly thinking about the problem and everything around it. Then, you need to give the model all it needs to succeed. You need to curate the context as much as possible because that's our main role as AI engineers. You want to compact the model. You want to trim what's irrelevant, add what's relevant. You really want to manage this context budget as much as possible
Segment 12 (55:00 - 57:00)
because that's where we can actually bring value since OpenAI and the other companies create the models, but all these models have the same shared limitations of both the context, which cost us money, but also adds delay and reduce quality as the more you add token there. So, the best thing we can do is to minimize the tokens, but optimize the quality of these tokens going in at each exchange. Then, obviously, you always want the simplest solution that worked. You want to start simple, just a prompt and add as you go based on your evaluations, based on the performance of your system. Which leads to the last one, to measure everything. You want five checks, you need to look at your exchanges, at what's happening, but you don't want to only be based on that because as soon as you will try to improve on something, you will get something else worse. And that's pretty much it for the foundations for AI engineers. Next, to upskill as AI engineers, the true secret is to practice and to build. Like, obviously, you won't become an engineer after this talk, but hopefully, it helps situate a bit all the solutions, things to be careful of, things to do, things to add. And then, uh you either learn by building or you learn by building with some help, some direction, which we give in our courses. So, please, if you're interested, check out the Towards AI Academy or reach out to me directly if you want a personalized training in your company. But in any case, I hope this presentation helped you, and I definitely invite you to subscribe to the channel for more videos about AI engineering with an upcoming one about vibe coding or better said, agentic coding, and how we do that properly at Towards AI.
Другие видео автора — What's AI by Louis-François Bouchard