# AI Series: Prompt engineering and RAG - Explained

## Метаданные

- **Канал:** AWS with Chetan
- **YouTube:** https://www.youtube.com/watch?v=VZrGuQiqFpQ
- **Дата:** 28.04.2026
- **Длительность:** 23:38
- **Просмотры:** 414

## Описание

In this video let's get the basics of AI clear by understanding how can we customize the LLM response using different techniques and there let's specifically focus on Prompt engineering and Retrieval Augmented Generation (RAG).
Further in the next video, We will build end-to-end RAG based Intelligent Document Processing application.

In this Video:
0:00 Techniques to customize FM/LLM response
2:15 Prompt Engineering
6:18 Why RAG?
8:38 What is RAG
10:07 How RAG works?
12:48 Vector Embedding Flow
15:19 Chunking
17:13 Vector embeddings
19:39 Vector embedding demonstration
21:23 Vector databases
22:08 Recap
22:51 Architecture for RAG application (Next video in Let's Build Series)

## Содержание

### [0:00](https://www.youtube.com/watch?v=VZrGuQiqFpQ) Techniques to customize FM/LLM response

But, actually they are in 3D space, and that's where they are kept closer to each other. All right. So, in this lecture, let's first talk about what is prompt engineering and what are the limitations of prompt engineering. And then, we will see how the rag, that is retrieval augmented generation, can overcome those limitations. And there, we will talk about what is rag, how it works, and while doing that, we will also understand the internal functionality of the rag. For example, what is chunking, what is vector embedding, and there, I will also show you small demonstration, and then, how the rag-based applications are designed. So, there, I'm also going to show you the architecture, and in the next video, we are going to build a complete rag-based application, which is based on this architecture. All right. Now, if you look at the any LLM or the foundational models, there are different ways in which you can customize the output of the model. And this is typically required because your foundational models or the LLMs doesn't have all the data, which is typically a private data. Now, for doing this, there are a couple of techniques, and out of this, if you compare them across the dimension of complexity, quality, and cost versus time, then the simplest method is prompt engineering, where you don't need to modify your LLM model as such, but just providing the different prompt, you can get different output. So, that's the first technique. However, there are certain limitations of the prompt engineering, and that's where we are going to the next step, that is using the retrieval augmented generation technique. So, in this video, we are going to focus on these two techniques. Now, apart from that, as you can see, there is a fine-tuning, as well as there is an option to build the FM from scratch, but as you might have guessed, it requires a lot of data, a lot of computing, and this is very expensive operation. And that's where typically only the big enterprises can afford to invest in building the FM from scratch or fine-tuning those FMs. Okay. So, now let's talk about the prompt engineering and the rag. So, I'm

### [2:15](https://www.youtube.com/watch?v=VZrGuQiqFpQ&t=135s) Prompt Engineering

sure you know what is prompt. So, prompt is just a set of input that you send it to LLM, and you expect the response from the LLM. So, user will send the prompt, for example, who invented airplane, and the foundational model or the LLM model will respond with the text output. Now, LLM can answer this because LLM has been trained with all the data available on the internet. So, this was a generic question, and LLM could answer that. But, what if you have to provide the information to the LLM, which probably it never saw? Or you want LLM to answer the question based on certain context that you provide. For example, you are adding some additional context, like effect of the climate change. And now, you want LLM to answer the question based on this context. So, LLM can definitely understand this and can give the response based on this context. So, here, basically, you have influenced the response of the LLM by providing this additional context. Right? Now, next, when we talk about the prompt engineering, there are different techniques in that. For example, something called zero-shot prompting, in which you do not provide any context to the LLM, and you simply expect the response. So, for example, if you are giving this prompt, then the LLM will provide the response whether the sentiment for this particular statement is positive or negative. And LLM is providing that based on thousands and millions of such similar sentences and understanding the sentiment behind those. So, that's called zero-shot prompting. However, in your business, it is possible that the sentiment is based on a lot of different factors, and in that case, you want to provide some examples to the LLM so that it can understand the sentiment little better. So, in that case, you will have few-shot prompting, where you're also providing some kind of examples. So, here, you are providing some sentences, and you are also telling the LLM that the sentiment for this statement is negative or positive. And now, based on these examples, it should provide you the answer. So, this is called few-shot prompting. All right. So, this works as long as there are limited number of examples, and the questions are fairly generic. Now, let's move forward, and let's ask a different question to this LLM. For example, if you have some kind of internal policies in your company, and if you want to ask the question about it, do you think the LLM will be able to answer that? So, let's check that. For example, you ask the LLM, "How many personal leaves am I entitled to take at my L3 job level? " Now, if you see, this is very specific information about your company policy. So, it is highly likely that LLM has never seen this document, and that's where it doesn't know about that. So, in this case, LLM will still respond, and typically, it will respond based on other information that it learned over the internet. So, it is possible that there is a, you know, plenty of documents about the similar policies out there on the internet, and LLM might just provide the answer based on those document. So, in that case, it may provide the answer, say, 25 days. And this is called hallucination. And LLM is doing that because LLM doesn't have the information about your internal company policy. And that's a big problem. So, how do you solve this problem? Now, the ideal way to solve this problem is to just provide additional context to the LLM while you ask that question. And for doing that, you might provide the policy document in the prompt to the LLM. So, you just upload those documents, which has all the information about policy, and then you ask this

### [6:18](https://www.youtube.com/watch?v=VZrGuQiqFpQ&t=378s) Why RAG?

question. Now, this time, LLM will read all those documents, will understand your internal policies, and based on that, now, it will provide the actual answer from your company policy document. So, it will provide the correct answer. So, this works, right? However, there are certain problems. Now, imagine that this policy documents could be in size of few MBs or even some GBs if it's a big enterprise. And if you are sending these PDF documents along with every prompt, imagine that how many tokens your LLM will consume. And that is really a problem because then, you would have to pay a lot of cost for using the LLM. So, cost is a big problem there because you are unnecessarily sending all the documents every time to your LLM. All right. So, what is the solution to this? And as you might have guessed, the solution is rag, but for this, we need to tell LLM how to use this information for answering the question. And here, we are not talking about all the information there in all the policy documents, but rag should only receive the context, which will help LLM to answer that specific question. Which means that you shouldn't send all these document, but only the part of the document which are relevant to your question or a prompt. And exactly, that is a rag. So, rag basically enables you to augment the prompt at run time, and along with the prompt, it also sends the additional context, which will help LLM to answer that particular question. And as you introduce the rag into your application, you are basically fixing this token limit issue, as well as unnecessary cost of the LLM. All right. So, I hope it is clear why rag is required. And now, let's talk about how rag works. So, you still need to have access to your company policy data, but this data will be now stored in the rag application. And it will store that in the form of vectors, which we are going to talk about soon. But here, let's first understand the flow of the same

### [8:38](https://www.youtube.com/watch?v=VZrGuQiqFpQ&t=518s) What is RAG

application when using the rag. So, on top of this company data, basically, you add one more layer of the rag. And now, if the user asks the same question about the number of leaves, then the GNA application, which is sitting in the middle, will now talk to the rag application to get the relevant information about that question. So, the rag application will now extract only the relevant information about that question, and will respond with that additional context. And now, your GNA application will send augmented prompt to this LLM, where it will also add that additional context. And now, because LLM has all the information to answer that question, it will respond with the correct answer. So, this is how the rag-based application will work. However, the most important thing here to understand is that what this rag contains, how exactly it searches the relevant information, and then sends back the relevant information to provide that context. So, now, let's understand that. So, in order to use the rag, there is a prerequisite. And the prerequisite is that your company data, or say, policy data, should already be stored in some kind of vector database. And in order to store there, there is an ingestion process for those documents. So, the

### [10:07](https://www.youtube.com/watch?v=VZrGuQiqFpQ&t=607s) How RAG works?

first step is ingestion and store. So, here as you can see that there is a company data which could be in your SharePoint or in the databases. And then, you need to get this data and you have to create the vector embedding of that data. Now, what is vector? What is vector embedding? We will talk about that. But basically, you are converting the text into the numeric format which are easier to search based on their meaning. And this is also called as semantic search. So, here basically, you should read all that data and convert that into the vector embeddings. And for that, you use the vector embedding models which are not LLM, but these are special machine learning models which can convert the text into the numeric format based on the semantic or the meaning of that word or a sentence. So, basically, you will feed in all these documents and the embedding model will convert that text into the vector format. And you will store these vectors into some kind of vector store or the vector database. And again, we are going to talk about this vector store shortly. So, the point is all the text data now is stored in the vector store in the form of numeric data. Right? So, that's the first phase or say prerequisite to use the rag. And now, as the user sends the same prompt to the GenAI application, now the query that user sending should also be converted into the vector format. So, your GenAI application will again use the embedding model and will get the vector representation of the query. So, now your GenAI application has the vector embedding for this prompt or say query and all the document and the information is already stored in a vector format in this vector store. And now, this GenAI application will query this vector store based on the input vector that has been created. And the vector store will basically respond with the relevant chunks of the information that it has stored. So, as a result of this search, it will provide the part of the text from those documents which are matching to the user query. And this will be that additional context that the GenAI application will send to the LLM. And now, LLM can answer this question based on this additional context. Right? So, this is how the rag flow works. Now, the most important part to understand here is that how the text in this document is converted into the vector format. And for that, let's understand the vector embedding flow.

### [12:48](https://www.youtube.com/watch?v=VZrGuQiqFpQ&t=768s) Vector Embedding Flow

So, as I said, there will be this company data and there will be your application which you have built to create those vector embeddings. And this application will perform few steps. First, it will just load the document from the source. And the next thing it will do, it will read all the text. Now, if your documents are in PDF, then you can use some Python libraries to simply read that PDF document. Now, next, you need to convert this text into the vector form and for that, as I said, you will use embedding model. However, if you see, the PDF document could be of any size and the length of the document could be of any size. And again, there will be limitation of how many bytes or the tokens you can send to this embedding model. So, it's not feasible to send the entire file in one go and expect the response back from the embedding model. So, now basically, instead of sending the whole file in one go to this embedding model, you are going to split this file into the smaller subset of text, you can say. And this is called chunking. So, you are going to split one big file into the smaller chunk based on the size of the chunk that embedding model supports. All right. So, once you split your file into the chunks, the next thing that this application should do, it should send this chunk to this embedding models to create the vectors out of those. So, this is simply a vector representation of that chunk. And once that is done, now this vector information along with the original data should be stored in some kind of vector database. And once it is stored there, it has all the information about that particular text, which file name it came from, and the vector representation of that text. So, at the end of this ingestion process, this vector database will have all the information about all the chunks from all your files. So, which means that this is kind of preprocessing that you need to do in order to use the rag. And over the time, if your policy changes, then again, you would have to repeat this process for that particular document. All right. So, that's the end-to-end vector embedding flow. And within that, now let's quickly talk about this chunking and the embedding. As in how it works and typically which tools you use for that. So, if you talk about the

### [15:19](https://www.youtube.com/watch?v=VZrGuQiqFpQ&t=919s) Chunking

chunking, it means simply splitting a large document into smaller and focused pieces of the text. Now, when you do this chunking, there are typically different strategies. As in, you can just create the chunk of fixed size. For example, say 512 tokens or words. However, the problem with this strategy is that then it is possible that your sentences are broken in between and that's where they are losing the context. So, this is kind of fastest way and it works for variable structured document, for example, the CSV file because there, the record size is fixed. However, for the unstructured documents like PDF and all, this doesn't work. And hence, you might go with the sentence or paragraph level splits. Or further, you can also go with the recursive strategy where first it will try to split by paragraph, but if the size is problem, then it can further break down at the sentence level. And finally, the most accurate one will be the semantic base splitting. That means it knows the context of a particular sentence or a paragraph and it will split at that boundary. So, likewise, there are different common strategies for chunking and depending on your use case, you will use one of those strategies. And in order to do this chunking, you can use different tools like LangChain, LlamaIndex, unstructured. io, and so many others. So, as a result of this step, as you know, one big file will be split into the smaller chunk of text. So, that is chunking. And here is the simple example where one big paragraph has been splitted into the separate sentences while keeping the context or the meaning of the sentence intact. All right. So, that's chunking. And now, let's talk about the vector embeddings.

### [17:13](https://www.youtube.com/watch?v=VZrGuQiqFpQ&t=1033s) Vector embeddings

So, as I said, vector embeddings are nothing but representing the text in the form of some numbers. And these numbers are across the dimensions. Now, this is interesting. So, for example, king is a word which may represent different thing in different context, right? So, king may be the actual king represents to the king of the empire, something like that. Or if in the context of playing card, king means one of the card. Or it could also relate to the king of the jungle and in that sense, it could mean lion. So, this means that a same word can have different meaning based on the context where it has been seen. And likewise, there could be hundreds of dimensions across which same word can be interpreted. And that's where these embedding models use the N dimensions. So, for example, if you go with say Amazon Titan model, it supports say 512 dimensions. So, this vector embedding for that particular text or a sentence will be created across these dimensions and according to the context, there will be weight assigned to those strings. And similarly, if you see, there will be other words like queen. Now, if you see semantically, king is very close to the queen and that's where these numbers are closely matching. So, that now in the vector database, when you are search for the queen, it might also give you the response which are closer to the king. However, if you look at the apple which is far from the king and queen, these numbers differ because in vector space, they are at distance from each other. And shortly, I'm going to demonstrate how these words might look into the vector space. So, the point here is that for every text, the vector embeddings are created across multiple dimensions. Now, this is for the text. However, you can do the same for the sentences as well. Now, the point here is that you are still keeping the meaning that is semantic of the word intact while you create this vector embeddings. Now, in the next video when we are going to build the HR policy rag based application, in that case, you will see that the sentence like this are converted to corresponding vectors and then during search, it tries to match to the closest vector. All right. So, with that, now let me show you these vectors in action. And

### [19:39](https://www.youtube.com/watch?v=VZrGuQiqFpQ&t=1179s) Vector embedding demonstration

these vectors look something like this. For example, if you provide the text, for example, cat, dog, kitten, puppy, everything, they are just represented here in three dimensions because as a human, we can only visualize up to three dimension. But, as I just explained, these vectors are actually created across many, many dimensions, which we can't visualize. So, now I'm just showing you the same vector, but just converted into the three-dimension space. All right. So, to visualize these vector embedding for these words, I'm basically using this simple app, and I got this from the GitHub, and you can find the link of the app in the description, and you can also deploy that. Now, only thing I changed is all these texts that I wanted to use. All right. So, now if you see this all text, for every text, there is a vector embedding, and this is represented in 3D space. So, if you see closely, if you just go for, say, this particular text, which is elephant, and then if you see sad, you might see that they are closer. But, actually they are in 3D space, and if you just turn around, you will see that there is this elephant, and there is this whale, which both are animals, and that's where they are kept closer to each other. Now, as I said, we can only visualize in 3D, but there are 512 such dimension for this embedding model. So, basically what it does, it keeps all the semantically closer objects or vectors closer to each other. So, that now when you query, it will provide you the answer based on the all the closer other vectors. So, this is how basically the vectors are stored and queried. And for storing these vectors, there are

### [21:23](https://www.youtube.com/watch?v=VZrGuQiqFpQ&t=1283s) Vector databases

again different tools and some of the databases that you will see, and the very popular databases are like Pinecone, or you can even use the PostgreSQL database with PG vector extension, and likewise there are quite a few other open-source databases in which you can store the vector. Now, in the next video when we are going to build the application, simply use Amazon S3 vector, which is a new type of the S3 bucket that we can create, and this is one of the cheapest option, because as you know, S3 itself is very cheap, and we don't really need to have these expensive databases. So, in our application, we will use Amazon S3 vector to store our vector embeddings. All right. So, just to recap what we

### [22:08](https://www.youtube.com/watch?v=VZrGuQiqFpQ&t=1328s) Recap

learned in this lecture is that we first talked about different ways to modify or say customize the FM response, and there the important techniques that we talked about is prompt engineering, and then retrieval augmented generation. Now, while the prompt engineering works, there are problems like you can't send the whole document into the prompt, because it will be very expensive, and that's where we use RAG technique. And this is how the RAG architecture looks like. Now, within the RAG, further we talked about how the chunking works, and how the vector embedding works, and how vectors are stored across multiple dimensions. Right? So, I hope it is clear, and now with this knowledge, in the next video

### [22:51](https://www.youtube.com/watch?v=VZrGuQiqFpQ&t=1371s) Architecture for RAG application (Next video in Let's Build Series)

that is Let's Build Series, we are going to build fully functional RAG-based intelligent document application. And there, we are going to use all these AWS services, but the most important part of this is that we are going to use Bedrock for the LLM and for embedding model, and we are going to use S3 vector bucket to store our vectors. So, I will highly recommend that you build this application from scratch. So, for that, just subscribe to my YouTube channel, and just keep a watch, because next week I'm going to release the video for building this RAG-based application from scratch. All right. With that, I hope you enjoyed this lecture, and I will see you into the next video. Thank you.

---
*Источник: https://ekstraktznaniy.ru/video/50053*