# RAG Explained in 20 Minutes | Retrieval Augmented Generation + Hands on Project

## Метаданные

- **Канал:** Cloud Champ
- **YouTube:** https://www.youtube.com/watch?v=RosLeHGBLoY
- **Дата:** 05.05.2026
- **Длительность:** 20:04
- **Просмотры:** 3,410

## Описание

RAG explained simply in detail for beginners | What is rag in ai | How rag works in llm 
In this rag tutorial, you will learn Retrieval-Augmented Generation (RAG) from scratch with a complete hands-on project.

Links to Free Hands on RAG Labs(KodeKloud):
RAG Crash course - https://kode.wiki/3ORhYr2
Rag Tutorial -  https://kode.wiki/4tDz1fb
What is Rag - https://kode.wiki/4mYQTil

We start with a simple explanation of how Large Language Models (LLMs) work and why they fail when dealing with real-time or private data. Then we introduce RAG in LLM as a solution and walk through the full pipeline step by step.
#ai

Through the Rag project you will also build your own RAG pipeline from scratch using embeddings, vector databases, and semantic search.

This video is designed for developers, DevOps engineers, and AI enthusiasts who want practical understanding of how modern AI applications like ChatGPT with custom data are built.

What you will learn
What is Retrieval-Augmented Generation (RAG)
Limitations of LLMs
How embeddings work
What is a vector database
Semantic search explained
RAG architecture step by step
Building a complete RAG pipeline
Real-world use cases

Timestamps
00:00 Introduction
00:27 Rag Explained with ChatGPT example
01:50 What is RAG
04:35 How RAG works
05:24 RAG Architecture
07:30 Vector Databases
09:10 Hands on RAG pipeline project
19:38 Conclusion

Keywords:
RAG explained
Retrieval Augmented Generation tutorial
RAG tutorial for beginners
Build RAG system
LLM with external data
Vector database tutorial
Embeddings explained
Semantic search tutorial
ChromaDB tutorial
LangChain RAG
OpenAI embeddings tutorial
AI project tutorial
Generative AI tutorial
LLM architecture explained
ChatGPT with custom data
RAG pipeline step by step

If you want to go deeper into AI systems and understand how everything connects, watch these videos next:
- MCP Explained: https://youtu.be/Xs9AwE2lyHg?si=j-fCvTAc1Cxeib5N
- Backend of AI Applications: https://youtube.com/shorts/XsooLu1h6Q8?si=nwC11oCW7MVWj54n
- Vector Database Explained: https://youtube.com/shorts/6uQrgyGF-UE?si=lYBBMyg4hM-zy1EY
- Generative AI vs AI Agents vs Agentic AI: https://youtu.be/i2hPuwapxOg?si=pcCzH6spEJgxN41U
- What is Openclaw: https://youtube.com/shorts/l51qziR8lLA?si=KwGMWLwFPNkYAbqD
- NemoClaw explained: https://youtube.com/shorts/jWTE-sR3kig?si=iwg7MkWMMPXgJJYO

These videos will help you connect the full picture of building production-ready AI systems from scratch.
Subscribe to CloudChamp for more hands-on AI, DevOps, and cloud projects.

## Содержание

### [0:00](https://www.youtube.com/watch?v=RosLeHGBLoY) Introduction

In this video, you will learn what is rag or retrieval augmented generation. We will also understand how different companies use rag and also a hands-on demo to set up your own rag pipeline. So, make sure you watch this video till the end. Let's go. Before we get started, I have detailed notes created about rag and it includes everything you should know. So, if you want me to share these notes with you, comment below rag and I will share it in the video description. All right. Now, to

### [0:27](https://www.youtube.com/watch?v=RosLeHGBLoY&t=27s) Rag Explained with ChatGPT example

understand rag properly, let's start with an example of chat GPT. If you go to chat GPT and ask a question, what is the capital of India? It will reply you with an answer because it has been trained on huge amount of data like PDFs, website pages, Wikipedia and a lot more that knows that the capital of India is Delhi. But, when you ask chat GPT something internal about your company, for example, how much sales did we make on Saturday? Since chat GPT do not have access to your company internal documentation or does not know how much sales you have made, it will reply you with an answer saying, I do not have enough context or access to your internal company. Or chat GPT might just lie or make stuff up. This is because chat GPT is not trained on your company internal documentation. It has been trained on public internet pages. I hope that makes sense. Now, let's try with another example. This is the error I got from a docker container running in my local machine. If I paste the same error in chat GPT, chat GPT might give me a vague answer which is most likely wrong. But, if I paste the same error on docs. docker. com in the garden docker AI, it will give me right answer pointing me to the correct documentation and the sources where it got the answer from. So, how come chat GPT gave me a wrong answer but docker AI gave me the right answer pointing me to the right documentation? The answer is rag or retrieval augmented generation. So

### [1:50](https://www.youtube.com/watch?v=RosLeHGBLoY&t=110s) What is RAG

retrieval augmented generation or rag is a technique that combines information retrieval with large language models to generate more accurate, up-to-date and context-aware responses. And how does it do that? So, instead of relying only on what a model was trained on, rag allows the model to fetch relevant data from external sources. Could be documents, PDF, databases, APIs and use the information while generating answers. So, when you ask a question to an LLM, rather than giving the answers from the data that has it has been trained on, it will get the data from the sources and then generate the response to give it to you. So, without rag, your model could hallucinate. You can get old information or you can get responses that it does not understand or it could be not your data. To explain you this more clearly, I actually have a project in which I'm using a document, which is a text file that has information about Japan and I have two files here, the insert. py and the question. py. I will very soon explain you what it does. But, now let's run this file. So, if I say python3 question. py, I can simply ask it a question similar to I do to chat GPT. But, in here I'm using an model which is the nomic embed text model. It's a local LLM that I have on my local machine using ollama. So, when I run this file now, you can see it's asking me, put your question. So, now if I ask any question related to Japan, it will get the information from this text file. But, if I ask anything apart from Japan, let's say if I ask, what is Kubernetes? It will give me an answer saying it does not have context or information about it. So, if let's ask, what is Kubernetes? So, this is the user query. Using rag, it will search the document and now it gave me the response saying, there is no mention of Kubernetes or any other technology in the provided context. The text only discusses Japan. Now, the information is fetched only from the documentation. It is not referencing to any data that it has been trained on. Let's try another example. Let me ask, what to eat in Japan? Okay. So, now we have the response here which says, you can eat sushi, ramen, tempura, all of these. And the information is has been fetched from this particular file. I can try another example. Give me 3 days Japan itinerary. So, since this documentation has information about Japan, it will give me a 3 days itinerary using the information and generating a response. So, you can see, on the first day you should go to Tokyo, then Kyoto and then Tokyo again or Osaka optional. But, if I ask anything else, let's say what is capital of India? Since this documentation does not have anything about capital or India, it will give me a response saying it does not have information. So, using rag you can reference to documentations or text files, PDFs like this to get correct response.

### [4:35](https://www.youtube.com/watch?v=RosLeHGBLoY&t=275s) How RAG works

In this rag notes, if you scroll down, you will find a section that says, how rag works and it has a architecture diagram of a basic rag pipeline showing you all the components in a rag setup. For example, LLM but also has a vector database which is special kind of database. This vector database will hold all your internal documents, PDF, etc. But, you cannot store them directly. You need to convert the documents into an embeddings or vector using an embedding model so that whenever a user passes a query or ask a question, the question is also going to be converted into vectors. Now, the question vector and the vector stored inside your vector database will be you will do a search between them and then the context is passed to the LLM to create a response and then give it back to you. I hope that makes sense. If not, let me show you with an architecture diagram. So, let's start with an

### [5:24](https://www.youtube.com/watch?v=RosLeHGBLoY&t=324s) RAG Architecture

architecture diagram showing you what happens when you don't use rag, which is a simple chat GPT setup. So, you will have an LLM or a large language model. Could be GPT-4 for chat GPT, could be Gemini, Claude, Opus, etc., etc. Now, when a user asks a question or gives a prompt to this LLM, for example, it says, what is Kubernetes? This LLM does not understand text. It only understands vectors. So, the text is going to be broken down and then converted into tokens. And the process is known as tokenization. So, the process of breaking down your prompt and then converting into a token is called as tokenization. Now, these tokens will be converted into vectors, also called as embedding. So, these vectors are nothing but a long list of arrays that has set of numbers. Length of the array or length of these numbers will depend on the model that you are using. For example, we were using nomic embed text for the Japan example and it has a vector dimension of 768 dimension. Here is the query to create table inside your vector database with the dimension 768. But, if you're using a different model, the dimension will differ. Could be 1536, could be 768 or could be more. For example, in this open AI documentation, you can see by default the length of the embedding vector is 5 1,536 for text embedding 3 small model or 3,072 large model. So, depending on the model, the vector size will differ. But, now once the query or the prompt is converted into vectors, these vectors are then passed to the LLM. LLM, which is trained on huge amount of data, will then use these vectors to find patterns inside the data and then once the pattern has been found or the once the answer has been revealed, it will pass the vector to a decoder. This decoder will convert vectors into tokens, tokens into text and then pass the value to the user saying that the Kubernetes is a container orchestration tool. Then this is how the thing work. So, this is your basic chat GPT setup. This is how chat GPT actually works where you don't have any rag setup where it is not using your internal documents to give you

### [7:30](https://www.youtube.com/watch?v=RosLeHGBLoY&t=450s) Vector Databases

answer. Whereas, if you were using rag, you will have an extra component which is going to be your vector database. So, here this is your vector database. There are different types of vector database like Pinecone, Weaviate, Milvus. You can also use Postgres database as a vector database, which I'm using inside this Japan example. If you want me to create a step-by-step video to show you how to create rag pipeline using the example and how to use Postgres as a vector database, let me know in the comment section. Now, when using rag, you will have a vector database that will store all your documents, PDFs, internal files, etc. Since vector database cannot store file directly, you need to convert them into embeddings and you do that using an embedding model. So, this embedding model will take the documents and then convert them into vectors. These vectors are then going to be stored inside this vector database. Now, whenever a user ask a question like, how much sales did we make? So, this information about the sales were present inside this document. The content inside this document is converted into vectors and present inside this vector database. So, whenever a user asks a question, the question is then going to be converted into vectors the same way which is going to be converted into tokens and then vectors. We will use this vectors, which is the question vector, with the vectors present inside this vector database. We will do semantic search. Semantic search is a process of finding closest vectors so that it can give you correct response. Now, when you have some information from this vector database, the vector database will pass that information to the LLM which will generate some context framing the answer properly and then give this answer to the user. So, this is how rag actually works. I hope this explains you everything and to make it more easier, we are now going to do a

### [9:10](https://www.youtube.com/watch?v=RosLeHGBLoY&t=550s) Hands on RAG pipeline project

hands-on demo where I will show you how do you will set up your own vector database, how will you convert your documentations into embeddings and then store into vector database, ask a question and do a semantic search. Everything will be explained. All right. So, now it's time for a hands-on demo. We are going to be setting up a complete rag pipeline. And for this hands-on practice, I'm using a free lab by Code Cloud. The link is in description. So, we are given a scenario where there is a company named as TechCorp and the CEO just burst into office saying, we have 500 GB of critical documents like policies, products, meeting notes and nobody can find anything. Our competitors are using AI. We need a solution today. So, a CEO is asking to create an AI that can use all these policies, all these documents and answer questions using a AI. So, your mission is to build an AI-powered knowledge assistant that can instantly answer any question about TechCorp's document. So, we start by setting up a vector database, which is Chroma DB here. These transformers are going to be used to convert documents into embeddings. There's a web server using Flask, and OpenAI is used as an LLM. Okay, when using CodeCloud Labs, everything will be provided to make sure you easily do hands-on practice. I'm going to just copy and paste this, but to explain you, this command is to install all the dependencies and also Chroma DB. Once it is done, it will give out a message saying ready. Uh so, we can go on and check if everything has been installed or not. Let's check this. Check if virtual environment was created, if UV packager was involved, all the required packages are installed, setup completion. So, everything is done. Let's click on next here. The next step is to explore what document you have. pet_policy. md, remote_work_policy. md. So, you can run this command to see all the documents present inside this particular lab. So, you have employee handbook, meeting notes, product specs, customer FAQs. Inside employee handbook, you have all these different things. So, you can go check it out if you want. These documents are going to be converted into embeddings and stored inside the vector database that we just installed. So, right now we just installed it. We also need to create collection to store those vector database. Uh so, here we're going to be converting documents into vectors and then do a semantic search on these vectors. For this, you need to create AI brain for storing document vectors, and you have the code here. In this code, we are initializing a vector database, which is Chroma DB, and then we are also creating a collection with the name techcorp_docs. So, I'm going to create a Python file with the name init_vector_db. py. And then, paste the code present in this particular section. So, let's create this file, copy the code, paste it here, save it, and then run this using the command cd inside this folder, and then run the python init_vector_db. py. So, this command is to initialize vector database and also create a collection with the name techcorp_docs. Uh we are also going to be using HNSW index and a cosine similarity search option. If any of this does not make sense, you can simply go and copy and ask ChatGPT about it, but these are requirements for you to set up RAG pipeline. So, now it says AI brain is ready. Let's go and check it if the initialization script was created or if the Chroma DB directory was created as well. Okay, so it's now created. Let's click on next. So, the document is broken down into different chunks for easy or better search. This is another file that we need to create. Uh this is a test_chunking. py file, in which we have a function here, which is going to start chunking for documentation. And here is the sample documentation. So, I'm going to just copy and paste by creating a file first. I'm going to run this file now to see if the chunking script was created. Chunking test output exist and verify the chunking count format as well. Now, it has completed two chunks from 506 characters. So, you can see the output here. Let's click on check button here, and everything passes. So, now we have a database. The documents have been chunked. We need to now put the embeddings inside the document. So, understand that pets permitted means dogs are allowed. So, the similarity percentage is 92%, whereas remote work does not mean dogs are allowed. So, the percentage is 18%. Uh so, semantic search means finding closest vectors, as we have already known. Let's go and create this file. So, in this code, you are using all-MiniLM model to convert words or sentences into vectors. Each vector is going to be converted into 384 dimensions or 84 numbers, as you can see here. So, I'm going to just copy this, put this inside the file. Okay, let's run this file again using the command python test_embedding. py, and this will create embeddings. So, you can see it's using the all-MiniLM-62. This is a Hugging Face model to convert words into vectors, and it has been done. Let's click on check. All right, let's click on next now. Okay, so far we have the documents converted into vectors, and we also have a database. But now, we need to put that embeddings inside the vector database that we created. So, the purpose is to process all the documents using chunking and embeddings into a into the database. So, in this code, we have the function to ingest uh the document vectors inside vector database. I highly recommend you to understand the code and what you're doing be- if you are actually doing this lab with me, or else it does not make sense. So, CodeCloud provides you with every command and every code, so you can simply go and use it for your reference. If you don't understand it, you can also go and ask ChatGPT about any line of code you want. Okay, so I'm going to save this file and then run the code again, and this will load all the vector embeddings inside vector database. So, you can see uh this first file has been converted into nine chunks, and then the vectors are created to store inside the database. Let's click on check if everything has been successfully done. And now we move on to the next step. So, in this step, uh we are going to be asking a question, which will be converted into vectors, and then a vector search will be done on any of these documents that have been inserted inside the vector database. So, we have The purpose is to build a semantic search that understands meanings, not just keywords. So, earlier in the SQL search, it was a keyword search, but now it's a semantic search because using LLM, you can now find words with similar meanings, not exact keywords. — [snorts] — So, we're going to be creating uh this file test_search. py. So, in this code, you have queries like, "What is the policy at TechCorp? Tell me about Cloud Sync Pro features. How many days of remote work are allowed? " And it will do a semantic search on the document present. So, I'm going to paste this code here, and then run the command python test_search. py. So, it will give me results of all the responses for these questions. The question was, "What is the pet policy at TechCorp? " Top result by semantic search. It is looking at this file, which is pet_policy. md, and it says employees with allergies should notify HR if they have allergies with pet. It also says that pets are not allowed in cafeteria, con- conference room, server room, etc., which is exactly what this question is asking. The second question is, "Tell me about Cloud Sync Pro features. " So, it is now searching into another document to find the answer, which is cloud_sync_pro. md. It says this particular file has 100% match up for the question that you're asking. This has 85% match, and this has 70% match. So, this just means that whenever you ask a question, it is going to do a similarity search on the vectors present inside vector database, referencing to all these documents, because this is what we inserted inside the document. And this is actually what RAG is. Now, let's move on to the next thing after we check if everything worked here. Okay, task eight is to complete a RAG pipeline test. Now, the purpose is to test all the three phases, which is retrieval, augmentation, and generation. So, I'm going to create another file, which is test_rag_pipeline. py. So, I hope this makes sense. If not, you have to do hands-on practice to actually understand all this stuff. And if you have not skipped any previous part of the video, you will definitely understand everything going on here. So, now I'm going to create a file, which is test_rag_pipeline. py. All right, so the file executed properly, and here we have response. It says, "What are the benefits of working at TechCorp? " So, it first retrieve the information from the document and says found three relevant documents. Once it has information, it is now augmenting it, which means preparing context for AI. Then, it will give the information to LLM. LLM will prepare a nice answer and then give back the response, uh which says, "Based on the TechCorp documents, employees enjoy comprehensive health insurance, 401k matching up to 6%, and these are all the different benefits of working at TechCorp, which means the RAG pipeline is working properly, and we can also check. So, all the checks passed. Everything is working now. We need to just give a web interface so that users can go ahead and ask their question easily, rather than running through these scripts. Let's go and set up a Flask application, which is going to work on port 5252. Okay, so the app is now running. You can see it's running on port 5252. To access it, you need to click on this three dot, and then click on view port option, and then choose the port, which is 5252, on which your app is running. Click on open port, and now it opens up a UI, where I can ask questions like, "Can I bring my dog to the office on Monday? " So, based on TechCorp's documents, here's the relevant information. TechCorp policy, da da da da. Dogs are welcome in office every Friday. They're not welcomed on Monday. So, this gave me answer based on the document present inside uh pet_policy. So, the source, as you can see, it's been referenced from this particular source. If I ask something like, "Do I get family health insurance or just mine working in the company? " So, now this question is around benefits of working in a company at TechCorp, and it says based on the document, health and wellness, it covers 80% dependent, 100% employee, and da da da. So, this means now the RAG pipeline is working perfectly. So, you can now ask your questions, and it will give you a response from the document. It will not hallucinate or lie about the answers, which means RAG is working properly. So, your RAG system is now live, and try more questions about benefits, products, etc. So, you can go ahead and try this out yourself. The link for this lab is in description. This is completely free, so you don't have to pay or buy a CodeCloud subscription. Along with this lab, there are also other two labs that I highly recommend, uh which is this YouTube lab RAG uh crash course. This is again free, and this has all these different modules. And the last one is this one, which has all the labs. So, you have different RAG search methods. If you want to go more deep, understand BM25, hybrid search, etc., you can go and check this lab. So, once this is done, you can go and try it out yourself. I hope this was informative. So, this was our video on RAG, or

### [19:38](https://www.youtube.com/watch?v=RosLeHGBLoY&t=1178s) Conclusion

retrieval augmented generation. I hope this video was informative. If you have any questions, any doubt, feel free to let me know in the comment section. And if you want me to create a separate, dedicated project on RAG, explaining all the settings like temperature, top K chunks, or how to work with different vector dimensions, or how to use Postgres as a vector database, let me know the comment section. I'm happy to create another video as well. See you in the next video. Bye-bye.

---
*Источник: https://ekstraktznaniy.ru/video/51722*