GenAI Project 1 - LLM Fine-Tuning with LoRA on Google Colab | Text-to-SQL
1:38:40

GenAI Project 1 - LLM Fine-Tuning with LoRA on Google Colab | Text-to-SQL

Siddhardhan 16.04.2026 2 890 просмотров 116 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
🤖 My end-to-end Machine Learning & Generative AI Course - Udemy: https://linktr.ee/siddhardhan In this video, we build a complete LLM Fine-Tuning project with LoRA on Google Colab and create a practical Text-to-SQL system step by step. You will learn how to fine-tune a lightweight large language model efficiently using PEFT / LoRA, prepare the dataset, train the model on Colab, and test it on real natural language to SQL examples. Colab notebook link: https://drive.google.com/file/d/18_n8VYa287Hf2sOPAKE_G9ok8ptyvspn/view?usp=sharing #generativeai #genai #artificialintelligence #ai #llm #finetuning

Оглавление (20 сегментов)

Segment 1 (00:00 - 05:00)

Hello everyone. I'm Siddharthan. Welcome to generative AI project one. In this project, we will learn how to fine-tune a large language model using Google Colab for a practical task, and that is text-to-SQL. By the end, you will be able to take natural language questions and convert them into SQL queries. While pre-trained LLMs are powerful, fine-tuning allows us to specialize them for specific use cases. Here, we will train a smaller model on text-to-SQL data, so it follows instructions more precisely and generates output in the exact format that we expect. So, let's get started. So, before getting into the hands-on part, let me quickly show you the Udemy courses that I have. So, right now I have two courses. One is a complete generative AI course, and the other one is a complete machine learning course. So, here I've started from the very basics and all the way up to advanced concepts. So, in generative AI course, you have topics like prompt engineering, rag, AI agents, MCP, and we also have topics like deployment and capstone projects as well that you will build at the end of the course. Similarly, in machine learning, I've started from the basics of machine learning, basics of Python, and we have machine learning models intuition, capstone projects, etc. So, take a look in case if you are interested. I'll provide the link for these courses in this video description, so check that out. So, now let's get started with the hands-on part of how we can fine-tune these large language models. The first thing that you need to do is access Google Colab and create a Colab notebook like this. So, I have this dot ipynb Colab notebook created, and next you need to go to this runtime, click this change runtime type, and select this T4 GPU. By default, Colab notebooks will have CPU as their environments hardware, but we need this GPU for our fine-tuning. So, I have a subscription-based plan, so I have access to these large GPUs like A100, etc., but free tier comes with this T4 GPU, so you don't have to worry about getting these larger GPUs. So, we will take a smaller model, and we will be fine-tuning it on this T4 GPU, which is like a much more smaller memory compared to this other larger GPUs. So, access this runtime type, select T4 GPU, click on this save. Now, as I said, we will be using a smaller model, and that is Tiny Llama 1. 1 billion chart version. So, we will take this smaller model and fine-tune it on our text-to-SQL data. And these are the steps that we will be doing this for this particular lesson. So, first is checking the GPU availability. We will print whether GPU is available or not, what is the CUDA version, etc. So, we will first check that, and then we will load this base model, which is this Tiny Llama model, and to save memory, we will be loading this in a floating-point 16, a smaller precision compared to the default precision of this model. And then we will inspect how the base model answered some prompts. So, what we will do is, before fine-tuning, we will send it some prompts, basically some questions, and understand how this base model is generating SQL queries, so that later, once we complete this fine-tuning, we can understand how this fine-tuned model is giving the responses. So, for that comparison, first we will understand the output structure of this base model for natural language to SQL conversion. And in the fourth step, we will prepare a small instruction dataset. So, we can use a dataset that is available on Hugging Face that has, you know, your SQL queries and natural language pairs, so we will be using that to fine-tune our model. And then we will attach this LoRA adapters and train it. So, here we are not going to fine-tune the entire model, rather than that, we are going to use this concept called as LoRA. So, I'll explain more about this LoRA later once we get to it, but yeah, just remember that we will be using this specific technique. And then we will inspect how this fine-tuned model answers these same prompts that we have used before. So, we will do a comparative analysis of how the output were before fine-tuning and how it was after this fine-tuning. And finally, we will save this adapter. Save this adapter is nothing but how you can save this fine-tuned version of this model. So, this is what we are going to discuss overall in this particular notebook and in this lesson. So, we will get started with the actual code. The first step that we need to do is set up our environment. So, I'll create a text cell, and I'll say environment setup. So, in this environment setup, we basically install all the required libraries that we need. I'll just make this as a section, so that our code is like more, you know, structured, and looks good in in, you know, terms of structure-wise. Right. So, these are the libraries that we need. So, first I'll run this installation command. So, we are getting connected to a GPU T4 GPU environment, and this installation script is getting executed. So, first we need this Transformers library, which is basically Hugging Face library. We need it in order to load this Tiny Llama model. And then we have this PEFT. PEFT stands for parameter efficient fine-tuning. And within PEFT, we have methods like LoRA, QLoRA, etc., which we will be discussing

Segment 2 (05:00 - 10:00)

later in this particular lesson. And then we have Accelerate. Accelerate is used to automatically load your large language model into GPU. It's basically for better handling and easier handling of these models into our devices like CPU or GPU. So, for that we need this Accelerate. And then we have this Datasets library, which is also from Hugging Face. So, this would have our text-to-SQL dataset, and in order to load that, we need this library. And then we have TRL. TRL stands for us the functions and methods that we need in order to fine-tune our model. And then we have SentencePiece and Protobuf, which are required by the tokenizers of the LLM. So, here we have this Tiny Llama model, right? So, this Tiny Llama model or this Tiny Llama LLM, or any LLM for that matter, would have a tokenizer. And the job of a tokenizer is to split the prompt or the text that is coming in into smaller pieces called as tokens, and the SentencePiece and Protobuf are required for that internally by these tokenizers. So, these are the libraries. Transformers for loading the model. PEFT is the library that's going to give us the LoRA method in order to fine-tune the model. Accelerate is for efficient loading of model to the GPU. And then we have this Datasets to load our text-to-SQL GPU. TRL as our fine-tuning-related code that we can use. SentencePiece and Protobuf, as I said, are used for the tokenizers internally. So, this these are the libraries that we need. So, in this next part of the code, we are going to check the availability of GPU, what's the version of, you know, let's say CUDA, PyTorch, etc. So, first I'll just like run these two lines of code. So, first we are importing the system library in order to print the version of Python that is installed in this Colab environment, and then we are also importing PyTorch. PyTorch is basically PyTorch library, and we are also printing PyTorch version. Let me type this as PyTorch. So, it's going to say that Python version is, let's wait for this output. Right. So, Python version is 3. 12. 3, and PyTorch version is 2. 10. 0. So, this CU128 means CUDA 1 12. 8. Uh yeah, so I think that is the version number, basically. It's 12. 8 or 1. 28. I think it's 12. 8, that maybe you can refer to it. So, you have this CUDA version, but in case you don't have any GPU, you can also install the CPU version of these libraries as well. In the next step, I'm going to check the availability of GPU. So, let me run this code. So, this is the first part. Maybe first I'll just like show this part of the code. So, this is going to tell you whether CUDA is available. So, we are using this torch PyTorch library that we have imported and say torch. cuda. is_available. So, this is going to return you a boolean. If you have a GPU in your environment, it's going to return a true. If you don't have a GPU, in case you have only a CPU environment, it's going to say false. So, this is just to validate whether like we have, you know, GPU available in a specific instance or machine. And now, in case CUDA is available, that means like in case if you have any Nvidia GPUs that is available, we are going to print the name of this GPU, and then like get the properties like what's the memory, what's the RAM, etc. So, let's print it. So, first we have the Python version, and then we have the PyTorch and CUDA version. We have It says whether CUDA is available or not, and then it says GPU is Tesla T4 GPU, so this is also another Nvidia GPU. And then we have this GPU memory of 14. 56 GB. Again, we won't use you know, all of this memory, so we will only use like a part of it. As again, we are using only a smaller model, and even that we are loading in a smaller version of FP16. So, we will get to that. So, this is basically for us to understand our environment in a more better way, so that according to that, we can load whatever model that we need. So, the next step, let me type it over here, is load the base model in FP16 version. So, this is basically the precision in which we want the model in. So, 1. 1 B basically represents there are 1. 1 billion parameters to it, right? And each of this parameters will be represented in some decimal values, and let's say the default precision value is floating-point 32, and this is going to take slightly larger amount of memory, GPU memory. So, we can load this in a smaller version of lesser precision, that is floating-point 16. So, that's basically the concept of this. So, I have a text over here. Let me paste it for your easier understanding. One second. All right. So, here I'll say we load the model in half precision, that is floating-point 16 precision, and on T4 GPU, this is the sweet spot, and this is like much faster than 32-bit like precision values. Right. So, the next step is loading the model. So, for this, I'm going to say from Transformers

Segment 3 (10:00 - 15:00)

import AutoTokenizer. And then, AutoModel for causal LM. Right. So, these are the two methods that we need. And again, we have already imported torch, but let me just like put it here as well. So, that in case you don't have you don't run this code in your notebook, you still have this PyTorch imported over here. So, I'll say import torch. Right. So, AutoTokenizer is used to load the tokenizer for our TinyLlama model and AutoModel for causal LM. This loads the actual large language model. So, these are the two things that we need. So, next thing that I have to do is provide the model ID. So, model ID, we have to provide this TinyLlama model name. So, go to Google and search for TinyLlama 1. 1 billion chat V1. So, go to this huggingface site and here you will be having this copy symbol. So, copy this. So, it says TinyLlama/tinylama-1. 1b-chat-v1. So, copy and paste it over here. So, this is the model that we need. Right. So, for this model, first we need to load this tokenizer and then load the model as well. So, let me copy those specific code and then I'll explain you what is happening. Right. So, first in this part, we are loading a tokenizer and the model. So, what is a tokenizer? Let's say we send a prompt like I love machine learning or let me say I love ChatGPT. So, let's say that this is the prompt that I'm sending to a large language model. So, before it kind of can process this text, first it uses a tokenizer to split this text into smaller pieces. So, it will be something like uh I and then it would split this as love and then ChatGPT can be split into chat. And this would be split into GPT. So, this one sentence has been split into four tokens. So, this is basically what a tokenizer would do. Later, it would also convert this into unique IDs, but just remember that it's basically a process of splitting text into smaller pieces. So, for that, we need this tokenizer. And we use this AutoTokenizer that we have imported, say dot from_pretrained and provide this model ID. And the important point to note here is not all models use the same tokenizer. Each model has its own tokenizer. So, I'm saying that AutoTokenizer. from_pretrained, provide the model ID that is TinyLlama. So, it's going to load the tokenizer for this specific model. And in the next step, we are loading this AutoModelForCausalLM. This is loading the actual model itself. So, we have this AutoModelForCausalLM, similarly say dot from_pretrained, provide your model ID, which is this TinyLlama-1. 1B. So, provide this data type. What's the data type that you need? So, we have discussed that we'll be using FP floating point 32, not uses, but uses floating point 16. So, I'm saying torch. float16. So, it's going to take much more smaller amount of memory in our GPU. Device map as auto. So, auto as in like if a GPU is already there, then it's going to load the model on the GPU. And this is why we have installed this accelerate library. So, this is basically the part of how you would load this tokenizer as well as the model. So, let's execute this. Now, I'll explain the purpose of this part of the code. So, this is like nothing complex. So, here we are basically loading our tokenizer. And after loading the tokenizer, what we are saying here is check if this tokenizer has a padding token. If it doesn't have a padding token, then we are saying that for padding token, you can use EOS token. EOS stands for end of sentence token or end of string token. And say tokenizer. padding_side is equal to right. So, let's try to understand what this three lines of code does. As I said, first we are simply checking if a tokenizer has a padding token. So, what it basically means is, let's say we have a input data something like this. I love ChatGPT and there are totally four tokens. And now, there is some other sentences saying, let's say machine learning is or I'll say ML is Yeah, so I'll say machine learning is a branch of artificial intelligence. So, let's say we have like other

Segment 4 (15:00 - 20:00)

sentence and this sentence have like these many number of tokens. Maybe let's count this quickly. Right. So, similar to this specific sentence, this sentence can also be broken down into smaller tokens. So, this sentence has 1 2 3 4 5 and 6 tokens overall. But the first sentence overall has like only four tokens, right? So, now what happens is when we send multiple sentences as a batch to our tokenizer or for the model, these sentences should have the same number of tokens. Here, there is a mismatch. So, we will add a padding token something like this. So, it would be looking like this. These are special tokens. So, it would look like pad. And then we would add another this These are kind of like placeholder tokens for us pad. So, now if you count this, this particular sentence has six tokens overall because we have added two padding tokens. And this sentence without padding tokens, it has six tokens. So, let's assume that six is the number of tokens it needs. Now, some tokenizers or the models won't have a specialized padding token. So, here I'm saying use a end of string token. So, EOS token is basically after a string or a sentence is ended, it would automatically add this EOS token. And this is for telling the model that the sentence has ended. Now, what happens is some models, as I said, won't have this padding token. So, here we would say instead of padding, you can simply use this EOS token itself. So, this is basically what we are doing over here. So, first check whether tokenizer has specialized padding token. If not, for padding tokens, you can use the EOS token as well, end of sentence token as well. So, that's basically a simple process that we are doing over here. And here we are saying tokenizer. padding_side is equal to right. That means it's just like saying here we know that uh two tokens are kind of less in this first sentence. So, I'm saying that I don't want to add it in the left side. So, I don't want to add this padding token kind of in the left side. So, rather add it on the right side. One second. Right. So, here I'm saying add it in the right side. Not on the left side of the sentences. So, that's basically the process for it. So, import your torch, import your tokenizer using which we are going to load the tokenizer and AutoModelForCausalLM for loading our large language model. And then we are saying that this is the exact model that I need to load. So, first load your tokenizer and then load your model with specifically this floating point 16 version. And then check if your tokenizer has a padding token. If it doesn't have any specialized padding token, then you can use this EOS token as well. And padding tokens should be added towards the right side. So, as simple as that. So, that is the next part that we have done. Okay. So, let me just like remove this. And after this, we can also set up few things. So, I'll quickly copy this code and paste it and I'll explain what is happening over here. Right. Let me run this. So, this model is already loaded. So, if I run this, it's not going to load it again. So, it will be cached and it will be loaded from there. So, yeah. So, the model is getting loaded. So, basically it's not getting downloaded again. Right. So, it says total parameters 1100. 0 million. So, it's basically a 1. 1 billion version model. So, here we have this specific part of the code to print the number of parameters in terms of millions. So, it says like it has uh 1,100 million parameters, which is basically 1. 1 billion parameters overall. Memory footprint basically shows you what is the total size that this model has occupied. So, it says like it has occupied 2. 05 GB of size. So, we are printing basically those basic things. So, here nparams is equal to sum of p. um numel. Numel is basically number of elements for p in model. parameters. So, it's basically a way of saying count the number of parameters in this model and then print it in terms of million values. And then here we are printing the size that it has occupied using this memory model. get_memory_footprint. That is all. And here we have two important configurations that we are setting up. So, once we load this language model, so we are saying that model. config. use_cache is equal to false. And then we are saying model. config. pretraining_tp is equal to one. So, use cache. So, what this use cache is? Uh these models use a concept called as attention mechanism, which is basically the model's way of getting context from other tokens and other words in a sentence. I'll just explain this in to you in a simple way and later videos, I'll make much more detailed videos on LLM working, attention mechanism, etc. So, right now just remember that we have a concept called as

Segment 5 (20:00 - 25:00)

as KV caching inside this model. So, for each of this query, the model calculates or calculates this KV and Q, K, and V vectors. For each of this kind of like uh tokens, okay? So, Q, K, and V. And what this caching what this Q KV cache means is instead of calculating this K and V again and again, so we would kind of cache these values for the previous tokens that we have calculated and reuse this. And the main thing is this is used in terms of inference. Inference is basically, let's say you send a prompt like what is machine learning? And the model should answer this. So, this is the prompt. The model should answer this, right? And this is basically my inference part where your model is answering a given prompt or a question. For these inferences, we enable this KV caching because in that case the inference is going to be much faster because it's not going to recalculate this K and V values again and again for each of the tokens. Whereas in the case of fine-tuning and training, we don't have to enable this. So, that's basically what we are providing here. So, use this model. config use cache is equal to false, that means don't you know, perform this KV caching as I'm kind of training. And this is also required for rate later gradient checkpointing. So, we would calculate this gradient. Gradient basically tells the model how it should update the weight values and the parameter values. So, for this we don't need this KV caching process. I'll just put it simply whatever we have discussed. So, there is a process of caching that happens in the inference part. But we are not doing inference right now. We are going to perform fine-tuning, which is similar to kind of training LLMs. So, we don't have to enable this KV caching. And during this training part, we also need to checkpoint these different gradient values that we will come across later in this code. So, for this purpose, we are saying that don't do this caching of K and V vectors. And this says model. config. pretraining_tp is equal to one. So, this is more on the GPU side of things. So, here I'm just like saying that I'm not running or loading my model in multiple GPUs. So, this is basically for what we call a tensor parallelization. So, this value controls this. So, if you have, let's say uh six GPUs or five GPUs and you have a much larger large language model that has like, let's say 70 billion parameters or like 100 400 billion parameters, we would load that on multiple GPUs. And this particular config. pretraining_tp, this is for TP basically stands for this tensor parallelism. So, it's basically for efficient loading of model in multiple GPUs. So, this parameter basically controls that. So, these are the configurations that we are providing. So, first we have loaded the tokenizer and we have configured the padding token as the end of sentence token. We have loaded the model in floating point 16 format and we have said that I don't want this KV caching right now as I'm fine-tuning it. And we have provided this pretraining TP value as one as we are not doing any parallelism as we only have only one GPU. And then we have printed the total parameters, memory footprint, etc. So, don't worry if some of the technical terms like KV cache, what is meant by this K and V doesn't make sense, don't worry about this. As I said, I'll definitely make more videos about the core theoretical side of it. So, right now we have successfully loaded the model and we have set up some configurations. So, this is our second step. Now, the third step is define a prompt and test the base model. So, what we are going to do in this specific step is we will create some prompts, kind of test prompts, and feed this to the base model, see how it is responding. And then after fine-tuning, we will also send the same prompts and check how the fine-tuned LLM is working. So, this is what we are going to do here. Uh I'll Let me just like add this text as well over here saying Yeah. So, we use TinyLlama's chat template and task is text to SQL. Given a table schema and a question, produce a SQL query. So, first I'll show you this function that we are building. So, I'm going to say define build_prompt. And this is going to take two parameters as input. One is schema, which is a string data type and it's going to take a question, which is also a string data type. And the output from this is also a string. So, this is the function that we are interested in. So, here we need a system command or basically the system prompt.

Segment 6 (25:00 - 30:00)

And let me quickly maybe paste this. That would be a better way. So, I'll just like paste this code and I'll explain what is happening over here. Right. So, this is how the input are going to be for this text to SQL conversions. So, 1 second. Oh, there is a typo in the schema. Okay. So, let's understand what is the input that we are going to send and what is the output that we are going to send. So, we will send the schema of a table and we will send a question. By looking at both of these things, the LLM has to provide a answer. Say, for example, let's say we have a schema and question like this. So, I would send a schema saying like there is a table. So, we are saying that there is a table called as employees and that has ID as one column, name as one column, department and salaries as like different columns in it. And the data type for this ID is integer, name is text, and salary is integers, etc. So, we would send a schema like this and a question saying that list the names of employees in the engineering department earning more than, let's say, 100 K US dollars. So, now the model has to give a SQL query that would get this relevant output. So, it should say something like this. Uh you know, select name from employees for engineering department, etc. And then get the number of people who has like more than this salary. Some SQL queries the model has to send. Now, uh we don't want to create these system prompts and send this for every time, right? So, let's say that this is like one query that I'm sending and a different query for a different schema and question. So, we don't want to write the system prompt every time. So, we create this reusable build prompt function that's going to add the system prompts to it, structure this prompt in a better way, etc. So, let's try to understand this part of the code now. Right. So, basically I'm saying that to this build prompt function, I will send two things. One is the schema of the table, the other one is the question for which the LLM should refer to the schema and then frame the SQL query according to the schema. So, that is what we are interested in. So, we are saying that the system prompt is you are a SQL assistant. Given a table schema and a question, reply with only the SQL query, nothing else. So, this is a strong condition that we are providing. So, system prompt is basically we just like let the LLM know what we are expecting from it, what role it is playing. So, here we are saying that it is basically a SQL assistant. And we are going to give the schema, which is this line. And then we are going to give a question, which is the second line that we have. And it should reply only with SQL query. So, it should not explain about the question. query, anything. We are only expecting a SQL query that we can, let's say, later run against a database, get some output. That is like the next step. But all that we are expecting is only the SQL query as output and nothing else. Now, I'm creating this user message as the schema and question and then add it to this messages. So, this message list is how we would usually send input to the LLM, right? So, I hope you have already worked on it. So, usually it would be something like this. A list that has role as system. And then there would be this key called as content. So, here there would be some system message, something like you are a helpful assistant. Okay? And then there would be another role just like this. And then another role. So, first let's say we have a system message basically telling LLM what it needs to do. And then there would be a user message. And in this content we here we would send the user message saying, let's say, the user is asking a question on what is ML, what is machine learning? And then we would have this assistant response over here that would say that, you know, machine learning is a branch of AI or something. So, ML AI. So, first we won't send this assistant text. So, we send these two things. One is the system prompt and the user question. And now the LLM, by looking at these two things, would answer this, which is as it says ML is a branch of AI, etc. So, this part, this assistant part, comes from this LLM. So, we need to take these input schema and question, create a list like this with a system prompt and the user prompt. So here, instead of this you are a helpful assistant, we are going to add this you are a SQL assistant. Give this given the table name schema and everything, give me a SQL query. And that we are adding

Segment 7 (30:00 - 35:00)

to the system message, just like the way that we have added here. Similarly, for user content user query, we are adding this content user, which is basically the combination of schema and the question that they are providing. Now, this is what gets This is basically the content that will be sent to the LLM and the LLM would respond something like this. In this case, for this particular schema and question, it would answer with a SQL query. So this is how we are framing the prompt. So maybe I'll show you how this process works. So once we create this messages list, we use this tokenizers. apply_chat_template. Now, what happens is once we have this uh messages like this or once we create these uh roles. The LLM expects specialized tokens something like this. It would say message start or it would say like instruction start or like before the start of this messages or it would at the end of this messages it would say instruction stop etc. So it would add these specialized templates that the model is trained on. So all of these open-source models has their own templates and you don't have to remember how this template starts for let's say a tiny llama. Now, for that we can use this tokenizers and use this dot apply chat templates, pass on your messages, it's going to automatically apply that. So I'll show you how this will look like. So I have this test prompt over here. So I'll say a test prompt one. Let me quickly copy this and show this to you. So test prompt one, I'm calling this build prompt function and passing the schema content and this question content and I'm printing this test prompt one. So first, what it will do is put this in this list called as messages that has as the system message and the user's content, which is basically the schema and question, and it's going to apply this chat template. So let's run this. So I need to execute this cell and run this. So if you see, this list that we have earlier created with this uh role of system, role of user content etc. has been replaced with this tag of system and it says you are a SQL assistant given a table schema and everything and then we have this uh user question of create table employees etc. and the question is list the name of this employees etc. The only difference is that it has added this tag called as system, user etc. So similarly, a different llama model or a different let's say a gamma model would have different tags similar to the system, user etc. So that's what this tokenizer. chat_template is going to add. I'll put this through a print statement. This would be like more clear from here. So I'll say test prompt one and now it says system uh with this like lesser than symbol, this vertical bar symbol etc. It says you are a SQL assistant, give me the SQL query. This is basically coming from the system prompt that we have provided over here. And then the user message has schema and then it has question in it, which we have combined in this particular step. So this chat apply chat template is nothing but adding this proper tags to the list that we have earlier created because the model expect the tags like this. So as simple as that. So this is what this build prompt is saying. So if I give a different uh query over here, different question under different schema, so these two parts will be updated. So create table, so this part will be updated with the new schema and this question user question. So that's all is going to change. So every time you want to send a new query, you can reuse this build prompt template function. So that is the idea. Now, we have used this tokenizer, we have converted this chat template and tokenize is equal to false. So what it means is once the tokens are basically created. Basically, as I said, tokenizer split the text into smaller pieces. So as we have seen in this example of I love chat GPT example, longer sentence is split down into smaller pieces, right? And then they are converted into token IDs. Here we are saying don't tokenize it yet, later I will do this. Just now just apply the chat template alone. I don't want to split this prompt into smaller pieces right now. I don't want that. So we are saying that tokenize is equal to false. And here it says add_generation_prompt is equal to true. So this true is basically going to add this assistant tag to it. So that when we pass this entire thing, so this entire thing will be passed to this LLM. Now the LLM knows that I have to provide the answer after this. To trigger that, we are providing this assistant tag and that's what this add_generation_prompt true is going to do. So these are the things that are basically happening here. So we framed a system prompt and this function then can be later used to just like update your schema and your question. And to that we will add this system message and then we would apply this apply chat prompt template so that the system tags and all these tags are kind of like attached to it. So this is basically what we are doing here. And in the next step, what we are going to do is create this generate function, which is basically sending this prompt to the LLM

Segment 8 (35:00 - 40:00)

and getting the output. So I'll paste the function for that over here. So here we are saying torch. no_grad. No grad stands for don't calculate the gradients for this because we are not doing any training here, right? Later we will do fine-tuning, but this generate function is only for generating a response. So as I said, gradients are basically calculated to tell the model or the training system that this is how I want to update the values of this particular model. And that happens in the training part, but this is inference where we are only sending a text and getting an output from this. So say this at torch. no_grad, which is saying that don't calculate any gradients. This is in the inference step. And we are creating this function called as generate. So to this generate function, we send this prompt and say max_new_tokens. That means what's the size of the token. So we are just like having specifically having this a lesser number of tokens so that our response comes quickly. Nothing critical in it. And this prompt is basically what we have created in the previous step. So now, first is that the user would send a schema and a question to this build prompt and that would frame the prompt that is like understandable by the model. Later we are passing this to the generate function. Now, we create a variable called as inputs, use this tokenizer that we have loaded in the previous step, pass the prompt return_tensors pt. model. device. So this is basically the part where the uh you know, input text is basically converted into token IDs. So first it will split the sentence into smaller pieces called as tokens and later these tokens are getting converted into token IDs. So that happens here. And this return_tensors pt says that return this in terms of PyTorch tensor data type. dot_to_model. device. So what this does is now in this case, device is GPU as we have loaded the model inside this GPU. So the reason we are doing this is the tensor once we get it from the tokenizer will be by default, let's say will be in CPU, but the model is loaded in the GPU, right? So we have used this device_map is equal to auto. So accelerate would load this model inside GPU as we have a GPU CUDA environment. Now we are saying that this input should also be in that same device. So if tensors are on the CPU and the model is on GPU, it's not going to work. So both the data as well as the model should be on the same device. So this is what it does. So once we get the prompt, this specific prompt that we are looking at, this tokenizer is going to split it into tokens, get the token IDs from this and move it to GPU. And now we call this model. generate. So this model we have loaded from here. So we have said model is equal to auto model for causal LM. So we have loaded the model, right? So now we are sending it. And this model would again send token IDs as output. Basically, the answer for this question that the user has asked, the LLM would generate it, but it will be generated as tokens, not as strings. Later, once we get this output IDs, we would decode this using this tokenizer. decode. So this is the whole process. First, split the data into smaller pieces called as token, get the token IDs, pass that to this model. generate. So this is going to generate the output, then decode it. So this is the main flow that is basically happening over here. Now, we have some uh you know, parameters over here. So first, we send the inputs, the input tokens that we have created, max_new_tokens, the value. So what is the size of the output? It says that output should only have at a maximum of 120 tokens or those 120 smaller pieces. do_sample is false. So this is like another parameter. I'll just like give a very quick uh explanation for this. Let's not dive deeper into it. So do_sample, what it means is uh a causal LM basically means it is going to generate one token at a time. Now, let's say that the user is asking a question of list the name of the employees in the engineering department etc. So it would start with a select query, right? So this is what the model is going to return as the first token. But what happens is the model would say that the probability of the term select is 50%. And word, let's say uh something like SQL query or I'll say delete is 10%. So similarly, it's going to provide probability of what's the next word is going to be. Now, when I say do_sample is equal to false, it's always going to get the top token, basically the one that has the highest probability. We have the top P, top K different methods and each of these methods has a way of selecting different tokens. The reason we have this do_sampling is like you don't always want the same uh you know, output. So you kind of sometimes want the creative output, you want to explore like different words. Let's say in case you are building a LLM for poetry, writing poetry, poems, etc. So, you don't always want the same value. So, if you use this do sample as false, the output is going to be deterministic. You're going to get similar values. So, there we would use sampling techniques like top P, top K

Segment 9 (40:00 - 45:00)

etc. But here we just want to keep things simple and just like consistent. So, we are saying like do sample as false. Just remember that this is just like using your temperature value. So, we would have used this temperature value as uh 0. 0, right? It's kind of similar to it. It works slightly differently, but right now remember this remember this. Do sample false is like providing a temperature value of 0. 0, similar to it. And then we are passing our pad token IDs, EOS token IDs to the model, etc. So, that it knows which is a pad token, which is a EOS token, and all that. Now, this is a simple slicing technique. Now, what happens is once the model has generated this output, it won't give a SQL query something like this. Select, let's say employees, uh and let's say this SQL query is continued. Let's assume that this is how this SQL query is going to come. But uh how this will work is Right. So, now instead of giving the output, the model is also going to give this input string as well. So, it would be something like this. So, it would first paste this input prompt and then continue with its answer. Now, I need to slice only this particular output and then print it to the user. So, this is what is happening over here. So, here we are identifying the input length of this input IDs. So, I'm basically counting the tokens for this inputs, removing that, and only printing the output. So, initially it would be something like this. From this I'm calculating what are what is the size of this input tokens and then slicing it and providing the output. So, those are the two lines that is basically happening over here. So, first we get the schema and question from the user and we build this structured prompt with model expecting tags and all that. And then we send this to this generate function that's going to accept this prompt and then it's going to generate the output. And then we are just like slicing the input and then generating only the output. So, this is the generate function. So, now let's call this generate function and see how the output is looking like. So, let me run this generate function again and get the output. It says to answer this question, you can use the following SQL query. Select name from employees where department is equal to engineering and salary is 100k. So, this query will return a list of names of employees in the engineering department who have a salary greater than 100,000. Now, it has generated this SQL query, but the problem is in the system prompt we have clearly mentioned that given a schema and question, reply only with the SQL query, nothing else. So, basically I've said that once you're generating, I only want this SQL query. I don't want this sentence, etc. So, this is what we are going to basically do this fine-tuning for and the purpose for this. So, we do this fine-tuning where a generic model that is trained on a pre-trained model basically is good at generic task, but it's not that much great in following style, structure for specific domain related use cases. So, in this we want a model that's better at following the specific instruction that we are kind of providing it. And the reason is now a larger model can follow this instruction better, but for the specific use case we can take a much more smaller model and train it on this data. So, there are like several advantages. It follows instruction for that specific task much more easily and now you can use a pretty smaller model instead of using like let's say a open AI model that has like let's say trillions of parameters. So, these are some of the advantage of it. So, in this case we are going to fine-tune a model that can follow this instruction of just given a query, just give me a output. Don't give me this uh you know, extra content text, etc. Now, you can think that I can simply do this with a few-shot prompting or like other methods, but the idea is there are some use cases which is not doable just by using a few-shot prompting or a one-shot prompting technique. So, there we have to go with this fine-tuning approach. So, there are like use cases that can be solved with a better prompt engineering strategy that, but there are use cases that cannot be solved with these things where you need like a fine-tuning approach where you want the model to follow a specific process, specific behavior, and style. So, that's what this fine-tuning is for. But again, this is a simple use case that can happen without fine-tuning, but the whole purpose of this video is to give you this understanding of how you can fine-tune a model for a much more complex use case, okay? So, long story short, remember that this process can be achieved by prompt engineering, but we are mainly focusing on understanding how we can instead of doing all those things, simply tune a model to follow that instruction better and just like answer what I'm expecting, which is basically given a natural language question, reply with only the exact SQL query that is expected. So, this is what we are trying to follow over here. Just like this, I have like other three

Segment 10 (45:00 - 50:00)

you know, examples that we can test it out over here. So, here we have like one schema saying that there is a employees table that has ID, name, department, salary, the same question that we have asked and the question also. And then there is another table called as orders that has order ID, customer ID, you know, amount, etc. And then we have this question of what is the total order amount per customer ID in 2024, etc. So, basically we give different schema, different questions trying to identify or understand how this base model is answering. So, we take this probes list which has like specific schemas and question. We are basically going to do what we have done over here. So, we got the schema and question, apply this build prompt, and then pass down to this generate function. Similar thing we are going to iterate. So, I'm going to first take this set of schema and question, pass it to this build prompt, and then generate, print the output. And then for this probe two, for this example two, I'm going to send the schema and question to this build prompt and generate. Similarly do this for third thing. So, we are doing this process. So, now we understand how the model is answering before fine-tuning. So, now probe one, we are also printing the question and the answer that we got. So, probe one, we have the question list the name of the employees in the engineering department. And the answer again, it doesn't follow the system prompt. It gave us some explanations and some extra text here as well. Similarly for probe two and probe three, we are like seeing additional output as well. So, this is the condition of the model before fine-tuning. So, now let's see how it works after like we perform this actual fine-tuning thing. So, I'll also add a text over here saying expect the base model to talk about the query, add commentary, and re-explain the schema or produce malformed SQL, and that is before state. So, this is the expected behavior when you are working on a generic pre-trained model. So, now in the next step, what we are going to do is load and prepare the data that we are going to fine-tune our model with and then we would perform this LoRA-based fine-tuning get a finalized fine-tuned version that would maybe follow the instruction in a better way. We will get to know that. So, let's maybe understand that. So, the next step let's focus on that. So, I'll say the fourth step is load and prepare the data. Prepare the data set. Right. So, this is the fourth step. So, we are going to use this BMC2 SQL create context data set and this would have our question, context, and answer triples. We will take a small slice so training finishes in a few minutes on T4 GPU. So, this is the data set that we are going to work with. So, first I'll show you how this data set looks like. So, for this as we have installed this data sets library, which is a hugging face data sets library, we are using this load data set method. And then I'm going to say this. So, raw is equal to load data set. So, I'm loading this BMC2 SQL create context data set, split as train. So, this data set would have like a training chunk and it would have a validation set test set. So, we are only taking the training data set. So, printing the full size of this data set and a example row. So, understanding what is the content is going to look like. And then we are going to only keep like a small portion of it. So, this data set almost have like close to 80,000 data points, but for fine-tuning we don't need like that much. So, I'm just going to take a very small piece of this, which is 3,000. You can even take more of this if you want to achieve more accurate results, but for this demonstration I'll go with a smaller number of data points. We first I'll explain you what is happening and then we will continue. So, this load data set that we have used over here is going to download this SQL create context data set, the training chunk alone, and print what's the size of this data. Basically, how many rows are there? And print a example row as well. So, full data set size it says 78,577. So, about 78k data points are there. Example row is this. So, we have our answer, question, and context. So, I'll maybe explain you the exact structure it has and then we will look into it, but just focus on this part of the code. So, here I'm saying I want to create two data sets. One is my training data set and the other one is evaluation data set to understand what is the loss and the performance is like. So, I'm taking this raw data set, shuffling it with seed 42. So, 42 is like random state value so that every time if you use the seed 42, you're going to your data set is going to be shuffled in the same way. So, first I'm shuffling it and I'm selecting only 3,000 values from this. So, from this overall raw data that has 78,000, I'm getting only 3,000 shuffled random data points from this and now I'm splitting that into 5% test

Segment 11 (50:00 - 55:00)

data set and the remaining 95% will be test data set. Uh Sorry, the 5% will be test data, the remaining 95% will be training data. See it again, random state if you use this 42 number, if you are trying this code, if even you are using this 42, your data is going to be split in the same way that my data is getting split. So, that is the thing. If you use a different number like 41 or 1 2 3 etc., your data is going to be split in a different way. So, now once we identify which are my training data and test data, we are creating this training data set and evaluation data set and printing it. So, now it says full data set size is 78,000. Example row is this that has three main things, answer, a question and the context of it. And uh it says training data which is like 95% is 2850 out of this 3,000 and the 5% of 3,000, here we have like 150. So, this is our training data and test data. So, now let's understand this data better. So, I'll say understand the data set. So, I'll print this train underscore DS of zero. So, it has like this answer, question and context. So, there are like three keys. Maybe first I'll just like say train DS dot keys to get the actual keys that we have. I mean, we know already that is like answer, question and context. Now, let's try to print this. So, I'll say print uh train DS train underscore DS of zero. Zero means out of 3,000, just print the first uh data point alone. And within that we have answer, question and context, right? So, first I'll print this context. So, create table employees. I'll add a line break over here. So, this basically prints uh 70 equal to signs. And now I'm going to similarly print my question and context. Sorry, question and answer. So, this will be answer. So, here we have seen that this has question as one key and answer as the other key. So, let's print it. So, this is how the first data point looks like. So, what is this context, question and answer? So, here I'll add another string part and say this is basically the schema that we were providing in our prompt. So, schema and this is my question. And this would be my answer. So, here I'll say answer. So, just like the way we have tried earlier, right? So, we were providing this uh schema and question and we were expecting an answer. So, this data set has just like a supervised uh learning. So, we have this, you know, input and the output. Input is this schema and question and the output is this answer. Let me just add this colon over here. Right. So, what we are going to train our LLM with is if you get a schema like this and question like this, all you have to do is give me only the answer as the SQL query. So, don't explain it. Don't give me additional details or anything. Just give me this output like this. So, you can basically use this process for some other fine-tuning task that you are doing. So, it doesn't always have to be text to SQL or text to Python or anything, but this is the process that you can follow. So, now what we are basically doing is we are going to tell the model that if uh you get a schema and question pair like this, this is the style of output that I'm expecting. So, this is what we are going to teach the model. We are not trying to teach it how to talk. How to talk is something that the model already knows from his pre-training from the data that is present on the internet. So, we know that large language models are trained from the data that is present on the internet, right? So, there it learns language, there it knows how to speak. So, there it uh understands or basically it creates its own knowledge, right? In fine-tuning, we are basically telling it how to answer, how to answer in a different way for the specific task that we are interested in. So, that's what we are trying to teach the model. So, this is how my uh data set is like going to look like. Now, what we have to do is, as I've said earlier, this tiny llama model expects the input in this format. So, there has to be a system message, user tag and then assistant tag, right? So, now this data is now present as like in dictionaries. Maybe I'll also print it over here. I'll say train DS of zero. So, we have answer, question and context. So, this format of data has to be converted into this style of data, system, user and for answer we can have this assistant and then all that. So, this is what we are trying to convert over here. So, let me quickly copy and

Segment 12 (55:00 - 60:00)

paste this conversion here. Right. So, let's try to understand what is happening in this case. Now, I will send each row. Let's say this is the first data point I have. So, each data point will be present as a row in this train DS. If I print this train DS, right? So, it has a data set that has this many features. Maybe I'll say colon two. So, this is my first row. Uh answer, question, context. Let me check. So, we have a answer. Sorry. So, there is a answer and there is a question. The context. Hm. Okay. So, look at it. These answer, question and context, think about these as columns and this list as rows. So, how we have to look at it is the first row as this as answer and the corresponding question is the first question that you have over here and the corresponding values like context. Right now, basically it is in the form of a dictionary, but you have to look at in terms of tables, rows and columns. So, uh maybe I'll just like show it to you. So, you would have answer as one column. And then there would be question as other context. So, these are the three columns that you have. And the first row one would have, let's say, answer one and the corresponding question is present in Q1. And this context is present as C1 etc. So, similarly you would have like uh second data point that would be present like this. So, second question would have answer two. Context two and question two. So, this is the structure of this. As you can see, this represents answer one, the first row. The corresponding question is in this specific line and then the corresponding context is present. So, this is one row and this is the second row. Look at second value of answer, second value of question and the second value of context. So, this is kind of a table structure. Now, I'm taking each row. When I say one row, that means I'm taking this specific row, which is first answer, corresponding question and corresponding context. Now, add it to the system message, user message assistant the same way as we have done before. So, context is basically the schema that we have defined. Question is the question based on that schema, which is this. And finally, we have this answer, right? Answer would go in this assistant uh you know, type. So, we are creating this as uh the required structure and creating this column called as text. Now, I'm using this train DS dot map format example remove columns train DS columns column names. What I'm doing is this mapping function of format example now has to happen for all the rows that is present over here. So, now uh maybe I'll run this and show this to you. So, earlier we had this as present something like this, right? So, now this has been converted into this. So, system, user, uh assistant message etc. Maybe I'll just like print this train DS over here. Train DS of zero. So, now it doesn't have question, context, answer, rather it only has text as its column name. So, column name or a key name, you can think about it. So, previously you had answer, question and context, right? So, this we have converted into, if I show this as zero colon three, we have text one. So, this is my first text. Okay? So, this basically combines these three things and it has printed. And now this is the second text. And this is the third text. So all we are doing is we are taking the data that is in this format of answer, question, and context, convert it into system schema question, and the assistant response. So that is the conversion that we are doing over here. So if you look at this training trained years of zero, we have got this text. I'll try to put this inside a print statement that would be easier to visualize.

Segment 13 (60:00 - 65:00)

Okay. So we have this text and this has the system message. And then you have the schema user, etc. So now the model can understand better because it is the way that the model is expecting. So I'll remove this. This is like basically formatting of the data that we have. Now, this is the entire data processing technique. Now let's move on to the actual fine-tuning part. So I'll copy some content over here and say attach LoRa adapters. Right. So here I'll add some text. I'll explain what is mentioned over here. But first I'll explain about what this LoRa is. All right. So pre-training is basically the way in which we train this large language models on the data that is present on the internet, right? Internet has vast amount of data and we train on it. Now, I want to fine-tune it for a specific task in this case, which is text-to-SQL. But I don't want to do full fine-tuning. Full fine-tuning basically means it's like trying to update all the 1. 1 billion parameters that we have. We don't want that. We can achieve this by doing a partial fine-tuning. And this method is basically called as PEFT. So PEFT stands for parameter efficient fine-tuning. So this is what it stands for. Basically it says instead of fine-tuning the entire model, instead of fine-tuning all the parameters of the model, I'm going to add a slightly uh you know, smaller subset of parameters and only fine-tune that. So think about this pre-trained model as a textbook. Full fine-tuning is rewriting all the pages, whereas PEFT is just adding one additional page to it. So that's what PEFT is, okay? So instead of doing full fine-tuning, I'm going to add a smaller uh or let's say that I'm not just going to fine-tune this full model, I'm a part of it. The entire model is going to almost entire part of the stay the same. Only a small portion is going to be updated. So that is the idea. So PEFT is the idea or uh the concept or like the old paradigm, the way to do this PEFT is like one technique is what we call as LoRa. So LoRa stands for low-rank adaptation. So in this think about this LoRa as we have these parameters, right? To this parameter matrix, we are going to add a very small matrix and then uh update these parameter values alone. So that that's what this LoRa is. Again, the same concept of PEFT, but we the actual implementation is adding this LoRa. So you take this entire model that has 1. 1 billion parameter. Now instead of fine-tuning this entire parameter, let's say I'm just going to take a subset. Let's say I'm going to instead of fine-tuning or updating this 1. 1 billion parameter, I'm just going to update 2 million parameters, which is like a million is a much smaller number compared to this billion, right? So this is what LoRa does. Again, maybe in a later video I'll explain in a more detailed and depth way, but this is mainly focused on hands-on, so I'm trying to give you a overview. So just remember that we are going to add a smaller subset and fine-tune it. And then we have another method called as QLoRa. Nothing different. It is also the same concept as this LoRa, but here we would be doing this fine-tuning on a quantized version of the model. So quantized version as in like even instead of taking 16-bit models, maybe we would use a quantized model that has only like 4-bit precision and then fine-tune it. So in that case the model would be much smaller and the fine-tuning would require like much more smaller amount of data compared to this LoRa. So LoRa, as I said, low-rank adaptation, where we are adding like a small matrix and updating its values. QLoRa is like doing that on a quantized version of the model so that you can do that in a much more smaller amount of uh you know, compute power that you have. So this is the whole concept. Main takeaway, remember that PEFT, which stands for parameter efficient fine-tuning, is not about doing full fine-tuning of the model, just only let's say do this in a efficient and smaller way. LoRa is like adding that smaller matrix and doing on it. QLoRa is doing it on a quantized model. So that is the overall idea. So now it says we have like LoRa inserts low-rank trainable matrices, the same thing that we have discussed, into specific linear layers, attention projections, while freezing the base weights. For LLaMA family models, the

Segment 14 (65:00 - 70:00)

common target modules are Q projection, K projection, V projection, O projection, and optionally the MLP projections. So this QKV, as we have discussed earlier, are part of this attention layers. So in attention mechanism, we would have this Q, K, and V vectors that are basically created, which are kind of important uh part of this model as attention is the core kind of uh you know, uh mechanism that happens inside this large language model, which is mainly responsible for context understanding. So it's just going to update this Q, K, and V, as well as all the output projections, these four matrices, in a smaller amount. So this is what we are going to do in this LoRa method, okay? You can alter like other parameters and other parts of the model as well. So here we are going to mainly focus on this QKV and O projection. So that is the idea. So now I'll say here I'll add this specific piece of the code and I'll explain it. So from PEFT, so PEFT library we have installed in our environment setup code, LoRa config. So here we would provide some details about the exact configurations that we want to provide for this LoRa. I'll explain it when we are using this. And get PEFT model. So we would pass the model that we have loaded from Hugging Face, pass this LoRa config, and this would convert this into PEFT model that would later be updated. So it's basically that part of adding that additional adapters that we are using, LoRa. So we have this LoRa config and then we have this get PEFT model. And then these are some of the simple configurations that we are providing. So model. gradient_checkpointing_enable. So enable gradient checkpointing to save VRAM during backward pass. So backward pass stands for backward propagation. So now I'll explain this. So I've mentioned about this uh KV cache, right? About like in the top of the code where we have configured it. And this is like one reason for it as well. So here we have this use_cache as false and this is exactly why we have done this gradient checkpointing uh later. So now I'll explain what is the purpose of this. So model. gradient_checkpointing_enable. So when we train a model during the forward propagation, so the model learns the mistakes using this loss. And we calculate these gradients that tells the model how it should update the weights. There are like several gradients that are being calculated. So what happens is these gradients will be basically saved in memory uh mainly using this KV cache. But that's going to consume uh quite a amount of memory in your GPU. So now what we are saying is that I don't want to cache this gradient values. I'm just going to uh save only part of this uh let's say this checkpoints. Remaining I calculated when I need it. What happens because of this is as the gradients are not getting saved, uh we have like lesser consumption of memory. But there is a disadvantage that the training takes slightly more time. So this is like a trade-off that we have to make. So here we are saying that don't checkpoint the values as I don't have like a larger memory to use. I'll recalculate it when I need it. It's okay if I, you know, have to wait a bit more time for this training. So that is what this gradient checkpointing do does. And enable input require_grads. So basically it says that uh we have we will be freezing like majority of the weights in the model, right? So we have let's say 1. 1 billion model and we have said that only part of the parameters will be uh updated. Now it's just like letting PyTorch know that I have freezed majority of the model, but still I'm doing this fine-tuning. So there is some updation that is happening. So I need this gradient calculation. Otherwise, let's say PyTorch would think that we are not doing any gradient calculations or training. So that is the process. So first part it's saying that don't save the checkpoints because this could consume a lot of let's say GPU. And this is like saying I'm still like I still need gradients. Don't think that I'm not doing training just because I have frozen like majority of the weights. So these are like a simple way to explain this, but I said at some time let's also like go through this in a more detailed and depth way. So these are the two configurations. Now let's look at this important part. So LoRa config is r is equal to 16. So this r16 is like kind of the size of the matrix or the adapter that we are adding. So, usually we would use values like 8, 16, 32, 48, etc. And if you want a majority of the model to be updated in the fine-tuning, then you can use like a slightly larger value. Again, in that case, you also need like a slightly larger GPU size. So, that's mainly about the rank. So, think about this R value as the amount of change or fine-tuning that you have to do on your model. So, larger value means larger amount of parameters are going to be updated. And

Segment 15 (70:00 - 75:00)

then we have this LoRA alpha as value as 32. So, this is how strongly do you want to apply this transformations or this like updations. So, usually if you divide this alpha by R. So, in this case, alpha is 32, R is 16, right? So, 32 divided by this 16 is two. So, usually we would have the scaling factor as two, but it's not like a fixed thing. You can also increase the scaling factor. I'll just explain you in a very high-level overview of how this works. Let's say the base model at some point gives you a value as 10. Now, fine-tuned model or this LoRA updated model would give the value as let's say 12. Now, the difference is two, right? Now, this difference is then magnified by a factor of let's say two. So, that's what the scaling does. So, it's basically when we fine-tune the model, how strongly do you want to apply that changes that we have applied. So, that is the main difference. So, think about this R rank as the amount of updations that you have to do and alpha is the strength in which you want to you know, apply this updations to your fine-tuning to your model. LoRA dropout similar to the dropout layers that you have. So, some of the neurons won't fire basically like adding dropout so that we are not facing any overfitting issues. Bias none basically says that I don't want to update any bias value. So, we have weights and bias values, right? So, this is saying that don't update bias value. I just want to update the weight values. Task type causal LM. Causal LM is stands Causal LM means like next token prediction basically. So, when we say something like the sky is blue because I can send this as a prompt and let's say ChatGPT would like complete this sentence, right? So, this is an example for causal LM. And all these large language models follow this causal LM kind of technique. There are other types like sequence to sequence that is let's say a transformer-based model that has encoder and decoder. Like when you give let's say sentence one, it would be translated to sentence two. So, one sentence is getting one sequence is getting converted into sentence another sentence. Say for example, converting English text into French text. So, that is like a sequence to sequence example. And then you also have text classification task, etc. But for large language model, causal LM basically means next token prediction. So, that we are also providing over here. This tiny llama is again language large language model-based architecture. So, this also follows causal LM. And the target model. So, this we have mentioned here. So, we want to add this smaller matrices on this Q projection K V and O projection values. So, this is my LoRA config. So, we have provided the size to which we have to make this updation the strength, the dropout layers, etc. And then now call this get peft model method that we have imported. Pass the model that you have loaded and pass this LoRA config that we have. And now the model that you're going to get is the peft-based version that basically has this additional adapters that we have added to this Q projection K projection and all this projection. So, now to this base model, we have added some additional matrices which will be later updated. So, that is the idea. And then you can print this trainable parameters as well. So, let's see. So, now what it says is trainable parameters are 4. 5 million and all parameters in total there are about 1. 1 billion parameters. And out of that, I only want to you know, train 4. 5 million parameters that we have added. So, trainable is 0. 40798. Even less than 0. 5%, right? So, not this entire 1. 1 billion parameter size is updated. Only like a very small minor portion is updated, which is less than 5%, which is this 4 million out of this 1. 1 billion. So, this is the power of this peft-based and the LoRA-based process where instead of fine-tuning it on the entire data set, we are fine-tuning it on a very smaller portion. Okay? So, this is the configuration. Now, let's look at the actual training script. So, let's call this sixth step as training. So, I'll create a text cell over here. So, put this as train. Let's just make this bold and create this as a section. And then another text cell, paste this. So, here we will be using this SFT trainer. SFT stands for supervised fine-tuning. So, I hope you remember supervised machine learning. So, we know that supervised machine learning means we train the model with input as well as the output or in other words, the labels, right? So, it's similar to this. In the supervised fine-tuning, it is called as supervised fine-tuning because we train with the input, which is this let's say this schema and the questions are my input and the target label or the output that we are expecting is this assistant response output. So, basically when we get an input of this schema and this

Segment 16 (75:00 - 80:00)

question, we need output of this. So, this is basically training with input and output pair. So, that's why this is called as supervised fine-tuning approach. So, we are going to use this SFT trainer that we have from TRL. So, TRL is that again phase library that we have installed in the first step. And here I've provided some parameter values that we will be using. Maybe I'll quickly copy and paste this code. So, we have quite a high number of parameters that we have. Let's not dive deeper into this. So, instead what I'll do here is I'll create a text cell and add some text on what each of this like let's say individual parameters do. So, I'll just put this as SFT trainer or let's say SFT config explained. So, I'll just like create this text within the section. You can expand and save it. So, here I've basically explained what is the purpose of like each of the parameters that we have used over here. So, later just like go through this when you access this notebook. So, I'll just like skip this part alone. Uh 1 second. Yeah. So, it's pretty much it. So, all these parameters are explained over here. Now, let's see what is happening in this code. So, from this TRL library that is transformers and reinforcement learning, we are importing this SFT trainer, which is supervised fine-tuning trainer and SFT config. So, first we have to provide the configurations in this SFT config class. Later pass that to SFT trainer. So, first we are saying that the adapters or the checkpoints have to be stored in this output directory called dot slash tiny llama sequel LoRA. Uh sorry, the runtime got disconnected. Let me just connect this again. Okay. Right. So, what happens is once we fine-tune this, the adapters that we can later load. So, what we would do is later when we want to use this fine-tuned model, we would just like load this adapters with the actual model. So, for saving that for persistence, we have this dot slash tiny llama sequel LoRA. So, this would create a folder in this file section and there it will be saved. So, I'll show this to you once this fine-tuning has completed. So, all these informations have to be provided in this SFT config. So, we are creating this SFT_config call this class that we have over here. First, we are providing this output directory, which we have assigned over here. And then we have number of training epochs. How many epochs you have? What is the batch size, the learning rate, etc. So, all these are provided over here. So, as I said, so go through this. This will be in a section. So, you can just like expand the section and go through the purpose of each of this parameter when you get time. I'll just like skip this as again we have like several of these parameters here. So, first we assign or create this SFT config with each of our kind of hyper parameters. And you can just like see how you can update these values like how you can you know, increase this epochs. Here I've only taken one epoch for mainly for the time purpose of the LLM being like fine-tuned, but you can increase this to two, three, etc. as well. Similarly, you can increase the batch size and all that. Now, here we have this tokenize function. So, here we have a batch of text, right? So, we have about 2850 data points for train and 150 data points for evaluation. So, we are going to use this tokenize function to take one batch at a time and you know, do this tokenization process, which is basically getting the sentence, converting it into tokens and then token IDs. So, this batch of text is basically represents this text that we have created over here within this text key. So, that's what we have provided here. Batch of text truncation true. So, that means maximum length each data point can have is 512 is the length. So, we are not going to use padding. So, instead of adding padding tokens, we would just like kind of truncate it. So, all the sentences would be truncated to 512 values and return this tokenized output. So, now we have created a function. We didn't perform this tokenization yet. Now, this entire function is applied to all this data set, all the rows of this data set using this training dataset. map. So, we are passing this tokenize. So, what it means is for all the data that is present in my training data, apply this tokenize function in a batched process and remove this column name called as text so that you only now have the token IDs that you have created and that will be saved in this training token and evaluation token. So, this is like preparing the data basically, okay? And now we call this trainer. So, trainer is equal to SFT trainer, provide your model, the PEFT model that we have got after adding the adapters, and then arguments. Arguments, pass your SFT config that we have created over here, all the hyper parameters and the training configurations, and pass your training data set, which is your training tokens, and pass your evaluation data set, which is evaluation token. Processing class is basically a tokenizer. Now we call this trainer. train. So, this is where the actual fine-tuning actually happens. So

Segment 17 (80:00 - 85:00)

uh forward pass will happen, backward propagation will happen, and the gradients will update the weights. So, the entire process happens here. As I said, uh this happens only for one epoch, but feel free to increase this number of epochs, but as I said, it might take some time. So, this uh yeah, I think this might take like 8 minutes, I think, for this particular one epoch, but yeah, take your time. So, let me see the time that is given. So, it says 7:52, 7 minutes 52 seconds, closer to 8 minutes. So, you can also try increasing the epochs as well. So, let it train. In the meantime, I'll uh just like explain you the remaining part of the code. So, I'll uh after this I'll create another text cell called this as seven. Compare Let's assume that the model has completed fine-tuning. Now we have to compare the output that we get from this fine-tuned model with the output that we got from the base model before fine-tuning. So, here I'll say compare after fine-tuning. Right. So, I'll just make this bold and let's create a section here as well. So, we earlier had these three probes, right? Basically, three examples that we were testing, schema question, and we have seen that it didn't give the SQL queries, but also some of the explanation and extra text. So, we are going to uh pass the same data and get the output and compare the two cases, before and after cases. So, I'm going to pass this over here. So, here I'm saying model. config. use_cache is equal to true. So, if you remember, earlier we have said this model. config. use_cache is equal to false because, as I said, for training, we don't need this KV cache, the K and V of attention cache. Uh whereas here we are running this inference. Having this KV cache is basically means that the model is not going to recompute K and V uh vector values again and again. So, it's going to save a lot of time on the inference part. So, just think about this as a caching mechanism inside the model. So, here we are enabling this cache, previously that we have uh you know, disabled to save memory in the training part. And model. eval, this is also saying that I'm going to not train right now. I'm going to run this model for inference. So, that's what this puts this model on. So, instead of training mode, we are putting this on evaluation mode. So, basically it tells the model that uh you know, we don't want dropout to happen, we don't want any uh you know, weight updations, etc. So, those configurations specific for inference. After that, we iterate over the probes. Probes is that three set of examples that we had uh with schema and the question. So, we pass that to the build prompt. Now we generate the answer, and then uh generate would again have this model, right? But now this model is the trained model. So, we have this answer is equal to generate. So, if you just like go to this generate uh function Let me access this, right? Yeah. So, it would access this model, but we are also fine-tuning or loading this PEFT model in the same variable called as model. So, after this you wouldn't have that original model. If you want to look at the original values, you have to load it separately and check the output. But this is mainly for uh you know, saving the memory part. I don't want to model that again and again, that's why. So, here we are using the same model name, but you can also create this generate function in a way where you can also, let's say, pass a model parameter. So, uh let's say you have one version that is before a fine-tuning, after fine-tuning. Now, what you can do is uh call this generate function, but also provide whether you wanted on a before model or after fine-tuned model. That would be like another uh thing that you can try, but right now we are not providing that configuration on this function level. So, I'm just like building this prompt, generating this, but remember that this prompt uh would be from the fine-tuned version, and then we are printing it. So, before outputs are already saved in this base outputs, let me show this to you. So, we have iterated over this probes, right? So, here we have saved the previous output that we got. So, we are not going to recompute it again. So, to this base outputs, which is the one that you're seeing over here, to this we are going to compare the values that we are going to get after this. So, this is basically after fine-tuning. So, this is going to be compared with the base output. Base output means the output that we got from the base model. So, that will be compared here, and we will see whether the model has like improved or not. So, that is something that we would do. So, it says we still have 8 minutes and 20 seconds. Maybe let's wait for it. If it doesn't get completed, maybe I'll just like show you the sample outputs, that would be like easier. So, this is basically what you should see. I'll add a text over here. So, before we add some verbose, chatty, often wrong syntaxes, re-explain the schema. So, this was the kind of output, but now it will be like uh you know, proper uh SQL queries without any extra additional informations or sentences is

Segment 18 (85:00 - 90:00)

what we are expecting. So, let's verify that as well. So, let's say that we have fine-tuned our model, we have uh tested the model and compared its output as well. Now we have to save these adapters. So, I'll say save the adapters. You're not going to save the entire model, it's only the adapters that we will be saving, so it's not going to take a lot of storage for you. Maybe I'll just add it as a text as well. So, this is something to remember. We are not saving the entire model, okay? So, later when you want to use the fine-tuned version, you will load that model from Hugging Face Tiny Llama, and then add these adapters that are trained. Earlier we have added these adapters from this LoRA config, right? These are like untrained adapters. Later in this SFT, the adapters are like getting trained. Basically, the values are like getting updated. So, now we have to see how we can save this. — [clears throat] — You can see this uh Tiny Llama SQL LoRA that we have created as the output directory. So, here we can uh create this adapter directory, okay? So, uh Tiny Llama SQL LoRA adapter, and say model. save_pretrained_adapter_directory, uh tokenizer. save_pretrained. So, save like these two things, and we can also like uh print the output size and all those things as well. So, it should be like closer to 20 MB only, not like a larger size. So, later you can only like load this model alone. Uh and then this is how you would load this model again. So, let's say that you have fine-tuned your model and you have saved the adapters as well. And I'll just give a sample code of how you would load the model. So, I'll just put this in a text cell, but if you want to load it again, just copy and paste it in a cell code cell, okay? So, this is in a text cell right now. So, let's say we have saved this adapter here. Now in a different environment, in a different notebook, I have to load the fine-tuned version. Now import this auto model for causal LM, auto tokenizer, same way that we have did before, PEFT uh from import PEFT model. So, first load this base model, auto model for causal LM from pre-trained. So, pass your model ID, the data type, device, etc. Now call this model is equal to PEFT model from uh dot from pre-trained. Pass your base model that you have loaded. So, this is the difference. This is not the fine-tuned version. So, this we download a fresh model from Hugging Face, the Tiny Llama version, and now convert it into PEFT model. Pass your base and the adapters that we have saved in our local files. Similarly, uh we can just like get the tokenizer as well. So, this is the overall process. So, we are loading the model, adding the adapters, fine-tuning it, saving only the adapters, and later when we want this fine-tuned version, similarly we would load the base model, and then add these adapters that we have created. Similarly, do this for tokenizer as well. Then you can uh instead of using the base model, you can use this model. Uh base is the uh fine like the base version of this Tiny Llama. Model is the fine-tuned version. So, you can simply use that. So, this is the overall flow. Let me see if it has like completed. It does like 3 minutes like 43 seconds. Just like uh some amount of time is spending. In the meantime, I'll just like add these recap steps. I'll also explain this entire process of what we done in a short recap, so that it's easier for you to remember. Base model loaded in FP16, so it was about like 2. 2 GB of VRAM in our GPU. So, LoRA adapters uh attached. So, we have used RS16, and target is mainly that attention projection, K projection, V projection, Q projection, and then output projection. And we got about 4 million trainable parameters, and the overall parameter size was about 1. 1 billion. And trained one epoch on 3 3,000 about 2850 training SQL examples. This took about like 8 minutes on T4 GPU, and the peak VRAM basically means like what's the maximum amount of RAM used, so that also we can maybe like print and show. Maybe I'll just like add the code for it after the training has happened, so we can also print that. And we saved the adapter component alone, which would be like about 20-25 MB in size. And the we would see a visible difference in the output. So, earlier we saw some you know, additional text with the SQL queries and all that. Now, we would see like a clean SQL output after the model has been fine-tuned. So, this is about it and here I've also like added some examples on how you can take a next step for this or maybe like try this as well. And just I just wanted to add this particular note on accuracy component. Let me just like put this Uh the main thing that we have tried to achieve here is that the style and the structure in which we want the output, but still as this is a pretty smaller model, the accuracy might be low. So, the SQL queries might not be that accurate, but it's kind of like this code is scalable. You can simply instead of using a 1. 1 billion model, you can use a 7 billion version model, which is still smaller. It's not like as long as like a 400 billion version or a 70

Segment 19 (90:00 - 95:00)

hundred billion version. Still smaller, so usually we can take the 7 billion version and then fine-tune it with maybe slightly more amount of data and then you know, those things could kind of slightly increase the performance and the accuracy of the queries that you get. So, this I've added as a note on accuracy. Just keep this in mind. So, when you are doing this on your let's say organization, you have some task for which you have to fine-tune it, you can follow this exact code, but this code is scalable. By using like a different model and using like different slightly larger data set, this is like completely doable. Just go through this. So, this notebook is a demonstration of the LoRA workflow, not a production SQL model with a small 1. 1 billion base 3,000 example in one epoch. The fine-tuned model reliably learns the response style, which is clean SQL instead of chatty explanation, but the main thing is it is also not semantically correct. So, I've also added steps on how you can improve the performance. Performance as in getting more accurate SQL queries. So, style learning is what we are doing and not basically not task mastery, but we can do even more fine-tuning for that. To improve accuracy, the same workflow scales up. Use a larger base model, which can be a Qwen 2. 5 or 3 billion model or a 7 billion LLaMA model with QLoRA. And more provide more training data. So, we have about 78,000 example, but again we can use maybe 5,000, 6,000 or 10,000 examples. That would be slightly better and add like more epochs. So, that is like one way to improve the performance. So, these are some things that you can still keep on mind. Okay? So, that I just thought that I'll mention that as well. Uh Okay. Now, I think the model has completed the training. Let's look at it. So, we have totally let's say we have printed the logs at 50 steps. We got the training loss, uh evaluation loss, etc. So, basically the model has completed this trainer. train, which is the fine-tuning part. So, go through this parameter configurations that we have. Maybe now I'll print that peak memory used and some components. So, when you use this torch. cuda. max_memory_allocated, it says like what's the maximum amount of memory used for this entire process that we were doing. It says like 4. 10 GB, which is well under our 14 GB allocated size in this T4 GPU, so it's pretty good. So, now let's compare this after fine-tuning. So, let's run this. So, we get this output after LoRA and before LoRA. So, for this question, list the names of employees in this engineering department earning more than 100K, we have also sent the context as well. So, before it kind of gave a chatty response with this additional text, etc. But after LoRA, if you see, we got only the SQL queries in a proper structure instead of this line breaks and then extra text, etc. We get we got only the SQL queries. So, if for second question also, we are getting only SQL queries as well. So, this is how the style and structure change our fine-tuning is going to do, but it can also It's just like a simple example in the SQL queries, but for other similar tasks also, you can perform this. Now, we didn't change the system prompt. queries or anything. It's the exact same thing, but as we have trained it for this specific use case and specific task, it follows that same exact you know, rules much better. So, this is the overall idea. So, after this we got structured SQL queries in the same style that in which we have provided the data set as well. So, this process is done. We have this tiny LLaMA SQL LoRA. The training checkpoints are saved over here. Now, let's save this adapters. So, adapter size is 20. 67, so this would have created this. You can also like download this adapter directory and then when you are loading a new model as I said, you can first load the base model from Hugging Face and then put this through PEFT and just like add your adapters directory over here and that's going to load the fine-tuned model. So, that is the overall process that we have done and it we have successfully performed this fine-tuning, which follows this instruction of text to SQL conversion much better. So, I'll just go through this code again and I'll quickly give you a quick recap of whatever we have done so that it's easier to reinforce what you have learned. Right. So, first we have checked the availability of GPU, loaded a base model in FP16 version, inspected how the base model performed before fine-tuning for queries on this schema and questions and we have prepared a instruction data set basically that context, question and answer and then we have attached this LoRA adapters using PEFT and then we have inspected after training how the model has performed. So, we could clearly see the difference before and after fine-tuning and then we have saved the adapters and we have used this tiny LLaMA. So, we have like all the required libraries that are installed. Important ones are PEFT that is going to give you the LoRA configurations and then we have this TRL which has the SFTTrainer, which is supervised fine-tuning trainer. SentencePiece is put up of mainly for

Segment 20 (95:00 - 98:00)

internal requirements of the tokenizers. So, we have checked the GPU that is available for us, what is the memory and all that. And in the next step, second step, we have loaded the tiny LLaMA 1. 1 billion, a very small model. And even this small model was able to perform well after the fine-tuning that we have done. So, we have loaded the tokenizer and the purpose of this tokenizer is to split the text into smaller chunks or smaller pieces and then we have loaded the model as well. We have added some configurations for the caching, KV caching and all that. We have discussed about that. Then we have printed the total size, the memory footprint, etc. And the two important functions, one is to build the prompt. Here we get the schema and question from the user, added the system prompt to it and applied this chat prompt template so that model-specific special tokens are added to it. So, we have like added this chat template as well. And then we have tested this build prompt template and then we have this generate function, which is going to perform tokenization, pass this tokens to the model, got the outputs and decoded the token IDs back to the human-readable output. And then we finally saw that when you send a query with a schema, it gave the query, but we the structure and the style is not what we were expecting. Even though we clearly said that don't you know, provide additional information, reply with only the SQL query, it still kind of add some additional questions and all those things were still present. We have tested with this three probes or test prompts. Still it's the same. So, in the next step, we have loaded this smaller BMC2 data set. I mean, there were like 78,000 data points, but we have taken a smaller subset that is 3,000 data points. That could be well under our GPU limit. And then we have formatted this example to be in that same system user and assistant chat template. And then we have attached this LoRA adapter. So, we got this base model in inside this variable called as model and then we have added this LoRA config. That's going to add that additional LoRA adapters. We have said that this is the rank, which is the size of the adapters that you want to add and what's the strength it to. So, usually we would have a scaling factor of two. And then we have said that these are the target modules or the projections to which we want to add this adapters, mainly the attention-related projections. And yeah, so we have got this PEFT model and we have seen that out of this 1. 1 billion, we are going to only train like about uh uh percentage and percentage like less than half percentage, which is like 4 million is what we were about to train. And then we have used this SFTTrainer from TRL and then we have performed this trainer. train to fine-tune this model. And finally, we have seen that after fine-tuning, it was like more focused and it actually gave what we were looking for. So, that is the overall process and then we have saved the adapters and I've also added the code to tell you or let you know like how you can load this model later. Just maybe save this directory in your Google Colab or your local file. Later, you can simply load it along with the adapter. So, base model can be loaded. After that, you can load this adapter as well as tokenizers that we have saved in this LoRA adapters. That from that you can save it. So, this is the overall process. Always remember this note on accuracy that I have added and yeah, the overall recap. So, I hope you have understood whatever we have covered today. Please practice the code that we have done. So, that would be like super helpful. So, later when you have to work on a different fine-tuning task, you can use this code as reference. Just change what are the things that are necessary. You will be able to kind of like do that in a pretty good way. And if you have any doubts, you can let me know in the Q& A or the comment sections. I'll be happy to help you. So, let me know there. I hope you had a good time learning about this. I'll see you in a different topic. Thanks for watching.

Другие видео автора — Siddhardhan

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник