# DL Project 11: Build an IMDB Sentiment Classifier with BERT + Gradio UI (Full Python Tutorial)

## Метаданные

- **Канал:** Siddhardhan
- **YouTube:** https://www.youtube.com/watch?v=f5Zd8scYU40
- **Дата:** 23.04.2026
- **Длительность:** 1:22:01
- **Просмотры:** 882
- **Источник:** https://ekstraktznaniy.ru/video/50875

## Описание

🤖 My end-to-end Machine Learning  & Generative AI Course - Udemy: https://linktr.ee/siddhardhan

DL Project 11 — Fine-tune BERT on the IMDB movie reviews dataset and deploy it as an interactive Gradio web app. Full Python walkthrough using Hugging Face Transformers, Datasets, and the Trainer API. We cover tokenization, fine-tuning, evaluation (88% accuracy), inference, and building a shareable UI — all on a free Colab GPU.

🔗 Notebook / Code: https://drive.google.com/file/d/1JnxftYfn1hlqM2lVijvx2Lu7IggUntEI/view?usp=sharing

🔔 Subscribe for more AI Engineering projects

If this helped you, drop a 👍 and let me know in the comments what DL project you want next!

#BERT #HuggingFace #SentimentAnalysis #DeepLearning #Python #NLP #Gradio #MachineLearning #AIEngineering #DLProject

## Транскрипт

### Segment 1 (00:00 - 05:00) []

Hello everyone, I'm Siddharthan. If you are new here, I teach AI engineering, the practical kind. Today, we are doing something every ML engineer should know how to do, and that is fine-tuning BERT for text classification. We are going to take Google's BERT model, teach it to classify IMDb movie reviews as positive or negative, and walk through the entire Hugging Face code to do this. By the end, you will have a trained model hitting about 85 to 90% accuracy that is saved to disk and running in a shareable web UI that you can share with your friends or interviewer. I've put the full notebook link in the description. Hit subscribe if you want more videos like this, and let's jump in. And this is the finalized Gradio application that we will build. Let's say that here I'm providing this movie review. I can also type it in. Here I'm saying this movie was absolutely fantastic, brilliant acting, and a gripping story, and I can click on the submit button. And this will invoke the trained BERT model, and this will give the output as either positive or negative, and I will get a confidence score. So, this is the final application that we are building, and you will also get a shareable live HTTPS link. So, if you're going to an interview, you can share this link to the interviewer. So, instead of saying that I have built this sentiment classifier, you can directly send them this link, and they can check this out. So, this is what we are trying to build in this particular video. And before getting into the hands-on part, let me quickly show you my Udemy courses. So, right now I have two courses live on Udemy. One is complete generative AI course, and the other one is a complete machine learning course. So, in these courses, I have started from the very basics and all the way up to capstone projects. In generative AI course, we have concepts like prompt engineering, rag, AI agents, MCP, etc. And in machine learning, we have a Python basics, ML basics, intuition of different ML models, and we also build several machine learning projects and capstone projects. So, take a look at this in case you are interested in learning these courses. So, I'm also planning to add more courses like deep learning, ML ops, etc. So, please take a look at this in case this is useful to you. I'll give you the link of these courses in my video description. With that being said, let's get started with today's video that is building an IMDb sentiment classifier with BERT. For this particular use case of IMDb sentiment classification, we don't need any paid subscriptions, or we don't have to purchase any compute units in Google Colab. The T4 GPU that is available on free version of Google Colab is sufficient for us. So, the first thing that you can do is create a Colab notebook like this one, and go to this runtime tab, select this change runtime type, and select T4 GPU. Okay. I have purchased some compute units, so I can access this H100 A100 GPUs, but for you, it might be kind of grayed out. So, you can just select this T4 GPU that should be available on free tier of this Google Colab. So, select this one. And in this notebook, as I have mentioned earlier, we are going to fine-tune a BERT model on IMDb movie reviews dataset for binary sentiment classification, positive or negative. Basically, we are going to take a pre-trained BERT model and fine-tune it for this specific use case of IMDb movie review classification, where if I give a movie review as raw text, it's going to classify that as positive or negative along with some confidence score. So, this is the idea. The main takeaway here is that we are not following classical machine learning approaches of using some ML models or using a recurrent neural networks, the ones that we usually do. So, we are not going to do that. Instead, we are going to use this transformer based model called as BERT, and it's like super powerful, you know, compared to this previous models that we would use for NLP, mainly because of the attention mechanism that this has. I'll explain slightly more about this BERT model, what's its architecture, again, in a very high-level overview when we kind of use it in our code in the later part of this video. But right now, remember that we are going to fine-tune a pre-trained BERT model, which is a transformer based model for this particular binary classification problem. So, this is the use case. And these are the individual steps. First, we will install all the required libraries and set it up. And the next step is loading this IMDb dataset, and then we will look at this dataset, understand what are all the components of it. And in the next step, we will split this dataset into train, validation, and test dataset. And the next step is downloading the tokenizer for this BERT model. So, before sending this data to the actual BERT model, so this data has to be chopped down into smaller pieces called as tokens. For that purpose, we will download a tokenizer and then tokenize the dataset that we have, the text data that we have, and then we will load a pre-trained BERT model. And in the next step, this is where the actual training or fine-tuning happens. So, we will teach this BERT model how to do this specific task of movie review classification, and we will be specifically using Hugging Face library

### Segment 2 (05:00 - 10:00) [5:00]

and we will be using this trainer class, which is like a much easier way to build this classification model, classification version of the model, compared to using raw PyTorch or TensorFlow. I'll touch upon that as well. So, we will learn this fine-tuning using this Hugging Face library, which is Transformers library, and then we will evaluate this fine-tuned model on a test evaluation dataset, and finally, we will build a predictive system, a function to which you can pass new reviews, and you can get this output as either positives or negatives. And then we will save this fine-tuned model as artifacts, and then create a Gradio UI just like the one that we have seen over here that can pass this review to the model, and we would get the final output. So, this is what we are going to build in this specific video, and let's get started with the actual steps. So, the first step, I'll say install libraries. Here, I'll say pip install {exclamation mark} pip install. I'll put {hyphen} Q, so this is going to, you know, remove the logs like requirement already satisfied or this installation messages that we would get. So, I'm running this on quiet mode. So, the first library that I'm going to mention is Transformers, which is the Hugging Face libraries. And then we have this datasets library. This is also part of Hugging Face. So, we will download this IMDb dataset from this datasets library. And then I'm going to import this scikit-learn, mainly for, you know, importing the accuracy score and the precision score. These kind of evaluation metrics, we will be using scikit-learn. And then I'll install accelerate. And finally, Gradio. So, this accelerate, we won't be using this explicitly. Rather, this is required by transformer model for effectively loading our model into the GPU that we have. So, that's for that purpose, we have this accelerate model. And then we have Gradio for building our simple user interface. So, these are the libraries. Most of the libraries should be pre-installed in Google Colab. But if you are running this on your local machine, let's say you have a Nvidia GPU on your local machine, and you want to run this on Jupiter or your VS Code, you can just like make sure that you have installed all these required libraries. So, this is our first step. So, let it run this installation process. And the second step that we have to do is imports and setup. So, here let's import all the required dependencies. So, first I'll say import random, import NumPy as np, and then import torch. Right. So, this is basically PyTorch library, which will be used internally by Hugging Face. And then from datasets, let's import load_dataset. This is similar to the load_dataset function that we would have in scikit-learn, from which we can load this breast cancer dataset or this, you know, Boston house price, these kind of datasets. We have downloaded it from scikit-learn. Similarly, we can use this load_dataset. That would have access to this Hugging Face datasets. And then from Transformers, import these things. So, we are going to like import multiple things. So, for better structure, I'll open and close a parenthesis and mention all the dependencies here. So, the first thing that we need is a auto tokenizer. So, let's mention that. And the next thing that we need is auto model for sequence classification. So, from this we will be downloading our BERT model. And then I'm going to import this training arguments. The fourth import that we need is trainer. I'll explain about all these imports in a bit. And then we have this data collator with padding. Yeah, these are the things that we need. And finally, I'm going to say from sklearn. metrics import accuracy_score. And then I'm going to import my F1 score. So, these are the two metrics that I'm focusing on. Right. So, let's run this. Now, as I said earlier, we are going to use Hugging Face, that is this transformer library, for fine-tuning this model. Now, PyTorch and TensorFlow are core deep learning libraries that we can use to build our models, train our models, and we would manually set up our training loop, etc. Hugging Face provides you a wrapper, and this makes it kind of much more easier to use. So, this is a high-level library compared to PyTorch or TensorFlow, but this Hugging Face can use any of these two libraries internally. So, let's say that here I'm training or fine-tuning this BERT model. Internally, it will only be using PyTorch library.

### Segment 3 (10:00 - 15:00) [10:00]

Along with that, it would give us access to the several open-source models, BERT models, datasets, tokens, etc. The reason I've chosen this is I just wanted to start it simply and for you to understand things easier. But later, we will definitely do this in the version of let's say PyTorch or TensorFlow. But right now, we are focusing on this Transformers. And first, we have imported these utility libraries mainly. So, we have random, that's used for random number generation. NumPy, again, it's used internally inside let's say when the model is training, weight initializations, etc. As I said, torch is also used internally by the Hugging Face. And the reason we are importing these three things is to mainly set up the seeds. So, I'll first write that specific part of the code so that you can understand that. Okay. So, after these import statements, so we have executed it, I'm going to run this particular cell. Where we are setting up a seed as 42 and then saying random. seed, np. random. seed, torch. manual_seed if torch. cuda is available. That means if GPU is available, then set up the manual seed in this CUDA with seed of 42. This is mainly for reproducibility. So, the reason we are doing this is when we fine-tune this model or when we kind of like train this model, there are several places where data will be generated in random. Say for example, your weight will get initialized and the weights will be initialized in a random way. So, if you run this notebook once and let's say I run it after some time again from top to bottom, you won't get the exact results that you got earlier. So, for that reproducibility we can set up this seed. And if you use this number all the time, you're going to get similar results. If I use this number as 20, next time if I run this with 42, I'm going to get different results. But if I use 20, it's going to be like reproducible. So, for that exact purpose, we need to set up the seeding part. And the three places that can cause this randomness are this random library, NumPy library, torch library. So, all these are used internally. So, make sure that you set up these seed values over here for this reproducibility purpose. Now, the next imports are important. From datasets, we are importing this load_dataset, which is mainly used to download our IMDb dataset from Hugging Face. You can also download this from Kaggle, but again, you have to do this authentication with Kaggle API, etc. So, I just chosen to use this Hugging Face dataset. And this Transformers is basically the Hugging Face library. From that, we are importing AutoTokenizer. As I said, we will be downloading this tokenizer for BERT model using this AutoTokenizer class. And AutoModelForSequenceClassification. So, from this, we will be downloading a BERT-based model for sequence classification, basically for text classification. If you are importing a large language model like a ChatGPT uses GPT models, right? So, those are what we call as like causal language models. So, in case you want to download those kind of models, instead of this, we would say AutoModelForCausalLM. You might have already seen that. And then, we have this training_arguments import. So, here we will provide the configurations for the training, hyper-parameters, etc. And this is the actual trainer. We will call this trainer. train to fine-tune that model to basically run the epochs and fine-tune the BERT model that we have. So, after this trainer, we have this data_collator_with_padding. So, before we pass this data this let's say text data to this model for training, we have to add this padding to it. And this data_collator_with_padding dynamically adds this padding. I'll explain this in detail when we actually use it, so that will make much more sense. So, just remember that this is kind of a preprocessing step that we would do before passing the data to the model. And after this, we have from scikit-learn. metrics, you might have used it already, accuracy_score and f1_score. So, here we are importing f1_score just to make sure that we don't have any imbalance data. So, in case if we have any imbalance data, that will be identified from this f1_score. Along with this, you can also import let's say confusion_matrix, precision_score, recall_score, etc. or the classification_report that you can do. Okay? So, these are the imports that we need, and we have also set up the reproducibility seeding. So, this is the initial step. So, first we have installed all the libraries. Next, we have imported it. And the third step is downloading this IMDb dataset. So, I'll say load the IMDb dataset. All right. So, let me just like add a text about this. So, we load IMDb data directly from Hugging Face dataset hub. So, no Kaggle login required for this. And it comes with a pre-split version of training data and test data. And each would have 25,000 reviews. So, this is the, you know, the content of this dataset basically.

### Segment 4 (15:00 - 20:00) [15:00]

So, I'll say dataset is equal to load_dataset that we have imported from Hugging Face datasets library. And within this, provide this IMDb within quotes. So, this will download this IMDb dataset. So, you will see this dataset getting downloaded over here. Shouldn't be like that long. So, let's see. Ignore this warning that you get about HF token. So, what happens is like this dataset is a public dataset, so you don't need to give any authentication. But for some datasets, you have to make sure that you have the required permission. So, for that, we would download this or basically get this Hugging Face token from Hugging Face account and then later provide it. But in this case, as this is a public dataset, we don't need it. So, now I'll try printing this dataset and show you what are the contents of it. So, I'll say print dataset. So, this dataset is a variable that we have assigned over here. So, print this dataset. And it says that it's of type datasetdict. And this datasetdict, which is basically a dictionary, has three components: train, test, and unsupervised. So, this train component has two let's say columns. One column is text and the other column is label. So, totally, there are 25,000 rows. So, it's like there are uh the first row will be there and there will be like review one saying like the movie was good or something like that. So, I'll just like give an example. Let's say we have review one. This is the first column. And the label will be let's say positive. Similarly, this is like one row. Similarly, you have 25,000 rows and two columns are text and label. So, this is my training data. Similarly, I have 25,000 test dataset. And then, combination of these two things, which is your unsupervised data. So, this is used for some unsupervised tasks that we want to use in case we want for some other purposes. But we are following a text classification, which is a supervised learning approach. So, we will just like focus on these two parts of the dataset. So, this is about it. Uh we have successfully downloaded this dataset, and we have this in this dataset variable. Now, let's look at this data and understand the exact reviews and the contents present in it. So, here I'll say — [clears throat] — look at a sample. So, let's just take one review and understand what it has. So, I'll say sample is equal to dataset of zero. Oh, okay. Sorry. So, first I have to access the training data. This should access the 25,000 data points that we have and from that, access the first data that we have. Right. And then, I'm going to print this — [clears throat] — label. Label is access. So, sample is basically the first review. So, from this sample, we know that the column names are text and label. So, access this label. Print it. So, it would say Sorry. So, this [snorts] should be within double quotes. So, it says label is zero. So, in this particular dataset, zero represents negative reviews. And then, one represents positive reviews. So, zero represents negative and one represents positive. I'll also print the exact review for which we got this let's say label. So, I'll say review is this sample. Just like the way we have accessed this label, we can access this text. And this is going to be a longer text. So, I'll say maximum, just show me like the 500 characters. And then, just like include this continuation symbol. Let's just put three dots over here. And maybe I can also add a new line here for better readability. So, I'll add this backslash n. So, we have this label as zero. So, we know that zero is for negative and one is for positive. So, it says, "I rented I'm Curious Yellow from my video store because of all the controversy that surrounded it when it was first released, etc. " So, we have this review basically. Possibly like this is the negative review that we got. So, we got this label as zero. So, this is how the data is going to be. You have a text basically for that movie review you have. And then, you have the label for it, either zero or one. So, this is the data that we are working with. So, I'll just like add this particular text over here saying that each example or each sample has two fields. One is text and the label. And the label would be either zero or one, zero representing negative and one representing positive. Text is the actual movie review that the user has given. Right. So, this is the data that we have. Next is again, I'm not going to focus a lot on analyzing this data, understanding the data splits, and all that. Let's just like move on to creating a sample and fine-tuning this as this is the core idea of this particular video, but feel free to do some analysis and understand more about this data. So

### Segment 5 (20:00 - 25:00) [20:00]

here I'll just like skip that part. So, here I'll say take a small subset. So, we're not going to train with this entire 25,000 mainly for fast training. So, if you train this entire 25,000 data points, then it's going to take a long time and again not ideal for this particular video. So, I'll just take a subset of this, but you can increase the subset size and you can train on it. So, I'll say train size is equal to 2,000. And test 500. So, basically I'm saying I'm not going to train with 25,000 reviews. I'm going to train it with 2,000 reviews. And test data set, I'm going to take 500 data points as my test data set. So, I'll call this as small_train. So, this is the subset of data that I'm creating. So, now I can say data set within train. So, access the train data set and say dot shuffle. Seed is equal to seed. This is also a place where our data get shuffled, right? So, every time you want the same split. So, that's why we are providing the seed and the seed value is 42 that we have set up earlier. Nothing critical about using this exact number 42. You can use different number, but that's like a general convention that we would use. And after this, I have to say select range train_size. I'll explain what is happening in this code in a bit. Right. So, we got the subset for this training data. And similarly, we have to get this for this test data. Got it. And now I'm going to split the validation data as well. So, I'll say split the training data into train plus validation. So, here I'll say split is equal to small train. dot train_test_split and within that I'm I can say this test size as 0. 1 or 0. 2, anything that you want. And then I'll say seed is equal to seed. And next step, I'm going to say train_ds is equal to split. train and then I'll have this as val_ ds split _test. Right. Now, in the next step, I'll just like print the length of this data set and then I'll explain what we have done in this code. Right. So, basically uh okay. So, this shouldn't be train size. This should be test_size. Okay. So, this is what we are expecting. So, I want to train with 2,000 data points. I have a validation data set that is 200 data points and a test data set of 500 data points. So, now let's understand what we are doing. So, the purpose of this training data is this is going to be passed on to the model when we are training it or in this case when we fine-tune it, right? And test data is used at the final step of this process. So, the model should not see this test data before that. It's only used for final evaluation. So, for that we have this 500 data points for evaluation. And now we have this validation data set in between. Now, let's say that we are training this model for 10 epochs. At each epoch, so epoch basically means in one epoch the model looks at the entire 2,000 data points that you have. After each epoch, we would evaluate we evaluate it with this validation data set. So, or let's say you want to do hyperparameter tuning as another process. So, in these use cases these different purposes as I said, one is evaluating after epoch or performing hyperparameter tuning, etc. We can use this validation data set for validating the model, understanding how the model is performing, whether the performance is increasing or not. This test data set should not be used anywhere during these processes. Test data set should be used only at the end of this finalized model. So, this is the reason we are splitting this into this kind of a split. Now, let's understand how we end up ended up with this split with this exact code. So, I have accessed this data set. I'm accessing this training data that has about 25,000 data set that we have seen over here, right? So, I'm accessing this particular subset. So, I have a training data set that has like 25,000 data points. So, I'm accessing the training data. I'm shuffling it and then selecting the first 2,000 rows from it. So, the select range train size, which is select size 2,000, is going to select the top 2,000 data points from this. And the reason we are shuffling it before that is that there is a chance that in this 25,000 data set, 12,500 first 12,500 data set can be negative reviews

### Segment 6 (25:00 - 30:00) [25:00]

and the next positive reviews. If we just like do this select range ranges without shuffling this, there is a chance that all this 2,000 data points would become negative reviews. So, we don't want that. So, first we shuffle this training data, make sure that the positive and the negative reviews are mixed and then select the top 2,000 data points. Similarly, we do that for test data as well from our Okay, this I have to select test. So, here I'm doing this a small test shuffling from this test data that we have over here. Okay. So, this test data won't be looked at when the model is getting trained. So, that is the idea of this. So, let's run this again. And now, once we have this 2,000 data points for training, we are splitting this into further into training data and test data. So, if I show this to you, right? Now, I have this training data as I said, 2,000 data points and test data of 500 data points. Now, this training data has to be split further into training data and validation data. So, I'm taking this small train, basically that 2,000 data points that we had, splitting that further into two subset. And here the test size, which is basically the validation size is 10%. That means 10% of this 2,000, let's say that would be 200. So, that 200 data points would be my validation data, which goes into this. And the remaining data will be my training data, which is my train ds. So, now I can just print my train_ds over here. So, this is the final size. So, 2,000 data points was my initial training data. From that, I have took 80 sorry, 90% which is 1,800. That is the final training data. And validation data, 10% of this 2,000, which is 200 data points. Finally, I have this 500 as test data points. So, use this training data for your training, validation data set for uh you know, evaluating your model after epoch, etc. And then test data set finally for evaluating the model. So, this is the split that we are working with. Now, uh once we have split this data, next is building the tokenization for our model. So, I'll create this sixth step and mention this as tokenize the text. So, here I'm going to provide this model name or you can call this model ID. So, I'll say model name is equal to bert-base uncased. So, you can refer to hugging face documentation for this exact names and all that. And I'll say max length is equal to 256. So, this is the maximum number of tokens that we want in each sentences or each reviews to be more precise. You can increase the number of tokens as well, but again that would increase the training time. Here I'll just choose this as the number of tokens that we have. So, now I'll have this tokenizer as tokenizer is equal to auto tokenizer and within that pass your model name. model_name Right. And then I'll say df Or maybe I think I can first show you how this tokenizer works. Oh, sorry. I should say this as auto tokenizer from pre-trained. Right. So, this is going to download the tokenizer for this bert-based model. So, what is a tokenizer? As I said earlier, let's say that we have a movie review saying that the movie is unbelievable. Let's say this is the movie review. And this has to be passed on to the model, right? Either for training or do during inference for classifying this as positive or negative. But before sending this to the model, it cannot understand this text data. We have to convert this into numerical format. So, for that purpose, we will be using a tokenizer. Now, what this tokenizer will do is split the sentence into smaller pieces. Let's say the movie is unbelievable, etc. So, it's going to split it. Some longer words that the model doesn't seen earlier can be split into smaller pieces. Let's say for example, there is a chance that this unbelievable can be split into one token or into two smaller tokens. So, there is a chance. So, this processing is what a tokenizer will do. Let's say for example, I'll call this tokenizer on this text. So, and print it. So, it will it's going to split this into tokens. So, we have this 101 as the CLS token, 102 as the separator token. I've earlier made a video very recently about tokens and tokenization. Go through this. So, we are going to get this input IDs, token type IDs, attention mask, etc. So, maybe I'll just like show this.

### Segment 7 (30:00 - 35:00) [30:00]

tokenizer. tokenize So, this is how the sentence is going to be split, and you can see the uppercase letter has been transformed to lowercase letter because this is a uncased model. So, this is the process it does. First, it splits the sentence into smaller pieces like the individual words that we are seeing, and now it converts this tokens into IDs. Now, this word the is converted into token ID of 1996. Move would have this Okay, I have to say movie, right? So, this movie would have this uh token uh of movie would have a token ID of 3185. This has 2003. Unbelievable has a tokenizer token ID of 23653. So, these are additional tokens that gets added as I said, CLS token and the separator token. This is the process that token uh tokenizer does, and later these token IDs are passed on to the model, and then this gets processed. So, the raw text is not being passed. Rather than that, the token IDs are fed to the model. Now, we have this token type IDs. These are some of the additional outputs that we get. This what these token type IDs do is we can also use this BERT base model for sentence similarity calculation. So, I would send this sentence as like, you know, this movie is unbelievable, and along with this I can also send like another sentence as uh you know, this movie was bad. And try to get what is the similarity between these two sentences. So, all these tokens would be given an ID of zero, and the tokens of the second sentences would be given an ID of one. In this case, we are just like working with only one sentence as it is just for a classification task. So, all the values are given as zero, and then we have a concept of padding. Uh maybe that also I'll explain later in later section of this video. So, actual tokens will have uh the attention mask of one. If it's a padding token, padding token are just like a placeholder token to make sure that your sentences are of the same length. So, that is the thing. So, basically, actual tokens would get the attention mask of one. Padding zero. So, I'll explain this later, but this is the main output that we have. And this token IDs, the input token IDs are the main output that the tokenizer is going to send. Okay, just keep that in mind. I'll just like remove this code for now. Later, you can also run this to get a better understanding. So, we have downloaded our tokenizer, which is going to split this text. Now, I'll create a reusable function for this. Instead of doing this tokenization process again and again, we can use this reusable function, and I'll create a input parameter called as batch. So, now we have seen how you can take one input review, one movie review, and convert it into tokens. But here, we are creating this function that can take a batch of input and do this tokenization process. So, it should get this batch as input and return tokenizer of batch from that access the text because text is present in this variable, right? So, in this column called as text. So, basically, I'm accessing the text component and say truncation So, truncation is true. And then, I'm going to say max_length is equal to max_length. Right. So, the next step is saying tokenize all three sets. So, I'll say train_tokenize is equal to train_ds, training data set. map pass your tokenize function and say batch is equal to true. Similarly, we can do this for Oh, this should come out of this function. Do this. This should be uppercase T. And I'll just like remove this empty spaces. So, let's call this as validation data tokenize, and this should be test tokenize. And here, this should be our validation data set. Val_ds, and our test data was named as small_test, right? You can also rename this to test_ds for consistency, but you can do that. I'll just like leave it to you. So, this let's call as small_test. So, what I'm doing is we have created a tokenize function. Basically, it does the same thing that I've shown earlier. It's going to take one input at a time, convert it into token IDs, but we are just like doing it in a batched manner, okay? So, it's going to get the text and tokenize it. Truncation is true, and

### Segment 8 (35:00 - 40:00) [35:00]

max_length is 256. So, what I'm saying over here is let's say if a review has overall after this tokenization, let's say if it has 500 tokens. Review one, and let's say review two, the second data point has about 300 tokens. So, we cannot work with this variable number of tokens in our input. So, here we are saying make sure that you truncate it to 256 tokens. Again, this is configurable. You can increase this number as well. So, after this, all these reviews should have a maximum of 256 tokens. There is a chance that review three would have a number of token as like only 200 tokens. So, this is where we add, let's say, another 56 padding token. But that comes at a later step. So, I'll explain that later, but just remember that this is the purpose of this padding part. But for now, just remember what's the purpose of this truncation and max_length. So, the reason is that some movie reviews can be pretty long, but we just like truncate it to say that maximum a review can have 256. It can have lesser number of tokens, but we will handle it later by adding padding tokens to it. So, that is the idea. So, now we have created this function with truncation, max_length, etc. Just remember that if you are using a BERT model, BERT base uncased, you have to get the exact same model's uh tokenizer. So, that's why we are using this name. Later, when we download this model, we would use the same exact name of BERT base uncased. So, we have provided the max_length, downloaded the tokenizer. You can see over here, and we have created this function called as tokenize, which would take input as batch, convert it into token IDs, and now we are mapping this. So, what this does is it takes this training data, it applies this tokenize function that we have created to all the rows that's present in this training data, which is 1,800 data points are going to go through this process of tokenization. Batch, we are basically doing this as a batch-wise process so that it can happen quickly compared to one-by-one tokenization. Similarly, this happens for validation data set as well. So, train tokenize, validation tokenize, test tokenize. Make sure that you provide this train_ds, val_ds, and small_test correctly. And now, we have this Let me run this data. I'll run this again to make sure that I'm not like messing things up. So, I'll just like run this data splitting thing first, and then do this tokenization again. Right. So, three mapping has been done. Okay. So, this is now done overall. We downloaded the data set, split it into training data, validation data, and test data, and then we have also tokenized it, which is kind of the pre-processing step that we have to do when we are working with this NLP-based or this transformer-based model. And the next step over here is Seventh is load pre-trained BERT model for classification. So, this is my seventh step. And here, let me add some text about it. Right. I'll give you a very high-level overview about this BERT model. Later, we will discuss about transformer architecture, encoder-only models, decoder-only models, etc. But let's focus on this BERT now. So, here I have said that auto model for sequence classification, which we have imported earlier from hugging face library, that is transformers library. From this, we have imported this auto model for sequence classification, right? From that, we are going to download this BERT base model. And on top of that, we are going to add a small classification net with num_labels as two. That means it takes this movie reviews as input, and it's going to give two labels as output. And later, we will convert it into positive or negative. So, basically, to this base model, we are adding a classification net for predicting positive or negative. The bait Sorry, the BERT body starts pre-trained. That means the BERT is already pre-trained with, let's say, the Wikipedia data or the data that is available on the internet. The classification net starts random. So, that is a pre-trained part to this final model. That is the base body of this BERT model. On top of that, we are adding a classification net. This is trained. The body is trained, but the classification net is not trained. Now, fine-tuning is making sure that we align both of this together so that what happens is in the pre-training part of this base BERT, it understands grammar, it understands language, it understands vocabulary. Classification just teaches it how to predict positive or negative. So, this part is trained and this part is like fresh right now. So, fine-tuning make sure that it aligns both of these things together so that it can work as a text classification model. So, this is the overall idea of the fine-tuning that we are going to do. Now, what is RNN and what is a transformer

### Segment 9 (40:00 - 45:00) [40:00]

based model? So, RNNs take this input sequentially and process it. Whereas, again, it depends on this memory. So, it remembers the first token that came up and then process the next token. So, this process goes on in a sequential process. So, parallelization cannot happen. So, training can also happen longer and it is not like as much as accurate compared to a transformer based model. Now, in 2017, Google as you know, launched this attention is all you need paper. It's like a pretty famous paper which kind of talked about this transformer model. So, initially, this transformer model was used for sequence-to-sequence problems. That means let's say for translation problems where you would translate from English language to French language. For this translation process, they have kind of proposed this transformer architecture. And this architecture would have an encoder and then it would have a decoder. Okay? So, these are the two components of this. Now, uh later, people started using only the encoder component of this. And BERT is an example of this. So, the full form for this BERT stands for bidirectional encoder based I forgot that exact name. Let me just quickly, you know, search and show you. So, bidirectional encoder representation from transformer. So, it is a encoder only model. And the large language models that we use right now, that powers ChatGPT, that powers Gemini, Llama, all these models are decoder based models. So, these encoder based models can be used for this text classification or named entity recognition, etc. Whereas, your decoder based models can be used for next token prediction. So, basically, once you give a text, it's going to predict what's the next probable token that can come. So, the core idea that you can remember for now for this video is that we have this transformers architecture that initially had a encoder plus decoder, which was used for the sequence-to-sequence conversion, like translation, etc. After that, we had this encoder only models, which is let's say BERT is an example for this. Let's call this encoder only. Similarly, we had this decoder only models. And examples of this like GPT family of models, Llama family of models, Gemini, etc. So, this is the different categories of model. So, these encoder only models as I said can be used for this text classification. You can provide two sentences, get the similarity score, understand or more than similarity score, understand like how much they are related to it, etc. Named entity recognition. Given a paragraph identifying which word represents name of the person, age of the person, etc. Whereas, decoder only models are used for next token prediction as I said. So, this is the family of models that kind of came up after this initial transformer paper was kind of launched. And the main component of this is the attention mechanism, which is basically, if you have 10 tokens in a sentence, attention mechanism is the way using which each token would understand which other tokens are important for it. Now, let's say that we have like a sample review saying that you know, I think I've used this, right? The movie was fantastic. Attention mechanism allows these tokens allows these models to say that this movie has context of this fantastic. Or in other words, it's just telling the model that the word fantastic is basically talking about this movie. So, this context sharing has been enabled by this attention mechanism we have in this transformer models. In a later video, I'll dive deeper into this exact mechanism, how it works, and all that. Right now, just get a eye-level overview of that you know, BERT is a encoder only model, which is uses this transformer model, but initially transformer has this encoder-decoder and all this, etc. Encoder only models has this specific use cases like text classification. Decoder only models have this next token prediction. Whereas, a encoder plus decoder, both of this can be used for sequence-to-sequence problems like your translation problems. Just remember that for now, that is like sufficient. Right now, as I said earlier, we have a pre-trained BERT model. To this, we are going to add a classification head and fine-tune this entire model. So, this is the overall idea. And the idea over here is pre-training. That means let's say this BERT model is trained already with let's say uh several data points. Let's say like a million data points, a huge data set to understand language. Now, we are just like kind of teaching it how to classify a text into positive or negative. So, this fine-tuning is basically a type of transfer learning. I hope that you already know what transfer learning is. So, just like remember that. Okay? So, with that basic understanding of this model, let's understand how we can exactly do this fine-tuning part. Just like the way we have downloaded this tokenizer by providing this model name, instead of this auto tokenizer, I have to say auto model for

### Segment 10 (45:00 - 50:00) [45:00]

text classification. So, I'll say model is equal to auto model for sequence classification. Sequence classification because we are going to use it for text classification problem. Similarly, you have different use cases that you can download specific models for. And save from underscore pre-trained. And then provide the model name that we have provided earlier, which is BERT base uncased, the base version of this BERT model. Okay. So, let me just like delete this. We can also provide the other parameter over here, which is the first is model name. And the second one is number of labels. Basically, how many labels that you want to predict. So, this is going to add a classification head that can give you the predictions for two label, whether a review belongs to class one or class two, which is positive or negative. As you can see, the size of this model is only about 440 million and I think the size of this model is about 110 million. So, it's not like a pretty larger billion scale model. It has like only about 110 million parameters to it. Okay? So, we have downloaded the model and uh Yeah, I think that is the next step. So, we have downloaded the tokenizer, created a function for tokenize. We have created the tokenizer version of this data. And in the next step, we have downloaded the model as well. Now, uh we have to do this training, but before that we have to set a few things up. So, here I'm going to create a text cell. — [clears throat] — I'll call this as metrics and data collator. Right. I'll also add a text saying what we are doing exactly. So, first is I'm going to say define. So, we are creating a function called as compute metrics. And provide your evaluation pred. Basically, the data that we are going to test with. And say logits {comma} labels is equal to eval underscore pred. And now, I'll say preds, which stands for prediction, np. argmax. Provide your logits over here and then axis is equal to one. And then, I can return two things. One is my accuracy score. So, I'll say accuracy. We are going to use the accuracy score that we have imported and provide your labels, the true labels and your predicted labels. And next thing, similarly, I'm going to provide my — [clears throat] — F1. So, I'll say F1. Use F1 score. Labels and then your predictions. And then, I'm going to call this data collator is equal to And maybe we don't have to do this together. I'll maybe create a separate cell for this. So, I'll say data collator is equal to data collator with padding that we have imported earlier. Um And here, for tokenizer, I'm going to pass my tokenizer. Right. So, these are the two things that we need. Let me run this. So, now let's understand this. Right. So, first we have this compute metrics function. Now, let's say we are training this model for five epochs. That means the model goes through this entire data set during forward pop propagation and then backward propagation updates the weight of the model. Just like the way our ML model is trained. Or basically, not ML model, but our let's say a deep learning model is trained. Just keep that in mind. Okay? So, we have a forward propagation and backward propagation. And during backward propagation, loss is computed and the weights are updated using gradients, right? So, this completes one epoch, one forward propagation and backward propagation of uh model. So, in one epoch, complete data is basically like the model goes through the entire data, which is in this case 1,800 training data. After this, we want to understand what's the accuracy of the model, what's the F1 score of the model. And here, we are providing a function that can be used when the model is getting trained. So, basically, we are saying at the end of each epoch, give me the accuracy score and the F1 score. So, that's what we are trying to get over here. And what happens is, once the first

### Segment 11 (50:00 - 55:00) [50:00]

epoch has been completed, the training loop will send this validation data, the 200 data points that we have split over here, to this model, get the predictions. And this prediction is then passed on to this compute metrics, okay? And this prediction will have two components, logits and the labels part. Now, what is this logits and labels? One second. Right. So, let's say we have built this model. The first epoch has been completed, and we send this review one saying that the movie was fantastic. For this review, we know that the true label, let's say is positive. Because that's present in the training data as is as this is a supervised learning approach, right? So, this is a true label. Now, this is how the prediction would be looking like for. So, prediction won't be coming as label, instead it would be in the form of logits. Logits basically represents the raw score. So, what this raw score as is, it's like saying positive label as a score of, let's say, 5. 3. And negative as a score of 3. 2. Now, what this means is, the model says that positive as the highest score for this review and negative as a lower score. Basically, it's just the model way of saying that I think the review has more chance of being positive and less chance of being negative. So, this gives this scores as the exact predictions, okay? It won't come as positive negative, zero or one. Whereas in the training, let's say we have given this as this label as one, which stands for positive, okay? So, this is the output of the model. Now, we look at these two things and convert it into zero or one using this np. argmax. So, this argmax, if the value is higher for positive, it's going to convert this logits to one, this will be zero. Okay? So, basically, the final prediction that you would get would be either zero or one. If the logit for positive class is higher, which is in this case 5. 3, right? You will get the prediction as one. If the logit for or the raw score for negative is higher, then you would get the label as zero. So, it's simply we are converting this true labels, sorry, the predicted logits into labels that we can compare with the true labels. So, that is the idea. — [clears throat] — We have the evaluation prediction that has the true labels as well as the logits given by the model. And from that, we take this logits, convert it into prediction, and now then just pass this to accuracy score or the to the F1 score, just like the way we would do this for any machine learning model. So, that is the core idea over here. Get the logits, convert it into prediction, compare it with the true labels. Just after doing that, compute your accuracy score or F1 score. So, we are going to use two things. Along with that, you can also use your precision, recall, and other metrics as well. So, this is about this compute metrics. So, what is the metric that we want to focus on? In this case, it's accuracy and F1 score. And next, we have this data collator. Now, we know that, let's say, this R1 represents review one. Initially, it had about 500 tokens after tokenization, but we have truncated we have set up this truncation to and maximum length of 256. So, this review one will be converted into 256, right? Review two as, let's say, originally had 300 tokens, and now that has also been truncated to 256. Now, let's say there is a review three. Review three has only 200 tokens. Now, what happens is, max length is only something that we have set over here, right? 256. But if a review has lesser number of tokens, this tokenizer is not going to do anything. That's where this data collator with padding comes in. So, here we have also provided the tokenizer, so it knows like knows what is the max length. Now, for this R3, what this padding would do is data collator with that it's going to add this padding token. So, it would add something like this. So, these are basically ignored by the model when it is getting trained, but the reason we are doing this is we would add like 56 padding tokens over here. Now, even this review three would also have 256 tokens in it. So, this is the purpose of padding, okay? Just keep this in mind. Now, data collator with collator with padding does something, uh you know, something that's more efficient than this and in a dynamic way. Now, what it does is we have a concept called as batch size. Now, I'll say batch size is eight. That means the model look at eight reviews at a time. Now, uh let me just for easier understanding, let me put this as batch size as three. That means three reviews the model looks at a time. So, that means each batch, we pass three reviews at a

### Segment 12 (55:00 - 60:00) [55:00]

time. Now, review one has a size of 256 after truncation. So, let's assume that we have already truncated it. So, we have 256 as the size. Second also had a maximum size of 5 400 or something, we have truncated it. And R3 has a size of, let's say, 200. So, to that we add like 56 padding tokens, and this would also come as 256, right? Now, let's say there is a second batch that has 150 tokens, that has 120 tokens, and then the review three has 100 tokens. Okay? So, all these are less than 256. Now, instead of adding, let's say, 160, right? So, this would combine oh, sorry, it should be 106. So, adding these two things would give me this 256 max length, right? So, instead of adding this 106 as the padding token, I'm just going to add 30 more padding tokens here. And here, I'm going to add 50 more padding tokens. So, the idea is now this would be converted into 150 tokens, and tokens. The core idea is that don't always strictly focus on creating 250 tokens because the idea is that each batch size should have the same number of tokens. So, if in a batch, the maximum number is 150, I don't have to add extra 106. So, always the target is that focus on what's the largest token in that particular batch, and remaining add tokens for it, okay? So, this is the core idea. If let's say there is a case where one review has a size of 500, then we would truncate it to 256. And then in this remaining, we would add more tokens so that this would be combined to give you like 256 tokens. So, we would add like, let's say, 136 more tokens, and this we would add like 156 tokens so that totally we have 256 token. So, the core idea is that each batch should have the same number of tokens. But it's not important that it should be 256 or 512. Data collator with padding just does this dynamically, and it have these variable sizes for each batch. The requirement is that each batch shouldn't have varying size, but different batches can have different size. And this is what data collator with padding would do. So, instead of adding or making like 256 tokens over here, it's just going to target the maximum number of tokens that's present in that particular batch. So, I hope that this is kind of clear for you. If it's not clear, not a problem at all. Please let me know in the comments. I'll just give you a detailed written explanation in the comments or the Q& A section so that you can be clear with that. So, these are the next two things that we have set up. So, we have mentioned the compute metrics for us, and in the next step, we have also assigned this data collator, we can set up the trainer for our model. So, here I'll say this as nine {dot} setup trainer. This is the actual part we are going to start this fine-tuning part. So, for this, I'll quickly copy and paste the model as we have several parameters that we have to focus on. Right. Again, this is the reason also we are going with Hugging Face because this entire training loop is like much easier to implement compared to, let's say, PyTorch or TensorFlow, but even that is also important. We will look at it later. Now, we have two things, training arguments. So, we are using this training arguments that we have imported from Hugging Face and this trainer that we have imported. Now, for this Hugging Face, it automatically does this training on GPU so that we don't have to, you know, worry about the training happening on device, whether it's happening on CPU or GPU. But still, you can maybe run this part of the code earlier. I'll just like add it over here. To make sure that you check whether you have the access to GPU or not, you can run this particular code using device CUDA. So, we are saying device is equal to CUDA if torch. cuda is available. So, this is basically checking whether you have a NVIDIA GPU. CUDA is available, that means you have a NVIDIA GPU, otherwise the device is CPU. Later, we will use that. In PyTorch, we would manually move the data and the model to GPU, but here it's not required in Hugging Face as it automatically does that. Okay. With that being said, we have two components. One is your trainer training arguments, and the other one is trainer. Training arguments are the configurations for our training, the hyper parameters that we set for our training. Trainer is the actual training loop that we have. So, to this trainer, as you can see, we are passing this training arguments, these configurations. Now, let us understand this. The first parameter that we are providing is the output directory. Basically, after fine-tuning, where the checkpoints have to be saved. Here, we are saying that dot {slash} bert IMDB

### Segment 13 (60:00 - 65:00) [1:00:00]

output. This is where I want to save this, you know, training-related data. And then, I have this eval strategy, evaluation strategy epoch. That means after each epoch, you can, you know, do this evaluation that we have provided earlier, that we will provide later. I'll show you exactly where we would use this. So, basically, we are saying after the epoch, do a evaluation, the specific evaluation, which we will later pass. Learning rate, two to the two into e power minus five. That means the learning rate is basically how much you want to change your parameters, you know, when this forward propagation and backward propagation happens, the magnitude of change that you want. Per device train batch size. Per device, you can also, if you have a larger model, usually, let's say, for larger LLMs, you would do this training on 10 GPUs, 100 GPUs. I mean, I'm talking about OpenAI or Meta scale. So, they would like I mean, not even 100. They have like a several number of GPUs, right? Here, we are just like going with one GPU. This is saying how many batch size for one GPU. Here, just forget about this per device as we are working with only one GPU. Here, I'm saying that the batch training batch is eight. That means it processes eight reviews at a time. So, eight forms the batch size here. Similarly, for evaluation also, it's eight data at a time. Now, you have this number of training epochs. I've given this epochs as two, but you can increase this if you want your model to be trained, you know, in a much more number of epochs. It's basically how many times it should look at the training data. Increasing the epoch sometimes would give you higher performance, but make sure that your model is not overfitting. And then, we have this weight decay 0. 01. So, this is kind of a regularization technique where just make sure that your model is not again overfitting. So, what it does is it's going to penalize your training or penalize your model if it gives like a larger weight. So, that's what this weight decay does. And then, we have FP16, floating point 16 or 16-bit values. So, we reduce the precision. So, it's like the default, the precision would be 32-bit values. So, here, the training can happen in floating point 16, and it can be So, basically, GPUs are optimized for it, not CPUs. So, here, we are saying torch. cuda is available. So, this is going to give you either true or false. If GPU is available, it's going to give true. If GPU is not false. Or in other words, we are saying use FP16 precision if you have GPU. If you don't have a GPU, don't use FP16. That is the idea, okay? And usually, we do this only on GPU, this precision on GPU because CPUs are not optimized for this lesser precision. Report to none. This is basically, we can also provide this or send this training logs to TensorBoard or other visualization, you know, platforms. Here, we don't want to do that, so I've provided this as none. Seed, we have provided this seed as which is 42 as the seed value to make sure that we get reproducibility. Logging steps, so how many once in how many step you want to log your results. So, that is your logging step. So, these are the parameters from my training arguments. Now, we create the actual trainer. So, we create this trainer, provide the model that we want, and the model is basically the BERT base uncased along with a classification here that we have added this num labels as two, right? So, we have provided that model, and then we have provided the training arguments that we have created over here. We have provided the training data set and the evaluation data set. We are providing our validation. Don't give your test data, give your validation data set. Test data, we will evaluate it finally. Processing class, we would pass our tokenizer, which is the preprocessing. It happens using this tokenizer. Data collator also, we pass this as we discussed that it does this dynamic padding addition. Compute metrics, so this is exactly where we are passing the metrics that we have assigned over here. So, at each epoch, it would do this specific computation. So, that is the idea. So, now, I'll run this. So, we have configured our training, provided all these details. The next step is fine-tuning, training or to be more precise, we are fine-tuning the model, right? So, that is my 10th step, which is fine-tuning the model. And again, two epochs shouldn't take longer time. It should be three to five minutes on this T4 GPU. So, now, I'll call this trainer. So, trainer is the trainer that we have mentioned over here. So, I'll call this trainer. train, open and close your parenthesis, and run this. So, now, as it runs, at end of each epoch, we can look at the computation metrics and all that. Maybe, I'll pause this video now and continue once this training has done. So, the training process is now complete, and it actually took like much smaller time than I even expected. So, we can definitely like increase the number of epochs and the amount of data that we have provided for training. And these are the results, but the results are pretty good. So, we have after first epoch, we have an accuracy of 87. 5, and we have this F1 score of 87. 0. Similarly, in epoch

### Segment 14 (65:00 - 70:00) [1:05:00]

two, we add about like 89% accuracy and all that. So, this is like still pretty good. Okay, so this is again on the training metrics and the validation metrics is what we are looking at. Now, let's evaluate it on the final test data set that we have, which is about 500 test data points that we have split earlier. So, I'll create this 11th part as evaluate on test set. So, for this evaluation, we can say final {underscore} results is equal to trainer. We can use this evaluate function. And here, I can say eval {underscore} dataset is equal to our tokenized version of this text data. So, test tokenized that we have got earlier, so we can pass that. And then, we can print this final results. I'll say final results is this. So, let's We would get those metrics in this final results variable that we have created. We can simply print it this way. So, it's making that predictions, 63 steps, and we got the results. Evaluation loss, evaluation accuracy is also like 89%, 88. 93 is our evaluation F1 score, etc. I'll just like use this for loop in order to print all these things individually line by line, so that it looks better over here. Just the same things, it's just like iterating over each of this key and value and printing it. So, we are accessing like individual items, let's say, eval loss and this value. So, this eval loss will be my key key, and this value would be my V. And then, we are just like printing it with some, you know, float formatting, so that it looks better. Nothing critical here. Right, so we got our eval loss, eval accuracy, about closer to 90% F1 score, etc. As I said, increase the epochs, increase the amount of data in our training and evaluation. That might be a good exercise. Now, we still have to do like a few things. So, we can build a predictive function where we can provide new reviews and get the output for that, basically, the labels for that. And then, we can save the model and build a Gradio UI. So, for this, I'll quickly do a copy-paste as again, this code is like pretty simple. So, this is predict sentiment on new reviews. So, this is our 12th part. So, this is basically a helper function, and we are also going to return a confidence value. Confidence is nothing but the prediction probability for it. So, I'll print it. Right. So, what we are doing is I have three test reviews. This movie was absolutely fantastic, brilliant acting, and a gripping story. This should be a positive review. And then, we have this I hated this film, it was boring, too long, etc. So, it's a negative review. It was okay, some good moments, but overall, forgettable. This kind of like a mid, so it's kind of mixed in both positive and negative. If we have like a neutral or a mixed review, this should be like suitable for this. But, I know I mean, like we know that it does only have two reviews, but that's okay. So, basically, we want to create this predict sentiment function where we can pass this individual raw text that are like new reviews that are coming up, and our model is going to make a prediction. Now, let's see how we can do this. So, if the prediction is zero, we just have a mapping dictionary. If the final output prediction, if it is zero, we convert it into label negative. If it is one, we would convert it into positive. And this is the function. So, in this for this function predict sentiment, we would pass, let's say, one review at a time. The first step that we have to do is first tokenize this. As we know that we cannot pass this raw text to the model. So, first, call this tokenizer, pass your text. We are say Here, we are saying that return this as PyTorch tensor. This is what this PT stands for. Truncation true of max length 256, what we have seen. dot model. device. So, this is important. So, we are getting those token IDs, and then we are moving this to device. In this case, if you have a GPU, this is saying that move this data to device because we know that the model is also on the GPU. If you are working on a CPU, you have to replace this with make sure that this is getting loaded on CPU. So, that is important. Model and the data should be on the same device. And then, we have these two lines, model. eval with torch. no_grad. So, these two lines are basically telling this model that you are now running on inference mode, not on training mode. So, some layers work differently in training and differently in inference. Similarly, in training you have to compute the gradients. Whereas, in inference we don't need these gradients. Gradients are computed and they tell the model how it should update the wage. For inference, we are not updating any wage and all that. So, we say that model. eval basically it's an evaluation mode or an inference mode. And here we are saying pytorch. don't calculate any gradients. And we know that we get this logits as output from the model. So, we call the model, pass these inputs. So, this inputs is basically the token IDs that we got for these reviews. Pass that to the model, get the logits out of this. And now

### Segment 15 (70:00 - 75:00) [1:10:00]

this is how we would get the logits, right? So, let's say that uh P would get a logit score of let's say 8. 2 and let's say negative is getting a logit score of 3. 4. Now, it's just like saying there is a high chance that this is a positive review compared to this being a negative review. Now, we are going to convert these two scores, positive and negative. We have this 8. 2 and then we have this 3. 4, right? I want to let's say convert this into probability. Let's say 8. 2 and 3. 4 if you combine, I'm just taking a random number. Let's say the probability here is 0. 70. For this it is like 0. 3. So, when you add these two probability you should get one. So, that is the idea. So, basically I'm converting these two logits into two probability scores. Whatever is the highest, I would get that as probability. So, that's for this purpose for this conversion of let's say we have multiple logits and they have to be converted into probability, we can use a softmax conversion. If you just have like only one score, 8. 5 then you can use a sigmoid. So, conversion of one score to probability use sigmoid. If there are like multiple logits, two, three, four or any number of logits, we can use softmax. So, that's the conversion we are basically going to do here. Get your logits converted into probability using the softmax function. And then once you get this probability value or the final value as like either zero or one, convert it into negative or positive. So, here we are doing that basically. And this confidence is nothing but we are rounding the probability that we are getting. So, if it's 0. 7 is a probability of P, we are saying that we are 70% confident that this review is positive. And if there is a case where positive is 0. 30 or let me put this as 0. 80 and negative is 0. It should sum up to one. So, in this case we would say that we are 80% confident that this review is negative. So, these are the simple conversions we are doing. So, convert your input text into tokens, pass that to the model, get the logits. Logits is basically the predictions that the model is making. It wouldn't directly give you positive or negative or zero or one. So, from this logits we get the probability using softmax. After that, find whether the probability is maximum for P or N. So, for that we are using this argmax. And then convert it into positive or negative and then get the probability as confidence. That is all. So, this reviews, three reviews that are in a list, we iterate it over this predict_sentiment function and we print the final output. All right. So, label is for this we are saying that it is positive. For this for the next two we are saying it is negative with this much of confidence as I said which is basically the probability. So, this is the overall process of building a predictive function. So, for new predictions, for new reviews you can simply call this predictor. And the next step is saving the model and tokenizer. Now, the purpose of this particular step is that I don't want to train the model every time. Here, we went [clears throat] with a smaller amount of data and smaller number of epochs, but what if you're working on a different use case and your training let's say it takes like several hours. You don't want to run that again and again, right? So, for that purpose we have to save this tokenizer and the model that we can later reload. — [clears throat] — And this is the code for this. So, already we have saved this training related data and this bert_imdb_output that we have configured in our training arguments over here. Now, I'm saying that I want to create another directory called as. imdb_bert. So, here I want to save my tokenizer. the model. So, this is for saving the model. This is to save the tokenizer. So, the next time we want to make this prediction, this predict_sentiment function, we don't have to download this model from this auto_tokenizer and this auto_model and retrain it. Rather, we can just like directly load the saved model. So, we have saved this model. So, this is writing this model shards to this particular imdb directory. And once it is saved, I'll show you how you can reload it and use it within this gradio UI. Again, I'm not going to go detailed into the gradio UI code as this is we can even change this to a streamlit user interface as well. This is not that much important. The core idea is like how you can train how you can, you know, basically use this model. So, let's focus on that. Gradio, I mean if you really want to understand this, maybe put this to chatGPT or cloud, get the understanding of this, but I would say this is not that much important as anyway even if you are even if you're working in an organization, typically separate front end team would be working on it. This is just for building like a simple user interface like this. So, it's not like that much important for us. So, I'll just like skip this part. And I have this complete code. This code you can also write it in a separate notebook. It's not that it should be part of this notebook. What you can do is download this model artifacts. So, this imdb_bert. So, this has the saved model artifacts. You can download this directory uh which is the saved model or you can save this in your Google Drive

### Segment 16 (75:00 - 80:00) [1:15:00]

create a new notebook and run this code alone and this will do the trick. So, what we are doing here is forget about all the previous code because we can as I said we can do this separately. We can download this model in our local. Make sure the device placement is correct about CUDA, CPU or um in macOS you would have this metal acceleration. Make sure that device is configured correctly, otherwise this code would work. So, we are importing torch. We are importing gradio and then we are from transformers we are importing this auto_tokenizer and auto_model for sequence classification. Only difference is that instead of providing the model ID, we would provide this save directory. Now, transformers, the hugging face library know that I don't want to download a model. I want to use a saved model. So, provide that for your reloaded tokenizer and then reloaded model. Provide the save directory. So, this is going to load your tokenizer and the model. Make sure you configure this device correctly. So, here we are saying that CUDA if torch. cuda is available. So, here we are saying if GPU is available, make sure the device is CUDA or else use CPU. That is all. And then here we are manually moving the model to that particular device, either CPU or GPU. So, that's why even if you're working on this on your local, this would still work if you configure this device correctly. Just run this exact code. And then we have this exact same code. Inference function is basically the earlier predict_sentiment function that we have created. Exact same thing, nothing different. And this is the gradio UI part. So, when we are creating this gradio UI, I have to create a function like this, classify_review. So, once we put this review over here, it's going to remove any spaces that's present and it would say like please enter a review and so on. And then it would just like call this predict_sentiment. And then we have this gradio interface and function. We have to call this classify_review which in turn is going to call the predict_sentiment and all that. Some you know, UI related parameters over here. And this is important, demo. launch share is equal to true. is going to generate a particular URL in which you can access this thing. We also provide some examples that the user can try. So, I'll run this. So, this is going to give you this gradio screen. You can try this out in collab itself. So, you will see this in a minute. Or you can access the https URL that has been generated. Now, this is accessible for anyone. So, this URL, again https URL, you can share it with others as I said earlier. If you're going to an interview, you can share this particular URL running on public URL. And uh the link expires in 1 week. So, this will be live for 1 week, but for that to happen you should have this code that is like running on Google Colab. But yeah, in a production environment we would like let's say put this on a EC2 instance or a different computer resource, but this is like a simple easy to do thing. As we are mainly focusing on the model building fine-tuning part. And for user interface we can have something simple like this one. So, now go to this URL that it has generated. Again, you can type your review over here and get the output or you can access this URL. So, the gradio application is loading over here. So, here I can type my review or we have given some examples. So, I can just like let's say click this example. So, it will fill it over here. Click on the submit and then we would get this review as positive, negative along with the confidence which is basically my probability score. So, this is the overall idea for it. So, I hope you have understood whatever we have covered today. So, please practice this code. Before wrapping up, I'll also give you a recap of what we have done. And I think I also have this a wrap up and what you can do next thing that you can maybe try it out. I'll just like paste it over here. So, wrap up. So, I've listed basically down what are the things that we have done and what are the other things that you can include to make it like more rounded. Uh practice this. The main goal that I had is like as we are just like starting with fine-tuning this transformer based model, I just wanted to start it simply and that is the reason for having this code in a very simplified way and using hugging face, but on top of that we will build like more use cases, try to use let's say pytorch, tensorflow or those packages and try to build like better systems as we move on to other use cases. So, now I'll just like uh do a quick review as I said. Right. So, we have seen what are all the different steps that we will be performing in this video. We have installed all the libraries. Here the important ones are mainly the hugging face library and hugging face dataset library. We have imported all the dependencies and mainly the tokenizers, auto model, training parameters, etc. SK learn for evaluation metrics. And another important thing is that we have set up this reproducibility. And now when you rerun this notebook again, you would get the same exact results. And for that purpose, we are setting up the seeding. So here we have mentioned what is the exact device that we are working on, whether it is CUDA. CUDA represents Nvidia GPUs and CPU is for like let's say a CPU machine. And then here we have downloaded this dataset from you know, Hugging Face. If you are working with a custom dataset, even that we can also you know, process that and provide this as a Hugging Face dataset so that the workflow remains the same. So you can use this exact code for other

### Segment 17 (80:00 - 82:00) [1:20:00]

use cases. So use this as a base reference code and you can work on other use cases as well. So we have printed the dataset and we have seen how the data is looking like. And just like printed one sample, we have seen that it has label and the exact movie review. We created a subsample over here. You can just adjust this train size and test size over here and maybe instead of you know, having this test size which is the validation split, you can also have this as another configurable variable just like your train size and your test size. So we got training data, validation data, and test data. 1,800 training data, 200 validation data, 500 test data. And then we have talked about tokenizer. The main job of the tokenizer is to split a sentence into sentence or any text into smaller pieces called as tokens and convert it into token IDs. And then we have downloaded our pre-trained BERT model and added a classification head saying that I wanted to use it for binary classification, basically two labels. This is what we have provided here. And then we have provided the metrics that needs to be computed at the end of each epoch and then data collected that would dynamically add this padding depending on individual batches maximum number of tokens. And here we have set up the training configuration using the training arguments and then we have created this trainer by providing the model, training arguments, training data, validation data, etc. So we did this actual fine-tuning using this trainer. train and then evaluated this on our test tokenized data, our test dataset. And then we have predicted the sentiment as positive or negative. So we have created this predict sentiment function. And finally, we have saved the model and built a simple Gradio UI where a user can provide a review and in turn this is going to call this classify review which is again in turn going to call this predict sentiment function which has loaded the model and all that and it's going to give you the final output. So this is what we have seen. So I hope that this is useful to you and you have understood all the things that we have covered. So please let me know in the comments if you have any doubts. I'll definitely reply with you know, or clarify your doubts that you have. So that is pretty much it. I'll see you on the next lesson.
