🤖 My end-to-end Machine Learning & Generative AI Course - Udemy: https://linktr.ee/siddhardhan
In this video, we do a complete hands-on tutorial on tokenization in Python using a Hugging Face tokenizer.
Tokenization - Conceptual video: https://youtu.be/aV9xvNBRYuw
Colab notebook link: https://drive.google.com/file/d/12bbB2cl1ffQXoxbcH-adQkDcBpFOMKW1/view?usp=sharing
#generativeai #genai #artificialintelligence #ai
Оглавление (5 сегментов)
Segment 1 (00:00 - 05:00)
Hello everyone. I'm Siddharthan. In the previous video, we understood what tokenization is and why it plays such a crucial role in natural language processing and generative AI models like LLMs. Now, in this video, we are going to take a hands-on approach. We will use a tokenizer from Hugging Face and understand how a text goes into this tokenizer and how we would get token IDs as the response. So, this is what we are going to understand in a detailed way. So, by the end of this video, you will not just understand tokenization conceptually, but also be confident on using this before passing this token IDs to a large language model. So, this will be the agenda for today's video. So, before getting into the actual coding part, let me quickly show you the Udemy courses I've created. I have two courses right now. One is a complete machine learning course and the other one is a complete generative AI course. So, in these courses, I've started from the very basics and covered all the way up to advanced topics. So, for example, in machine learning, I've started from basics of machine learning, Python basics, etc. and explained about intuition of different models and we also have capstone projects on machine learning. Similarly, in generative AI, I started with the basics of generative AI, transformers, etc. and how do you build a rag applications, prompt engineering, MCP, AI agents, etc. and we have also, you know, covered capstone projects as well. So, please take a look at that if you feel like this is useful to you. So, with that being said, we can move to today's video. And before the actual coding part, first let's look at the theory that we have covered in this video. So, we have seen that a tokenizer's role is basically to get a text and chop it into smaller pieces, convert it into token IDs, and then send it to a model. So, let's say if we have a text or a prompt like I love chat GPT, this will be split into smaller pieces like I love chat and GPT. So, now we have these three tokens from this four-word sentence. And now, each of these tokens will be converted into unique token IDs. So, I would be having a token ID of 101, love 234, etc. Similarly, different words would have different token IDs. If I comes in a different sentence, there also it would have this token ID as 101. So, that is the core concept. And this is how the flow is. So, first we would have a text and this text passes on to a tokenizer and this tokenizer convert this into tokens and would give us token IDs as response, basically the list of numbers for a given sentence. And later, this sentence will be passed on to the model and within that it will become, you know, converted to embedding and attention happens on it, etc. And this is the flow of our large language model or a transformer-based model works. But right now, we are looking mainly at this initial step, which is passing the text and getting this token IDs. So, this is what we have discussed in theory in the previous video. Now, let us understand it in hands-on part. So, how we can get a tokenizer from Hugging Face and how we can get the token IDs from this. So, let's get started. So, right now, I am in a Google Colab environment and for this lesson, we don't need any GPU or GPU session. CPU session should be fine in Colab. So, first thing that I'll do is connect it to a session. So, this will assign me to a CPU environment. And the next thing that we do is install transformers library, which is basically Hugging Face library. So, I'll say installing Hugging Face library. And now, I'll say pip install space transformers and run this. So, by default, Google Colab would already have this transformer library installed. So, you can see this requirement already satisfied. But in case you are installing this or running this notebook on your local VS Code or Jupyter notebook, then you might have to install this. And you can also include this hyphen Q so you don't see all this like verbose over here. If there are any errors, that would show up. But if everything is good, you wouldn't see those, you know, a lot of logs in between. So, that's like a good way to structure your notebook. So, we have installed transformers library. And next step, I have to import my auto tokenizer. So, I'll say from transformers import auto tokenizer Right. So, usually, when we want to download a LLM, we first would import this auto tokenizer and we would also import this auto model for causal LLM. So, we would like import something like this. This is my tokenizer, which would give, you know, take this text as input, convert it into token IDs, and then we would download a LLM from this auto model for causal LLM and then pass it to it. But right now, we are not interested in it. So, I'll just like remove that part. So, let's focus only on the tokenizer. The next step after importing this auto tokenizer is downloading the tokenizer of our choice. So, I'll say loading a pre- trained
Segment 2 (05:00 - 10:00)
tokenizer. So, I'll say tokenizer is equal to auto model, sorry, auto tokenizer and say dot from pre-trained. So, from underscore pre-trained. And within this, I have to provide a model ID. Now, the important point to note here is every model has its own tokenizer. A GPT-2 model would have its own tokenizer. Gemini Nano would have its own tokenizer. So, it's not like we have one common tokenizer that can be used anywhere. So, that's not the case. Let's say that I want to use a tokenizer for Google BERT base uncased. That means first I would tokenize with this tokenizer of the specific model and then pass this on to this BERT base uncased model. So, that is the process that we would follow. And how do you load a tokenizer from Hugging Face? Let's say my choice is Google BERT base uncased. Just search for it and add this Hugging Face in search bar. You will see this Hugging Face site. And when you open this, you will see this model card details, etc. And this is my model ID. You will see a copy symbol over here. Click it. That would copy it and you can paste it within this double quotes that we have provided over here. In case you want to do a similar thing for, let's say, a Llama 3, uh, you know, model, I'll just search for Llama 3 Hugging Face. So, this is like a Meta Llama 3 8 billion. Here, you would copy this specific model ID and then you would paste it here. So, that is tokenizer for a Llama model. So, this is basically the process. So, use auto tokenizer. from_pretrained, provide the exact model for which you would need to download the tokenizer for. So, I'll run this. So, usually, the tokenizer size will be very small, whereas the size of the model that we would download from this auto model for causal LLM, that will be larger. This is a simple tokenizer. So, this shouldn't like take a larger amount of space or storage. So, that's one thing. And here, this is actually not a, you know, GPT kind of a model. So, BERT are encoder-only models, which is mainly used for text classification, sentence understanding, named entity recognition, or these kind of purposes. Whereas, chat GPT-based applications, right? So, they are powered by large language models such as, let's say, GPT-5 or you have a similar architecture of Llama models, Gemini, Gemma models. So, these are decoder-only models and they are used for next token prediction. Just remember this in your mind. This was like important. We will come to it later. Just remember that BERT is a encoder-only model. And the LLMs, GPT, Gemma, Gemini, all these are decoder-only models used for next token predictions. But the process of using this tokenizer won't change. So, that's why I just wanted to show this for the BERT model. So, we have BERT base uncased. Uncased because this is trained for, you know, lower case letters. So, it doesn't like distinguish between upper case and lower case letters. So, this is like this specific version of this BERT model. So, next step after what we are going to do here is get a sample text, pass this on to the tokenizer, and get the tokens from it. So, I'll say convert text to tokens. So, here I'll say text is equal to I love machine learning. Let's say this is the sample text that we are working on. And we would say tokens is equal to tokenizer dot tokenize and within this parenthesis, pass your text. So, text is I love machine learning and we pass this to this tokenizer. tokenize and then pass your text within this. So, this is going to give me the tokens. We can also try printing it and see it. So, I'll just like add a string here saying that tokens put a colon, comma, and let's print this tokens that we have got. So, let's run this. Now, if you see this, I had a sentence I love machine learning. This has been converted into a list of tokens. So, I have like this I as the first token, love as the second token, machine as the third token, learning as the fourth token. So, we have these four tokens that we have got from this one sentence. Okay? So, the important point to note here are, as this is a uncased model, all the upper case letters I, M, and L has been converted into lower case letters. And you can also note that if I add a punctuation over here, even that would be converted into a separate token. So, this is how it kind of works overall. So, we pass the sample text to a tokenizer, got the tokens in return. Now, the next step is, as we have discussed earlier, this is the initial step, but not the final working or the output of a tokenizer. We need the token IDs from this, which is basically unique IDs for the tokens that we see over here. So, now let's see how we can convert these tokens into token IDs. Now, I'll just add a text here saying that
Segment 3 (10:00 - 15:00)
each token would have its own unique token IDs. All right. So, here I'll add a comment saying tokens to token IDs. And I'll say token underscore IDs is equal to tokenizer dot convert tokens to IDs and then pass your tokens here. Right. Now, I'll say print token IDs. And print your token IDs here. Let's run this. Now, if you look at this, we had a token called as I and this I has been converted into token ID of 1045. Love has been converted to 2293, machine 3698, learning 4083, and exclamatory mark is like 999. So, these are the token IDs that we get. Now, if I have a different sentence saying I love ChatGPT, in that case, I will still get the value as 1045. So, each individual unique token will have a unique ID. It's not like I in one sentence get this token ID as 1045, but in another sentence, it would get a different token. That's not the case. I, wherever it comes, will have the same token ID. So, that is the concept. But, this is specific for BERT base uncased. Now, if you use tokenizer for Llama, then you would get a different token ID for I. So, these are tokenizer specific values. So, just like keep in mind these two ideas. One is Let's say you take a BERT base uncased tokenizer. No matter how many different sentences you give, how many times I comes in that sentence, I will always have a token ID of 1045. Now, if you used a Llama tokenizer, for that, I would have a different token ID. But, for that Llama tokenizer, let me just like write it here. Let's say for BERT, uh I is getting a token ID of 1045, right? So, if I comes multiple times in the sentence, it would still have this 1045 as the token ID multiple times. Now, let's say that we are using a tokenizer from Llama. So, there I might have value of this token ID as 238. Now, if we use a Llama model and send multiple sentences, wherever I comes, I will still get this 238 as the value. So, that is something that you can keep in mind. So, we started with the sample text of I love machine learning. We have seen how we can convert it into tokens, which is basically chopped down version of the sentence, and we have seen how we can pass these tokens and then get the token IDs from the tokenizers tokenizer. So, we have used this tokenize and then convert tokens to IDs. So, these are the two things that we have done. And this is one way to do it and there is like another easier way to do this. Like, instead of, uh you know, doing this process twice, like first getting the token and then getting the token IDs, we can also use this direct method where you directly pass a text and then get the token IDs out of it. So, I'll call this as direct method and say encoded is equal to Maybe I'll just like copy this text again. Again, it's not needed, but just for better code structuring and when you look at it, it would be like easy to understand where the text is coming from. So, I'll say encoded is equal to just pass this on to this tokenizer. So, earlier we used tokenizer dot tokenize and then use this convert tokens to IDs. Instead of that, now I'm saying tokenizer, within that just like pass your text. And now, print your encoded. Now, this will automatically convert this sentence into token IDs directly. So, we don't see this intermediate steps. It still happens there, but we don't see this. Now, if you look at this, 1045 is the token ID of this I and 2293 is for love, machine learning, exclamatory mark is 999, right? And now, we also see this two additional tokens. So, these are the special tokens. So, this 101 is my CLS token and this 102 is my separator token. So, the special tokens that we have discussed in the previous conceptual video. So, usually CLS would come at the start of a sentence. So, what happens is, let's say we want to perform a text classification problem. In that case, this 101, the initial classification token is alone passed on to the later classification it. So, what happens here is this first token gets the information using attention mechanism, the information from all the other tokens, and this 101, the initial CLS token is like a summary of the sentence. It captures the overall meaning and only this token is passed on for classification. is the purpose of the CLS token and it usually appears at the start of the sentence. And then we have the separator token that kind of comes at the end of the sentence. So, we have this 102 as the separator token. So, this is the difference between these
Segment 4 (15:00 - 20:00)
two. And the way in which like this BERT model can be used is also to get the similarity score between two sentences. Let's say I have a sentence like I love machine learning. And there is another sentence like, uh I love artificial intelligence. So, we want to find the similarity score between these two sentences. In that case, we would just like add a separator token over here and then pass this entire sentence, like two sentences with the separator token in between that, to this BERT model and then it can give you a similarity score. So, this is also somewhere where a separator token can be used. So, just remember that, uh we have the special tokens that gets added automatically. So, this is an example of a CLS token. separator token. So, this is how you can infer this output that we got. So, the these are my input token IDs. Input as in we gave this text as input. For that, we got this IDs. And next, we have this token type IDs. So, I explained that we can also pass two sentences, right? So, it is an example for this. So, it is just to represent whether a specific token comes from sentence one or sentence two. If you look at this, this 101 105, all these have a value of zero. That means it says these are part of the first sentence. Here, we only passed only one sentence, that's why it's it all the values are zero. But, if we have passed two sentences, there would be for those specific tokens, you would have a token type as one. So, basically, this tells you whether a specific tokens comes from sentence one or sentence two. So, this zero, you have to look at 101. For 1045, which is I, etc. And then we have attention mask. We have also talked about this padding tokens in the previous video. So, this tells whether a particular token is actual token coming from the text or it is just a padding token. If it is an actual token, attention mask would have the value as one. If it's padding token, the value would be zero. So, that is a thing about it. So, here we don't have any padding token, so that's why we have all the values as one. So, this is how we can infer the output that we get from this tokenizer. So, when we convert this text into this tokenizations that we got. Uh maybe I'll just like add a text here saying that note In this case, uh 101 is basically your CLS token and then 102 that we saw over here is your separator token. Right. Now, this is how we can infer this. And how the flow works is you start with the text, okay? A input text or a input prompt is what we would start with. Now, we would pass this to let's say a tokenizer. This tokenizer would give me token IDs just like the way that we have seen over here. And later, this tokenizer is passed on to a model. This model can be a LLM or a BERT base model, etc. And this, let's say we are passing this to a LLM. This LLM for a prompt would give a output, right? But, it would wouldn't give you the output straight as text. Rather, it would also give the output in the form of output token IDs. Later, we use a tokenizer again for decoding this. And then you would get output text. So, this is the overall flow. So, first we have this Let's Let me call this as input text, which is present in the form of text format or a string format. This goes inside a tokenizer. We get token IDs as output from this tokenizer, which flows into the model. Model finds the output for the given prompt, given text. And then, the important point is you wouldn't get this as text format. You would similarly get this as a IDs. So, this process is basically encoding process where it converts text into token IDs. But, in this case, the output token IDs are converted into output text. So, the vice versa of what is happening here. So, this is called as a decoding process. So, text flows text and then we get token IDs, pass on to the model, you get output token IDs, and then convert it into a output text. So, now let's see how this decoder works. So, even when we are using a hugging face model, so first we will use a tokenizer, pass that the output from this tokenizer to the model, and then we would decode it ourselves. So, that is the overall flow. So, I'll say decode back. Let's call this as after generate After a model generates output as token IDs. So, this decoding [clears throat] process happens after the model has
Segment 5 (20:00 - 24:00)
generated the output as token IDs. So, I'll say decoded is equal to I'll call this tokenizer {dot} decode. Let's assume that these are the output that are coming from the model. Let's assume that this is the output token IDs in this case, and I'm going to pass this output IDs. So, I'll just like pass this input IDs. So, basically I'm passing this list over here, or I can simply, you know, copy this and paste it over here. Same thing. So, this should be encoded input IDs, or in other way we are basically passing this input over here and trying to see whether it is decoding correctly or not. Now, I would say print decoded text. The comma, and paste this decoded. And it says decoded text not subscriptable. Maybe I'll just like pass this as uh input IDs itself. encoded input IDs. Oh, I think maybe the mistake I did is Let's see. I'm not sure if this is going to work, but let's see. Oh, right, right. So, I have to pass this within parenthesis, not two square brackets. Let's see if it works now. Mhm. Now, it says CLS I love machine learning SCP. So, we have this 101 as the CLS and then we have this I love machine learning. 999 is my exclamatory mark, and then we have this SCP. So, this is how we would decode a output that is coming from uh let's say uh LLM into the actual text. And we have also discussed about this subword, right? So, sometimes it's not like every word would be converted into tokens. So, in this case, if you see I is one token, love is one token, machine one token, and learning one token. But, it's not always the case. We have also discussed that we also have subword kind of an approach where if a word like playing is there, it will be converted into play and ing. So, let's just like understand that quickly. It's the same code that we have seen earlier. We are passing a text called as unhappiness. Tokens, we are calling this tokenizer. tokenize, and passing this text over here, and let's pass paste this. So, here the word unhappiness has been split into u n un at shape p i n e s s. For this one word, which is like a larger word, we are getting like four tokens. And this ash symbol that we see over here, right? This represents that this is a continuation of this token, right? So, earlier we didn't have this ash because we didn't have any continuation, or basically a single word has not been split into multiple tokens, so that's why we didn't see this. Whereas here, we can see that this ash basically represents, or it's a way of telling the model that's present downstream that this at shape p i n e s s. So, all these are from a single word that you have to combine. So, this is the overall idea, and this is an example of subword tokenization behavior. So, this is the basic of our tokenizer would work. This is a very simple concept, but it is important. So, I also wanted to cover the flow. I hope now you would understand this flow of how we would pass this input uh a tokenizer is present, a model is present, and how it is decoded back. So, that is like one important takeaway as well. So, for any model, you would first get a tokenizer, corresponding tokenizer, convert the input text into tokens, and then to token IDs, or you can directly use this method of just like pass this text to the tokenizer that we have loaded, and then this would give you the input IDs, token types, and attention mask, and then we can also decode this text back. So, this is what we have learned so far, and we will just like try to build this understanding of what we have got, and see like how this exact text you know, flows inside the model, and what happens there exactly. So, all this we will discuss in a structured and a you know, in a ordered way, so that we get this understanding in a better way. So, I hope you have understood whatever we have covered in this tokenization part. So, please go through this code, try to execute it yourself, try different tokenizers, and try different text, and see like how it works, okay? So, that is all from my side for this lesson. I'll see you in the next one.