🔥Post Graduate Program in Generative AI and ML: https://www.edureka.co/executive-programs/pgp-generative-ai-machine-learning-certification-training
🔥Integrated MS+PGP Program in Data Science & AI:https://www.edureka.co/dual-certification-programs/ms-data-science-pgp-gen-ai-ml-birchwood
This Natural Language Processing (NLP) full course is designed to help you build a strong foundation in one of the most important fields of artificial intelligence. Starting from the basics, you’ll learn how machines understand and process human language using core NLP techniques like tokenization, stemming, and vectorization. The course progresses into advanced topics such as word embeddings, sentiment analysis, and transformer models like BERT. With hands-on coding examples in Python and practical projects using libraries like NLTK, spaCy, and Hugging Face, you’ll gain real-world experience in building powerful NLP applications.
00:00:00 Introduction
00:01:42 Natural Language Processing In 10 Minutes
00:09:24 Python NLTK Explained
01:27:20 NLP & Text Mining Using NLTK
02:04:03 Stemming And Lemmatization
02:13:54 Context Free Grammar Using NLP In Python
02:45:05 Text Classification Explained
02:47:43 What is Supervised Learning?
02:57:20 What is Unsupervised Learning?
03:04:37 Decision Tree Algorithm
03:49:29 Random Forest
04:21:33 Support Vector Machine In Python
04:35:07 What is a Neural Network?
04:42:03 Neural Network in Python
04:58:57 Artificial Neural Networks
05:31:17 Recurrent Neural Networks
06:00:13 Transformers Neural Networks
06:11:49 Transformers Explained Using Generative AI
06:19:27 What is Generative AI?
06:34:42 What is LLM?
06:52:02 Chat GPT Explained In 10 Minutes
07:01:41 Prompt Engineering For Code Generation
07:11:00 What is LangChain?
07:28:14 What is RAG?
07:51:05 Deep Learning Interview Questions and Answers
🔴 𝐋𝐞𝐚𝐫𝐧 𝐓𝐫𝐞𝐧𝐝𝐢𝐧𝐠 𝐓𝐞𝐜𝐡𝐧𝐨𝐥𝐨𝐠𝐢𝐞𝐬 𝐅𝐨𝐫 𝐅𝐫𝐞𝐞! 𝐒𝐮𝐛𝐬𝐜𝐫𝐢𝐛𝐞 𝐭𝐨 𝐄𝐝𝐮𝐫𝐞𝐤𝐚 𝐘𝐨𝐮𝐓𝐮𝐛𝐞 𝐂𝐡𝐚𝐧𝐧𝐞𝐥: https://edrk.in/DKQQ4Py
📝Feel free to share your comments below.📝
🔴 𝐄𝐝𝐮𝐫𝐞𝐤𝐚 𝐎𝐧𝐥𝐢𝐧𝐞 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐚𝐧𝐝 𝐂𝐞𝐫𝐭𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬
🔵 DevOps Online Training: http://bit.ly/3VkBRUT
🌕 AWS Online Training: http://bit.ly/3ADYwDY
🔵 React Online Training: http://bit.ly/3Vc4yDw
🌕 Tableau Online Training: http://bit.ly/3guTe6J
🔵 Power BI Online Training: http://bit.ly/3VntjMY
🌕 Selenium Online Training: http://bit.ly/3EVDtis
🔵 PMP Online Training: http://bit.ly/3XugO44
🌕 Salesforce Online Training: http://bit.ly/3OsAXDH
🔵 Cybersecurity Online Training: http://bit.ly/3tXgw8t
🌕 Java Online Training: http://bit.ly/3tRxghg
🔵 Big Data Online Training: http://bit.ly/3EvUqP5
🌕 RPA Online Training: http://bit.ly/3GFHKYB
🔵 Python Online Training: http://bit.ly/3Oubt8M
🌕 Azure Online Training: http://bit.ly/3i4P85F
🔵 GCP Online Training: http://bit.ly/3VkCzS3
🌕 Microservices Online Training: http://bit.ly/3gxYqqv
🔵 Data Science Online Training: http://bit.ly/3V3nLrc
🌕 CEHv12 Online Training: http://bit.ly/3Vhq8Hj
🔵 Angular Online Training: http://bit.ly/3EYcCTe
🔴 𝐄𝐝𝐮𝐫𝐞𝐤𝐚 𝐑𝐨𝐥𝐞-𝐁𝐚𝐬𝐞𝐝 𝐂𝐨𝐮𝐫𝐬𝐞𝐬
🔵 DevOps Engineer Masters Program: http://bit.ly/3Oud9PC
🌕 Cloud Architect Masters Program: http://bit.ly/3OvueZy
🔵 Data Scientist Masters Program: http://bit.ly/3tUAOiT
🌕 Big Data Architect Masters Program: http://bit.ly/3tTWT0V
🔵 Machine Learning Engineer Masters Program: http://bit.ly/3AEq4c4
🌕 Business Intelligence Masters Program: http://bit.ly/3UZPqJz
🔵 Python Developer Masters Program: http://bit.ly/3EV6kDv
🌕 RPA Developer Masters Program: http://bit.ly/3OteYfP
🔵 Web Development Masters Program: http://bit.ly/3U9R5va
🌕 Computer Science Bootcamp Program : http://bit.ly/3UZxPBy
🔵 Cyber Security Masters Program: http://bit.ly/3U25rNR
🌕 Full Stack Developer Masters Program : http://bit.ly/3tWCE2S
🔵 Automation Testing Engineer Masters Program : http://bit.ly/3AGXg2J
🌕 Python Developer Masters Program : https://bit.ly/3EV6kDv
🔵 Azure Cloud Engineer Masters Program: http://bit.ly/3AEBHzH
🔴 𝐄𝐝𝐮𝐫𝐞𝐤𝐚 𝐔𝐧𝐢𝐯𝐞𝐫𝐬𝐢𝐭𝐲 𝐏𝐫𝐨𝐠𝐫𝐚𝐦𝐬
🔵 Post Graduate Program in DevOps with Purdue University: https://bit.ly/3Ov52lT
🌕 Advanced Certificate Program in Data Science with E&ICT Academy, IIT Guwahati: http://bit.ly/3V7ffrh
🔵 Advanced Certificate Program in Cloud Computing with E&ICT Academy, IIT Guwahati: https://bit.ly/43vmME8
🌕Advanced Certificate Program in Cybersecurity with E&ICT Academy, IIT Guwahati: https://bit.ly/3Pd2utG
📌𝐓𝐞𝐥𝐞𝐠𝐫𝐚𝐦: https://t.me/edurekaupdates
📌𝐓𝐰𝐢𝐭𝐭𝐞𝐫: https://twitter.com/edurekain
📌𝐋𝐢𝐧𝐤𝐞𝐝𝐈𝐧: https://www.linkedin.com/company/edureka
📌𝐈𝐧𝐬𝐭𝐚𝐠𝐫𝐚𝐦: https://www.instagram.com/edureka_learning/
📌𝐅𝐚𝐜𝐞𝐛𝐨𝐨𝐤: https://www.facebook.com/edurekaIN/
📌𝐒𝐥𝐢𝐝𝐞𝐒𝐡𝐚𝐫𝐞: https://www.slideshare.net/EdurekaIN
📌𝐂𝐚𝐬𝐭𝐛𝐨𝐱: https://castbox.fm/networks/505?country=IN
📌𝐌𝐞𝐞𝐭𝐮𝐩: https://www.meetup.com/edureka/
📌𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐭𝐲: https://www.edureka.co/community/
Please write back to us at sales@edureka.co or call us at IND: 9606058406 / US: +18885487823 (toll-free) for more information.
Оглавление (15 сегментов)
Introduction
Hello everyone and welcome to the NLP pool course. Natural language processing is a key area of artificial intelligence that enables machines to understand, interpret and generate human language. It powers applications such as chatbots, language translation, sentiment analysis and voice assistance. In this course, you will explore the fundamentals of NLP, including how text data is processed, how language models work, and how machines extract meaning from human language. You will also learn about common NLP techniques and their real world applications across different industries. And by the end of this course, you will have a solid understanding of NLP concepts and how language-based AI systems are built and used in today's technology landscape. So before we begin, please like, share and subscribe to Edureka's YouTube channel and hit the bell icon to stay updated on the latest content from Edureka. Also check out Edureka's generative AI masters program designed to help you build strong expertise in generative AI, Python, data science, NLP, Chad, GPT, LLM from engineering and agentic AI. The program focuses on hands-on learning through real world projects, expert-led training and industry relevant tools. By enrolling in this generative AI certification program, you will gain practical skills to design, build and deploy modern Zen AI applications. So check out the course link given in the description box below. Now let us get started by understanding what NLP is.
Natural Language Processing In 10 Minutes
— Well, human beings are the most advanced species on earth. There's no doubt in that. And our success as human beings is because of our ability to communicate and share information. Now that's where the concept of developing a language comes in. And when we talk about the human language, it is one of the most diverse and complex part of us. Considering a total of 6,500 languages that exist. So coming to the 21st century, according to the industry estimates, only 21% of the available data is present in the structured form. data is being generated as you speak, tweet and send messages on WhatsApp or the various other groups of Facebook and majority of this data exist in the textual form which is highly unstructured in nature. Now in order to produce significant and actionable insights from this data, it is important to get acquainted with the techniques of text analysis and natural language processing. So let's understand what is text mining and natural language processing. So text mining or text analytics is the process of deriving meaningful information from natural language text. It usually involves the process of structuring the input text, deriving patterns within the structured data and finally evaluating and interpreting the output. Now on the other hand, natural language processing refers to the artificial intelligence method of communicating with an intelligence system using the natural language. As text mining refers to the process of deriving highquality information from the text, the overall goal is here to essentially turn the text into data analysis via the application of natural language processing. That is why text mining and NLP go hand in hand. So let's understand some of the applications of text mining or natural language processing. So one of the first and the most important applications of natural language processing is sentimental analysis. Be it Twitter sentimental analysis or the Facebook sentimental analysis. It's being used heavily. Now next we have the implementation of chatbot. Now you might have used the customer chat services provide by various companies and the process behind all of that is because of the NLP. Now we have speech recognition and here we are also talking about the voice assistants like Siri, Google Assistant and Cortana. And the process behind all of this is because of the natural language processing. Now machine translation is also another use case of natural language processing and the most common example for it is the Google translate which uses NLP to translate data from one language to another and that to in the real time. Now other applications of NLP include spellchecking, keyword search and also extracting information from any doc or any website and finally one of the coolest application of natural language processing is advertise on matching basically recommendation of ads based on your history. Now NLP is divided into two major components that is the natural language understanding and the natural language generation. The understanding generally refers to mapping the given input into natural language into useful representation and analyzing those aspects of the language. Whereas generation is the process of producing meaningful phrases and sentences in the form of natural language from some internal representation. Now the natural language understanding is usually harder than the natural language generation because it takes a lot of time and a lot of things to usually understand a particular language especially if you are not a human being. Now there are various steps involved in the natural language processing which are tokenization stemming leatization the poss tags name entity recognition and chunking. Now starting with tokenization. Tokenization is the process of breaking strings into tokens which in turn are small structures or units that can be used for tokenization. So if we have a look at the example here, taking this sentence into consideration, it can be divided into seven tokens. Now this is very useful in the natural language processing part. Now coming to the second process in natural language processing is stemming. Now stemming usually refers to normalizing the words into its base or the root form. So if you have a look at the words here we have affectation, affects, affections, affected, affection and affecting. Now all of these word originate from a single root word and as you might have guessed it is affect. Now stemming algorithm works by cutting off the end or the beginning of the word taking into account a list of common prefixes suffixes that can be found in an infected word. This indiscriminate cutting can be successful in some occasions but not always. So let's understand the concept of limitization. Now lemitization on the other hand takes into consideration the morphological analysis of the word. To do so it is necessary to have a detailed dictionary which the algorithm can look through to link the form back to its original word or the root word which is also known as lema. Now what lemitization does is groups together different infected forms of the word called lema and is somehow similar to stemming as it maps several words into one common root. But the major difference between stemming and limitization is that the output of the limitization is a proper word. For example, a lemitizer should map the word gone going and went into go. That will not be the output for stemming. Now once we have the tokens and once we have divided the tokens into its root form next comes the poss tags. Now generally speaking the grammatical type of the word is referred to as poss tags or the parts of speech. Be it the verb, noun, adjective, adverb, article and many more. It indicates how a word functions in meaning as well as grammatically within the sentence. A word can have more than one part of speech based on the context in which it is used. For example, let's take the sentence Google something on the internet. Here, Google is used as a verb although it's a proper noun. Now, these are some of the limitations or I should say the problems that occur while processing the natural language. Now, to overcome all of these challenges, we have the named entity recognition also known as NE. So, it is the process of detecting the named entities such as the person name, the company names, we have the quantities or the location. Now it has three steps which are the noun phrase identification, the phrase classification and entity disambiguation. So if you look at this particular example here, Google CEO Sundap Pai introduced the new Pixel 3 at New York Central Mall. So as you can see here, Google is identified as a organization, Sundap Pichai as a person. We have New York as location and Central Mall is also defined as an organization. Now once we have divided the sentences into tokens, done the stemming, the limitization, added the tags, added the name entity recognition, it's time for us to group it back together and make sense out of it. So for that we have chunking. So chunking basically means picking up individual pieces of information and grouping them together into the bigger pieces. Now these bigger pieces are also known as chunks. In the context of NLP, chunking means grouping of words or tokens into chunks. So as you can see here we have pink as an adjective, panther as a noun and the as a determiner and all of these are together chunked into a noun phrase. Now this helps in getting insights and meaningful information from the given text. Now you might be wondering where does one execute or run all of these programs and all of these function on a given text file. So for that Python came up with NLTK. Now what is NLTK? NLTK is the natural language toolkit library which is heavily used for all the natural language processing and the text analysis.
Python NLTK Explained
So what is NLP? Natural language processing or in short NLP is a automatic way of presenting or processing human language. What I'm trying to say here is here we try to develop applications and services in order to understand human language. Some of the practical examples of NLP are Google voice search, sentiment analysis and many more. As I mentioned earlier, we use NLP to extract meaningful data from textual data. Right? So, do you think NLP is a magical tool that when a text is passed, we get a desired output? Well, this isn't the case. But as a matter of fact, raw text input data has to go through various stages just so that we can perform operations on the textual data set. As you see here in the pipeline, a raw text data under goes data cleaning which involves steps like tokenization, stop word removal, limitization and many more. The next step is vectorization where we convert our text data into numerical format. Finally, based on the requirements, we perform the classification task. All right, now let's see few of these steps in detail. Starting off with cleaning our data. As mentioned earlier, here the goal is to convert raw text into clean text data. This involves steps like tokenization, stop word removal, stemming and many more. Speaking about tokenization, tokenization is essential for splitting a sentence or a paragraph or an entire text documents into smaller units such as individual words or phrase. Each of these smaller units are then called as tokens. Then we have stop word removal. Stop word removal in general refers to filtering words whose presence in a sentence make no difference to the analysis of our data. So why do we have to remove them? Well, we remove the stop word just so that you know our model doesn't get more complicated. In the next step, we have something called as stemming. Stemming is a process of reducing a word into its root form. What I'm trying to say here is with stemming, we're basically removing the prefix. For example, consider a word giving, right? So once a stemming is performed on giving, the giving ends up becoming gift. Moving ahead, we have vectorization. Text vectorization is a process of converting text into numerical representation. Here we end up creating something called as bag of word model which is a model that signifies or represents a text and describes occurrence of text in that word document. Finally coming down to classification task. Text classification also known as text tagging or text categorization is a process of categorizing text into organized groups by using natural language processing. Text classification can automatically analyze text and then assign a set of predefined tags or categories based on the content. Now let's move ahead and understand an open-source tool called as NLTK. NLTK stands for natural language toolkit. This toolkit is one of the most powerful NLP libraries which contain packages to make machine understand human language and reply them in an appropriate desired response. So why do we need NLTK? You'll see NLTK has many built-in package to process our textual data at every stage. We can perform tasks like data cleaning, visualization, vectorization that will help us in classifying our text. So let me now move to my code editor and show you how we can pre-process or clean our data using NLTK. All right guys, as you can see here, I'm going to use Google Collab. Okay, although you can use any of the code editor like Jupyter Notebook or Visual Studio Code, but I would prefer to go with this. Okay. So in the next stage now we need a data set. Right. So where will I get that? So in order to get a data set what I'm going to do is from skarn dot data set import fetch 20 news. Right. This fetch 20 news groups will give us a data set. So let's quickly see how that would look like. All right. So if I have to execute this I just need to press shift enter. Okay. So let's see our textual data now. Okay. text data is nothing but we are going to create instance of our fetch 20 news group. So we'll copy this here and call it. So what would this return? This would return us a bunch right bunch of object. So let's now quickly execute this. If you're downloading this for the first time, it would take some time to download our data set. Okay. And this data set is present in this particular link. Fine. So this is done now. So let's quickly look at how this data would look like. Okay. So in order to do that we are going to use uh like type data text data and then you will see here this is a bunch. So let's now import our numpy and convert this back to a list because we cannot perform any operations on bunch right we have to convert this into a list. So import numpy as np. Uh let's execute this. And now what we're going to do is let's give it as raw text. So raw text over here would be equal to text data. Okay. Dot data. Okay. So let's print how this data would look like. Raw text. Okay. So as you can see here we have a huge amount of data set. Fine. And all of these are separated by commas. Let me quickly show you here. Okay. So as you can see first off we have a list and then we also have a huge sentences or you can say a paragraphs which are separated by commas. And make sure you don't confuse this to CSV. And now what we're going to do is as a list right we don't want to take the entire data set because it's going to be computationally expensive and apart from that it will take more time to execute right. So, and to get a better understanding of what we're doing, we're going to take only first four paragraphs or sentences, I would say. Fine. So, in order to do that, all I'm going to do is I'm going to use slice operation, have a colon, and put four. So, let's see how this would look like now. Okay. So, as you can see here, we just have first four paragraphs of sentences. Let me zoom out here just so that you get a better view. So, yeah, fine. So, now we have our text data and we are supposed to start cleaning our data, right? So let's start doing that. What we're going to do is first off let's reduce all the upper cases into lower case. Okay. So let me just give here a text and give a heading step stage one convert into lower text. Fine. And now what we're going to do is in order to convert to a lower text. So we'll just give here lower text. So this is just going to be an array. Let's ignore this for now. So as we supposed to do this like we supposed to no matter what input you take right like whether it's for a training or a testing data set you obviously have to convert it rather than writing the entire code here let's give it a method so that you know next time if you want to lower our text we just call the method so def lower this is the name of our method and then we're going to pass data here that would be an argument okay and we have a for loop now for words in this data raw text fine for words in raw text what we're going to do is we have to append it to this part so let's give here as clean text in order to get a better understanding clean text stage one so I hope this is fine okay let me quickly copy this and we're going to append this part dot append okay so we're supposed to convert our words into lower case right so it's going to be str lower this is a built-in method and all we're going to do is we're going to pass as words. Okay. So now we have a method. So let's call this method. So to lower and all we're going to do is we're going to pass our raw text input data. Raw text. Okay. So it's not st it's going to be str here. Let me quickly execute that once again. Okay. Perfect. So now what we're going to do is let's compare our raw text with this. So we have something like clean text stage one right cl. So let me execute this. So as you can see we have from which is capital upper case and we have what car is this and here everything has been converted to a lower case. Okay. In the next stage we have tokenization. As I've mentioned earlier what we do with tokenization is whatever the sentence is there or a paragraph we convert that into either you know individual sentences or into individual words. So in order to convert this into words we have something called as word tokenizer. And to convert this into a sentence we have sent tokenizer. So let me quickly go here and give this name of the next block as stage two. And this is going to be tokenize. Okay. And let's see how our tokens would look like. Fine. Same like before we have to write one array here. So let's give it as clean text stage two. And this is going to be empty. Right. And now what we're going to do is from NLTK dot tokenize import sentence tokenizer. Okay. And then we also need word tokenizer. Here I'm going to show sentence tokenizer only for the demo sake. We're going to use word tokenizer in future. Okay. So before this we also have to download something that's called as punkit. Okay. So import nk and then we have ntk. d download we have something called as punkit fine let me shift enter all right so now what we're going to do is we are going to perform sentence tokenizer okay for words or sentence in obviously we need this data right clean text one so clean text one what we're going to do is sentence we're going to use send tokenizer as simple as that and we are going to pass this sentence Okay. And now in order to obviously we have to store this somewhere. We're going to store this in a new variable here. Sent tokenize. And this is just for showing you, right? the demo. That's why I'm not going to use clean text three over here or two. Okay. So we'll append this send dot append. And we're going to append sentence here. Okay. So looks good. So let's see how this would look like. So as you can see here, right? Earlier we have a single dimension array. Okay. And here this each of these over here represents a paragraph. Now within a paragraph we all know we have multiple sentences. So as you can see now we have become two dimension array and each of these within this array represents a paragraph and each of the sentence in this paragraph has become a word or one particular character. Okay. So now we'll do word tokenize and we move ahead. We need word tokenization. So we'll add this to a clean text too. Okay. Because this is a part of this thing. So what we're going to do is we're going to perform word tokenization. So let me give a comment here. Okay. So now what I'm going to do is for word tokenize you know rather than writing this for loop I'm going to show you a simple and easy way. What we were doing so far is we are initializing this array here and then writing a for loop and appending it. Right? We can also do something called as you know list comprehension. Let me just show you what it is. So we have this right clean text two. Now in list comprehension what will happen is the for loop we write it within our list. Okay. So let's see word tokenize and we obviously want some kind of textual data over here which we'll fill it in a while. So over here I'm going to write for loop for I in clean text data. Okay. And now whatever words that come out or sentences I want them to be tokenized. Okay. So let us now see how this clean text two would look like. So as you can see here we have every word within this sentence converted into form of tokens. Let me scroll this up so that you can see how it looks. So still it's a two dimension array. You know whatever is there within this first array here within this array represents a paragraph. Okay. So everything over here has been converted into a single word tokens. Fine. So next stage right we want to remove some punctuations. You see in our data set we have some special characters, punctuations. We don't want these edus and dots. So in order to do that we are going to use something called as regular expression. Okay. So let me quickly show you how we can implement this using regular expression. So to do that what we going to do is we are going to import regular expression. Fine. And now what we want is as this is a two dimension data. So we need two for loops. Unlike previous we had only one dimension data but now it's going to be 2D data structure. Okay. So we'll create something here an empty list which is clean text 3 and then we're going to have a for loop. So for words in clean text two we're going to create one array here and now we're going to have another for loop to access inner words. Right? So for W in words now for regular expression what we are going to do is we're going to show a pattern. Okay. So S is equal to so E dots substitute. So wherever that particular character is there we want to substitute it with something else. So dots substitute then we have R which is going to be obviously a comment over here. Then we have a carrot symbol followed by W followed by S. We want to replace this with an empty string here. and then word. All right. And now what we're going to do is if is not null, if is not empty, then what we want to do is we want to append that word. Okay. Clean dot append w. Finally, we have to add this clean to our main array. Okay. So, let's define that array as well. Clean text. Oh, you have to give it an array here. Correct. Fine. So let us now append this to our array. So it's going to be clean text 3. append and then we're going to pass just this clean array here. Fine. Okay. So let's now see how our this thing looks like. So we have clean text three, right? So clean text three. You know according to our analysis, all of these semicolon everything should disappear by now. Okay. So as you can see here, we don't have any special characters within our data set. Okay, still this is a 2D array but we don't only difference over here is that we don't have any special characters. We have only alpha numeric values. Right? Okay, this is great. So in our next stage we're going to remove stop words. I hope you remember what is stop words. As I mentioned earlier, stop words is nothing but you know those words which are most commonly repetitive. Okay. So let me now in order to do that let me just show you import NLTK. Okay. We obviously going to download the stop words. Okay. So we have NLTK. d download and we'll download stop words. Although we can find the stop words on Google. What we can do is we can copy those stop words put them in the form of a list. And if that stop word is present within our data set here, make sure you don't include that in our clean text for stage. Moving ahead, let me just give a title here for stop word removal. Okay, so now that we have downloaded our stop word, what we're going to do is from NLTK corpus import stopwards. Looks great, right? Okay. So now what we're going to do is we'll have clean text 4. Okay, similar to the previous for loop, we are going to have four words in clean text three and then we'll just create an empty list which is going to be appended to clean text 4. We'll have four word in words. If the clean text three contains this words, right, the stop words, we're going to eliminate that. Okay. So if not word in stop word dot words here we're going to pass the language. Okay. If the word is not present in this list what we're going to do is we're going to append because that means that word is not a stop word right. We're going to append that word. Okay. And now we have to append this to our clean text 4. So clean text 4. append this list w. So let me quickly execute this. It would take some time to execute because it has to go through a lot of data set, right? So please be patient and let's see how it would look like. Okay. So this has finally executed and let's now see how this would look like. Okay. So as you can see here we have removed couple of unnecessary words and we have just the important ones. All right. Now, so as you can see, we have removed couple of words over here. And yeah, so now moving ahead to our next stage that is stemming. I hope you remember what is stemming. Stemming is nothing but you know, whatever word we have, we have to convert that into a root form. So over here data processing is there. After stemming, it becomes data prep-process. Okay, that's what is stemming. So let me quickly show you how we can perform stemming here. Okay. So as I mentioned earlier we use stemming just to remove this prefix right. So for in order to perform stemming we have various types of stemmers. So we have something like porter stemer, snowball stemmer, lancaster stemer. So today we're going to use porter stemer right. So in this particular example, so let's get that now from NLTK stem. Porter import porter stema. All right. So once we have imported this, we're going to create an instance of our port stema. So port is equal to portter stema. Okay. So just to give an example of what we're going to do here. Let's take three words. Okay. Let's take a list. Let's say list is a, right? So a has couple of words. Rather than having words, let's pass a list within our list. Okay. So we'll have port dot stem. This is how we call a stemmer. Okay. And here we're going to pass the word that we want to stem, right? And now we're going to have for loop for i in. Okay. So now let's pass couple of words like let it be like reading, washing. Let's give one word which doesn't have a prefix like wash. Then let's give driving. Okay. So let's now print this. So this going to be I here. And what output that we I'm expecting over here is that reading should be converted into read. Washing will be converted to wash. And then wash should remain the same because there's no prefix. And driving would be converted to driving. Drive. So let me just print here and execute this part. I hope you can see this. It read becomes read. Wash remains wash and rest everything remains the same. Okay. I have done a small typo here. So it's going to be driving. Okay. So let me execute this once again. So as you can see here we have successfully removed all the prefix and they do make some sense. Okay, this is the case in case of port stema but this doesn't hold good for when we trying to use lancaster stemer or snowball stemer. So let us now quickly move ahead and see how we can implement the stemer in our data set here. So we all know we need our loop here. Okay. So before that we're going to have array. So clean text five and it's going to be an empty list. So now we'll have a for loop another empty list. Okay. So now we're going to pass this list and we're going to append. So it's going to be w. append. It's going to be word. Okay. And now we're going to append this smaller list to this one. So clean text 5 dot append w I hope this is done right. Let me quickly execute this. Okay. And let me see how this clean text file looks like. Okay. So as you can see here we don't have any more uh yeah obviously there are some errors. That's because you know this stemer might not recognize it. That's why we have multiple other stem. But most of the places you can see I know our words have been converted down to the stem words. All right. So I hope now you understood how to perform stemming. But you know as I've mentioned earlier we have multiple stemmers. We have like porter stemer. We have like lancaster stemmer and each of those stemmers are unique in their own way. Sometimes what happens is when we perform stemming we get words which makes no sense right and that thing can sometime be really annoying. So in order to overcome that we have something called as limitization. So let me quickly show you what limitization is. So let me just give limitization here. Okay. So in order to get this limitization we use something called as word net. Okay. So from NLTK dot stem and this is a form of stemming but it makes sure the word which is being outputed has some sense okay import word net limitizer okay and we're going to create a instance of this so it's going to be word net or let's give word net w ne it's going to be word net limitizer okay and now we obviously have to download couple of packages so import nk Okay. And then we have nltk. d download word net. Fine. Perfect. So now in the same way what we're going to do is we'll create just limitize words. All right. So we have limitized words here. L e mm still be lemm and then we have an empty array. And then we'll have a for loop for words in clean text 4. It's not going to be five obviously. is going to be clean text 4 because this is the form of stemming, right? And now what we're going to do is just the same drill. We have W which would be an empty list and another for loop and we are going to append whatever is there. W. appen the limitize words. Okay, so word net dot limitize and whatever word we want to limitize. So it's going to be word here. And now once this is done, we're going to append this W to our bigger LM. So lem. append W. And let me execute this now that this is done. Okay. So LEM let's print this. Okay. And let me just execute this part here. Okay. We cannot print this. That's because it's saying data is too long. So rather than printing, I'll just do this part here lem so that we can just see some glimpse of how our data looks like. Okay. So yeah, so as you can see here, although we are performing limitization, but now the words make sense. Just to give you a brief insight, right? Let's compare how our data look earlier and how it looks now. Okay. So what we'll do is we'll take our raw text that is this part here and we'll compare this with our final text which is nothing but clean text four right that is after stemming sorry it's going to be clean text five okay so let's now see how it would look like so let me quickly print them print raw text okay let me execute this first all right and now let me print clean text safe. Okay, as the data is pretty huge, what I'm going to do is I'm just slice this up over here. Okay, let me just take our first one data. Okay, this is going to be the first sentence. And as you can see here, we have all the words which have been tokenized. And it's unlike this part here we have everything which is it looks organized. Okay. So obviously we want to pre-process this data right because this makes more sense and it is more easy on a system to analyze and the classification would be pretty accurate as compared to what it would be over here. All right now moving ahead. I hope now you understand why we need to pre-process our data. Okay. So now that we know how to pre-process our data using NLTK let's see how classification of text is done. So okay in order to classify our text we use something called as nave bias algorithm. So what is this navebas algorithm? You see before we understand this navebas algorithm right let us see what is classification in simple words classification means grouping of data based on common characteristics. As you see here we have couple of figures right we have triangles circles and a square. And now when we pass this through a classification algorithm all of those get categorized into different classes and it's totally based on the shapes size and whatever other features are. This is a similar way how the nave algorithm works. Okay. So the principle that drives nave bias algorithm is something called as base theorem. And we use base theorem to calculate the conditional probability. So let us now see the maths behind our conditional probability. All right. Then as I mentioned earlier, right, we use navebased algorithm to perform classification on our textual data. And navebased algorithm has something called as base theorem. Okay. And the way this base theorem works is that we have to find a conditional probability. So what is this conditional probability? You see conditional probability we can say mathematically like probability of occurrence of event A when event B has already occurred is equal to probability of occurrence of event B when event A has already occurred times probability of occurrence of event A and this would be normalized by probability of event B. Okay, I'm sure you might be having confusion like what is this right? You see this line over here? This represents the conditional probability. Okay. So this is the conditional probability. Now what we are going to do is let's understand what is this right? So probability of event A and B. So let us now take event A to be like shopping and event B would be something like rain. Okay. So what is the probability of you going to shopping when it has already started raining. So this is what this means. Okay. Probability of A B right. So this represents and you can also say it as conditional probability. So probability of occurrence of event A when B has already occurred. Okay. So when it's already raining what is the probability that you'd be going down for shopping and there are a couple of terminologies that you need to know when we are dealing with this. Let's quickly see that. Okay. So when we dealing with conditional probabilities we have couple of terminologies as I've mentioned. So this part over here is referred to as posterior probability. Okay. This is the most important part to be found and this is called as likelihood. This part over here probability of occurrence of event A is called as prior probability. As you can see by its name, right? Prior refers to something that has already occurred. Okay. And this part over here is the most unused part that is nothing but likelihood. Okay. We call this as marginal likelihood. All right. So now speaking about probability let's see how this concept came into existence. So we have probability only because we have something called as random variables. Okay this random variables give rise to a randomness. Okay to give you a better understanding of what I'm trying to say let's take an example. Okay. So we have two bags here. We have say bag one and let's say bag two. Okay. And now what is happening here is in bag one we have balls. Okay we have red balls five of okay so do you think probability exists over here? Obviously not right. So no matter whichever ball you try to pick out we're going to get red balls and randomness over here is zero. Okay. So let's take one more bag over here. So this bag has like five red balls and then four blue balls. So do you think probability exists over here? Absolutely. Over here you can see if I try to put down my hand and pick up any ball. So probability of getting blue is nothing but the total number of elements right. So we have like five balls and this so it would be nine divided by total number of blue balls that's nothing but four. Okay. So this is what is probability. So over here we have more randomness. Okay. This is how probability came into existence. And speaking about our conditional probability, let's try to derive this conditional probability equation. Okay. So the way we get this conditional probability is by having P of A intersection B. Okay. We all know this is equal to probability of A by B when B has already occurred. Okay. So this would be our equation one. Okay. Similarly, we know that it holds good for probability of B intersection A. Right. The reason for this is because P of A intersection B and P of B intersection A is commutative. Okay. So this should be similar. Only difference that we're going to have is the change in the values. So instead of A it's going to be B here times P of A. Okay. And now when we equate these two right we going to get something like probability of occurrence of event A when B has already occurred times probab and this would be equation over here. All right. So let's bring this down and then we'll have probability of occurrence of event A time probability of occurrence of A divided by probability of occurrence of B. So this is called as base theorem and if you see right so this is something which is very much similar to what we had over here. Okay. So this is what is base theorem and this is how we derive it. So now you might be wondering how can I use this base theorem for classification problems. Right. So just a quick recap as I mentioned earlier classification is nothing but you know categorizing a data based on its characteristics. Okay. So here what's going to happen here we'll have something like you say we'll have a data set right. So let's take x. So we'll have x data set. So this would be nothing but group of values. Okay the text data. And then we'll have y. Y is nothing but the classes. And y refers to the class. And what this class means is it can be like 0 1 and so on and so forth. So here let's take something like 0 and one. And here zero refers to being it not spam and this is spam. And here x would be nothing but group of emails. Okay. So now let's put this in a base theorem and see how it would look like. All right. So over here we have for base theorem we'll have something like probability that an email is spam. Okay, when we already have the email is nothing but probability of this particular email being in spam class time probability of that email all divided by P of X. Okay. And similarly this is for spam email. And now for not spam it would be P of Y is equal to zero. Okay. Given X this should be nothing but given that we have a label of non-spam and then what is the probability of that email being here this would be times probability of Y =0 all divided by P of X okay so this is how it would look like you know base theorem for finding whether the email is spam or not so to better understand this right let's see what each of this represents okay so let's take this part here I'm pretty sure you must be confused what this part represents presents right so what this part says is think that we have a data set okay so let's this be our this is our data set and we have x values here and then we have y values what this x values will have is nothing but emails okay so this is nothing but group of emails and this would be a class so if the email is spam it would be zero not spam it would be one and then one and zero so on and so forth let's just take it as an example and this is something which is an unknown value right so this part over here is an unknown. See both of these are same. Okay. So as of now let's just consider for a spam email. Okay. So we need to find a new email. Okay. So we'll be given like we'll have a test. So let's call this as a train data. Right? So x train. Okay. And this is the output. This is a class. So now what will happen is I'll be given a email. So I'll be like juned find out whether this email that I'm giving you is a spam or not spam. So this is going to be like this. Okay. and then there'll be x test but only difference is here we don't know which class they belong to okay so what we're supposed to do here is we are supposed to train our model and figure out which class this email would belong to okay so this is an email here these are the question marks okay we don't know what class does this email belong to so what this p of x and y is equal to 1 represents is when we are given a class y okay when we already know that a email is spam that is this particular email. What is the probability of this email part of being this? Okay. And then we compute this part over here. And then finally, what we're going to do is when we get this test data, right? X test data, we'll just say what is the probability of this particular email being a part of zero. Okay, zero means not spam and one means it's a spam. So we'll basically get a digital value over here. So for example, now let's take an example over here. So I got this X test value. X test is something which will be over here. So I got this extest value say something like free food. So now this will be represented in spam right? The reason is because spam is a keyword which is usually found in a fake email. So what will happen over here? So probability that a given email that is X test is spam. Okay. So probability that a given email over here. So this email that is this email is not spam and is spam. So this would say something like this will have a high probability of being in a spam right? So this would give us a numerical value say something like 80 which refers to 80%. And this would give us like say 20 okay and this is nothing but 20%. So which among this is high? So obviously this particular value is high right. Therefore this email would be classified as a spam email. So this is how basically it works. Okay. So now in order to find this values over here you know that is nothing but in order to find the value of our posterior probability we have to calculate likelihood prior probability and marginal likelihood. Although we can ignore marginal likelihood this is because we're trying to normalize it. So we can ignore this part. Finding the probability of this is pretty simple because all we need to do is find the total number of spam email but total number of emails. And similarly I would do it for total number of non-spam but total number of spam. But the only difficult part over here to find is likelihood. So let's now see how we can do that. So to start off let's say something like we are given some emails. Okay. So we have 100 emails. Out of this 100 emails we have 40 of them are spam. We know that these are the 100 emails and 40 of them are spam and 60 of them are not spam. And this not spam is represented by zero and one. That is nothing but y. Okay y is equal to 0 or 1. So how this would look like is let's say we have a table here and out of this table we'll have say x test which is nothing but the 100 emails okay so we'll have from 0 1 2 dot dot and this would end up till 100 and now at the same time we'll also have y is nothing but a class and these emails over here can belong to either 0 1 or anything but it should be either 0 or one and here it's going to be 0 or one it's just that we're taking an assumption. So now what we're going to do is we'll be calculating our prior probability. So if this is our data set, our prior probability is going to be nothing but P of Y is equal to 1. So what this means is count all the spam emails. Okay, count all spam emails. But total number of emails. Okay, so let's now see what would be the probability for this. So what is the total number of emails? It's 100, right? So it's going to be 100 over here. And what are the total number of spam emails? It's 40. So let's quickly write 40 over here. Similarly, we're going to do this for P of Y is equal to zero. Okay. Here it's going to be total number of emails. And here we're going to write all the number of non-spam emails. So what would this give us? This would be 100 which is the total number of emails and then we'll have 60. So this is how this particular part would look like in order to give this in a mathematical form because you know we obviously we'll be putting this in the form of formula right so in a mathematical form so this is nothing but an average that is 1x m summation of all the ones for y is equal to 1 or zero and here this i will range from 0 to n so basically we're trying to add 11 on over here. Okay. So this is how we calculate our prior probability and as I've mentioned earlier we don't have to calculate our marginal likelihood and finally we are coming down to important stuff that is the likelihood. Okay. So this is the part this is the likelihood which is the most important part and the toughest part to calculate. Although it's pretty simple you have to understand the math behind it. So in order to calculate our likelihood what we're going to do is we're going to calculate the probability right. So let's see how we can do that. So we have this P of X when Y is equal to 1. Okay. What this means is when we have this email right so we already know that email belongs to spam. non-spam. So what is the probability of that email belonging to this particular group? This is nothing but probability that email belongs to class one. So zero. Fine. So now how X would look like? So just to give you a brief before we move ahead. X over here would be nothing but a email. So it will have multiple words and somewhere over here in the middle it will be like get unlimited 50% of and so on and many other words. These are called the features and based on these features we calculate whether this email belongs to a spam class or non-spam class. So how this would work? How this probability over here works? It'll take each of these feature. Let's take something like ultimate. Okay. So it's going to be like probability of ultimate belonging to spam. This would give me some value say 0. 9%. Because it's high probability right that an ultimate word comes in a spam email. And then we'll also calculate at the same time probability that ultimate belongs to non-spam. So this is going to be less probability. You obviously are not going to use uh ultimate in your day-to-day activities right or day-to-day conversation. So this is how it's going to be. So let's now quickly see how we can calculate for this. So now we have x right. So if x this is a capital x is nothing but list of words. Okay. And this is nothing but an email. Okay? And then small x represents the words which are there. So here it can be x1, x2, x3, x4 and this would end up to xn. So these are nothing but features or words. Okay. And x is the entire email. Then what we're going to do find over here is probability that x nothing but the capital x belonging to y=0 is equal to probability of all of these individual words over here. So words belonging to a spam. And similarly we can do this for spam. So when we have an entire email what is the probability that all of the words or the content of that email belonging to spam. So here we'll have x1 x2 x3 x4 so on y is equal to 1. So this is how it works. Let's see the expanded version of this. Fine. Let me copy this entire equation here. Okay. And let's paste it on a new sheet. And let's see how we can calculate each of these. Right. So what's going to happen now? Probability P of X that Y is equal to 1 is equal to I'm just expanding this part over here. Okay. Let me just erase this to give you a better insight. All I'm trying to do is I'm trying to expand this part. P of X1 X2 X3. Right? So this is nothing but probability. You see this comma here represents and. Okay? Okay, it's an and operator. So, probability of x1 belonging to y=0 multiplied by probability of x2 when y is equal to0 that x1 is also not spam. This represents and then we'll perform multiplication and then we'll do something like this again. So for P3 probability that X3 that is nothing but the word this X3 belongs to non-spam category when we already know that X1 and X2 also belong to non-spam category. What I'm trying to say over here is each of these words are dependent upon each other only if X2 is considered as not spam only if X1 is not spam. You know all of these words are dependent on each other and the probability of them is holds true only if the other one holds true. So what's the issue with this is by the time it reaches this X and right it becomes pretty huge value and it becomes computationally very expensive. In order to overcome this, we use something called as nave bias assumption. And what this nave bias assumption says that when we calculate this probability right here, we have calculated probability of x1 when y is equal to zero. And then we have also calculate the probability for the second word. According to navbar's assumption, this word is totally independent of the first word. So the first word can have higher probability of being a spam and second not spam. But they are totally independent of each other. So this is what is nva's assumption is. So let's now see how this equation would look like after nas assumption. So what we're going to do is I'll write one below the other so that we get a better understanding. So if y is equal to 1. If an email we consider that to be a spam email only if p of the first letter or the first word of that email is spam is equal to 1. This will give us some probability and then I'll multiply that with probability of the next word in that email being a spam. So and then this won't be dependent upon the second value here. And then I'll multiply this again by the third word P of X of three that is nothing but the third word is equal to spam because over here we're trying to find the probability that email belongs to spam category and this would continue for the nth term P of Xn when Y is equal to 1. So you see here right that none of the probabilities are dependent on each other thus reducing the computation need. Okay. So in order to put this in a mathematical form what will happen over here is P of X that or we can say P of email belonging to a spam category is equal to the product. Okay, so this is multiplication, right? So for summation, we use this and for product we use pi. Okay, so this is for product and where i will range from 1 to n and then we'll have probability of x of i when we know that particular character or word belongs to a spam category and same way we'll have for non-spam y is equal to zero. This is nothing but pi where i ranges from 1 to n probability of x i when we know y is not a spam. This is an equation for our likelihood. So now as you can see right we have found the value for likelihood posterior probability we don't have to calculate marginal likelihood so we can uh just skip that part and now finally we are coming down to posterior probability. So let's now see how we can substitute our values and calculate our posterior probability. So posterior probability that's nothing but p of we feed an email right. So p of x i let's write a generalized version where y is equal to c. See here refers to the class and class here is nothing but either spam or not spam. This is equal to on top we'll have the likelihood right. So let's write the likelihood first. So here we'll have something pi i = 1 to n probability of x i that is going through each and every word in an email. Y= c can be either 1 or zero. We will multiply this by prior probability right which is nothing but 1x m i = 1 n here it's going to be y is equal to c and obviously we have to put the normalization below but we can skip it because it doesn't make any difference right so this is the equation for our nbas algorithm this is the way how we can classify our text let us now go to my code editor and try to code the entire algorithm this would help us in understanding how the underlining working of NLT TK works. Okay. So as you can see I have come here to my Google Collab. Let me quickly give a name as NLTK or let's give classification implementation. So let's get started. Now there are a couple of things that we have to import. So let it be import pandas as PD. Then let's import numpy as np. And then let's have a label encoder. So basically we use label encoder to convert our text into numerical form. And if you are asking why we are going to convert it into numerical form that's because computer no matter how advanced it is, it is unable to understand textual form. So it has to take the data in a numerical format. Okay. From skarn preprocessing import label encoder. And finally we need sklearn dot model selection import train test split. All right. So let us now execute this. So in our next stage right we will have to get our data set. So getting data set. So here we'll be using mushroom data set. So what we're going to do is df. We know it's going to be in data frame. So, pandas do read CSV and the place where I have my file over here is here and I'll just quickly copy the path and paste it over here. So, let me execute this code. Fine. And now let's see the shape of our data set. So, it's going to be DF. So, what this represents over here is that we have 8,142 rows and then we have 23 features. So let's now see each of these. So let me just print this df. head. Let me slightly zoom it out and let me play this. So as you can see here we have all our data in a textual format. And then we'll have the values which would range from 0 to 8,124. And then the features over here are nothing but you know cap shape, cap color, cap surface. This is nothing but the surface of the mushroom. And the class you can see here right? So this part over here this class is nothing but y. And all of these features over here are represented as X. So now what we'll do is let's try to you know encode this. Let's try to convert these values into numbers. So what we'll do is LE and we're going to use label encoder. Okay. So we are creating the instance of this and then we'll have DF encoded that's df stands for data frame. This is nothing but df do. apply. Okay. Apply is one of the method. What this apply method does is it's like a for loop over here. So it'll go to each and every row and each and every column and apply whatever function we pass and we're going to pass label encoder. Right? So over here I'm going to pass here lefit transform. We'll we just have to give method name and no method call. And then we'll give axis. By default axis is zero. We want zero because we want it to go by every row. So let's now execute this and let's see how our data would look like now. So as you can see here this is our data or let me just give it as head and let me zoom out so that we can compare this data with this. So these both are the same thing. Okay as you can see here class and then we have cap shape and all of these are the same thing. Only thing after performing label encoding all of these values have been converted into numbers. So now what we'll do is we obviously need to convert this into a list of array right. So let's finally do that. So let's DF is equal to encoded DF. So let's see what it is. Yeah, df encoded dot values. And now we'll have to define our X. RX is nothing but DF. So as you can see right, this is Y. This class refers to Y and X is all of these features except the first column we want all the other columns, right? So what we'll do is from column one because this is column zero. We want it to all the columns. So basically here we want all our rows that is 0 to 8,000 some change and then except first column the zeroth column we want all of them. Okay. So this is X and for Y it's going to be all the rows. I hope you understand why it's all the rows right? So 0 1 2 3 and we need it for all the values. So it's all the rows and only the zeroth column because zeroth column gives us a class right? So let's quickly execute this and just for your satisfaction let's see how it would look like. So here I would press X. So you'll see here that yeah it's going to be capital X right upper case. So everything has been converted in an array and except the first part except this part will have all other values and similarly let's see it for Y. So as you can see here we have just single column. So now what we'll do is we'll split our data. So this is going to be we going to use train test split. So here we're going to have the value of x and then y and then we have to give by what percent we want our value to be splitted right. So for that we'll give something like this. Okay. Yeah. So as you can see here we need our test size right. So we need a test size and a random state. So let's me just copy this from here. Now what this test size resembles is by what percent we want our data to be splitted. So let's give it as 20%. And let me now execute this. So now that we have all the data and every requirement. So let us now directly jump into having our knife bias classifier. So let me give you a quick recap here. In order to get a knife bias classifier, we need to find posterior probability. In order to find the posterior probability, we need to find something called as likelihood and prior probability. And then we're going to have prior probability. Let's now calculate each of this. And we'll start this by calculating prior probability. Okay? Because prior probability is pretty simple to calculate. Fine. So let's now start off with prior probability. So we'll create a function here def prior probability. So this is nothing but you're going to pass Y train and then we're going to pass labels. Okay, labels over here refers to X and zero values. Instead of giving Y train, let's generalize it to give Y. And now what we're going to do is we need to find the size of Y, right? So because prior probability is nothing but sum of either X or zero all divided by the total number of classes. Okay. So what I'm trying to say here is if you have 100 emails out of which 40 are spam and 60 are not spam then probability of email being spam is 40 by 100 and the other one is going to be 60 by 100. Okay. So for that we need to find the total size. Right. So it's going to be y dot. shape and then this will give us an array. Right. So we need the first value. Apart from that we need a sum. Right. So s is equal to np dot sum y train or let it be y values is equal to label. I hope you understand why we are doing this. You see y value has only class values right? Only if the class value and label value are same you're going to get one. So this is how we are going to do here. And this is going to return us prior probability. So written m by s. Fine. So now let's find our likelihood. That's our next task. And in order to find our likelihood, we need to have conditional probability. Okay. So def conditional probabilities. This is going to take a parameters like x and then y. So we give here as x train and not x. And then we'll have y train. And then we obviously need to find feature column and then we need to have feature label and then finally a label. Okay. So what this feature column says is it represents which column we want that particular feature. So as we all know that this is nothing but a tableau data, right? So let me just quickly show you how it would look like. Okay. So over here we have a table and this table will have multiple rows and columns, right? So what this feature column represents is within this feature column within this column which value represents the values present in the feature columns. Okay. So let me just quickly now erase this and move ahead. So what this will do is we're going to just filter out the value right. So x filter this is equal to x train where y train is equal to label. So over here we'll just get the email values which is either spam or not spam. That's why I'm giving here as xfiltered and number is equal to np dot sum it's going to be x filtered where we have all the rows and the columns is going to be feature column which is going to be equal to feature values. And now for denominator so it's going to be x filtered dot shape which is nothing but total number of values right zero and this will return as numerator by denominator and let's give it as float over here and it's going to be numerator divided by denominator okay this is nothing but the conditional probability okay I hope you remember what is conditional probability when we were discussing the derivation so now what we're going to do is we obviously have to predict our class right so let's do that let's write this function here predict def predict this predict takes two values x train y train apart from that we also need to perform predictions right so that's why we also need x test so now what we're going to do is we'll have classes refers to be spam or not spam over here it's going to be poisonous or not poisonous because here we are taking mushroom data set right so classes is equal to np do unique Y right so it's going to be Y train okay what this predict will return is Y test values whatever value I feed to this I'm going to get the answer for that okay so now we also need features so n features this is going to be equal to x train dot shape all right so let's now calculate our posterior probability so posterior probability Let's leave it empty as of now. Okay. So for every value of a posterior probability we'll get some percentage. So this percentage represents you know probability of that particular word being a part of spam or not spam. So over here we're going to have four label in classes. This means that we're going to go class by class. So either it's spam or non-spam. Over here it's going to be either it's going to be poisonous or non-spoisonous. So the values of classes can be either 0 or 1. Fine. And then we are going to give likelihood we'll give it as 1. 0. And then for features in range n features. Okay. N features represents the column over here. Right? So these are the n features except the classes. Whatever is there, they are nothing but n features, right? So yeah. So go through each and every feature or in short I can say go through each and every column and then find the conditional probability. C N D is equal to conditional probability. Here we're going to call this function. So now we have to pass our X train values and then we have Y train then the features FEMA and within the features which feature you want. Okay. And then we'll have label. Okay, this label over here represents whether the class is either X or zero. So now what we're going to do is we're going to calculate our likelihood. So likelihood is equal to likelihood plus conditional probability time conditional probability. So this likelihood over here, we're just randomly initializing it over here. And for every iteration of this for loop likelihood increases. So now all we need to do is we need to append prior probabilities over here. So we'll give here as prior is equal to prior probability. Okay. This is nothing but this function over here which we have defined where we're going to give white train and then label and then for posterior probability this is nothing but likelihood times error. Okay. I hope you remember the mathematical equation for this particular part. So we are trying to find prior and posterior probability for each of this. And now what we're going to do is we'll just append these values over here. So we'll have posterior prop dot append. And the value that I want to append is post. And now what we'll do is we need to find the probability which has the max one. Right? So for that we going to use arg max over here. So predicted value is equal to np. org Arm max armax gives us in which place we have the highest value right so that's what armax does and so we'll give you a posterior probability that is nothing but this part and this would return as the class right so return predicted okay so this is done all right so let us now do one thing let us now find by what accuracy we are finding this value so let's also calculate the accuracy that we'll have a correct values right so to do that def accuracy C X train X test then we also need to give Y train and Y test right so Y train and it's going to be Y test and within this function what you're going to do is predicted is going to be an empty list so for i in range so it's going to be th all the values of x right so x test dot shape So now here it's going to be P is equal to or predicted value. So now we'll call this function here predict X train Y train and then we have X test and as we are doing this for all the values we'll just pass here I every time I perform this predict right it gives me whether one example belongs to a mushroom class which is poisonous or not. So only reason why I'm doing this accuracy or I can say only reason I'm having this for loop here is to find all the values and put them over here. This predict list over here contains the predicted value of each of the test values. Okay. So now what I'm going to do is whatever value I get over here. So pred. append p. Let's do one thing. Let's give this name as y prred and we'll convert this predicted over here to numpy array. So, np array and we'll give here a spread. And now in order to calculate our percentage or in order to calculate the accuracy all we need to do is accuracy is equal to np dot sum and we'll just compare this predicted value. Okay, whatever values because we get only ones and zeros with the value that we already know. Okay, this is just the way to have you know testing your data how accurate it is. So we'll have y bread has to be equal to y test and every time these two values are same right it will add up one and in order to get a value in percentage we'll have y bread divided by the size right so y bread by shape and we'll give index is zero and this will return as accuracy fine so let's call our accuracy here and let's give all these values so before I run this code right let me just give you a quick recap Our main agenda over here was to classify whether these mushrooms over here belongs to class of poison or not. So that's what we are doing here. We have to do that obviously using posterior probability. In order to find posterior probability, we need two things. One is nothing but likelihood and another one is prior probability. Finding prior probability is pretty simple because we just need to find the total number of values by total number of other values. Right? That's what p probability does. And for likelihood it's pretty simple. Okay? We have to find the conditional probability and then this should return us the predicted values. And let me now quickly run this and let's see what is the accuracy of our model. Okay. So it's saying X test is not defined and the reason why we are getting that is because we have given a smaller case value over here. So let me quickly rerun this now. Oh yeah, we have to give X as uppercase. Okay. So in order to overcome this, what we'll do is we'll quickly run this from the start here. Okay. So let's wait for our data to get processed over here. Sometimes what happens is when you're trying to execute multiple lines over here, right? You know, one block might get executed before and then other later. So let's see if this works this time. I have restarted the runtime in order to run it from the beginning. So let's I hope it should work now. So now we don't have any errors. So let us now quickly see what is the accuracy that we got. Okay. So we have got 0. 99 right. So let me quickly give this over here. Accuracy time 100. So now when we perform this classification task here we are getting 99. 63 accuracy. So this is how we can perform classification task using knifebase algorithm. All right. Now so now that we know how NLP works. Now that we know what is knife based classification and how knife based classification works and we also know how to pre-process our data. Let's take certain amount of sentences. Let it be a small sentence and let's see if we can perform any kind of sentiment analysis on them. And over here we'll be using a library called as skarn. With skarn we don't have to write all the number of lines that we wrote now. So let me now quickly move to my code editor and show you how I can implement that. Okay. So now let's change the name here. Let's give it as sentiment analysis. Okay. So what we'll do is we'll have a text over here and we won't have a huge amount of data because if you have huge amount of data then it would be pretty hard to understand. So let me get this textual data for you. Okay. So over here in my notepad I have some small amount of data and over here as you can see so we have X test sorry this is going to be X train and then we have Y train. So let's now analyze our data over here. So if you can see right what is happening here is we have a data set and we're supposed to train our model based on this and we also have classes over here. This class over here represents that whether this first sentence is a positive or negative sentence. So if it's a positive it is one and if it's a negative class it's going to be zero. So on and so forth we have it for all the movie rating predictions here. Okay. And now once we are done training our model let's we'll test it by passing this values. Okay. So over here we have three sentence. I was happy and I loved acting in the movie and then the movie I saw was bad and we can add some more examples. So what we'll do now is let me copy this part over here text and let me paste it for our data set. Okay. And let me quickly execute this. So let's now see the shape of our X. Let's try making this an uppercase. And the reason why I'm using uppercase here specifically and lower case and y this is because this is a standard in the data science community. All right. So let's see the shape of our training data set. All right. And then we'll have x train. Okay. So as you can see we have a data set here. And let's have shape. Yeah. We won't get the shape here because this is not a numpy array. Right. So this is extra. Fine. So now what we'll do is we have to clean our data. Right. So let's do the data cleaning part wherein we'll be doing all kind of stuff like tokenization, stemming and stopout removal. Okay. So let's give a heading here as data cleaning. And now rather than writing this as an individual function, what I'll do is I'll write it as the entire method. Okay. So let's import our values first. from NLTK dot tokenized import regular expression tokenizer and then we have from NLTK dot stem dot port let's import portmer and then finally we have to have a stopw right so from nltkus import stopwards so let me now download our stopwards so import NLTK and then we are going to have NLTK. d download stopwords. All right. So now uh let's now create an object of our tokenizer portmer and stop words. This is going to be regular expression tokenizer. And then I'm going to pass what pattern I want. Right? So I want only the words and then I also want to concatenate those words. And then we have stop words. So which language I'm using? I'm obviously using English. So stop words. This is going to be SCT stop words dot words. It's going to be English. And now we have portma. That's ps is equal to port stema. All I've done over here is just creating the object of our classes over here. Fine. And now we'll create a function or a method def clean data. And then we're going to pass here as text. Fine. And what we'll do is we'll convert our text into lower case. And now we'll perform tokenization. So this is nothing but tokens. This will be equal to tokenizer. We are getting this tokenizer from here. Dot tokenize. So we want the text to be tokenized and then new tokens equal to token for token in tokens because we're getting a list over here right for tokens in tokens if token not in stopwards what I'm trying to do here is I'm just combining stopward remover and tokenizer right so that's why I'm using another for the list over here So this part over here for token okay for token in tokens right okay this part over here gets me the list of tokens and then what I'm going to do is I'm going to compare this tokens with stop word list and if that word isn't present then I print it basically I'm performing tokenization as well as stop word removal at the same time and now we'll perform stemming same way like before I'm going to give uh less comprehension so ps is the name that we have given port stemor stem obviously it's going to be words right so for tokens in new token what I'm going to do is I want them to be stemmed token okay and now what's going to happen is here I'll have sentence or like it'll be like clean text this is going to be join stem tokens right and what this method will return is the clean text. So let me quickly execute this now. All right. So this is done from our end now. All right. So now what I'm going to do is I'm going to use this get clean text to clean our text data and train data. Right. So what I'm going to give here is X clean. Okay. X clean is going to be get clean text. Okay. And I'm going to pass X train. Similarly I'm going to do it for X test. So for that I'm going to do XT_cle and this is going to be get clean data and it's going to be Y sorry X test. All right. So let me now execute this. Okay. So we are getting an error over here saying that you know this word over here has no attribute. That's very true. The reason why we are getting this is because I'm just passing the value, right? So we don't want that. So what I'm going to do is I'm going to put this in a form of a list here. And rather than passing the entire extra I'll just give one word and pass for I in or X train sorry. And now similarly I'm going to do here for I in X test. So yeah we haven't defined our X test yet. So let's quickly define our X test. All right. So let me get this coding part over here and let me get my X test. So I'll copy the X test from here and let me paste it here. Fine. Okay. The reason why we getting an error over here is because over here I have defined as new tokens and here it is new token. Right. So let me quickly fix that and run this once again. It says X test is not defined. Yep. So we will fix this and rerun this again. Most of the times when you're trying to do this program, right, you'll encounter a lot of issues. And only when you encounter these issues, you're going to learn a lot. So now we have our clean text. So let's just compare it to and see how it looks. Okay. So as you can see here, our text has been reduced. And yeah, so in order to get this better, what I can do is I can just pass some space here and then run this here. Let me do one thing. Let me rerun this from the start here and restart the timer. and let me run it all. Okay. So as you can see here now we're getting some spaces. All right. So the reason why we weren't getting any space is because I had not added any space. So now let's perform our classification task. Let's we'll be using NLTK. Right? Before that we have to vectorize our text. As I've mentioned in order for us to perform classification we need to vectorize it. So vector. So now what we'll do is from skarn dot feature extraction dot text import count vectorzer. All right. And now we'll create the instance of our count vectorzer. CV is equal to count vectorizer. And we'll give our engram range. It's going to be one and two. And now to vectorize our input. So x vectoriz is equal to cv dot fit transform. And now we're going to pass this value, right? So we'll pass x clean and then we want to convert this into an array. So let us now execute this and let's see how this x vector looks like. Okay. So basically for every word you know we are getting this vectorzer over here. All right. So now what we're going to do is we are going to perform our classification task. But before that let's get our feature names. Okay. So print cv dot get what this get feature names does is we don't understand these values right. So you don't know what 0 02 here represents. So in order to know what values over here represents all we need to do is get feature names. And as you can see here this first over here represents act. And it's the same for all the five arrays over here. So what basically this count vectorizer tells is that this word act how many times has it repeated in this sentence how many times it was present in this sentence this is what vectorizer does and this kind of model is usually referred to as bag of word model similarly we'll perform vectorization for our test value so x test cvt transform and now we have the clean text and now we also want to convert into an array All right. So now let's execute this part here. Okay. And finally we'll perform our classification task and we're going to use multinnomial knife base here. Okay. If you don't know there are multiple versions of knives that are available. In order to perform text classification we use multinnomial knives. Okay. So multinnomial. All right. So let me import that from skarn. knife base import multinnomial knbase and now we'll create the instance of that. So mn and we'll give multinnomial kn. So we'll fit our model. So it'll be multinnomial knife base dot fit and we'll give values. Okay. So it's going to be x vector. x vector is nothing but you know our vectorzed form which is nothing but this x vector and then we're going to give y values. All right. And let me execute this. So it says Y isn't defined. So let's go back here and see where we're going wrong. Oh yeah, it's not Y, it's going to be Y train, right? So let's copy this and let me paste it over here and let me execute this program. So now we have our multinnomial night base. And let us now perform prediction. So MN, we'll return this value, right? So it be like Y red. This is going to be MNB or MN multinnomial knifebase. predict. predict and over here we're going to pass the test value right so x text vector xt vector fine so let's now see what will be the out for this but before that I would like you to guess what can be the output okay see we have performed classification task and now by doing this predict the value of xt all we are trying to do is predict whether that sentence belongs to class A or class B and class A here refers to spam and class B here refers do not spam and in vectorized form it's going to be either one or zero. So let's now see why predicted it'll give us an array. So over here it gives one and zero. So let's see what it means. Okay here one refers to positive value and zero refers to negative class. And what was our x test value? So we have defined our x test somewhere over here. So here I was happy and I love acting in the movie. Y refers to positive right? So happy is a positive word which we know. Let's see what did our machine identify it as. So the first word over here or the first sentence refers to this and it says one which means happy. Okay, instead of doing this what I'll do is in our test data we'll just give one value. Okay, this is just so that you better understand this. So we know that this is a very bad word, right? And when I say bad word I don't mean you know I mean this words these are the bad words or the words that give us negative feature, right? what I'm expecting in the output is zero. So as you can see here our predicted value is zero. So with this we can say that our classification is working and we can also do this using a pretty huge
NLP & Text Mining Using NLTK
data set. So let's understand what is natural language processing. So NLP refers to the artificial intelligence method of communicating with an intelligence system using natural language. By utilizing NLP and its components, one can organize the massive chunks of textual data, perform numerous or automated task and solve a wide range of problems such as automatic summarization, machine translation, named entity recognition, speech recognition, and topic segmentation. So let's understand the basic structure of an NLP application. Considering the chatbot here as an example, we can see first we have the NLP layer which is connected to the knowledge base and the data storage. Now the knowledge base is where we have the source content that is we have all the chat logs which contain a large history of all the chats which are used to train the particular algorithm and again we have the data storage where we have the interaction history and the analytics of that interaction which in turn helps the NLP layer to generate the meaningful output. So now if we have a look at the various applications of NLP. First of all we have sentimental analysis. Now this is a field where NLP is used heavily. We have speech recognition. Now here we are also talking about the voice assistants like Google assistant, Cortana and the Siri. Now next we have the implementation of chatbot as I discussed earlier just now. Now you might have used the customer care chat services of any app. It also uses NLP to process the data entered and provide the response based on the input. Now machine translation is also another use case of natural language processing. Now considering the most common example here would be the Google translate. It uses NLP and translates the data from one language to another and that too in real time. Now other applications of NLP includes spellchecking. Then we have the keyword search which is also a big field where NLP is used. Extracting information from any particular website or any particular document is also a use case of NLP. And one of the coolest application of NLP is advertisement matching. Now here what we mean is basically recommendation of the ads based on your history. Now NLP is divided into two major components that is the natural language understanding which is also known as NLU and we have the natural language generation which is also known as NLG. The understanding involves tasks like mapping the given input into natural language into useful representations, analyzing different aspects of the language. Whereas natural language generation it is the process of producing the meaningful phrases and sentence in the form of natural language. It involves text planning, sentence planning and text realization. Now NLU is usually considered harder than NLG. Now you might be thinking that even a small child can understand a language. So let's see what are the difficulties a machine faces while understanding any particular languages. Now understanding a new language is very hard. Taking our English into consideration, there are a lot of ambiguity and that too in different levels. We have lexical ambiguity, syntactical ambiguity and referential ambiguity. So lexical ambiguity is the presence of two or more possible meanings within a single word. It is also sometimes referred to as semantic ambiguity. For example, let's consider these sentences and let's focus on the italicized words. She is looking for a match. So what do you infer by the word match? Is it that she looking for a partner or is it that she's looking for a match be it a cricket match or a rugby match? Now the second sentence here we have the fisherman went to the bank. Is it the bank where we go to collect our checks and money or is it the river bank we are talking about here. Sometimes it is obvious that we are talking about the river bank but it might be true that he's actually going to a bank to withdraw some money. You never know. Now coming to the second type of ambiguity which is the syntactical ambiguity in English grammar. This syntactical ambiguity is the presence of two or more possible meanings within a single sentence or a sequence of words. It is also called as structural ambiguity or grammatical ambiguity. Taking these sentences into consideration, we can clearly see what are the ambiguities faced. The chicken is ready to eat. So here what do you infer? Is the chicken ready to eat his food or for us to eat? Similarly, we have the sentence like visiting relatives can be boring. Are the relatives boring or when we are visiting the relative it is very boring you never know. Coming to the final ambiguity which is the referential ambiguity. Now this ambiguity arises when we are referring to something using pronouns. The boy told his father the theft he was very upset. Who is he? Is it the boy? Is it the father or is it the thief? So coming back to NLP firstly we need to install the NLTK library that is the natural language toolkit. It is the leading platform for building Python programs to work with human language data and it also provides easytouse interfaces to work 15 corpora and lexical resources. We can use it to perform functions like classification, tokenization, stemming, tagging and much more. Now once you install the NLTK library, you will see an NLTK downloader. It is a pop-up window which will come up and in that you have to select the all option and press the download button. It will download all the required files, the corpora, the models and all the different packages which are available in the NLCK. Now when we process text there are a few terminologies that we need to understand. Now the first one is tokenization. So tokenization is a process of breaking strings into tokens which in turn are small structures or units that can be used for tokenization. Now tokenization involves three steps which is breaking a complex sentence into words, understanding the importance of each words with respect to the sentence and finally produce a structural description on an input sentence. So if we have a look at the example here considering this sentence tokenization is the first step in NLP. Now when we divide it into tokens as you can see here we have 1 2 3 4 5 6 and seven tokens here. Now, NLTK also allows you to tokenize phrases containing more than one word. So, let's go ahead and see how we can implement tokenization using NLTK. So, here I'm using Jupyter notebook to execute all my practicals and demo. Now, you are free to use any sort of IDE which is supported by Python. It's your choice. So, let me create a new notebook here. Let me rename as text mining and NLP. So first of all let us import all the necessary libraries. Here we are importing the OS NLTK and the NLTK corpus. So as you can see here we have various files which represent different types of words, different types of functions. We have samples of Twitter. We have different sentimental word net. We have product reviews. We have movie reviews. We have non-breaking prefixes and many more files here. Now let's have a look at the Gutenberg file here and see what are all the fields which are present in the Gutenberg file. So as you can see here inside this we have all the different types of text files. We have Austin Emma, we have the Shakespeare, we have the Hamlet, we have Mobex, we have the Carol Alice and many more. Now this is just one file we are talking about and NLTK provides a lot of files. So let's consider a document of type string and understand the significance of its tokens. So if you have a look at the elements of the hamlet, you can see it starts from the tragedy of hamlet by William SPE. So if you have a look at the first 500 elements of this particular text file. So as I was saying, the tragedy of Hamlet by William Shakespeare 1599 actor's premise. We can use a lot of these files for analysis and text for understanding and analysis purposes and this is where NLT comes into picture and it helps a lot of programmers to learn about the different features and the different application of language processing. So here I have created a paragraph on artificial intelligence. So let me just execute it. Now this AI is of the string type. So it'll be easier for us to tokenize it. Nonetheless, any of the files can be used to tokenize. For simplicity, here I'm taking a string file. The next what we are going to do is import the word tokenize under the NLTK tokenized library. Now, this will help us to tokenize all the words. Now, we will run the word tokenize function over the paragraph and assign it a name. So, here I'm considering AI tokens and I'm using the word tokenize function on it. Let's see what's the output of this AI tokens. So as you can see here it has divided all the input which was provided here into the tokens. Now let's have a look at the number of tokens here we have here. So in total we have 273 tokens. Now these tokens are a list of words and the special characters which are separated items of the list. Now in order to find the frequency of the distinct elements here in the given AI paragraph, we are going to import the frequency distinct function which falls under NLTK. probability. So let's create a f test in which we have the function here frequentist and basically what we are doing here is finding the word count of all the words in the paragraph. So as you can see here we have comma 30 times we have full stop nine times and we have accomplished one according one and so on. We have computer five times. Now here we are also converting the tokens into lower case so as to avoid the probability of considering a word with uppercase and lower case as different. Now suppose we were to select the top 10 tokens with the highest frequency. So here you can see that we have comma 30 times the 13 times of 12 times and 12 times whereas the meaningful words which are intelligence which is six times and intelligence six time. Now there is another type of tokenizer which is the blank tokenizer. Now let's use the blank tokenizer over the same string to tokenize the paragraph with respect to the blank string. Now the output here is nine. Now this nine indicates how many paragraphs we have and what all paragraphs are separated by a new line. Although it might seem like a one paragraph, it is not. The original structure of the data remains intact. Now another important key term in tokenizations are biograms, diagrams and engrams. Now what does this mean? Now biograms refers to tokens of two consecutive words known as a bagram. Similarly, tokens of three consecutive written words are known as triagram. And similarly, we have engrams for the n consecutive written words. So, let's go ahead and execute some demo based on biograms, diagrams, and engrams. So, first of all, what we need to do is import biograms, diagrams, and engrams from nltk. util. Now, let's take a string here on which we'll use these functions. So taking this string into consideration, the best and the most beautiful thing in the world cannot be seen or even touched. They must be felt with the heart. So first what we are going to do is split the above sentence or the string into tokens. So for that we are going to use the word tokenize. So as you can see here we have the tokens. Now let us now create the background of the list containing tokens. So for that we are going to use the nltk. boggrams and pass all the tokens and since it is a list we are going to use the list function. So as you can see under output we have the best and most beautiful thing in the world. So as you can see the tokens are in the form of two words. It's in a pair form. Similarly, if we want to do the triagrams and find out the triagrams, what we need to do is just remove the bagrams and use the triagrams. So, as you can see, we have tokens in the form of three words. And if you want to use the engrams, let me show you how it's done. So, for engrams, what we need to do is define a particular number here. So, instead of n, I'm going to use let's say four. So as you can see we have the output in the form of four tokens. Now once we have the tokens we need to make some changes to the tokens. So for that we have stemming. Now stemming usually refers to normalizing words into its base form or the root form. So if we have a look at the words here we have affectation, effects, affections, affected, affection and affecting. So as you might have guessed the root word here is effect. So one thing to keep in mind here is that the result may not be the root word always. Seming algorithm works by cutting off the end or the beginning of the word taking into account a list of common prefixes and suffixes that can be found in an infected word. Now this indiscriminate cutting can be successful in some occasions but not always. And this is why we affirm that this approach presents some limitations. So let's go ahead and see how we can perform stemming on a particular given data set. Now there are quite a few types of stem. So starting with the porter stem, we need to import it from nltk. stem. Let's get the output of the word having and see what is the stemming of this word. So as you can see we have as the output. Now here we have defined words to stem which are give, giving, given and gave. So let's use the porter stemer and see what is the output of this particular stemming. So as you can see it has given give, given, give and gave. Now we can see that the stemmer removed only the ing and replaced it with an e. Now let's try to do it the same with another stemmer called the Lancaster stemmer. You can see the stemmer stemmed all the words. As a result of it, you can conclude that the Lancaster stemmer is more aggressive than the potter stemer. Now, the use of each of these stemmers depend on the type of task that you want to perform. For example, if you want to check how many times the words GIV is used above, you can use the Lancaster stemmer. And for other purposes, you have the Potter stemer as well. Now, there are a lot of stem. There is one snowball stemer also present where you need to specify the language which you are using and then use the snowball stemmer. Now as we discussed that stemming algorithm works by cutting off the end or the beginning of the word. On the other hand leatization takes into consideration the morphological analysis of the word. Now in order to do so it is necessary to have a detailed dictionary which the algorithm can look into to link the form back to its lema. Now limitization what it does is groups together different infected forms of a word which are called lema. It is somehow similar to stemming as it maps several words into a common root. Now one of the most important thing here to consider is that the output of limitization is a proper word unlike stemming in that case where we got the output as giv. Now giv is not any word it's just a stem. Now for example if a leatization should work on go on going and went it all stems into go because that is the root of the all the three words here. So let's go ahead and see how leatization work on the given input data. Now for that we are going to import the leatizer from NLTK. Now we are also importing the word net here. As I mentioned earlier that lemitization requires a detailed dictionary because the output of it is a root word which is a particular given word. It's not just any random word. It is a proper word. So to find that proper word it needs a dictionary. So here we are providing the word net dictionary and we are using the word net lematizer. So passing the word corpora into the word net limitizer. So can you guys tell me what is the output of this one? I'll leave this up to you guys. I won't execute the sentence. Let me remove the sentence here. You guys tell me in the comments below what will be the output of the lemitization of the word corpor. And what will be the output of the stemming? You guys execute that and let me know in the comment section below. Now let's take these words into consideration. Give, giving, given and gave and see what is the output of the limitization. So as you can see here the limitizer has kept the words as it is and this is because we haven't assigned any poss tags here and hence it has assumed all the words as nouns. Now you might be wondering what are poss tags. Well I'll tell you what are poss tags later in this video. So for just now let's keep it as simple as that is that poss tags usually tell us what exactly the given word is. Is it a noun? Is it a verb? Or is it different parts of speech? Basically P stands for parts of speech. Now do you know that there are several words in the English language such as I, ate, for, above, below which are very useful in the formation of sentence and without it the sentence would make any sense. But these words do not provide any help in the natural language processing. And this list of words are also known as stop words. NLTK has its own list of stop words and you can use the same by importing it from the NLTK. corpus. So the question arises are they helpful or not? Yes, they are helpful in the creation of sentences but they are not helpful in the processing of the language. So let's check the list of stop word in the NLTK. So from nltkus we are importing the stop words and if we specify what all stop words are there in the English language. Let's see. So as you can see here we have the list of all the stop words which are defined in the English language. And we have 179 total number of stop words. Now as you can see here we have these words which are few, more, most, other, some. Now these words are very necessary in the formation of sentences. You cannot ignore these words. But for processing these are not important at all. So if you remember we had the top 10 tokens from that particular word that is the AI paragraph I mentioned earlier which was given as F test top 10. Let's take that into consideration and see what you can see here is that except intelligent and intelligence most of the words are either punctuation or stop words and hence can be removed. Now we'll use the compile from the re module to create a string that matches any digit or special character and then we'll see how we can remove the stop words. So if you have a look at the output of the post punctuation, you can see there are no stop words here in the particular given output. And if you have a look at the output of the length of the post punctuation, it's 233 compared to the 273 the length of the AI tokens. Now this is very necessary in language processing as it removes the all the unnecessary words which do not hold any much more meaning. Now coming to another important topic of natural language processing and text mining or text analysis is the parts of speech. Now generally speaking the grammatical type of the word which is the verb, noun, adjective, adverb, article indicates how a word functions in the meaning as well as the grammatical within the sentence. Now a word can have more than one parts of speech based on the context in which it is used. For example, if we take the sentence into consideration, Google something on the internet. Now, here Google acts as a verb although it is a proper noun. So, as you can see here, we have so many types of POS tags and we have the descriptions of those various tags. So, we have the coordinating conjunction CC, cardinal number, CD, we have JJ as adjective, MD as modal, we have the proper noun, singular, pler, we have verbs, different types of verbs. We have interjection symbol. We have the Y pronoun and the Y adverb. Now we can use POS tags as a statistical NLP task. It distinguishes the sense of the word which is very helpful in text realization and it is easy to evaluate as in how many tags are correct and you can also infer semantic information from the given text. So let's have a look at some of the examples of POS. So take the sentence the dog killed the bat. So here the is a determiner, dog is a noun, killed is a verb and again the bat are determiner and noun respectively. Now let's consider another sentence. The way to clear the plates from the table. So as you can see here all the tokens here correspond to a particular type of tag which is the path of speech tag. It is very helpful in text realization. Now let's consider a string and check how NLTK performs POS tagging on it. So let's take the sentence Timothy is a natural when it comes to drawing. First we are going to tokenize it. And under NLTK only we have the poss tag option and we'll pass all the tokens here. So as you can see we have Timothy as noun is a verb or as a determiner natural as an adjective when as a verb it as a preposition comes as a verb to as a to and drawing as a verb again. So this is how you define the poss tags the poss tag function does all the work here now let's take another example here John is eating a delicious cake and let's see what's the output of this one. Now here you can see that the tagger has tagged both the word is and eating as a verb because it has considered is eating as a single term. This is one of the few shortcomings of the POS taggers. One thing important to keep in mind. Now after POS taggings there is another important topic which is the named entity recognition. So what does it mean? Now the process of detecting the named entities such as the person name, the location name, the company name, the organization, the quantities and the monetary value is called the named entity recognition. Now under named entity recognition, we have three types of identification. Here we have the nonphrase identification. Now this step deals with extracting all the noun phrases from a text using dependency passing and parts of speech tagging. Then we have the phrase classification. The step classification this is the classification step in which all the extracted noun phrases are classified into respective categories which are the location names organization and much more. Now apart from this one can curate the lookup tables and dictionaries by combining information from different sources and finally we have the entity disambiguation. Now sometimes it is possible that the entities are mclassified. Hence creating a validation layer on top of the result is very useful and the use of knowledge graphs can be exploited for this purpose. Now the popular knowledge graphs are Google knowledge graph, the IBM Watson and Wikipedia. So let's take a sentence into consideration that the Google CEO Sundap Pichai introduced the new pixel at Minnesota Roy Center event. So as you can see here Google is an organization, Sundap Payai is a person, Minnesota is a location and the Roy center event is also tagged as an organization. Now for using Ner in Python we'll have to import the NEC chunk from the NLTK module which is present in Python. So let's consider a text data here and see how we can perform the NE using the NLTK library. So first we need to import the NE chunk here. Let's consider the sentence here. We have the US president stays in the white house. So we need to do all these processes again. We need to tokenize the sentence first and then add the POS tax and then if we use the any chunk function and pass the list of tpples containing POS tax to it. Let's see the output. So as you can see the US here is recognized as an organization and white house is clubed together as a single entity and is recognized as a facility. Now this is only possible because of the POS tagging. Without the POS tagging it would be very hard to detect the named entities of the given tokens. Now that we have understood what are named entary recognition and yes, let's go ahead and understand one of the most important topic in NLP and text mining which is the syntax. So what is a syntax? So in linguistics syntax is the set of rules, principle and the processes that govern the structure of a given sentence in a given language. The term syntax is also used to refer to the study of such principles and processes. So what we have here are certain rules as to what part of the sentence should come at what position. With these rules, one can create a syntax tree whenever there is a sentence input. Now syntax tree in layman terms is basically a tree representation of the syntactic structure of the sentence of the strings. It is a way of representing the syntax of a programming language as a hierarchical tree structure. This structure is used for generating symbol tables for compilers and later code generation. The T represents all the constructs in the language and their subsequent rules. So let's consider the statement the cat sat on the mat. So as you can see here the input is a sentence or a war phrase and it has been classified into non-phrase. Then the prepositional phrase again the noun phrase is classified into article and noun and again we have the verb which is sat and finally we have the preposition on the article and the noun which are the and matt. Now in order to render syntax trees in our notebook you need to install the ghost strip which is a rendering engine. Now this takes a lot of time and let me show you from where you can download the ghost script. Just type in download ghost script and Select the latest version here. So as you can see we have two types of license here. We have the general public license and the commercial license. As creating syntax and following it is a very important part. It is also available for commercial license and it is very useful. So I'm not going to go much deeper into what syntax tree is and how we can do that. So now that we have understood what are syntax trees, let's discuss the important concept with respect to analyzing the sentence structure which is chunking. So chunking basically means picking up individual pieces of information and grouping them into bigger pieces. And these bigger pieces are also known as chunks. In the context of NLP and text mining, chunking means grouping of words or tokens into chunks. So let's have a look at the example here. So the sentence into consideration here is we caught the black panther. We is the preposition, caught is a verb, the determiner, black is an adjective and panther is a noun. So what it has done is here as you can see is that pink which is an adjective, panther which is a noun and the is a determiner are chunked together in the noun phrase. So let's go ahead and see how we can implement chunking using the NLTK. So let's take the sentence, the big cat ate little mouse who was after the fresh cheese. We'll use the POS tax here and also use the tokenizing function here. So as you can see here, we have the tokens and we have the POS tags. What we'll do now is create a grammar from a noun phrase and we'll mention the tags that we want in our chunk phrase within the curly braces. So that will be our grammar np. Now here we have created a regular expression matching string. Now we'll now have to pass the chunk and hence we'll create a chunk pass and pass our non-free string to it. So as you can see we have a certain error and let me tell you why this error occurred. So this error occurred because we did not use the co script and we do not form the syntactical tree. But in the final output we have a tree structure here which is not exactly in the visualization part but it's there. So as you can see here we have the NP noun phrase for the little mouse. Again we have the noun phrase for fresh cheese also. Although fresh is an adjective and cheese is a noun. It has considered a noun phrase of these two words. So this is how you execute chunking in NLTK library. So by now we have learned almost all the important steps in text processing and let's apply them all in building a machine learning classifier on the movie reviews from the NLTK corpora. So for that first let me import all the libraries which are the pandas the numpy library. Now these are the basic libraries needed in any machine learning algorithm. We are also importing the count vectorzer. I'll tell you why it is used later. Now let's just import it for now. So again if we have a look at the different elements of the corpora as we saw earlier in the beginning of our session we have so many files in the given NLTK corpora. Now let's now access the movie reviews corporas under the NLTK corpora. As you can see here we have the movie reviews. So for that we are going to import the movie reviews from the NLTK corporus. So if you have a look at the different categories of the movie reviews, we have two categories which are the negative and the positive. So if you have a look at the positive, we can see we have so many text files here. Similarly, if we have a look at the negative, we have thousand negative files also here which have the negative feedbacks. So let's take a particular positive one into consideration which is the CV29590. You can take any one of the files here. Doesn't matter. Now the above tokenization as you can see here the file is already tokenized but it is generally useful for us to do the tokenization but the above tokenization has increased our work here. And in order to use the count vectorzer and the TF IDF we must pass the strings instead of the tokens. Now in order to convert the strings into token we can use the D tokenizer within the NLTK but uh that has some licensing issues as of now with the cond environment. So instead of that we can also use the join method to join all the tokens of the list into a single string and that's what we are going to use here. So first we are going to create an empty list and append all the tokens within it. We have the review list that is an empty list. Now what we are going to do here is remove all the extra spaces the commas from the list while appending it to the empty list and perform the same for the positive and the negative reviews. So this one we are doing it for the negative reviews and then we'll do the same for the positive reviews as well. So if you have a look at the length of this negative review list, it's 1,000. And the moment we add the positive reviews also, I think the length should reach 2,000. So let me just define the positive reviews. Now execute the same for positive reviews. And then again if we have a look at the length of the review list, it should be 2,000. That is good. Now let us now create the targets. before creating the f features for our classifiers. So while creating the targets we are using the negative reviews here we are denoting it as zero and for the positive reviews we are converting it into one and also we will create an empty list and we'll add 1,000 zeros followed by th00and ones into the empty list. Now we'll create a panda series for the target list. Now the type of y must result into a panda series. So if we have a look at the output of the type of y, it is pandas. code or series. That is good. Now let's have a look at the first five entries of the series. So as you can see it is th00and zeros which were followed by th00and ones. So the first five inputs are all zeros. Now we can start creating features using the count vectorzer or the bag of words. For that we need to import the count vectorzer. Now once we have initialized the vectorzer now we need to fit it onto the rev list. Now let us now have a look at the dimensions of this particular vector. So as you can see it's 2,000 by 16,228. Now we are going to create a list with the names of all the features by typing the vectorzer name. So as you can see here we have our list. Now what we'll do is we'll create a pandas data frame by passing the sci matrix as values and feature names as the column names. Now let us now check the dimension of this particular pandas data frame. So as you can see it's the same dimension 200x 16,228. Now if we have a look at the top five rows of the data frame. So as you can see here we have 16,228 columns with five rows and all the inputs are here zero. Now the data frame we are going to do is now split it into training and testing sets and let us now examine the training and the test sets as well. So as you can see the size here we have defined as 0. 25 that is the test set that is 25% the training set will have the 75% of the particular data frame. So if you have a look at the shape of the X train we have 15,000 and if you have a look at the dimension of X test this is 5,000. So now our data is split. Now we'll use the nave bias classifier for text classification over the training and testing sets. So now most of you guys might already be aware of what a nave bias classifier is. So it is basically a classification technique based on the base theorem with an assumption of independence among predictors. In simple terms, a nave bias classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. To know more, you can watch our nave bice classifier video the link to which is given in the description box below. If you want to pause at this moment of time and check quickly what a n by classifier does and how it works, you can check that video and come back here. Now to implement nave bias algorithm in python we'll use the following library and the functions. We are going to import the gshian nb from sklearn library which is a scikitlearn. We are going to instantiate the classifier now and fit the classifier with the training features and the labels. We are also going to import the multinnomial nave bias because we do not have only two features here. We have the multinnomial features. So now we have passed the training and the test data set to this particular multinnomial n bias and then we will use the predict function and pass the training features. Now let's have a look and check the accuracy of this particular metrics. So as you can see here the accuracy here is one that is very highly unlikely but since it has given one that means it is overfitting and it is overly accurate and you can also check the confusion matrix for the same. For that what you need to do is use the confusion matrix on these variables which is y test and y predicted. So as you can see here although it has predicted 100% accuracy the accuracy is one. This is very highly unlikely and you might have got a different output for this one. I've got the output here as 1. 0. You might have got an output as 0. 6 0. 7 or any number
Stemming And Lemmatization
in between 0 and 1. Now the degree of inflection may be higher or lower in a language. As you have read the definition of inflection with respect to grammar, you can understand that an inflected word will have a common root form. Stemming and lemitization have been studied and algorithms have been developed in computer science since the 1960s. In this video, you will learn about stemming and leatization in a practical approach covering the background, some famous algorithms, applications of stemming and leatization and how to stem and lemitize words, sentences and documents using the Python NLTK package which is the natural language toolkit package provided by Python for natural language processing tasks. Now, stemming and leatization are text normalization techniques in the field of natural language processing that are used to prepare text, words and documents for further processing and these are widely used in tagging systems, indexing, SEO, web search results and information retrieval. Now, for example, searching for the word miss on Google will also result in Mrs. missing as miss is basically the stem of both these words. So let's start with stemming. Now stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the language. Now there are English and non-English stemmers available in NLTK package. Now for the English language you can choose between Porter Stemer and Lancaster STEM. Porter stemer being the oldest one originally developed in 1979. Lancaster stemer was developed in 1990 and uses a more aggressive approach than porter stemming algorithm. So let's try out the porter stemer to stem words and along with it you will see how it is stemming the words. So this is how the code for porter stemer works. Now porter stemmer uses suffix stripping to produce stems. Porter stemer algorithm does not follow linguistics rather a set of five rules for different cases that are applied in phases to generate stems. Now this is the reason why porter stemer does not often generate stems that are actual English words. Now it does not keep a lookup table for actual stems of the word but applies algorithmic rules to generate stems. It also uses the rules to decide whether it is wise to strip a suffix. Now one can generate its own set of rules for any language. That is why Python NLTK introduced snowball stemmers that are used to create non-English stemmers. So why do we use it? Now pa stemer is known for its simplicity and speed. It is commonly useful in information retrieval environments known as IR environments for fast recall and fetching of search queries such as for words like connections, connected, connecting or connection. All of these words mean connect. Now the Lancaster stemer is an iterative algorithm with rules saved externally. Lancaster stemmer is simple but heavy stemming due to iterations and over stemming may occur. Now over stemming causes the stems to be not linguistic or they may have no meaning at all. So now let's have a look at the Lancaster stemmer code. For example in above code destabilized is stemmed to de in Lancaster stemer whereas using porter stemer destable. Now Lancaster stemmer produces an even shorter stem than porter because of iterations and over stemming is occurred. So you can stem sentences and documents using NLTK stemmers using the following code. So as you see the stemmer sees the entire sentence as a word. So it runs it as it is. Now we need to stem each word in the sentence and return a combined sentence. Now to separate the sentence into words, you can use tokenizer. The NLTK tokenizer separates the sentence into words. So let's see how it's done. Now to stem a document we need to do the following steps. First we have to take a document as the input. Next we have to read the document line by line and then tokenize the line. Next we have to stem the words and finally we will output the stemmed words. So let's do some coding now. Open a file that is any text file. So I have a text file named deep learning and you have to provide your complete file path in open command of Python if it is stored in any other directory. Now you can see the content of the file using the dot readad method altogether. Now you can maintain the lines in a file in a python list using read lines method. You can now access each line and use the tokenize stem sentence function that you created before to tokenize and stem the line. Now you can save the stemmed sentence to a text file using Python write lines function. Make a list first to store all the stemmed sentences and simply write the list to the file using right line. So the text file created will be as follows. Python NLTK provides not only two English stemmers that is Porter stemer and Lancaster stemmer but also a lot of non-English stemmers as part of snowball stemmers ISRI stemer RSLPS stemer now Python NLTK included snowball stemmer as a language to create non-English stemmers currently it supports the following languages angages Danish, Dutch, English, French, German and much more. Now, lemitization unlike stemming reduces the inflected words properly ensuring that the root word belongs to the language. In leatization, root word is called lema. A lema is the canonical form, dictionary form or the citation form of a set of words. Now, it takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lema. For example, a leatizer should map gone, going, and went into go. Now, Python LLTK provides word net lematizer that uses the word net database to look up lemmas of words. So, let's have a look at the code. Now in the above output you must be wondering that no actual root form has been given for any word. This is because they are given without context. Now you need to provide the context in which you want to lemitize that is the parts of speech. Remember when we were learning about the different steps in NLP pos tag was an important step in the whole process. Now this is done by giving the value for POSOS parameter in wordnet lemitizer lemitize. So let's have a look at some of the applications of stemming and leatization. The first one is sentiment analysis. Now sentiment analysis is the analysis of people's reviews and comments about something. It is widely used for analysis of products on online retail shops. Now stemming and lemitization is used as part of the text preparation process before it is analyzed. Next up is document clustering. Now document clustering is the application of cluster analysis to textual documents. It has applications in an automatic document organization, topic extraction and fast information retrieval or filtering. Next one is information retrieval environments. Now it is useful to use stemming and leatization to map documents to common topics and display search results by indexing when documents are increasing to mind-boggling numbers. Now you may be asking yourself when should I use stemming and lemitization? Now stemming and leatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas lema is an actual language word. Now stemming follows an algorithm with steps to perform on the words which makes it faster. Whereas in leatization you use word net corpus and a corpus for stop words as well to produce lema which makes it slower than stemming. You also have to define a parts of speech to obtain the correct lema. So when to use what? Now the above points show that if speed is focused then stemming should be used since lematizers scan a corpus which consumes time and processing. Now it depends on the application that you are working on that decides if stemmers should be used or leatizers. Now if you're building a language application in which language is important you should use lemitization as it uses a corpus to match root forms. So with this we come to an end of this video. I hope you guys understood the various steps involved in NLP, the different types of stemmers and how leatization works and
Context Free Grammar Using NLP In Python
most importantly where to use which function. Now what is syntax tree? So this is important thing. Syntax tree syntax is a study of rules governing the way words are combined to form sentences in a language. So whenever you create sentence there's always some rule that you need to start off with some identifier maybe the is a then you have certain verb a certain noun coming into picture then you have certain verb maybe some adjectives comes into picture. So there are rules for creating sentence you cannot create a sentence without any rules. We have to have some rules like we have to specify some noun then verb then adjective then prepositions. So these rules are called as a syntax. So syntax is study of rules governing the way words are combined. Sentences are composed of discrete units called rules. So every sentence has certain rule in it. Whether it's a past continuous or present perfect or past perfect or whether it's a simple sentence, simple tense. So we have certain rules for defining a sentence and that is defined by a syntax. Now phrase structure rules suppose any word can be starting with a noun. Suppose we have a certain word which can start with a noun or a verb. Then we have something called noun, verb. And then we have a determined prepositions. Then we can have again nouns, prepositions and adjectives. Then again we can have prepositions. Then finally we can have some closing preposition. Any kind of a word is having some rules. By what we can start it, end it, what has to be in between. So these are called as a phrases of rules. So in layman term syntax tree is a tree representation of syntactic structure of sentence or a string. So we have a whole sentence like for example the old tree swar in the wind the is what the is a determinant old is adjective tree is a noun sw is verb in is a preposition t is a determinant and wind is a noun so if you see the rule for this language in this sentence we have determinant adjective noun verb preposition then determinant then noun and we can put it in the hier hierarchy as well noun preposition verb preposition and both are combining to become a sentence. So any word has got certain rules. So we can define those rules. We can check these rules whether it's working fine or not. And in the same order the sentences are coming or not words of not. So in order to render syntax tree in your notebook you need to install go script a rendering engine for the link this. So if you go to this link I have already downloaded it. You need to download relevant. exe this exe file go script. exe and you need to update the path in the path variable. So if I'll go to this website I've already downloaded for 60wood ghost script agl release and it's a. exe file which get downloaded. So I've already downloaded in my machine. This is the file which I have downloaded. I ran this file. It's like a installer as like other softwares. Once it is done it is getting installed in C drive. It has got all these rules which are all the syntaxes defined in them. So when you go to program files you get a folder called GS within GS 9. 25. Within that we have a bin directory. So you need to copy this path and put it in the path variable in the windows. You go to C drive properties this PC properties advanced system settings environment variables and under path you need to edit it and add the path edit text and you need to add this path after putting semicolon. So I've already added it so that this can be accessed from any location. So this exe file is having all the syntaxes present in it. Whatever can be possible that noun can come after adjective or not or adjective can come after noun or not. how the verb is going to be there in the sentence where the prepositions can be kept. So all the rules whichever is required to write a English language this go script AGPL release is having. So you need to download it first and once you download it you go to the folder where it is installed open the bin folder add the path of bin folder in your environment variables. So you need to copy the path where the bin folder is there and you need to paste that path in the environment variable so that any place wherever you are running NLP you can access it. So you need to go to the path variable, edit it and add a text there. You need to Adding the path. So once you import the path once you do that now you notify the path of environment variable through piece of code. You'll say import OS path to GS is path where it is present. And in the environment variable of OS you add this path. OS. Evironment path. You add this path in the environment variable of OS library. So OS library has got some environment variables. Environment variable means path variables would automatically get set when we invoke Python or when we invoke notebook. So you need to add this path in the environment variable so that you can easily access it. So once you have invoked this path now that you have modified the path environment variable let's discuss some important concept with respect to analyzing sentence structure. What is chunking? Chunking is basically means picking up the individual pieces of information and grouping them into a bigger piece. So what is chunking? It is basically picking the smaller information and putting them together to make a bigger piece. The bigger piece is also known as a chunks in the context of NLP. Chunking means grouping of words and tokens into chunks. So chunking is what? Picking up the individual words, grouping them to make a valid sentence maybe or make a valid syntactically correct sentences. And this is what we do in chunking. We take the smaller units called tokens and we club them and make a bigger chunk. For example, you can see here we have individual words like V is a preposition, so is a verb, it is a determinant, JJ is characteristic or NN is a noun. So here dog we are doing you can see here adjectives and different colors. Dog is a noun and determinant are chunked together into a noun phrase. So we call it as a noun phrase. Three things combined together become a noun phrase. Determinant or determiner and noun and JJ which is becoming a noun phrase. So there are certain rules about English. Noun phrase is having determinant, jg and noun. It becomes a rule for this. So let's see how we do chunking in python. So I'll show you one example. So let's say we import nltk and os library. Then we import nltk. corpus. Then we use word tokenizer and regular expression tokenizer. We import from nltk. data import load. So this is used because we already have some inbuilt data there in this import load. Now we have a sentence called Mary is driving a big car. We tokenize it send tokens. So we print send tokens. So Mary is driving a big car. We get the tokens out of it and we put a POS tag in all those tokens. Once we put PS tags, we get to know what is a noun there. What is a verb? What is adjective? All those things. Similarly, we have another sentence. John is eating a delicious cake. We tokenize it. We put it to PS tags. We get for John what is a word having all these things. Similarly for the Jim eats banana. We tokenize it and we put in the tokenizer. Similarly using regular expression tokenizer we can do regular expression tokenizer to tokenize it. We get Jim eats a banana with a space in between. And now if I pass this to POS tagger we get individual spaces in between them. So in the corpora if you're going to find an NLTK data find the word corpora you'll see word corpora there in this you'll see so many files which is having corpora in it. The word corpora we see the corpora. Goodutenberg dot fields. We can set see all the fields there. So this is for just our knowledge. So let me come to the point. So what we are doing here we pick up one of the files from there. So we have all the fields in the corpus. All these text files are already present there. We pick up one file Shakespeare helmet. txt maybe. So we are already using inbuilt file rather than passing our own string. And we see length of this. So it has got 37 360 different words in it. It has got 37,360 words. And we pick only top 2,000 words. And we pass it to POS tag. And we create a helmet pause. We can create a list having all these words. And this is what the list has become. It is going to print the list having all the words having poss characteristics. So it will take some time because it's a big file. So we get all the words 2,000 words from this file. T is a determinant. Tragedy is a noun. Helmet is a noun. By is in. William is a noun. We get the complete details of this. Now we are going to pass it. check only the NNP words in this which is a noun phrase. We're going to check if it is a noun phrase then only we are using it. If we get all the nouns from this file so we get all the things we got a article, we got a determinant, we got a verb in this file but we are interested in only filtering NNP and we create a new list called helmet NNP. Then we are going to put one more filter there. We are going to filter out names like William, Williams reading and we get to know the context of the same in this file. So if you go down you'll see in which context they are getting used. So William is used as a noun phrase or it leading is used as a verb as well as noun. Now we import chunk any chunk and we say US president stays in white house tokenize it put it to nax and chunks it. We have seen all these examples. The US is a organization. White is a facility. White house is a facility. We have seen these examples. Similarly the state of New York touches Atlantic Ocean. From there we come to know geographical place is New York, organization is Atlantic. Similarly, Apple is company name. We get geographical location. Person is Apple, organization is company. So as I told you like this is not a very good way of analyzing sentences because it is not going to give you all the time right results to you. So let me open chunking one. Okay. So this is junky. So we have seen we have imported word tokenizer. We have used regular expression tokenizer. So we have used nlt tokenize import word tokenizer and regular expression tokenizer. We have added the path c program file gs in the bin path and we add the path in the environment variable. So this is the first step we have to do. We have to download this file gs9. 25. Reason being we want to have some rules how the words are getting created in a form of a sentence. So let me increase the font size. Now we have the sentence the little mouse ate a fresh cheese. So this sentence we pass it to pos tag. P tag is doing what? is going to categorize whether it's a noun, it's an adjective, it's a verb. Then we introduce a grammar. We say the grammar always starts with NP that's a keyword we need to start and in grammar we say we have a determinant then any single word then JJ then noun any of the conjunction words and then we have a noun and we define this kind of a grammar and we parse it through regular expression parser for this and then we use it to whenever the parser is getting created we parse it and see the chunk results. Chunking is doing what here. If I run this, it is getting a tree like structure. So whole is a sentence. S is a sentence. Noun phrase is what? It is going to determine internally what all are the nouns in them. Noun phrase. So the little mouse, this is the noun phrase. And another noun phrase is the fresh cheese and is a verb. So we got a from the grammar whatever we have decided the grammar is what on the basis of that it is going to internally create a treel like structure how it is joining two noun phrases. So 8 is a verb. Verb we get a hierarchy of words. How the noun phrases are getting joined and one noun phrase can have a determinant JJ and noun. Adjective JJ is what? Adjective. We have a determinant adjective and noun. Similarly So if you put it in a form of a clause, we can say noun phrases are made up of determinant adjective and noun. And two noun phrases are always joined by the verb. So you can define your own grammar. Say that we want a sentence in which we have a noun phrases and we join noun phrases by a verb and noun phrases can have a determinant adjective and nouns. So you see this the rule what we have said determinant adjective and noun and that is what is coming in the answer determinant adjective and noun determinant adjective and noun. So depending on the grammar, the sentence has been chunked out. Let's see the next example. Okay. So here she is wearing a beautiful dress. We poss tag it. We find the adverb, adjectives and all those things. Then we use chunk parser. parse send two tokens. Right? We have not defined any grammar in this. We have not seen And we get the results. So if I run this, see it is giving me the results as preposition, verb and again verb. Similarly, preposition, determinant, adjective and noun. So, noun phrase always consists of determinant, adjective and noun. And a sentence is having prepositions, verbs and verb another verb. So, it is determining on its own the treel like structure how the complete grammar has been created. So, it is going to take its grammar by default as per the English language and creating our own treel like structure. So, what is the need of this chunking? The question comes why do we choose this chunking? Reason being we need to figure it out how the sentences have been created in the whole paragraph or maybe in the whole document and based on those sentences we are going to create a big treel like structure and filter out all the noun phrases there all the adjectives there all the prepositions there all the verb phrases there and figure it out in which tree at what level they lie or a decision-m when we are creating a random forest like a tree we going to figure it out how the noun phrases are dependent on adjective how adjectives are dependent on determinants how determinants are dependent on prepos positions we find the dependencies among them and we find out the right clusters that okay these are the clusters where determinant adjective and nouns come together always these are all sentences these are the sentences in which we have determinant adjectives and noun these are sentences where noun phrases verbs and noun phrases are coming so we can do clustering kind of algorithm internally so chunking is dividing the whole set of data into individual words and looking at their context how they're coming in a form of a treel like structure and a treel like structure is a complete hierarchy noun phrase can consist of like in the previous case you can see noun phrase was having determinant adjective and noun here noun phrase is again so noun phrase is something which is most commonly coming as determinant adjective and noun and the previous case also the little mouse determinant adjective and noun let's take one more example she's walking quickly to the mall and here we tokenize it we define the grammar as preposition verb any form of a verb VB VBD VB zbg and then any form of a RB so we define our own grammar that I want to filter out the words in a tree according to this rule which I have given then I'm parse this using regular expression parser and send these tokens to a creating a tree. Now the answer you get here is you get first PRP preposition as per our rule then you get the verb of any form then you get a an adjective and rest of the words which are not lying into this like two dt and nn if you see so dt is basically a determiner and nn is a noun so we already know that two is a two so two is classified as two so what we are doing here we are classifying a sentence as per the rule which we are trying to fix up and building up a hierarchy. Let's say I change this rule up to some extent. Let's say I make this rule as I want to have. So let's say I remove this VB. Now you can see I have changed the rule. I have said I want a preposition then verb then adjective. So you can see first I got a preposition then I get any form of a verb then VB which is quickly and then any adjective which is there in that or not or list of the things the tree gets changed depending on the grammar it is coming as per the like the first one we get a determinant we get a adjective we get a noun so we get a determinant adjective and noun so whatever the grammar is going to be there it is going to branch out to categorize so d has one branch adjective has one branch noun has another branch and if you do not able to create a branch, it will come to the base branch as a sentence. So in this case, we have a preposition as a separate branch. We have a verb adjective as a separate branch. We don't have adjective, it is not coming. That's the reason. And let's say we don't give any kind of a rule. It will do it as per its own wish. He drives fast on highways. So first preposition come, then verb come, then RB is what? Adverb come. RB is adverb. Then adverb comes then in comes in is preposition in is again a preposition that comes and then noun comes and if you might have provided some kind of a rule there let's say I give the same kind of a rule which was there in the top then things would have changed if I would have given this rule the things would have changed in this graph now this is known as chunking we have discussed this all about chunking so let's talk in detail about this so let's consider a s scenario a little mouse ate the fresh cheese convert the sentence into token and add POS tags to the same. The POS tag is adding the whether what type of word it is, it's adjective, noun or any adverb, it is going to add it. Now we'll create a grammar from the noun phrase and we'll mention the tags that we want in our chunk phrase. Here we have created regular expression matching this chunk. We want a determinant. We want adjective without a noun. Now we'll have to pass the chunk. Hence, we'll make the chunk parser and pass out noun phrase string to it. We pass the complete string to a regular expression parser. The parser is now ready. we will use to pass within our chunk parser to create our sentence. So we'll pass it to the chunk parser. So first of all we run the regular expression parser to determine what all our determinants are join adjectives and nouns and then we pass it to chunk passer to create a tree for us. The tokens that match our regular expression are chunked together into noun phrase. So we have given a noun phrase right. We have said okay these are consolidated become should become a noun phrase and we get np here. NP noun phrase we call it as what? NP. So noun phrase is this word. So in the previous example if I determine give some word to them like for example it is a verb phrase. We get as a word as verb phrase. If I give it as noun phrase np it is a noun phrase. We call it as a noun phrase. So my grammar says she is a noun phrase. NP is a verb phrase. Noun phrase. So all the noun phrases get distinguished by this rule. So we say that my noun phrase should contain a preposition, verb and a adjective. So it is your choice what kind of rule you want to put in and according to your grammar which you put in system is going to create a tree for you. So maybe it is not a noun phrase if I'm not in good in English. So I will say it is not a noun phrase. Maybe in the previous example we say noun phrase is determinant adjective and noun. And if I'll give it as verb phrase system will call it as a verb phrase. But is you as a teacher going to train the machine that okay what is noun phrase? What is verb phrase? How the sentence should look like? What it should contain? So we need to define that grammar. So she is wearing a beautiful dress and we supposed tokenize it and pass it to passer. So convert this sentence to tokens and add PS tags. Now we'll we will now have a pass the chunk here. We'll create a chunk passer and pass a noun phrase into it. So it is using the chunk passer default passer and it is automatically figuring out what is a noun phrase for us. Noun phrase generally have determinants, adjectives and nouns and it is finding out preposition verb and verbs creating our own tree for us. So if you do not have a passer you can use the default passer or you do not have a grammar there with you can ignore that system will automatically do for you. Similarly let's create a verb phrase. So we define a verb phrase as preposition then verb of any form then adjective. I'll create another passer will pass through a verb phrase. Create another sentence and we'll tokenize and add a PS to it. Again pass it to the passer and we'll create a verb phrase containing all this. a verb phraser where a pronoun followed by two verbs followed by adverb are chunked together into a verb phrase. So as per my rule which I'm defining system is chunking out that part from the whole sentence. So let's consider another sentence below. He drives fast on highways and we use a default passer. It gives a verb phrase. Verb phrase consists of preposition, verb and the verb. First step is tokenize it. That is word tokenizer. Second step is pass it to post stack to find out whether it's a verb, adjective or noun or pronoun. Third step is to pass it to chunk parser to make a tree. So there are three steps. Now there's another sub program or subprocess called chunking. Chunking is dividing the whole set of big sentence into verb phrases or the grammar that we have in mind and divide it into a independent sentences or independent tokens. What is chunking? Helps us to define what we want to conclude from a chunk. So if you want to have some meaning from the chunk we call it as chinking. Under chinking we create a sequence of tokens which are not included in the chunk. So it is finding a insights or context like let's say we did stemization and lemitization. What was the difference? Chinking is going into the context but chunking is not doing that. Chunking is just dividing the words into phrases or verb phrase or noun phrase. Chinking is a more deep dive. Chinking in Python. Let's create a chinking grammar string containing three things. Chunk name, the name of the chunk that we want to pass. Regular expression sequence of a chunk and our chunk. So it contest consists of three parts. The name of a chunk, regular expression sequence of a chunk. So whatever the regular expression we want to put in the chunk of filter the data and regular expression sequence for our chunk, whatever we want to make out of it. So let's say our chunk is going to divide the sentence into preposition, verb of any form and adverb. I want to keep it as a adverb and then whatever can come later on. So let's see this example. Let's say we use a regular expression passer. Now we'll create a passer for NLTK regular express. We'll pass the ching grammar to it. And the grammar which we have created here, whatever the grammar we pass it to parser. Previously we was passing to chunk parser, we are passing to chunk grammar here. chin grammar has been passed to regular expression and we tokenize it. So you'll see on comparing the syntax tree of chin parser with that of original chunk you can see that token quickly that is at verb is chilled out of the chunk. So in the previous case we were getting this word quickly as a part of the chunk right? So if you see this what is the sentence she is walking quickly to the mall that was sentence. So this was the sentence she was walking quickly to the mall. Let me give this see when I use this grammar preposition verb of any form and adjective to the chunk parser it was giving quickly as a part of verb phrase. You can see this it was a part of verb phrase but when I pass it to parser the adverb has been moved out of it. Why it has been moved? Because we define the grammar that whatever the chunk you are having take adverb out of it. You can see the curly braces which is out outer side closing. We have curly braces of opposite sides. We have created one or more adverbs. So whatever the adverbs are going to be it's going to be chunked out from there. So we get the answer here adverb coming out of it. So syntax if you compare previous one the adverb was coming as a verb phrase now it is not coming as verb phrase it is coming out of it quickly we'll now create a parser from NLTK regular expression I'll pass ching grammar to it so if you use NLTK grammar and pass a ching grammar it is going to keep quickly out of the chunk because we have given in the rule that I won't want this adverb there in the chunk now we'll discuss about another topic in LTK for analyzing sentence known as contextf free grammar so this is a big domain guys there's a subject called automator if you research more on that subject automata is one field wherein how system is understanding natural language is beautifully explained there so I'll take one web page and I'll explain you what is contextf free grammar types of grammar tutorials point so let me discuss what are different types of grammar so introduction to grammar so the grammar denotes syntactical rules for conversation in natural languages linguistics have attempted to define grammars since the inception of natural language like English Sanskrit Mandarin etc. The theory of formal grammar finds its applicability extensively in the field of computer science. Nom Chomsky gave a mathematical model in 1956 for writing computer languages. He's a father of automator Nom Chomsky and there's a book in automator generally you'll get it from the Indian authors or foreigner authors in the market. The book name is automator and it has got all the rules there. What is grammar? A grammar can be written by four pupils N, T, S and P where N is a set of variables or non- terminal symbols. T is a set of terminal symbols. S is a special variable called start symbol which belongs to N. So start symbol is also one of the non- terminal terminals symbols. T is a production rules or the terminals or non- terminals. What are the rules which need to put? So in order to form any grammar, we should have set of variables called non- terminal symbols which are not ending ones. some terminals which are the ending ones, some special variables which are known as a start variables and some rules to define it. A production rule has a formula a pointing to beta where a and b alpha or beta represents a subset of whole set of strings. So let's say for example we have a grammar s a b a ab s gives to a b a gives to b. Now S, A and B are non- terminal symbols. Here the terminal symbols. Here S is a start and P is a rule production rules. So rule says what? Rule says that non- terminal symbols giving to some more non- terminal symbols and non- terminal symbols giving end to terminal symbols. Now if I'll take this let me take this and analyze it more. So we have let's say this rule S pointing to A B A pointing to A and B pointing to B. And what is the start symbol? start is AB. So I'll pick up this one and I can always replace A with what? In the next line I can say A is pointing to A. This can be done. Similarly in the next line we can put B pointing to B. So sentence is having AB. This is the final answer. Now if you go down let's say we have another rule. Now the production rules are S is giving rise to this. This this and epsylon. Epsylon means empty. So what is the start rule? Now what you can do is you can substitute a with what? a with a and capital b you can do that. Now you put small a and a with this right you can do that small a and capital a you can substitute with a b. Now again small and a you can substitute it this one. So I'll put s pointing to a and a with b this one again this a and capital a would be substituted by what a a b I have done that and b was already there so this is going to be the replacement again I'll go one more level up I can substitute this a with small a with what double a capital a b and previous b's again I can go up to the next level right we can go like this keep on going And finally this a camdic can be replaced by epsylon. epsylon means empty. So we can get a final rule final string as this string. So this is the final string which we get from this rule. So we will discuss another important topic in LTK is known as context free grammar. Context free grammar is a layman in terms of simple grammar where certain rules describe possible combination of word and phrases. We have seen that certain rules are describing the words. Context free grammar is a pupil with four values n sigma, r and s. N is a finite set of terminal symbols. Sigma is the alphabet of R are the rules and S is the starting point. So we have seen all this. It generates a language by capturing consistency and ordering. For example, sentence is consisting of noun phrase and verb phrase. Noun phrase determined nominal. Nominal deter nor nominal is pointing to noun. Verb phrase is pointing to verb. Determinant is a noun is flight. Verb is left. So let's say we have a word having these things as a rules given there. Sentence always consist of noun phrase and a verb phrase. Noun phrase can contain a determinant. Nominal can contain a noun. So again this nominal which we are defining here can contain a noun or we can say in this terms let me put it in easier words. So like this. So according to this start point is let's say a forget about noun and verb. A is pointing to c and d is pointing to small d noun which is the most nominal one. B is pointing to verb clause pointing to B and C is also capital here because determinant is capital. We have a rule for C. Determinant is pointing to A suppose some X. Noun is pointing to flight. So third rule is D. D is again capital to K. K is pointing to some flight maybe some word. Verb is again pointing to VP that is fourth one. It is again B or M. M is pointing to left. So some kind of rule like this. So we have a certain set of rules. Based on that we are creating a sentences. This is what it says. A sentence can contain a noun phrase and a verb phrase. Again noun phrase can contain what within noun phrase what things are coming can contain what again that within that contain what. So this says that there set of units s np and vp are in the language s consist of np followed by immediately vp. Doesn't say that that's like only kind of s nor does it say like there's only one place that np and vp can occur. So there can be a combination of them as we have seen a bb any number of combinations are coming. So we can create n number of sentences. So from this sheet which we have done in the past we can have n number of sentences there possibility right the minimum one and the maximum one. So we had n number of sentences we can create a triple b this is another sentence this is another sentence. So from the rule we can create multiple sentences right. So this is the intention of natural context free grammar. So let's implement NLP in this. Let's say we define context free grammar. There's a library called cfg inbuilt library and from the string we define a rule. Sentence can contain a noun phrase and verb phrase. Verb phrase can contain a verb and a noun. Verb can be saw and met. Noun phrase can be John or Jim. Noun can be dog or cat. Basically cfg you can have almost all the permutation of sentences as long as the above conditions are met. It should have a noun phrase followed by a verb phrase. Verb phrase will be having a verb and a noun. A verb can either be saw or met. A noun can be either job or gym. the dog or cat. So we defined a rule and based on that rule we can create any number of sentences and context free grammar productions are these many verb can be verb and a noun verb can be saw verb can be met noun phrase can be John noun phrase can be gym noun can be dog noun can be cat productions is going to do what all the permutation combinations which are possible with these rules productions function is going to give that now we've learned most of the concept let's automate the entire process of text paraphrasing now we create a function that will take sentence as an input and we'll tokenize it and tag it also will define context grammar for the same sentence. So this cfg paraphrase sentence is taking a sentence from you. It is going to do what? It is going to word tokenize it. For one incent so if it is NNP it is going to add / single quotes over there. If it is a verb again it is putting a single quotes over there. If it is a noun it is putting a single quotes else it is passing. So any word which is reading after tokenizing if it is a noun because we have already done POS tagging. If the PS tagging if found it's a noun phrase, it's a verb phrase in the past tense or a present tense or if it is a noun simple, we're going to put it in a single quote. And we define a context free grammar as sentence should have noun phrase, verb phrase contain noun and verb. Noun phrase can have anything. Verb phrase can have anything noun as anything in the format of snp the one which we have defined above. Whenever these nouns are coming, put it in a single quotes. Whenever the verb is another noun We define the rules
Text Classification Explained
in a single quotes. We define the rules for this. A sentence to generate John saw a long white boat. If you pass it, it will give a result as John saw boat. Since the beginning of time, written texts have been a means to communicate, express, and document something of significance. Even in the modern age, it has been proven a lot of times that an individual's writing style can be a defining aspect of one's psyche. Ever since social media emerged, micro blogging became the new form of writing, expressing or documenting an event. This also gave rise to a lot of unstructured data and with it a need to understand that data. Now this is where text classification can be put to our advantage. Text classification is nothing but classifying unstructured text data into various categories. To further simplify how we use text classification, let's consider an example. You have a product that was launched a while ago and you have also kept track of all the reviews that the product got on all the platforms across the internet. Now what you have is unstructured text data. And to do text classification on this unstructured data, you can follow two approaches. One, you can either make a few rules where a collection of words will decide the sentiment of the input text. This approach can be useful for a handful of data. But analyzing large sets of data is neither efficient nor cost effective. But a better approach is making use of natural language processing and classification using machine learning. For this we should train a classifier with a separate class labeled data and using this model we can do classification on the input data. When we give input to the classifier we will get an output with the class category based on our trained model telling us if the review is bad, average or good. This way text classification on user reviews can help us improve user experience. Data is the new fuel. This even bad reviews can help us identify the attributes which can help us improve our upcoming campaigns. Businesses and organizations are following this trend to understand user sentiment and user behavior. Text classification can also be used for applications like spam detection in emails, targeting customer needs, etc. In this day and age where data is generated every second of the
What is Supervised Learning?
day, text classification becomes an asset for any organization. Supervised learning has many applications ranging from spam filtering, speech recognition to medical diagnosis to economic predictions and many more. So guys without any ado let's learn the ropes about covering the fundamentals of machine learning. Hello everyone I welcome you all in today's session about what is supervised learning. Let's cover some of the basic insights about supervised learning followed by its essentials. But before we go ahead if you haven't already make sure to subscribe to the Edureka YouTube channel to never miss out on any updates from us. Also if you're looking for any of the certification courses from Edureka do check out the link given in the description below. Let's get ahead with our agenda for today. So firstly we see what is supervised learning exactly. Then we look at how supervised learning works followed by types of supervised machine learning algorithms and then we cover advantages of supervised learning followed by its disadvantages. So guys let's get ahead. So first question that arises is what is supervised learning. So guys, it is a type of machine learning in which a model is stored by using training data that has been identified by the model. So basically you supervise the learning process by giving the model both the data it needs to learn the right answers it should come up with and this idea for the model to learn a function that connects with the right inputs to the right outputs. This function can be then used to make predictions based on the data set that has not yet been seen. So this is the basic gesture of what is supervised learning. Let's go a bit deep and study how a supervised learning works. So in order for a supervised learning to be effective, a data set which can be of a real-time use is required where both the input features and the output labels are already known. So by learning a function whose inputs and outputs are always changing is the goal here for supervised learning. So let's look at step-by-step explanation of how supervised learning functions. Firstly, let's start off with collecting label data. So here the first step is to collect a data set with the input features and their associated output labels. So the model will be trained using annotated data points which serves as a road map for the data sets. Next step followed is data cleaning and pre-processing. So guys this process is required for the training and validation in sets which is created and missing values is handled features can be normalized or scaled and categorical variables can be encoded. Next in terms of model selection so it's an essential machine learning method which is selected depending on the data and the problems that can be solved. This model could be regression model which is used to predict continuous outputs or a classification discrete categories. Fourth, we have training the model. So once a model has been selected, it is then trained on the training set by the minimizing a loss of a function which can be done by iteratively altering its parameters. When comparing the model's prediction to the true labels in the training data, this loss function is used as a metric. Lastly, we have model evaluation. So by validating a set a subset of the data that was not used for training this is used to evaluate the model. Here the accuracy in classification issues and mean square error is used as the two examples for evaluation of the metrics. Next let's look at the types of supervised machine learning algorithms. So here we have two types which can be the classification algorithms and regression algorithms. So there are basically two by parts. Let's study one by one. So firstly let's talk about classification algorithms. Here logistic regression. So guys this regression is basically used for binary classification. This regression models the probability of a binary response based on one or more predictive values. Then let's talk about decision tree. So by using a treel like model to make decisions. Decision tree is very helpful. This can be used for both classification and regression. Next let's talk about random forest. So what it is guys it is an ensemblance method which uses multiple decision trees. It also aims to improve accuracy by reducing the overfitting. Next we have in the list is support vector machines. So what it is used for? It finds the hyper plane in the graph that best divides the classes in the feature space and this can be kernelized to handle nonlinear boundaries. K nearest algorithm which is also referred to as KN& N algorithm. So guys, this is used to classify a sample which is based on the majority classes of its k nearest neighbors in the feature space also. Then we have n base on applying base theorem with strong independent assumption. This can be an effective algorithm which is particularly effective in text classification task. Lastly, let's talk about the neural networks and deep learning models. These are complex models that capture intricate patterns and boundaries which also includes architectures like convulation neural networks also known as CNN for image classification. So guys next let's look at regression algorithms. So firstly in the list we have linear regression. Linear regression is used to model the connection between a set of features and continuous metric. It is also very clear and easy to understand. Then we have regression by a polomial. So this is a method for modeling relationship that goes beyond those captured by linear regression. Next we have support vector regression which is to regress a vector with its support. So SVR tries to fit the best line within a predefined or acceptable error boundary. That is the use of support vector regression. Next let's talk about random forest regression and decision trees. So by making predictions by the numerical values rather than classes, decision tree can be utilized also for regression. Combining several trees into one random forest aids for more accurate forecasting. Now let's move ahead to the advantages of supervised learning. So guys, there are numerous benefits which make supervised learning a preferred machine learning paradigm. Supervised learning involves training an algorithm with input output pairs that have already been labeled. So here are the list of advantages which are very essential under supervised learning. So firstly in terms of anticipating outcomes, supervised learning algorithms have the ability to accurately anticipate the outcomes based on incomplete or missing data. They can generalize and anticipate outcomes for new and unknown occurrences by learning from prior data with known results. Next, let's talk about transparency. So guys, supervised learning is flexible since it may be used for solving many different purposes such as classification, regression, time series forecasting and many more. It's a flexible method that can be applied in many settings. Next is human interpretation. So guys, in this context of supervised learning algorithms, this refers to the ability to understand and explain the relationship between the input features and the model's predictions or decisions. It means that the output of certain algorithms can be analyzed in a way that provides insights into how each input feature contribute to the final output which allows humans to make sense of why the model is making certain predictions or decision. Next on the list we have active learning. So guys this is a technique used by supervised learning models in which they can actively seek out answers to their questions from oracle or human expert can be anything. Lastly, in terms of handling the missing data when the missing values can be deduced from the given data, supervised learning algorithms are better able to handle any missing value than any other machine learning techniques. Lastly, let's talk about the disadvantages of supervised learning. So guys, under supervised learning, there are some difficulties and drawbacks which are built into several learning and models as well. So guys, the first limitation we have need for label data. So guys, what does it mean? So guys, this means that the requirement for label data under each training example must have a label or a target value which is associated with the model. So obtaining accurate labels can be challenging or subjective in terms of domains which can make the acquisition of label data costly and timeconuming. Next we have data quality and bias. So guys the performance of the model is highly sensitive to the quality of the label data. It is possible for the model to learn and propagate flaws such as bias and incorrect predictions. If the data is noisy, contain errors or is prejudiced. It all affects the data quality and creates bias. Next, let's talk about overfitting. So, it's possible for the models to overfitit the training data or to become too dependent on those examples rather than being able to generalize well to new information. Overfitting can occur when training data set is too small or when the model's complexity outstrips the available data. Fourth, we have limited generalization. So guys, this model can generalize well to the new data that follows the distribution similar to the training data. This model's performance may degrade if it is exposed to data that differs greatly from the training data. Lastly, we have time and resource intensiveness. So guys at high computational cost and under time commitment these are often very associated with the training of models. This can be true for big data sets or complicated models. This can restrict
What is Unsupervised Learning?
their usefulness in applications where time or resources are of great essence. So firstly let's talk about what exactly is unsupervised learning. So guys, you should know that the goal of unsupervised learning, which is a machine learning theory, is to have algorithm to discover patterns, structures or relationship within a data set without any human supervision or label instances. Unsupervised learning makes use of data that has not been structured in any way. Also, unsupervised learning seeks to identify hidden but present data structures such as groups, clusters, or patterns that could otherwise go unnoticed. Next let's talk about why to use unsupervised learning. So the capacity of unsupervised learning is to uncover pattern structure like discussed. So it can be useful for a broad variety of purposes and their implementations. So guys here are many of the most important benefits of unsupervised learning. First in the list we have outlier and anomaly detection. Here it is about a task that can be accomplished with unsupervised learning. So guys, these are out of the ordinary occurrences which should be looked into further since they could be signs of error, fraud or anything else. So this comes under outlier and anomaly detection. Next we have clustering algorithms which can assist you to find groups of similar data points which can be beneficial for customer segmentation, product grouping or even document categorization. So guys, this data can be used to improve marketing and recommended systems. Next we have in the list is image and text analysis. These can be the common applications of unsupervised learning. Here the learning makes use of dimensionality reduction techniques for applications of image analysis for instance which can help in the visualization of highdimensional image data and also clustering in text analysis by grouping comparable articles for topic modeling. Next let's talk about data privacy and anonymization. So guys, in unsupervised learning, it can be used to mask personally identifiable information from data while still preserving the underlying structure and patterns. This is the use for data privacy. So guys, unsupervised learning can also be used for dimensionality reduction which are useful in streamlining complicated data sets without sacrificing useful information through feature engineering and reduction techniques. So guys, this learning structure has the potential to boost model efficiency and speed up for future processing stages. So guys, now that we are familiar of why to use unsupervised learning, let's move ahead with the types of unsupervised learning. So guys, here we have two classification which is the clustering and association. So firstly let's talk about clustering which is a form of unsupervised learning and a process of identifying and grouping the data points with shared properties. So guys, all the patterns with shared properties are identified in this form. So guys, here the objective is to discover clusters and grouping in the data that can occur naturally or without the use of arbitrary labels. Here the motive is to find hidden patterns or structures in which your data can be a common use case for clustering. Let's take for example a data set which has the consumer in store actions. So cluster analysis can be used to group clients into subsets with similar buying habits. For instance, customer who tend to buy one type of goods more often than others might form one cluster while those who tend to buy a wide variety of things might form another. So guys, this is cluster analysis that allows you to find unique subsets without having to label them in advance. Now let's talk about association. So in unsupervised learning, association means it is a process of discovering relationship between variables in a data set without having to access to the labels. So guys, when the labels are known and supervised learning, the objective was to train a model to correctly predict the labels for fresh data. So guys, let's say you have information about grocery shoppers habits. So guys, here you find patterns like customer who buy bread also often buy milk. This could be discovered by association rule mining. So that's the application of it. These connection are useful as they can shed light on consumer habits and also inform advertising decision such as the placement of similar products in close proximity to one another and to promote cross-selling. Now that we took a good look about clustering and association, let's move ahead with how does unsupervised learning work. So guys, for this learning to be effective, the data being studied must be completely unlabelled and unclassified for this to work. So large amounts of data are needed for unsupervised machine learning. So what data scientists do is they start the process by training algorithms on training data sets. These data sets do not have properly label and categorized data points. So the learning objective of the method is to discover patterns in the data set and classify the points according to those patterns. So guys unsupervised learning system may be trained to recognize characteristic traits like whiskers, long tails, etc. by analyzing the photo of the cats for example without the requirement for labels. Unsupervised learning improves corporate insights and decision making related to the customer behavior. Next let's move on to the advantages of unsupervised learning. So here first advantage is that there are no labels required. So since unsupervised learning doesn't require any label data, it can be applied to a wide variety of data sets for which getting labels would be inconvenient and comparatively expensive. Next, in terms of exploring the data, unsupervised learning can unearth structures and patterns in the data that could otherwise be concealed from human eyes. Next, we have in terms of future discovery. So, by using unsupervised methods, we may better understand that the data and the relevant features will improve the accuracy and efficiency of the subsequent machine learning models. Moving on, we have anomaly detection which can be useful for spotting fraud errors and out of the ordinary behavior. So anomaly detection seeks out to figure and highlight these outlying data points. Lastly, in terms of reduced biases. So unsupervised learning can help estimate the possible biases which can be caused by human assessments because it does not rely on human labelled data. So lastly moving on with the problems of unsupervised learning. So firstly we have lack of ground truth. So guys, since unsupervised learning does not require label data, there is no absolute standard by which to judge the quality of the model's prediction. This can make it hard to evaluate the effectiveness of the acquired clusters and patterns. Next we have is the challenges with subjective interpretation. So guys, unsupervised learning algorithm discovers patterns or clusters that may lend themsel to easy objective interpretation. Here the context and individual's perspective might play a role of determining the meaning of these patterns. Lastly, what we have is overfitting. So guys, unsupervised learning lacks supervision of the label data that may produce patterns or clusters that are not transferable to unseen data. Overfitting can occur when a model incorrectly
Decision Tree Algorithm
collects data because of its specificity. So guys, these were quite a few illustrations about unsupervised learning. What is classification? I hope everyone of you must have used Gmail. So, how do you think the mail is getting classified as a spam or not a spam mail? Well, there's nothing but classification. So, what it is? Well, classification is a process of dividing the data set into different categories or groups by adding label. In other way you can say that it is a technique of categorizing the observation into different category. So basically what you're doing is you are taking the data analyzing it and on the basis of some condition you finally divide it into various categories. Now why do we classify it? Well we classify it to perform predictive analysis on it. Like when you get the mail the machine predicts it to be a spam or not a spam mail and on the basis of that prediction it add the irrelevant or spam mail to the respective folder. In general, this classification algorithm handled questions like is this data belongs to A category or B category like is this a male or is this a female something like that. Are you getting it? Okay fine. Now the question arises where will you use it? Well, you can use this for fraud detection or to check whether the transaction is genuine or not. Suppose I'm using a credit card here in India. Now due to some reason I had to fly to Dubai. Now if I'm using the credit card over there, I'll get a notification alert regarding my transaction. They would ask me to confirm about the transaction. So this is also a kind of predictive analysis as the machine predicts that something fishy is in the transaction as 24hour ago I made the transaction using the same credit card in India and 24hour later is being used for the payment in Dubai. So the machine predicts that something fishy is going on in the transaction. So in order to confirm it, it sends you a notification alert. Well, this is one of the use case of classification. You can even use it to classify different items like fruits on the bas of its taste, color, size or weight. A machine well trained using the classification algorithm can easily predict the class or the type of fruit whenever new data is given to it. Not just the fruit, it can be any item. It can be a car, house, it can be a sign board or anything. Have you noticed that while you visit some sites or you try to login into some you get a picture capture for that right where you have to identify whether the given image is of a car or it's of a pole or not you have to select it for example there are 10 images and you're selecting three images out of it so in a way you are training the machine right you're telling that these three are the picture of a car and rest are not so who knows you're training it for something big right so moving on ahead let's discuss the types of classification All right. Well, there are several different ways to perform a same task. Like in order to predict whether a given person is a male or a female, the machine had to be trained first. All right. But there are multiple ways to train the machine and you can choose any one of them. Just for predictive analytics, there are many different techniques. But the most common of them all is the decision tree which we'll cover in depth in today's session. So as a part of classification algorithm we have decision tree, random forest, neighbors, k nearest neighbor, logistic regression, linear regression, support vector machines and so on. There are many. All right. So let me give you idea about few of them. Starting with decision tree. Well, decision tree is a graphical representation of all the possible solution to a decision. The decisions which are made, they can be explained very easily. For example, here is a task which says that should I go to a restaurant or should I buy a hamburger? You are confused on that. So for that what you'll do? You'll create a decision tree for it. Starting with the root node will be first of all you'll check whether you are hungry or not. All right. If you're not hungry then just go back to sleep. Right? If you are hungry and you have $25 then you'll decide to go to restaurant. And if you're hungry and you don't have $25 then you'll just go and buy a hamburger. That's it. All right. So this is about decision tree. Now moving on ahead, let's see what is a random forest. Well, random forest build multiple decision trees and merges them together to get a more accurate and stable prediction. All right, most of the time random forest is trained with a bagging method. The bagging method is based on the idea that the combination of learning model increases the overall result. If you're combining the learning from different models and then clubbing it together, what it will do? It will increase the overall result. Fine. Just one more thing. If the size of your data set is huge, then in that case, one single decision tree would lead to a overfitit model. Same way like a single person might have its own perspective on the complete population as the population is very huge, right? However, if we implement the voting system and ask different individual to interpret the data, then we would be able to cover the pattern in a much meticulous way. Even from the diagram, you can see that in section A, we have a large training data set. What we do? We first divide our training data set into n subsamples. All right? And we create a decision tree for each subsample. Now, in the B part, what we do, we take the vote out of every decision made by every decision tree. And finally we club the vote to get the random forest decision. Fine. Let's move on ahead. Next we have n bias. So name bias is a classification technique which is based on base theorem. It assumes that presence of any particular feature in a class is completely unrelated to the presence of any other feature. Neighs is simple and easy to implement algorithm and due to its simplicity this algorithm might outperform more complex model when the size of the data set is not large enough. All right. A classical use case of name bias is a document classification. In that what you do you determine whether given text corresponds to one or more categories. In the text case the features used might be the presence or absence of any keyword. So this was about nave. From the diagram you can see that using nay bias we have to decide whether we have a disease or not. First what we do we check the probability of having a disease and not having the disease. Right? Probability of having a disease is 0. 1 while on the other hand probability of not having a disease is 0. 9. Okay. First let's see when we have disease and we go to doctor. All right. So when we visited the doctor and the test is positive. So probability of having a positive test when you're having a disease is 0. 80 and probability of a negative test when you already have a disease that is 0. 20. This is also a false negative statement as the test is detecting negative but you still have the disease right? So it's a false negative statement. Now let's move ahead when you don't have the disease at all. So probability of not having a disease is 0. 9. And when you visit the doctor and the doctor is like yes you have the disease but you already know that you don't have the disease. So it's a false positive statement. So probability of having a disease when you actually know there is no disease is 0. 1 and probability of not having a disease when you actually know there is no disease. So and the probability of it is around 0. 90. Fine. It is same as probability of not having a disease. Even the test is showing the same result. It's a true positive statement. So it is 0. 9. All right. So let's move on ahead and discuss about KN& N algorithm. So this KN& N algorithm or the K nearest neighbor. It stores all the available cases and classifies new cases based on the similarity measure. The K in the KN& N algorithm is the nearest neighbor we wish to take vote from. For example, if K equal 1, then the object is simply assigned to the class of that single nearest neighbor. From the diagram, you can see the difference in the image when k= 1, k= 3, and k= 5. Right? Well, the modern systems are now able to use the k nearest neighbor for visual pattern recognization to scan and detect hidden packages in the bottom bin of a shopping cart at the checkout. If an object is detected, which matches exactly to the object listed in the database, then the price of the spotted product could even automatically be added to the customer's bill. While this automated billing practice is not used extensively at this time but the technology has been developed and is available for use. If you want you can just use it. And yeah one more thing K nearest neighbor is also used in retail to detect patterns in the credit card users. Many new transaction scrutinizing software application use CANN algorithms to analyze register data and spot unusual pattern that indicate suspicious activity. For example, if register data indicates that a lot of customers information is being entered manually rather than through automated scanning and swapping, then in that case, this could indicate that the employees who are using the register are in fact stealing customers personal information. Or if a register data indicates that a particular good is being returned or exchanged multiple times, this could indicate that employees are misusing the return policy or trying to make money from doing the fake returns. Right? So this was about KN& N algorithm. Fine. Since our main focus for this session will be on decision tree. So starting with what is decision tree. But first let me tell you why did we choose decision tree to start with. Well these decision tree are really very easy to read and understand. It belongs to one of the few models that are interpretable where you can understand exactly why the classifier has made that particular decision. Right? Let me tell you a fact that for a given data set, you cannot say that this algorithm performs better than that. It's like you cannot say that decision tree is better than name bias or name bias is performing better than decision tree. It depends on the data set, right? You have to apply hidden trial method with all the algorithms one by one and then compare the result. The model which gives the best result is the model which you can use it for better accuracy for your data set. All right. So let's start with what is decision tree? Well, a decision tree is a graphical representation of all the possible solution to a decision based on certain conditions. Now you might be wondering why this thing is called as decision tree. Well, it is called so because it starts with a root and then branches off to a number of solution just like a tree, right? Even the tree starts from a root and it starts growing its branches once it gets bigger and bigger. Similarly, in a decision tree, it has a root which keeps on growing with increasing number of decision and the conditions. Now, let me tell you a real life scenario. I won't say that all of you, but most of you must have used it. Remember, whenever you dial the toll-free number of your credit card company, it redirects you to his intelligent computerized assistant where it asks you questions like press one for English or press two for Hindi, press three for this, press four for that. Right? Now once you select one now again it redirects you to a certain set of questions like press one for this press one for that and similarly right so this keeps on repeating until you finally get to the right person right you might think that you are caught in a voicemail hell but what the company was actually doing it was just using a decision tree to get you to the right person all right I'd like you to focus on this particular image for a moment on this particular slide you can see an image where the task is should I accept a new job offer or All right. So you have to decide that for that what you did you created a decision tree starting with the base condition or the root node was that the basic salary or the minimum salary should be $50,000. If it is not $50,000 then you are not at all accepting the offer. All right. So if your salary is greater than $50,000 then you will further check whether the commute is more than 1 hour or not. If it you'll just decline the offer. If it is less than 1 hour then you are getting closer to accepting the job offer. Then further what you'll do you'll check whether the company is offering free coffee or not right if the company is not offering the free coffee then you'll just decline the offer and if it is offering the free coffee then yeah you will happily accept the offer right there are just an example of a decision tree. Now let's move ahead and understand a decision tree. Well here is a sample data set that I'll be using it to explain you about the decision tree. All right. In this data set, each row is an example and the first two columns provide features or attributes that describes the data and the last column gives the label or the class we want to predict. And if you like, you can just modify this data by adding additional features and more example and our program will work in exactly the same way. Fine. Now, this data set is pretty straightforward except for one thing. I hope you have noticed that it is not perfectly separable. Let me tell you something more about that. As in the second and fifth examples, they have the same features but different labels. Both have yellow as their color and diameter as three but the labels are mango and lemon. Fine. Let's move on and see how decision tree handles this case. All right. In order to build a tree, we'll use a decision tree algorithm called cart. This cart algorithm stands for classification and regression tree algorithm. All right. Let's see a preview of how it work. All right, to begin with, we'll add a root node for the tree and all the nodes receive a list of rows as input and the root will receive the entire training data set. Now, each node will ask true and false question about one of the feature and in response to that question, we'll split or partition the data set into two different subsets. These subsets then become input to two child node we add to the tree. And the goal of the question is to finally unmix the labels as we proceed down or in other words to produce the purest possible distribution of the labels at each node. For example, the input of this node contains only one single type of label. So we could say that it's perfectly unmixed. There is no uncertaintity about the type of label as it consists of only grapes. Right? On the other hand, the labels in this node are still mixed up. So we'll ask another question to further drill it down, right? But before that, we need to understand which question to ask and when. And to do that, we need to quantify how much question helps to unmix the label. And we can quantify the amount of uncertaintity at a single node using a metric called genium impurity. And we can quantify how much a question reduces that uncertaintity using a concept called information gain. We'll use these to select the best question to ask at each point and then what we'll do we'll iterate the steps. We'll recursively build the tree on each of the new node. We'll continue dividing the data until there are no further question to ask and finally we reach to our leaf. All right. So this was about decision tree. So in order to create a decision tree first of all what you have to do you have to identify different set of questions that you can ask to a tree like is this color green and what will be these question will be decided by your data set like is this color green is the diameter greater than equal to three is the color yellow right questions resembles to your data set remember that all right so if my color is green then what it will do it will divide into two part first the green mango will be in the true while on the false we have lemon and the mango. All right. If the color is green or the diameter is greater than equal to three or the color is yellow. Now let's move on and understand about decision tree terminologies. All right. So starting with root node. Root node is the base node of a tree. The entire tree starts from a root node. In other words, it is the first node of a tree. It represents the entire population or sample. And this entire population is further segregated or divided into two or more homogeneous set. Fine. Next is the leaf node. Well, leaf node is the one when you reach at the end of the tree, right? That is you cannot further segregate it down to any other level. So that is the leaf node. Next is splitting. Splitting is dividing your root node or your node into different subp part on the basis of some condition. All right. Then comes the branch or the subtree. Well, this branch or sub tree gets formed when you split the tree. Suppose when you split a root node, it gets divided into two branches or two sub trees. Right? Next is the concept of pruning. Well, you can say that pruning is just opposite of splitting. What we are doing here, we are just removing the subnode of a decision tree. We'll see more about pruning later in this session. All right, let's move on ahead. Next is parent or child node. Well, first of all, root node is always the parent node and all other nodes associated with that is known as child node. Well, you can understand it in a way that all the top node belongs to a parent node and all the bottom node which are derived from a top node is a child node. The node producing a further node is a child node and the node which is producing it is a parent node. Simple concept, right? Let's use the cart algorithm and design a tree manually. So first of all what you'll do you decide which question to ask and when. So how will you do that? So let's first of all visualize the decision tree. So this is the decision tree which we'll be creating manually. All right. First of all let's have a look at the data set. You have outlook, temperature, humidity and windy as your different attribute. On the basis of that you have to predict that whether you can play or not. So which one among them should you pick first? Answer determine the best attribute that classifies the training data. All right. So how will you choose the best attribute or how does a tree decide where to split or how the tree will decide its root node. Well before we move on and split a tree there are some terminologies that you should know. All right. First being the genindex. So what is this genex? This gen index is the measure of impurity or purity used in building a decision tree in cart algorithm. All right. Next is information gain. This information gain is the decrease in entropy after a data set is split on the basis of an attribute. Constructing a decision tree is all about finding an attribute that returns the highest information gain. All right. So you'll be selecting the node that would give you the highest information gain. All right. Next is reduction in variance. This reduction in variance is an algorithm which is used for continuous target variable or regression problems. The split with lower variance is selected as a criteria to split the population. See in general term what do you mean by variance? Variance is how much your data is varying. Right? So if your data is less impure or it is more pure then in that case the variation would be less as all the data are most similar. Right? So this is also a way of splitting a tree. All right. Next is the G square. G square it is an algorithm which is used to find out the statistical significance between the differences between sub nodes and the parent nodes. Fine. Let's move ahead. Now the main question is how will you decide the best attribute? For now just understand that you need to calculate something known as information gain. The attribute with the highest information gain is considered the best. Yeah, I know your next question might be like what is this information gain? But before we move on and see what exactly information gain is, let me first introduce you to a term called entropy because this term will be used in calculating the information gain. Well, entropy is just a metric which measures the impurity of something or in other words you can say that it's the first step to do before you solve the problem of a decision tree. As I mentioned here something about impurity. So let's move on and understand what is impurity. Suppose you have a basket full of apples and another bowl which is full of same label which says apple. Now if you are asked to pick one item from each basket and ball then the probability of getting the apple and its correct label is one. So in this case you can say that impurity is zero. All right. Now what if there are four different fruits in the basket and four different labels in the ball. Then the probability of matching the fruit to a label is obviously not one. It's something less than that. Well, it could be possible that I picked banana from the basket and when I randomly picked a label from the ball, it says a cherry. Any random permutation and combination can be possible. So in this case, I'd say that impurities is non zero. I hope the concept of impurity is clear. So coming back to entropy, as I said, entropy is the measure of impurity. From the graph on your left, you can see that as the probability is zero or one, that is either they are highly impure or they are highly pure. Then in that case the value of entropy is zero. And when the probability is 0. 5 then the value of entropy is maximum. Well what is impurity? Impurity is the degree of randomness. How random a data is. So if the data is completely pure in that case the randomness equals zero. Or if the data is completely impure even in that case the value of impurity will be zero. Question like why is it that the value of entropy is maximum at 0. 5 might arise in your mind. Right? So let me discuss about that. Let me derive it mathematically. So as you can see here on the slide, the mathematical formula of entropy is minus of probability of yes. Let's move on and see what this graph has to say mathematically. Suppose S is our total sample space and it's divided into two parts yes and no. Like in our data set the result for playing was divided into two parts either yes or no which we have to predict either we have to play or not. Right? So for that particular case you can define the formula of entropy as entropy of total sample space equals negative of probability of yes multiplied by log with a base 2 minus probability of no multiplied by log of probability of no with base 2 where s is your total sample space and p of yes is the probability of yes and p of no no. Well, if the number of yes equal number of no, that is probability of s equals 0. 5. Right? Since you have equal number of yes and no. So in that case, value of entropy will be one. Just put the value over there. All right. Let me just move to the next slide. I'll show you this. All right. Next is if it contains all yes or all no that is probability of a sample space is either one or zero then in that case entropy will be equal to zero. Let's see it mathematically one by one. So let's start with the first condition where the probability was 0. 5. So this is our formula for entropy right. first case right which we discussed that when the probability of yes equal probability of no that is in our data set we have equal number of yes and no. All right. So probability of yes equal probability of no and that equals 0. 5 or in other words you can say that yes plus no equal to total sample space. All right. Since the probability is 0. 5. So when you put the values in the formula, you get something like this. And when you calculate it, you'll get the entropy or the total sample space as well. All right, let's see for the next case. What was the next case? Either you have total yes or you have total no. So if you have total yes, let's see the formula when we have total yes. So you have all yes and zero no. Fine. So probability of yes equal 1 and yes is the total sample space obviously. So in the formula when you put that thing up you'll get entropy of sample space equal negative * of 1 * log of 1 as the value of log 1 equals 0. So the total thing will result to zero. Similarly is the case with no. Even in that case you'll get the entropy of total sample space as zero. So this was all about entropy. All right. Next is what is information gain? Well, information gain what it does, it measures the reduction in entropy. It decides which attribute should be selected as the decision node. If S is our total collection, then information gain equals entropy which we calculated just now that minus weighted average multiplied by entropy of each feature. Don't worry, we'll just see how to calculate it with an example. All right. So let's manually build a decision tree for our data set. So this is our data set which consists of 14 different instances out of which we have 9 yes and five no. All right. So we have the formula for entropy. Just put over that since 9 yes. So total probability of yes equals 9 by4 and total probability of no equals 5x4. And when you put up the value and calculate the result, you'll get the value of entropy as 0. 94. All right. So this was your first step that is compute the entropy for the entire data set. All right. Now you have to select that out of outlook temperature, humidity and windy which of the node should you select as the root node? Big question, right? How will you decide that this particular node should be chosen at the base node and on the basis of that only I'll be creating the entire tree. How you'll select that? Let's see. So you have to do it one by one. You have to calculate the entropy and information gain for all of the different nodes. So starting with outlook. So outlook has three different parameters sunny, overcast and rainy. So first of all select how many number of yes and no are there in the case of sunny. Like when it is sunny, how many number of yes and how many number of nos are there. So in total we have two yes and three nos in case of sunny. In case of overcast we have all yes. So if it is overcast then we'll surely go to play. It's like that. All right. And next it is rainy then total number of yes equal three and total number of no equals two. Fine. Next what we do? We calculate the entropy for each feature. For here we are calculating the entropy when outlook equals sunny. First of all we are assuming that outlook is our root node and for that we are calculating the information gain for it. All right. So in order to calculate the information gain remember the formula it was entropy of the total sample space minus weighted average multiplied by entropy of each feature. All right. So what we are doing here we are calculating the entropy of outlook when it was sunny. So total number of yes when it was sunny was two and total number of no that was three. Fine. So let's put up in the formula. Since the probability of yes is 2x 5 and the probability of no is 3x 5. So you will get something like this. All right. So you are getting the entropy of sunny as 0. 971. Fine. Next you'll calculate the entropy for overcast. When it was overcast remember it was all yes. Right? So the probability of yes equal 1. And when you put over that you'll get the value of entropy as zero. Fine. And when it was rainy, rainy has three yes and two nos. So probability of yes in case of sunny is 3x5 and probability of no 2x 5. And when you add the value of probability of yes and probability of no to the formula, you get the entropy of sunny as 0. 971. Fine. Now you have to calculate how much information you're getting from outlook that equals weighted average. All right. So what was this weighted average? Total number of yes and total number of no. Fine. So information from outlook equals 5 by4. From where does this five came over? We are calculating the total number of sample space within that particular outlook when it was sunny. Right? So in case of sunny there was two yes and three nos. All right. So weighted average for sunny would be equal to 5x4. All right. Since the formula was 5x4 multiplied by entropy of each feature. All right. So as calculated the entropy for sunny is 0. 971. Right? So what we'll do? We'll multiply 5x4 with 0. 971. Right? Well this was a calculation for information when outlook equals sunny. But outlook even equals overcast and rainy. For in that case what we'll do again similarly we'll calculate for everything. For overcast and sunny for overcast weighted average is 4x4 multiplied by its entropy that is zero. And for sunny it is same 5x4 3 yes and two nos multiplied by its entropy that is 0. 971. And finally we'll take the sum of all of them which equals to 0. 693. Right? Next we'll calculate the information gained. This what we did earlier was information taken from outlook. Now we are calculating what is the information we are gaining from outlook. Right? Now this information gained that equals to total entropy minus the information that is taken from outlook. All right. So total entropy we had 0. 94 minus information we took from outlook is 0. 693. So the value of information gained from outlook results to 0. 24. 247. All right. So next what do we have to do? Let's assume that Wendy is our root node. So Wendy consists of two parameters false and true. Let's see how many yes and how many nos are there in case of true and false. So when Wendy has false as its parameter then in that case it has six yes and two nos. And when it has true as its parameter, it has three yes and three nos. All right. So let's move ahead and similarly calculate the information taken from windy. And finally calculate the information gained from windy. All right. So first of all what we'll do we'll calculate the entropy of each feature starting with windy equal true. So in case of true we had equal number of yes and equal number of no. Well remember the graph when we had the probability as 0. 5 as total number of yes equal total number of no and for that case the entropy equals 1. So we can directly write entropy of true when it's windy is 1 as we had already proved it when probability equals 0. 5 the entropy is the maximum that equals to one. All right. Next is entropy of false when it is vendy. All right. So similarly just put the probability of yes and no in the formula and then calculate the result. Since you have six yes and two nos. So in total you'll get the probability of yes as 6 by 8 and probability of no as 2x 8. All right. So when you will calculate it, you'll get the entropy of false as 0. 811. All right. Now let's calculate the information from windy. So total information collected from windy equals information taken when windy equal true plus false. So we'll calculate the weighted average for each one of them. And then we'll sum it up to finally get the total information taken from windy. So in this case it equals to 8x4 * 0. 811 811 + 6 by 14 * 1. What is this 8? 8 is total number of yes and no in case when windy equals false. Right? So when it was false, so total number of yes that equals to six and total number of no that equal to two that sum ups to 8. All right? So that is why the weighted average results to 8x4. Similarly, information taken when windy equals 2 equals 2. 3 + 3 that is 3 yes and 3 no equals 6 divid by total number of sample space that is 14 * 1 that is entropy of true. All right. So it is 8x 14 * 0. 811 + 6x4 * 1 which results to 0. 892. This is information taken from Wendy. All right. Now how much information you are gaining from Wendy. So total information gained from windy that equals to total entropy minus information taken from windy. All right that is 0. 94US 0. 892 that equals to 0. 048. So 0. 048 is the information gained from windy. All right. Similarly we calculated for the rest two. All right. So for outlook as you can see the information was 0. 693 693 and its information gain was 0. 247. In case of temperature, the information was around 0. 911 and the information gain that was equal to 0. 029. In case of humidity, the information gain was 0. 152 and in the case of windy the information gain was 0. 048. We'll select the attribute with the maximum mode. Fine. Now we have selected outlook as our root node and it is further subdivided into three different parts sunny, overcast and rain. So in case of overcast we have seen that it consists of all yes. So we can consider it as a leaf node. But in case of sunny and rainy it's doubtful as it consists of both yes and both no. So you need to recalculate the things. Right? Again for this node you have to recalculate the things. All right. You have to again select the attribute which is having the maximum information gain. All right. So this is how your complete tree will look like. All right. So let's see when you can play. So you can play when outlook is overcast. All right. In that case you can always play. If the outlook is sunny you'll further drill down to check the humidity condition. All right. If the humidity is normal then you'll play. high then you won't play. Right? When the outlook predicts that it's rainy then further you'll check whether it's windy or not. If it is a weak wind then you'll go and opt for play but if it has strong wind then you won't play right. So this is how your entire decision tree would look like at the end. All right. Okay. Now comes the concept of pruning says that what should I do to play? Well, you have to do pruning. Pruning will decide how you will play. What is this pruning? Well, this pruning is nothing but cutting down the nodes in order to get the optimal solution. All right. So, what pruning does it reduces the complexity. All right. As here you can see on the screen that it's showing only the result for yes that is it's showing all the result which says that you can play. All right. Before we drill down to our practical session, a common question might come in your mind. You might think that a treebased model better than linear model right you can think like if I can use a logistic regression for classification problem and linear regression for regression problem then why there is a need to use the tree well many of us have this question in their mind and well there's a valid question too well actually as I said earlier you can use any algorithm it depends on the type of problem you're solving let's look at some key factor which will help you to decide which algorithm to use and So the first point being if the relationship between dependent and independent variable is well approximated by a linear model then linear regression will outperform treebased model. Second case if there is a high nonlinearity and complex relationship between dependent and independent variables. A tree model will outperform a classical regression model. In third case, if you need to build a model which is easy to explain to people, a decision tree model will always do better than a linear model as the decision tree models are simpler to interpret than linear regression. All right. Now let's move on ahead and see how you can write a decision tree classifier from scratch in Python using the cart algorithm. All right, for this I'll be using Jupyter notebook with Python 3. 0 installed on it. All right, so let's open the Anaconda and the Jupyter notebook. Where is that? So this is our Anaconda Navigator and I'll directly jump over to Jupyter notebook and hit the launch button. I guess everyone knows that Jupyter notebook is a web-based interactive computing notebook environment where you can run your Python codes. So my Jupyter notebook, it opens on my local host 891. So I'll be using this Jupyter notebook in order to write my decision tree classifier using Python. For this decision tree classifier, I have already written the set of codes. Let me explain you just one by one. So we'll start with initializing our training data set. So there's our sample data set for which each row is an example. The last column is a label and the first two columns are the features. If you want you can add some more features and example for your practice. Interesting fact is that this data set is designed in a way that the second and the fifth example have almost the same features but they have different labels. All right. So let's move on and see how the tree handles this case. As you can see here both of them the second and the fifth column have the same features. What they differ in is just their label. Fine. So let's move ahead. So this is our training data set. Next what we are doing? We are adding some column labels. So they are used only to print the trees. Fine. We will add header to the columns like the first column is of color, second is of diameter and third is of label column. All right. Next what we'll do? We'll define a function as unique values in which we'll pass the rows and the columns. So this function what it will do it will find the unique values for a column in the data set. So this is an example for that. So what we are doing here we are passing training data as our row and column number as zero. So what we are doing we are finding unique values in terms of color and in this since the row is training data and the column is one. So what we are doing here? So we are finding the unique values in terms of diameter. Fine. So this is just an example. Next what we'll do we'll define a function as class count and we'll pass the rows into it. So what it does it counts the number of each type of example within a data set. So in this function what we are basically doing we are counting the number of each type of example in the data set or what we are doing we are counting the unique values for the label in the data set as a sample. You can see here we can pass the entire training data set to this particular function as class count. What it will do it will find all the different types of label within the training data set. As you can see here the unique label consists of mango, grape and lemon. So next what we'll do we'll define a function is numeric and we'll pass a value into it. So what it will do it will just test if the value is numeric or not and it will return if the value is an integer or a float. For example you can see is numeric we are passing seven so it is an integer so it will return an int value and if you're passing red it's not a numeric value right. So moving on ahead we'll define a class named as question. So what this question does this question is used to partition the data set. This class what it does it just records a column number for example zero for color. All right and a column value for example green. Next what we are doing we are defining a match method which is used to compare the feature value in the example to the feature value stored in the question. Let's see how. First of all what we are doing we are defining a init function and inside that we are passing the self column and the value as parameter. So next what we do we define a function as match. What it does? It compares the feature value in an example to the feature value in this question. Right? Next, we'll define a function as repr, which is just a helper method to print the question in a readable format. Next, what we are doing, we are defining a function partition. Well, this function is used to partition the data set. Each row in the data set, it checks if it matches the question or not. If it does, so it adds it to the true rows or if not, then it adds to the false rows. All right? For example, as you can see here, let's partition the training data set based on whether the rows are red or not. Here we are calling the function question and we are passing a value of zero and red to it. So what it will do, it will assign all the red rows to true underscore rows and everything else will be assigned to false underscore rows. Fine. Next what we'll do we'll define a gen impurity function and inside that we'll pass the list of rows. So what it will do? It will just calculate the gen impurity for the list of rows. Next what we are doing here we're defining a function as information gain. So what this information gain function does it calculates the information gain using the uncertaintity of the starting node minus the weighted impurity of the child node. The next function is find the best split. Well this function is used to find the best question to ask by iterating over every feature or value and then calculating the information gain. For the detailed explanation on the code you can find the code in the description given below. All right. Next, we'll define a class as leaf for classifying the data. It holds a dictionary of class like mango for how many times it appears in the row from the training data that reaches this leaf. All right. Next is the decision node. So this decision node, it will ask a question. This holds a reference to the question and the two child nodes. On the basis of it, you are deciding which node to add further to which branch. All right. So next what we are doing we are defining a function of build tree and inside that we are passing our number of rows. So this is the function that is used to build a tree. So initially what we did we defined all the various function that we'll be using in order to build a tree. So let's start by partitioning the data set for each unique attribute. Then we'll calculate the information gain and then return the question that produces the highest gain and on the basis of that we'll split the tree. So what we are doing here we are partitioning the data set calculating the information gain and then what this is returning it is returning the question that is producing the highest gain. All right now if gain equals zero return leaf arrows. So if you are getting no for the gain that is gain equals zero then in that case since no further question could be asked so what it will do it will return a leave. Fine. Now true underscore rows or false underscore rows equal partition with rows and the question. So if we are reaching till this position then you have already found a feature value which will be used to partition the data set. Then what you will do you will recursively build the true branch and similarly recursively build the false branch. So return decision node and inside that we'll be passing question true branch and false branch. So what it will do it will return a question node. Now this question node this records the best feature or the value to ask at this point. Fine. Now that we have built our tree next what we'll do we'll define a print tree function which will be used to print the tree. Fine. So finally what we are doing in this particular function that we are printing our tree. Next is the classify function which will use it to decide whether to follow the true branch or the false branch and then compare to the feature values stored in the node to the example we are considering. And last what we'll do we'll finally print the prediction and leave. So let's execute it and see. Okay. So this is our testing data. All right. So we printed our leaf as well. Now that we have trained our algorithm with our training data set. Now it's time to test it. So this is our testing data set. So let's finally execute it and see what is the result. So this is the result you will get. So first question which is asked by the algorithm is diameter greater than equal to three. If it is true then it will further ask if the color is yellow again if it is true then it will predict mango as one and lemon with one. All right and in case it is false then it will just predict the mango. Now this was the true part. Now next coming to if diameter is not greater than or equal to three then in that case it's false. And what it will do? It will just predict the grape. Fine. Okay, so this was all about the coding part. Now let's conclude this session. But before concluding, let me just show you one more thing. Now there's a scikitlearn algorithm cheat sheet which explains you which algorithm you should use and when. All right, it's built in a decision tree format. Let's see how it is built. So first condition, it will check whether you have 50 samples or not. If your samples are greater than 50, then it will move ahead. If it is less than 50, then you need to collect more data. If your sample is greater than 50 then you have to decide whether you want to predict a category or not. If then further you will see that whether you have label data or not. If then that would be a classification algorithm problem. If you don't have the label data then it would be a clustering problem. Now if you don't want to predict a category then what do you want to predict? Predict a quantity. Well if you want to predict a quantity then in that case it would be a regression problem. If you don't want to predict a quantity and you want to keep looking
Random Forest
further then in that case you should go for dimensionality reduction problems and still if you don't want to look and the predicting structure is not working then you have tough luck for that. Okay. So now let's understand what is a random forest. So random forests is constructed by using multiple decision trees and the final decision is obtained by majority votes of these decision trees. So let me make things very simple for you by taking an example. Now suppose we have got three independent decision trees. Here we are just taking three decision trees and I have got an unknown fruit and I want that these trees would give me a result of what exactly this fruit is. So I pass this fruit to the first decision tree, the second decision tree and the third decision tree. Now a random forest is nothing but a combination of these decision trees. So the results are been fed into the random forest algorithm. So what it sees is that okay the first decision tree classifies it as peach. The second decision tree says that it is an apple and the third one says that it is a peach. So random forest classifier says that okay I've got the result as two peach and one for an apple. So I would say that the unknown fruit is an peach. All right. So this is based on the majority voting of the decision trees and that is how a random forest classifier comes to a decision of predicting the unknown value. Okay. So this was a classification problem. So it took the majority vote. Now suppose if it was an regression problem it would have taken mean of it. Okay. So now let's move on further to understanding what is a decision tree. But before that we should understand that random forest the building blocks are decision trees and that's why studying decision tree becomes important because if we understand one decision tree we can apply the same concept to random forest. Okay. So let us understand what is a decision tree. So decision tree has basically three nodes. They're important. The first one is the root node. The root node as the name suggests here the data set the entire data set is fed at the root node. And then there are decision nodes where decisions are being taken and splitting is performed. And then we've got the leaf node and these leaf nodes are the ending point of the tree where no further division takes place. And we can say that the predictions are made at the leaf nodes. Okay. Now another thing to note here is that decision nodes provide links to the leaf nodes and decision tree breaks the data set into smaller subsets. Splitting is done at nodes and at the end of the tree the final point the decision or the prediction is made. Now let's construct a decision tree and take an example of the penguin's classification. So let me just walk you through what is this penguin classification problem. So we've got some three species of penguin. Let us get familiar with these penguin species. This is Kento. This is Eddie and this is chinstripe species. So these are the penguin species of Antarctica and they are found on different islands. So we have to classify these penguin species correctly. So we using random forest here. But for convenience sake, let us just work with decision tree right now and we will see how a decision tree really classifies these species. So that's really interesting and let's move on forward and understand some parts of this penguin because we'll be working with this data set. Okay. So this is a penguin and this is the head, bill, flipper, belly and claws, the different body parts of the penguin. So we are majorly concerned with the bill and flipper and the body mass of the penguin because our data set contains majorly of these features. So make sure that you understand the flippers and also the bill of the penguin. Okay. So now moving on forward and let us now construct a decision tree. So let's see how a decision tree is constructed. So I have taken a subset of the penguin's data and here we see only two columns that is um island and body mass and of course the species of the penguin that is the outcome or the target variable that we have in this subset. So now we construct a decision tree here and we take body mass as the first feature and the splitting is done based on one uh condition that if the body mass is greater than equal to 3500 and if it is yes then based on island the another feature we will classify or get the leaf node as either torresen or bisco island. So if the island is either Toggressen then the species would be Eddie and if the island is Bisco then the species would be Gento. So after torress and Bisco no further division takes place because we are getting the predictions at these leaf nodes. Whereas if the body mass is less than equal to 3500 and it says that yes the body mass is less than 3500. So we get the species as chin strip and so no further decision had to be made at this node and that's why it was been ended here. So this was a very simple basic example of a decision tree and suppose if we had got a huge data set this decision tree would have gone into a huge depth and the depth of the decision tree would have led to overfitting of the data. So that is one of the drawback of decision trees that random forest overcomes. So now let's move on forward and understand the important terms in random forest and this will also help us consolidate whatever we have learned so far. So we have taken the same small decision tree of the previous example and let's understand these are also the important terms which will be relevant to random forest also. So the first is the root node. Now here what happens is that the entire training data has been fed to the root node. And then we've got here that each node will ask either true or false question with respect to one of the feature and then in response to that question it will partition the data set into different subsets. That's what it is doing here based on the condition that if the mass body mass is greater than equal to 3500 it ask a question either yes or no and based on that again for the partition is done and if not then it just classifies the species and then again what happens is that the splitting now this is very important here the splitting takes place either with the help of a genie or entropy methods and these helps to decide the optimal split and we will be discussing about splitting methods very soon, right? Okay. And then we've got the decision nodes which provide the link to the leaf nodes and these are really important because then only the leaf nodes would tell us what actually the real predictions or to which class does the species belong. So now coming to the leaf node and these are the end points where no further division will take place and we will obtain our predictions. Okay. So now coming up to another important thing here is working of random forest. So now for working of random forest we will have to understand a few important concepts like random sampling with replacement feature selection and also the ensemble technique which is used in random forest and that is bootstrap aggregation which is also known as bagging. So we will understand this with the help of an example which will be very simple and then we will go on understanding how feature selection is done in both the classification and the regression problem. Actually how random forest select features for the construction of decision trees. Well in random forest the best split is chosen based on genie impurity or information gain methods. So this also we will understand. Now let us first understand random sampling with replacement. Now what happens here is that we've got a small subset of the same penguin data set wherein we've got some six rows and four features that means four columns and the arrows that you can see is that now we will be creating three subsets from this small subset right and these three subsets will become our decision trees and then we'll be constructing decision trees from these subsets. So let us create our first subset and you can see here that the subset is randomly being created and for convenience sake let me just also show you the different subsets here. Okay. So now for better understanding let us understand this that in the first subset if we focus we've got certain random rows here and we've got certain feature but we do not know how this feature has been selected. we got island and we got body mass. But in the second subset, we got island and flipper length. And in the third subset, we got body mass and flipper length. Right? Now let's look at the rows. Now when I am talking about these features, I will say this is feature selection. And remember this term. Now coming to the second concept that is random sampling. Now random sampling is nothing but selecting randomly from your subset. So I'm selecting randomly certain rows from my subset and creating further subsets. Okay. So what is replacement here? Replacement is can be seen here and can be understood with this second subset. We see here that the gento species this row is being repeated again and this is replacement. That means that when we are working with repeated rows and this row can be repeated again in the second or the third subset then this is random sampling with replacement. That means my random forest can use a row multiple times in multiple decision trees. Right? So this is the basic concept of random sampling with replacement and feature selection in random forest. Another important term which I would like to bring into the notice is that when we are working with these type of small subsets these are also known as a bootstrap data sets and when we aggregate the results of all these data set it becomes bootstrap aggregation. So just filling in the gaps so that later on the concepts become more clear. So now let's move on to drawing decision trees of these subsets. Okay. So let's draw the decision tree of the first subset. Again we are taking body mass as the first root node and then based on a decision like if the mass is greater than equal to 3500 then take a decision either yes or no. If it is no then the species is chinstripe and if it is yes then again you partition based on island and if it is torresen then it is udly and if it is visco then it is genu species. Okay. So this is how we will construct two more decision trees of the remaining subsets. So on the second subset let us just again create decision tree and here now we are taking flipper length and then based on a condition that if the flipper length is greater than equal to 190 then make a split. If it is yes then the specy become Gentu and if it is no that means again make a decision based on island and if it is toen it is edi and if the island is dream island then it is a chinstrap species. So this is how the decision tree of the second subset has been created and this is how it will take decisions right based on the tree length depth and also the features it is selecting. Okay. So now let's create the third decision tree of the third subset. And we get a decision tree something like this where in body mass if it is greater than 4,000 and if it is yes then clearly it is a gender species. And if it is no then again make a partition with the with respect to flipper length another feature here. And then if it is again greater than equal to 190 then the specy would be edley else it would be chinstrap. So this is how decision tree three will make a decision. Now let's just keep these decision trees with us. Okay. And we will make sense of these trees just in a while. Okay. But before that let us understand how feature selection is done in a random forest. How am I selecting the columns? So for classification by default the feature selection is taken as the square root of total number of all the features. Now suppose I've got here four features. So it is a classification problem. I will take the square root of these four features which becomes two. So decision tree would be constructed based on two features each. If suppose I had 16 features then it would be square root of 16 that would be four. So four features would be taken in each decision tree. All right. And suppose if this would have been a regression problem then by default what would happen? The features would be selected by taking the total number of features and dividing them by three. Okay. So this is how by default the feature selection is being done by a random forest. Okay. Now let us move on forward to consolidating our learning. So now we are coming to ensemble techniques that is also known as bootstrap aggregation. Random forest uses assemble techniques. And what is ensembling? It just means that you are aggregating the result of the decision trees and taking the majority vote in case of classification and the mean in case of regression problems and giving the output. Okay. So now we have again plotted all our decision trees here and below we can see that there's an unknown data and I want to predict the species of this data. So what will happen is that again let us just feed this problem to each of the decision tree and let's see what each decision tree makes the prediction. So I just feed this unknown data to decision tree one and it says that okay the specy seems to be chin straight. Okay. And then decision tree two says that based on the data it has been found that this species is elderly. And then decision tree three says that no I with my decision tree this species is chinstripe. Okay. Now all these data is being fed to random forest classifier and it says that okay for chinstripe I've got two votes for edi it's got one vote. So the new specy would be chinstrap. Right. So this is how the bootstrap aggregation is done based on the majority voting and the decisions taken by different decision trees. They have been combined together, aggregated and we get an assembled result in the random forest. Okay. So this was very simple concept of ensemble techniques which has been used in random forest. Okay. So now let's move on forward to splitting methods. So what are the splitting methods that we use in random forest? So splitting methods are many like genie impurity, information gain or kai square. So let's discuss about genie impurity. So jinny impurity is nothing but it is used to predict the likelihood that a randomly selected example would be incorrectly classified by a specific node and it is called impurity metric because it shows how the model differs from a pure division. Right? And another interesting fact about genie impurity is that the impurity ranges from 0 to one with zero indicating that all of the elements belong to a single class and one indicates that only one class exist. Now value which is like 0. 5 this indicates that the elements they are uniformly distributed across some classes. Right? Now moving on forward to information gain. Now this is another splitting method which random forest can use and information gain utilizes entropy. So entropy is nothing but it is a measure of uncertaintity. So information gain let's talk about that first. So the features they are selected that provides most of the information about a class right and this utilizes the entropy concept. So let's see what is entropy. This is a measure of randomness or uncertaintity in the data right. So we will understand this entropy with the help of a small example. So don't worry about it. So let's understand this entropy. Now suppose there's a fruit tray with four different fruits right and uh what do you feel about the entropy here? That means the randomness of the data. Is it really easy to classify these fruits into the respective class? So this becomes really uncertain and the data looks messy here. But what if we just split here these into two trays wherein the first tray would have peaches and oranges and the second tray will have apples and lemons. So now this becomes a little more certain. We get low randomness here and this is called as low entropy. So when we move down from tree that means from root node to the leaf nodes the entropy reduces and we can also calculate information gain from this entropy that is the difference in entropy before and after the split that is known as information gain. Okay. So once we move down the tree and start reducing the randomness from the data the entropy becomes lower and that is what we want in our data. If there's low entropy that means we are likely that the predictions would be more accurate and we can make predictions very easily as compared to very messy data which has high entropy. Okay. So that was about entropy. So now let's move on to understanding the advantages of random forest and we see here various advantages. So let's focus on firstly low variance. Now since random forest overcomes the limitations of decision tree and it also has the advantage of low variance because it combines the result of multiple decision trees and each decision tree is being trained on limited data set that we have seen earlier also. So each tree was making its own subset of data and training the data on that limited length of the tree. So there's less step there's less overfitting and low variance of the data. So coming to the next point that is reduced overfitting. Again since we were working with multiple decision trees hence reduced depth of the tree. So we get reduced overfitting that means the model is fitted well and it does not tries to learn even the noises right. So uh we use the bootstrap aggregation or bagging here in random forest and that is why we also get reduced overfitting in random forest and this is one of the reasons that why is it so popular because you don't have to worry about overfitting of the data right all right now moving on forward to the another advantage is that the normalization is not required in random forest because it works on rule-based approach right and uh another advantage is that it gives really good accuracy which we will also So seen or hands-on. It really gives a very nice predictions either precision or recall and generalizes well on unseen data as compared to other classifiers or machine learning classifiers which are present like nbase or SVM or KN& N. Random forest really outperforms other classifiers. Right? So let's move on to understanding few more advantages of random forest is that it is suitable for both classification and regression problems and also it works well with both categorical and continuous data. So you can use it well with any of the data sets right and it performs well on large data sets right so it solves most of the problem that's why random forest is largely been used in machine learning problems now moving on forward to certain disadvantages of random forest so the first disadvantage is that it requires more training time because of the multiple decision trees if you've got a huge data set you would be constructing hundreds and hundreds of decision trees and that requires a lot of training time and here comes one more disadvantages is that the interpretation becomes really complex when you've got multiple decision trees. So decision tree interpretation is easy because it is an individual decision tree. But when you combine hundreds of decision tree to form a random forest, the interpretation is really very difficult to understand and it becomes quite complex to apprehend what exactly the model is trying to predict and where the splitting occurs and what features are being selected and so on. And another disadvantage is that it requires more memory. So memory utilization is really heavy in case of random forest because we are working with multiple decision trees. And another disadvantage is that it is computationally expensive and requires a lot of resources because of the training of multiple decision trees and also storing them. All right. So this was all about random forest the theory part of it and now let us just move on to the practical demonstration or a hands-on on random forest. Okay. So now it's time for an hands-on on random forest. So let us just import a few basic libraries of Python in our Jupyter notebook and we will run this. We will import pandas spd numpy SNP and seaborn as SNS. Now seabon is need needed here because we want to load a data set that is a penguins data set with the help of seaborn and this is already been preloaded in seaborn. This is already loaded data set and seabon has got multiple data sets you know for practice for beginners. So it is a good way to practice for data sets. Now we can see this asterk sign that means it is telling us to wait. So let us just let it get loaded. So we get got our data in an object called DF and we can see the first five entries here and this uh data frame is shown in the form of a table rows and columns and we see here some species bill length bill depth flipper length body mass and the sex of the penguin. So our task is to specify or to classify these species of penguins into the respective correct species. Right? So we see the shape of our data and we see that it is like 344 rows and seven columns and we will see the info. So we see df. info and this gives us along with the nonnull count we also get the data type of the values. So we have got species island as the object data type whereas the bill length, bill depth, ripple length and body mass are in floating point or you can say floating data type and the sex is in object data type. Right? So now moving on forward to calculating how many null values are there with the help of df do. null dot sum. So we get certain like sum around two null values in all these columns as you can say the features like bill length, bill depth, flipper length and body mass whereas there are 11 null values in sex feature. Right? So what we do is since they are very small null values we can just drop it or you can also ignore them. So here in this data frame what I'm doing is I'm just dropping these null values and let us just check whether the they have been dropped or not with the help of again the same function dot isnull dots sum and then we see that yes they have been dropped from our data frame. Now let us do some feature engineering with our data. Now we have seen that we have got some object data type in our data frame and before feeding it into algorithm that is random forest we have to transform the categorical data or the object data type into the numeric. So we are using here one hot encoding to convert the categorical data into numeric. Now there are various ways in Python which we can do that like one hot encoding or you can also use mapping function in Python but here we are using one hot encoding. So let us just do that and we find here first of all let us apply it on the sex column and here we see that we have got two unique values in sex that is male and female and we use pandas here to get dummies that is how we will apply this one hot encoding because this is how get dummies work. So what happens is here is that the new unique values are converted into the respective columns in the data frame. So we see here we have got two unique values males and female and they are being converted into the columns. Okay. So one thing to note here is that we also get a problem of dummy trap because here we see only two unique values. Now suppose if I had six or seven unique values and I do this one hot encoding I would have lots of features in my data frame and that would lead to several complexities. So what I do is uh to keep things simple I can use one hot encoding when my data frame or my unique counts are low when my unique values are less. So since I had just two or three I can use it. So I'm using here. So what I do is again now one row one column as we can see here that it is redundant giving me extra information. So I will just drop it. So I drop this first column and what I get in this data frame is only male. So let us just infer whether I can also infer females from this or not. So if the value is one that means the penguin is a male and if the value is zero that means the penguin is a female. Okay. So only one column is needed for this data frame. So I just kept one and dropped the another one. Okay. Now apply again one hot encoding to the island feature. So in island if we check the unique values we've got three unique values here. Toggressen, bisco and dream island. And the object is the data type right. So again we will use pandas pd. get dummies and we will use apply it on the feature island and let's get the head of it. So we get here again the unique values were converted into columns and we get here respective three columns and then again we will just drop the first column to get the remaining two columns. So here also we can infer that if the island is torresen if it is one then it is not dream neither bisco right. So this is how you can read it from the data frame and understand that now remember this thing that these two island and here sex these are two independent data frames these are not yet included in the main data frame. So what we will do now is we will concatenate the above two data frames into the original data frame. So what we do we again create a new data frame that is new data and let us just concat with the help of PD do concat function and we will concat what df island and sex and x is one that means in the column okay so when we will run this let's see the head of it so everything gets concatenated in a single data frame which is good for the feeding this data into or splitting the data into test and train data. So now we have this new data frame and we've got some repeated columns here which needs to be deleted. So what we do is we will delete sex and island here which are just repeating because we've got here male and we have also got here dream and toggress. So we do not require this island column neither the sex. So we just drop it with the help of new data drop and the column names x is one in place equals to true right and let's see the head of this data frame. Head of the data frame gives me five unique values. Right? And now it is time to create a separate target variable. And what we'll do is we will store in a variable called y only species. So what we do is from this new data dot species we will just store the species in this y. And we see this y dot head that is the first five species and we got the values here. That means another target variable is been created now. So and you can also see the y do. values as edi chinstrap and genu. So now we see here three unique values of the penguin that is chinstrap edi and genu and the data type is object here. So again we need to convert this object into the numeric data type. So now what we are doing is we are using the map function in python and what we do is we map edi to zero chinstrap to one and genu to two. So this is how we see that all the values have been mapped to numeric. This is another way to convert a categorical value into a numeric value in Python. Now what we do is let us just drop the target value species from our main data frame. So we'll just drop it and let's see our new data frame. So we see that we don't have any target species here. Right? Okay. So in X let's store this new data and perform the splitting of the data. So what we do is from skarn. mmodel selection we will import our train test split and we will split our training data into 70% and 30%. So test data becomes 30% and training data is some 70%. And this random state is zero which means that I'm not fixing any random state and this is also used for the code reproducibility. Now suppose if I again run this code I will get the same result. It will not change. You can set this random state to any of the random number as per your choice and the result would differ. Okay. So now let us print the shape of X train by train X test and Y test. So we see here that it has been splitted into 70 and 30% and we get X train has 233 values here and seven features and X test has 100 values and seven features. Similarly Y train you can see 233 values and Y test has 100 values that means the species. Okay. So that has been perfectly splitted into 70 and 30%. Now what we do is we will train the random forest classifier on the training set. How do we do it? We will import the random forest classifier from skarn. nol. So we've already dealt with what is ensemble and then in classifier we will store this random forest and this n estimator is nothing but decision trees. So we are creating some five decision trees here and the criteria is entropy and again random state is set to zero. So let's see and then we will fit this X train and Y train. So this has been fitted and the criteria is entropy here. All right. So now let's make some predictions and let's create a variable called Y predict and we will just predict it on X test and we have also printed this Y prediction and now let's print the confusion matrix to check the accuracy of random forest algorithm. And what we do is from matrices skarn matrices we will import classification report and confusion matrix and also the accuracy score. So we will just import them and then in cm variable we will print the confusion matrix of y test and y predictions. So we will print it and we see here the accuracy score also which is 98%. So our random forest classifier is giving us a very good accuracy of 98%. And you can see a confusion matrix that only two cases have been mclassified. Rest all the cases have been correctly classified by random forest classifier. Okay. So now let's move on to printing the classification report of Y test and Y prediction. Let's see. And we get the precision as 96%. That means the two predictions by the algorithm is 96%. The recall or the true prediction rate is 100% which is very nice. And F1 score is also good which is 98%. So this is giving us a good result. But what if we change the criteria from entropy to genie. So let's just experiment with that too. So let's try this with the different number of trees and change the criteria to gen coefficient. So now again from skarnon. semble we will import random forest classifier and fit it. Okay. And here what we are doing is just we are using seven trees. Previously we used five and now in the criteria we will use jinny coefficient and random state is zero. So let's run this and see whether there's a change in accuracy or not. And let's predict this and let's check the accuracy score. What is the accuracy score for this random forest classifier with seven trees. So we get 99% accuracy with changing the criteria and changing the number of trees. So you can just experiment with different number of trees and different number of decision trees. Let's just experiment with you know 12 decision trees and see what happens. So you can see the accuracy reduced to 98%. Okay with seven we were getting 99. So let's just keep seven because it is giving us really
Support Vector Machine In Python
good accuracy. So this is about random forest classifier and how it works with several trees and different criteria to give us very good accuracy on our training and test data. Now without any further ado, let's understand introduction to machine learning. Machine learning is the process of feeding a machine enough data to train and predict a possible outcome using the algorithms at bay. The more the data is fed to the machine, the more efficient the machine will become. So let us try to understand this with a real life example. So I'm sure most of you are aware of the predictions made in any sport before any major match. So in this case I'm going to tell you about a football penalty session. So let's say the data of previous performances are considered. So let's say the goalkeeper has saved all the penalties to his right in the last 50 penalties that he has saved. So this data will be crucial to predicting if he will or will not save the next penalty that he faces. Of course there are other factors to consider as well. Another example is the suggestions that we get while surfing the internet. The data of our previous choices are processed to give us the most favorable content we are most likely to watch. Anyhow, machine learning is not just feeding the machine an ample amount of data. There goes a lot of processes, algorithms and decisive factors to get the optimum results. So in this session, we will go through one such support vector machine algorithm to understand how it works with Python. Before that, let us also take a look at the types of machine learning. So there are three types of machine learning that is supervised learning, unsupervised learning and reinforcement learning. Supervised learning is contained in a controlled way to oversee the outcome accordingly. It is as the name suggests supervised in a way that the machine learns what the user wants it to learn. Coming to unsupervised learning, the machine simply explores the data given to it. The data is sometimes unlabeled and uncatategorized and the machine makes the possible references and predictions without any supervision. And talking about reinforcement learning, it basically means to enforce a pattern of behavior. The machine needs to establish a systematic pattern of approach and reinforcement learning. So these are the types of machine learning that we have. Let's move on to the next topic that is what is the support vector machine. So what exactly machine? A support vector machine or we can call it as SVM was first introduced in the 1960s and later improvised in the 1990s. An SVM is a supervised learning machine learning classification algorithm that has become extremely popular nowadays owing to its extremely efficient results. So an SVM is implemented in a slightly different way than other machine learning algorithms. It is capable of performing classification, regression and outlier detection as well. A support vector machine is a discriminative classifier that is formally designed by a separative hyper plane. It is a representation of examples as points in space that are mapped so that the points of different categories are separated by a gap as wide as possible. In addition to this, an SVM can also perform nonlinear classification. So I'm going to tell you a few advantages and disadvantages of SVM or support vector machine. So talking about the advantages of SVM, it is effective in high dimensional spaces and it is still effective in cases where the number of dimensions is greater than the number of samples. One more advantage is it uses a subset of training points in the decision function that makes it memory efficient and the last advantage is different kernel functions can be specified for the decision function which also makes it versatile. Coming on to the disadvantages if the number of the features is much larger than the number of samples we have to avoid overfitting in choosing the kernel functions and regularization term is actually crucial. The next disadvantage is SVMs do not directly provide probability estimates. These are calculated using five-fold cross validation. So these are the advantages and disadvantages of SVM. Now let's take a look at the next topic which is how does an SVM work. So the main objective of a support vector machine is to segregate the given data in the best possible way. So when the segregation is done, the distance between the nearest points is known as the margin. The approach is to select a hyper plane with the maximum possible margin between the support vectors in the given data set. Now to select the maximum hyper plane in the given sets, the support vector machine follows the following. It generates a hyper plane which segregates the classes in the best possible way and then it selects the right hyper plane with the maximum segregation from either nearest data points. Now let me tell you how we can deal with inseparable and nonlinear planes as well. So in some cases hyperplanes cannot be very efficient and in those cases the support vector machine uses a kernel trick to transform the input into a higher dimensional space. So with this it becomes easier to segregate the points. Now let us talk about the SVM kernels. So an SVM kernel is basically used to add more dimensions to a lower dimensional space to make it easier to segregate the data. It converts the inseparable problem to a separable problem by adding more dimensions using the kernel trick. A support vector machine or SVM is always implemented in practice by a kernel. The kernel trick helps to make a more accurate classifier. Let me talk about the different type of kernels that we have in support vector machine. First of all, we have linear kernel. A linear kernel can be used as a normal dot product between any two given observations. So the product between the two vectors is the sum of the multiplication of each pair of input values. And then we have the polomial kernel. So it is rather generalized form of the linear kernel. It can distinguish curved and nonlinear input space as well. Now talking about the next kernel that we have that is a radial basis function kernel. So the radial basis function or RBF kernel is commonly used in SVM classification. It can map the space in infinite dimensions. That is the advantage that we have with RBF kernel. So these are the kernels that we have in SVM guys. Now I'm going to talk about a few support vector to machine use cases. So these are a few use cases I have listed down here. We can use SVM for which is face detection. face detection. Then we can use it for text and hypertext categorization. classification of images. We can use it for bioinformatics, protein fold and remote homology detection. We can use it for handwriting recognition and we can use it for generalized predictive control as well. So these are the use cases that we can use SVM for. Now that we are done with the SVM use cases, let me tell you how we can implement SVM. So there are a few steps that we have to follow to implement a support vector machine in machine learning. In any machine learning model, we have to follow specific steps. So first of all, we have to load the data in which we are going to perform the classification of data. After that, we are going to explore the data. We're going to see how many labels, what are the target variables there and then we're going to split the data into train and test data and then we'll generate the model by implementing the support vector machine and after that we are going to evaluate the model as well. So I'm going to show you this when we are going to work on the PyCharm. After this I'm going to show you a simple use case which is a character recognition using support vector machine. So in this we're going to get a image recognition something like this. So let's take it up to PyCharm guys and I'll show you how you can implement support vector machine in Python. So we are in PyCharm guys. Let me go to the presentation mode for better visibility. So first of all you have to make sure you have all these libraries installed in your system that is scikitlearn skarn and then you have to import all these modules like SVM metrics train test split from model selection and data sets to load the data. So first of all we have that cancer data over here in which we are going to load the data set which is the breast cancer data set. So this is a simple example or this is the way we can load the data. I'll just write a comment over here. Loading the data. This is where we are going to load the data. After this we can check for what all are there inside the data like how many images or what all is there. So I'm just going to remove all this for now. and let's see what we have in the data. So I'm just going to print cancer data. Let's see what we have. It's going to take a while because it's loading the data set from the internet. You can see these all points are there inside the data set. So we have data. Then we have target also over here which is an np array. This is the data that we have over here. Now I'll just check the shape of the data as well. Let's see the target variable also. So this is our target variable guys which is in a binomial form which is 0 and one. So we have only two possibilities over here. Now that we are done exploring the data also let me show you how you can start splitting the data into test and train sets. So I'll just remove this again and print the previous code. So this is my training and testing variables that is X train X test. Then we have Y train Y test and inside this I have used train test split to train my data over here. And as you can see I'm using the data as cancer data. Then we have the target variable. We are using the test size at 0. 4. We can use it at 0. 3 as well. We'll be using like 30% for the testing and the one for training. And we have random state as 209. After this I am going to generate the model that is my classifier using the SVM over here and I'm going to use the SVC with the kernel linear. We are using the linear kernel over here. So I'll just write generating the model. After this we are going to train the model. For training the model we are going to use the fit method over here. Inside this we passing the X train method and Y train method to train the data. So basically fit method is used to train the data. After this we are going to predict the response using the predict method over here. Inside this we going to pass the variable that is test. So we have taken the test separately for this purpose only to predict the outcome over here. So now that we are done with the training and testing over here we are going to print the accuracy. I'll just write one more C over here to make it correct. For this we are using the metrics from the skarn module which is going to give us the accuracy score. In this we have y test and y prediction. So it's going to give us the accuracy report. After this we are going to get precision score using the precision score method in metrics. And then we have recall as well. And to make it easier I'm also using a classification report. So let's just run this program guys and we'll see what happens. we'll get the total report how efficient our model is actually. So as you can see we have the accuracy as 0. 92 which is close to 92%. And then we have precision of 93% and recall as 94%. And after that we have a classification report. Inside this we have also given the accuracy the FN score and the recall as well. So this is a simple example to implement your support vector machine guys. Now moving on let me show you one more example. Before that let me just exit the presentation mode. Now I'm going to show you a use case for character recognition using support vector machine. So for this also we are using one more library that is mattplot lip to plot the graph or to plot the image as well. So we have data sets which we are going to import from skarn only. We have SVM and metrics which we are not going to use in this one for accuracy which I'll tell you later guys. So first of all we are loading the data sets over here. This is the load digits. So we have a data set with a bunch of digits the handwritten digits and now we are generating the model directly. After this we have x and y variables which is you know going to take the test data and the target data as well. So we are taking all the data until minus 10 over here for training and for target variable we're also taking the same number and after this we going to fit the model to train it and we are predicting the models with the digit data that is over here. After this I'm just going to show a image or instead of nine we can use any other number we have taken interpolation that is nearest and I'm using the I am show method from the mattplot lib and if you are not familiar with mattplot liib or how you can plot graphs you can check out other tutorials on ID on mattplot lil to get this so let's just run this program guys let me exit the presentation mode and we getting a figure something like this so it looks 9 but it's quite distant because we have taken the gamma as 0. 01 and the C value is 100. So if I increase or decrease this value the output is going to change the is going to be much larger but the speed is going to become a little slower. So this is the reason and instead of 9 I can just write let's say six over here. So the prediction is actually right but the image is a little distorted because the accuracy is not quite accurate the way how we want it. But this is a very simple example to make a character recognition model using support vector machine. Also to increase the accuracy
What is a Neural Network?
we can change the gamma values or C values in the SVC parameter but it will hinder the speed too. So if we increase the gamma values the accuracy will decrease but the speed will increase and vice versa. From speech recognition and face recognition to healthcare and marketing, neural networks have been used in a varied set of domains. Hi all, I'm Zelka from Edureka and I welcome you to this session on what is a neural network. An artificial neural network is the functional unit of deep learning. Deep learning uses artificial neural networks which mimic the behavior of the human brain to solve complex datadriven problems. Now deep learning in itself is a part of machine learning which falls under the larger umbrella of artificial intelligence. Artificial intelligence, machine learning and deep learning are interconnected fields where aids artificial intelligence by providing a set of algorithms and neural networks to solve datadriven problems. Deep learning makes use of artificial neural networks that behave similar to the neural networks in our brain. A neural network functions when some input data is fed to it. Now this data is then processed via layers of perceptrons to produce a desired output. So let's understand neural networks with a small example. Now consider a scenario where you have been given a set of labeled images and you have to classify them into two classes. One class containing images of non-deseased leaves and the other diseased leaves. So how would you create a neural network that classifies the leaves into diseased and non-deseased crops? Now the process always begins with processing and transforming the input in such a way that it can be easily processed. In our case, each leaf image will be broken down into pixels depending on the dimension of the image. For example, if the image is composed of 30x30 pixels, then the total number of pixels will be 900. Now, these pixels are represented as matrices which are then fed into the input layer of the neural network. Just like how our brains have neurons that help in building and connecting thoughts, an artificial neural network has perceptrons that accept inputs and process them by passing them on from the input layer to the hidden and finally the output layer. Now, as the input is passed from the input layer to the hidden layer, an initial random weight is assigned to each input. The inputs are then multiplied with their corresponding weights and their sum is further processed through the network. Now here what you do is you assign a numerical value called bias to each perceptron. Furthermore, each perceptron is passed through activation or something known as the transformation function that determines whether a particular perceptron gets activated or not. An activated perceptron is used to transmit data to the next layer. In this manner, the data is propagated forward through the neural network until the perceptrons reach the output layer. At the output layer, a probability is derived which decides whether the data belongs to class A or class B. Now let's assume a case where the predicted output is wrong. In such a situation, we train the neural network by using the back propagation method. Initially while designing the neural network we initialize weights to each input with some random values. Now these weights denote the importance of each input variable. Therefore if we propagate backward in a neural network and compare the actual output to the predicted output, we can readjust the weights of each input in such a way that the error is minimized. This results in a more accurate output and this is exactly what back propagation means. Now let's discuss a few real world applications of neural networks. With the help of deep learning techniques, Google can instantly translate between more than 100 different human languages. Visual translation is an interesting application of deep learning. It can be used to identify images that have letters. Now once you identify them they can be turned into text translated and then the images are recreated with a translated text. In fact Google has an app for this purpose. It's called the Google translate app. Let's not forget to mention automated self-driven cars. Deep learning has played a huge role in the field of self-driving cars. From Tesla to Google owned way more self-driving cars are being perfected with the help of neural networks. Then of course we have the virtual assistants like Siri, Alexa, Cortana that can literally read your mind. These assistants are purely based on technologies including deep learning, machine learning and natural language processing. Apart from this, deep learning has also made its way into the gaming industry. So, all you Dota fans out there might have already heard of the famous Open AI 5, which is the first AI to beat the world champions in an esports game after defeating the reigning Dota 2 world champions. Post the victory, Bill Gates tweeted, quote, "AI bots just beat humans at the video game Dota 2. That's a big deal because their victory required teamwork and collaboration, a huge milestone in advancing artificial intelligence. Now guys, the applications of deep learning are not restricted to just games and machine translation. In fact, deep learning has found its way into the creative arts and music field. An AI based system called Museet can now compose classical music that echoes the classical legends like Batch and Mosart. Museet is a deep neural network that is capable of generating 4minute musical compositions with 10 different instruments and can combine styles from country to Mosart and to the Beatles. Another creative product of artificial intelligence is a content automation tool called Wordsmith. Wordsmith is a natural language generation platform that can transform your data into insightful narratives. Tech giants such as Yahoo, Microsoft, Tableau are using wordsmith to generate around 1. 5 billion piece pieces of content every day. I can go on and on about the applications of deep learning. In the long term, we hoping to see the
Neural Network in Python
use of advanced AI techniques like deep learning for the betterment of humanity of the threat. that AI is supposedly going to humans. I'm a firm believer that AI will only benefit us in the long run. Moving on let's finding how the end of we have used large work that two layers and there is some computation happening and we are getting the desired output. So let's think of it as an network where everything is connected and there are layers and neurons we'll call them neurons now and know the data points inside each layer and there are functions activation functions you know computations happening summation functions at the end of the day you're getting the desired output from that particular net in hindsight there There's input going inside the network and you're getting out. So it is as simple as that. And now in the figure also you can see that there is a network where nodes are interconnected and there are layers which basically roughly depicts a neural networks similar to as well. That's why most of the people would talk about this analogy. Now the importance of neural network in deep learning is quite immense guys. It basically narrows down the human intervention to a bare minimum and is pretty efficient with multi-dimensional data about that later on. Speeding up the processor with high efficiency is just one of the advantages of the artificial neural networks. And also I forgot to tell you guys neural networks are also known as artificial neural networks. understand neural networks with of how AGS and deep learning and so machine learning was through in the it led to consume task. It helped in solving complex problems and making smart decisions. Few drawbacks in machine learning that led to the deep learning. So there are a few pointers that first is it was unable to process high dimensional data. So this was so machine learning can process only small dimensions of data a small set of variables and if you want containing hundreds of variables then machine learning that is when learning is actually fulfilled. Next one is feature in machine learning. So consider a use case where you have 100 predictor. You need to narrow down only the significant. To do this you have to manually study the range of the variables and figure out which ones are important and this task is extremely tedious and time consuming for anyone including a developer. The next point is it was not ideal for performing object detection and image processing. So since object detection requires highdimensional data being you know the images and frames that you have and you know casket classifiers the XML files these are all high dimensional data and machine learning cannot be used to process the image data sets it is only ideal for a number of features that's why deep fraud risk manager and a data scientist at PayPal Kong quoted at one time like what we enjoy from more modern advanced machine learning ability to consume a lot more layers and be able to see a simpler techn even humans might not be able to see. So clearly a simple linear model is capable of consuming around 20 variables let's say. However with deep learning technology one can run thousands of data points. So that is exactly what we need with deep learning and neural networks is the center point of that. Now as a whole is a complicated concept. We can simplify it like there is an input data going inside bunch of layers. There is a computation in the calculations happening using some functions and we are getting the output. But it is not as simple as that. To understand how it works we'll have to take a look at the components. You know basically how many layers are there? what kind of activation functions we are using and how are we you know updating the weights and what are the bias and everything. So these are the components that we have to talk about. So the key components that build a neural network includes an input layer. There'll be hidden layers a bunch of hidden layers where the computations or theations will be happening. There is an output layer where we get the output the weights and biases with the input values. there is an activation function and then there is a loss function as well. So let's discuss each of them one by one. First of all I'm going to talk about exactly are layers. So a layer in neural net is basically it's storing the neuron before passing it onto the next input. Layer is basically a neuron before passing it to the next layer. And each neuron is a method that takes its input, multiplies it by the weight associated with it and then passes through the activation functions to the other neurons. And there are basically three types of layers as I've told you there is an input layer, hidden layer and an output layer. So as an the input layer accepts all the inputs provided by the programmer or the user and the hidden layer between the input and the output layer is a set of layers known as hidden layers and in this layer computations are performed which results in the output. So we getting the results from these hidden layers and whatever results that we're getting I mean the inputs go through a series of transformations via the hidden layer output. So it finally results in the output and the output is delivered by the output layer. So now we're going to move on and learn about what exactly are weights and biases and why are we using them in neural network. So to talk about weights and biases, one of the main components of neural network are weights and biases. So what exactly are weights? We can understand weights as a value associated with the input that basically decides how much importance that particular input has to calculate the desired output. Basically the priority of you know the input and the weights are optimized during the training phase which we'll discuss later on while training the model. And to understand this with an example let's say we have a vintage car. Now to calculate the price of a vintage car there would be two very essential factors. First is how old is the car or which model? I mean what year the model was made and how much has it been driven like how many miles are there on the car. So the weights would have a negative relationship with respect to the year it was made because older the model was made the higher the price would be and similar for the number of miles associated with the car. The less the number of miles the more the price will go up. bias. It is simply a constant value that is added to the weighted sum of inputs to we bias when we're taking a look at the neural network using Python guys. So don't worry and we're going to take a look at the activation function and understand why it is used in neural network. So what exactly is an activation function? An activation function is basically normalizing the computed input to produce an output. There can be various activation function. There can be a function, a linear function that soft max and re these are the activation functions that we can use in our model going sigmoid function guys probably because it has a threshold value from 0 to one. So they can be output as you want. So that is why we're going to use sigmoid function guys. And if you want to learn more about sigmoid function, we have tutorial on our YouTube channel that you can check out. You can also check out the gradient descent algorithm or tutorial that we have on our YouTube channel. And to understand, okay, let's try to understand the whole process in a systematic way. So let's say the input values are I1, I2, I3 and so on. And there is a subsequent weights associated with them which is W1, W2 and so on until WN. And the weighted sum will be 1 I2 W2. And after adding the bias, the weighted sum would become 1 I2 W2 and so on bias. Now the output will be computed using the act. So the output would be on our calculate or calculated. So it would look some is output is equal to 1x 1 + e - x. And we will learn more about this when we will train the neural network. So now we'll talk about the next step that we have in our neural network. So far we have discussed the layers the weight associated with the input and how the output is calculated using the activation function. Now to implement a neural network there are two involved forward and then there is back propagation is feed forward in network the wind randomly and the output is calculated using the activation function. These weights are actually taken at random are going to be optimized later on during back propagation. So the entire process of the input going through all the layers and getting the output is feed forward. Now on the other hand back propagation is the process where weights are updated to minimize the calculated error. Error is nothing but you know the actual output and the predictor output is compared and the difference between both of them is the error. So that error for that we use back propagation. Now to do the weights are basically updated using the gradient descent algorithm that's why I was talking about you can check it out on our YouTube channel. So I'll talk about the flow how we have set the flow right. So we take the inputs we assign the bias and the weight associated with and we prediction after you know applying the activation function on the output. So to minimize the cost function we do gradient descent algorithm and we repeat the train with updated weights until we get the lowest error or the lowest error in prediction and predictions. So that is how the whole process works and to understand how we the error we can take an example you know in the example we'll be taking a look at how mean square error is taken and weights are updated using the gradient descent algorithm where we have x= to input the f of x is the output based on last is the learning rate and we going to find the derivative of x and for that we'll have three derivatives the chain rule and there will be three derivative take a look at the algorithm the formula how we calculate the done there that so we're going to take it up to the actual implementation of neural network I have discussed so far so let's take it up to Jupyter and we're going to see the practical implementation of how we use a neural network so in this notebook I have written down the logic for uh you know we're going to implement a neural network going to need few components which is going to be nothing but in inputs you know the input features and they're going to be the output values and there'll be weights the bias and we're going to use the gradient descent algorithm. So we're going to need the learning rate and derivative of the sigmoid function and of course there'll be sigmoid function as the activation function. So initially I imported the numpy library and now guys I must tell you before we begin this you know implementation this is just to make you understand how it works. So not necessarily you have to deal with you know neural networks like this. We have tens flow we have scikitlearn where you can just you know uh import a module and using tensflow there is you know you can design a models you know sequential models with lot of layers and everything and there you have the activation functions and then you can also calculate the loss and there are metrics such as accuracy and everything. So that is what you will use to make it easier for you. This is just I'm telling you guys to understand how it really works. So we have the input value. All right. So we have an array with these four values in shape four by two. And after that we have the output values in the output array. And of course we have four values 0 1 0. And this is in shape we have reshaped into 4x one. And then we have the weights which is none other than uh 0. 1 and 0. 2 to two weights associated with this particular program and then we have the bias is equal to 0. 3 and activate function is the sigmoid function which we have you know implemented using the numpy and numpy exponentiation of course and this minus x over here this is you know how you implement a sigmoid function using numpy and then there's a derivative of the sigmoid function which we are going to use for chain rule which will you know help us uh in the gradient descent algorithm or the formula and then comes part where we have code where we are going to update the weights. So for example the epochs are in the range 10,000 and then the input array. So there's a weighted sum. So this goes on with the flow that I have mentioned already. So first of all you have an input value then you calculate the weighted sum add bias to it. So this is basically the weighted sum where we have added the bias the weights everything after that we have the first output. Okay so this is the feed forward. The first output is our sigmoid function using we are getting and after that we will have a set of values for first output which we can compare with the desired output. Here we will have a error basically the difference between all them. So we are calculating that using mean square error the statement that we have with numpy is calculating the mean square and after that we have the first derivative. These derivative values we are practically calculating to incorporate in the gradient descent algorithm. So first derative is the error. This is the first derivative in chain rule. Then we have the second derivative which is basically nothing but the derivative of the sigmoid function. Then we have the final derivative which is the dot product of the input values where we have transposed the input values. Before it was in the shape four 4x2 now it will become 2x4 and after that we update the weight using the equation which is nothing but this is x is equal to x - lr which we have set as 0 5 and basically dy dx of f of x all right so this is the final dative that's why we calculate of these And after that we update the bias as well using the same for and now if I run it so the weight value is 4 and 1. 8 and the is 5. 6. So now to get the prediction done we will check it for new value. So we have an error the input value is 0 and one. So for this 0 and one the target output is one. So we will see how far off our result is. It is basically 0. 9979.
Artificial Neural Networks
So it's pretty accurate guys. Now we'll check for other values as well. So let's check for 0 and 0. So it should be around 0. 3 which is almost equal to zero again. So this is how we can implement a neural network in Python. So this is a guys. So instead of using a sigmoid function in this you know activation function what we can do is we can use a linear function there are other options as well and you know we can use a soft max function. So this is it guys with we have come to the end of the session and I just want to tell you guys this is basic enough of neural networks that you can find. So if you really want to work with neural networks, this is just a very basic example that I've shown you what you can do is you can tensorflow official documentation or check out our YouTube channel and you can find a lot of tutorials there where we have image classification you know basic image classification how you can just get data pre-process the images and feed the input into a net where you cany or classify the images into different class and That's a very basic example again. So this is the problem statement guys. We need to figure out if the banks are real or fake and for that we'll be using artificial new network and obviously we need some sort of data in order to train our network. So let us see how the data set looks like. So over here I've taken a screenshot of the data set with few of the row data were extracted from images that were taken from bank note like specimens from those images and these are few features that I'm highlighting with my cursor and the final column or the last column actually represents the label. So basically label tells us to which class that pattern represents whether that pattern represents a or second. implement this use case. So over here we'll first begin to encode the dependent variable and what is a dependent variable? It is nothing but then we are going to divide the data set into two parts. one for training and for testing. After that we'll use test structures for holding features, labels etc. and a Python library that is deep learning models or our model on the training data will calculate the error. Error is nothing but your different output and the actual output and we'll try to reduce the test data and we'll calculate. So guys let me quickly open my PyCharm and I'll show you how the output looks like. Hi Cham guys over here I've already written the code in order to execute the use case. I'll go ahead and view the output. So over here as you can see with every iteration the accuracy is increasing. So let me just stop it. Questions any doubts with respect to what is our use case? What is the data set about? Any questions guys? You can go ahead and ask me. There's a question from Arpan. He's asking can you explain the code? Definitely Arpan. I'll be doing that at the end of this class. When you are done with all the fundamentals of neural networks, I'll explain you the entire code. How I have written that and how I've used implement neural networks. I find the answer okay he's any other question guys just go ahead and you don't right now guys later so what I'll do I'll we need neural networks we are going to compare the problems that were there convention computer installer and unless that's to follow computer the pro who actually pro to solve should problem which is there instructions to so this problem to pretty about tradition let us see what neural basically brain does networks from exam program them to perform a specific task they will learn from their examples from their experience. So you don't need to provide all the instructions to perform a specific task and your network will learn on its own with its own experience. All right. So this is what basically neural network does. So even if you know how to solve a problem you can train your network in such a way that with experience it can actually learn. So that was a major reason why neural networks came into existence. will go forward and we'll understand what is the motivation behind neural networks. So these neural networks are basically inspired by neurons which are nothing but your brain cells and the exact work of the human brain is still a mystery though. So as I've told you earlier as well that neural networks work like human brain and so the name and similar to a newborn human baby as he or she learns from his or her experience we want a network to do that as well but we want it to do very quickly. So here's a diagram of a neuron. Basically a biological neuron receives input from other sources combines them in some way perform a generally nonlinear operation on the result and then outputs the final result. So here if you notice these dendrites these dites will receive signals from the neurons. Then what will happen? It will transfer it to the cell body. The cell body will perform some function. It can be summation can be multiplication. So after performing that summation on the set of inputs via exon it is transferred to the next neuron. Now let's understand what are artificial neural network. It is bas basically a computing system that is designed to simulate the way the human brain analyzes and processes the information. Artificial neural networks has selfarning capabilities that enable it to produce better results as more data becomes available. So if you train your network on more data it'll be more accurate. So these neural networks they actually can configure your neural specific applications. It can be pattern recognition or it can be date anything like that. All right. Because of neural networks we see a lot of from translating web pages to having a virtual accessories online to conversing with chart. All of these things are pos neural networks. In a nutshell if I need to tell you artificial neural network is nothing but a network neurons. All right. So let me show you the importance with two scenarios before and after neural network. So over here we have a machine and we have trained this machine on four time see where I'm highlighting with my cursor and once the training is done we provide a random image to this particular machine which has a dog but this dog is not like the other dogs on which we have trained our system on. So without neural networks our machine cannot identify that dog in the picture as you can see it over here. Basically our machine will be cannot figure out where the dog is. neural networks even if we have not trained our machine on this specific dog but still it can feature dogs that we have trained on and it can match those features there in this particular image and it can identify that dog so this happens all neural networks so this is just an example to show you how important are neural networks now I know you all must be thinking how neural networks work so for that and understand how so over here I'll begin a single artificial neuron that is called as perceptron. Sole of a perceptron have multiple inputs x1 x2 dash and we have corresponding weights as well and w2 for x2 similarly w and then what happens we calculate s of these inputs and after doing that we pass it through an activation function is nothing but it provides a threshold value so above that value else it won't fire this is b basically an artificial neuron so when I talk about a neural network it involves a lot of these artificial neurons with their own activation function and their processing element. Now we'll move forward and we'll actually understand various modes of this perceptron or single artificial neuron. So there are two modes in a perceptron. One is training and another using mode. In training mode, the neuron can be trained to fire for a particular which means that we'll actually train our neuron to fire on certain set of inputs and to not fire on the other set of inputs. That's what basically when I talk about using mode it means that when a to input pattern is detected at the input associated output becomes the current output which means that once the training is done and we provide an input on which the neuron has been trained on so it'll detect the input and we'll provide the associated output basically using or your network so the two modes we'll understand these are the things also then this neuron will fire else similarly for sign function these are three activation functions there are many more that I've told you earlier as well these are the three major used activation functions what we are going to do we are going to understand how a neuron experience so I'll give you a way in order to understand that when we talk about a can and say multiple neurons in a network I'll explain you the maths behind it I'll explain math behind learning how it actually happens so right now I'll explain you with an analogy is pretty interesting so I know all of you must have guessed it so these are two beer mugs and all of you who love beer can actually relate to this analogy a And I know most of you actually love beer. So that's why I've chosen this particular analogy so that all of you can relate to it. All right. So apart. So fine guys. So there's a beer festival happening near your house and you want to badly go there. But your decision actually depends on three factors. First is how is the weather? Whether it is good or bad. Second is your wife or husband is going with you or not. And the third one is any public transport is available. So on these three factors your decision will depend whether you will go or not. Gather these three factors as inputs to our perception and we'll consider our decision of going or not going to the beer festival as our output. So let us move forward with that. So the first input is how is the weather? We'll consider it as x1. It will be one and when it is bad it will be zero. Similarly, your wife is going. If she is going, then it's one. If she's not going, then it's zero. For public transport, if it is available, then it is one. Else it is zero. So I'm talking about let's see the output. So out going to the beer festival and output will be zero at home. You want to have beer at home or so. So these are the two outputs whether you are going. Now what a human brain does. Okay, fine. I need to go to the beer festival. But there are three things that I consider. But will I give important equally? Definitely not. There'll be certain factors of high priority for me. I'll focus on those factors. Few factors won't affect that. All right. So let's prioritize our inputs or factors. So here our most important factor is weather. So if weather is good, I love beer so much that I don't care even if my wife is going with me or not or if there is a public transport available. So I love beer that much that if weather is good that definitely I'm going there. That means when X1 is high output will be definitely high. So how we do that? How we actually prioritize our factors or how we actually give importance more to a particular input and less to another input in a perceptron or in a neuron. So we do that by using weights. So we assign high weights to the more important factors or more important and low weights to those particular inputs which are not that important for us. So let's assign weights guys. Weight W1 is associated with input X1, W2 with X2 and similarly W3 with X3. Now as I weather is a very important factor. So I'll assign a pretty high as six. Similarly W2 and W3 are not that important. So I'll keep it as after that I've defined a threshold value as five. when the weighted sum of my input is greater than five then only my neuron will fire then only I'll be going to the BFS all right so I'll use my pen and we'll see what happens when weather is good and weather is good our x1 is our weight is six we'll multiply it with six if my wife decides that she is going to stay at home and she will probably be busy with cooking and she doesn't want to drink so she's not coming so that input becomes 0 into two will actually make no difference because it'll be zero only. Then again there is no public transport available also. Then also this will two. So what output I get here? I get here as six. And notice the threshold. So definitely six is greater than five. That means my output will be one or you can say my neuron will fire or I'll actually go to the beer festival. So even if these two inputs are zero for me that means my wife is not willing to go with me and there is no public transport available but weather is good which has very high weight value and it actually matters a lot to me whether the two inputs are high or not threshold but what thresh but public which is greater than will be build is bad my neuron will fire so these are the two scenarios that I have discussed with you. All right. So there can be many other ways in which you can actually assign weight to your uh problem or to your learning algorithm. So these are the two ways in which you can assign weights and prioritize your inputs or factors on which your output will depend. So obviously in real life all the inputs or all the factors are not as important. So you actually prioritize them. And how you do that in perceptron? You provide high weight to it. This is just an anology so that you can relate to a perceptron to a real life. Discuss the math behind it later in the session as to how a neon learns. All right. So how the weights are actually updated and how the output is all those things we'll be discussing later in this session. But my aim is to make you understand that you can actually relate to a real life problem with that of a perceptron. All right? And in real life problems are not that easy. They are very complex problems that we actually face. So in order to solve those problems a single neuron is definitely not enough. So we need networks of neuron. And that's where artificial neural network or you can say multi-layer perceptron comes into the picture. Let's discuss multi-layer perceptron or artificial neural network. So this is how an artificial neural network actually looks like. So over here we have multiple neurons in present in different layers. The first layer is always your in layer. This is where you actually feed in of your input. Then we have the first hidden layer.