How to implement Decision Trees from scratch with Python

37:24

How to implement Decision Trees from scratch with Python

AssemblyAI 15.09.2022 102 153 просмотров 1 995 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

In the fourth lesson of the Machine Learning from Scratch course, we will learn how to implement Decision Trees. This one is a bit longer due to all the details we need to implement, but we will go through it all in less than 40 minutes. You can find the code here: https://github.com/AssemblyAI-Examples/Machine-Learning-From-Scratch Previous lesson: https://youtu.be/YYEJ_GUguHw Next lesson: https://youtu.be/kFwe2ZZU7yw Welcome to the Machine Learning from Scratch course by AssemblyAI. Thanks to libraries like Scikit-learn we can use most ML algorithms with a couple of lines of code. But knowing how these algorithms work inside is very important. Implementing them hands-on is a great way to achieve this. And mostly, they are easier than you’d think to implement. In this course, we will learn how to implement these 10 algorithms. We will quickly go through how the algorithms work and then implement them in Python using the help of NumPy. ▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬ 🖥️ Website: https://www.assemblyai.com/?utm_source=youtube&utm_medium=referral&utm_campaign=scratch04 🐦 Twitter: https://twitter.com/AssemblyAI 🦾 Discord: https://discord.gg/Cd8MyVJAXd ▶️ Subscribe: https://www.youtube.com/c/AssemblyAI?sub_confirmation=1 🔥 We're hiring! Check our open roles: https://www.assemblyai.com/careers ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ #MachineLearning #DeepLearning

Оглавление (14 сегментов)

Introduction

hey and welcome in this lesson we're going to learn how to implement decision trees from scratch so let's first take a look at how decision trees work let's say this is a data set that we have it's a data set of whether we can afford a house or not and it has information of which neighborhood it is in and how many rooms it has and also the affordability um so when you want to build a decision tree what you do is let's say this is all of our data sets and this is basically our root note so we have one two three four five six seven eight uh data points here representing all these data points here so what we say is at first okay we want to divide which neighborhood this house is in so we can say if it's in the east neighborhood here's where all the data points are going if it's in a west and then you can do it again for the other column you can say okay how many number of rooms but you don't do it for each number you can say because this is a numerical value is it for example higher than 5 or not if it is higher than 5 then it goes into one side if it's lower than 5 it goes to the other side and now we have leaf notes these are called leaf notes that are really nice representing either yes or no so just to quickly recap as i said this first note here is a root note where all the data points are located and then if we go to a terminal node that is called a leaf node so that's the end node and this left side or the right side after a node is called a branch so we're going to have left and right branches for our trees but in a tree like that of course a couple of things need to be decided right so for example which feature do we split the this branch with so right now we're in the root node how did we decide that we need to divide on the neighborhood information or the neighborhood feature and not on the room feature first or when you are splitting in a numerical column or future which point should i split where the rooms are five either less or more than five or three or ten how do we decide that and lastly when do we stop splitting so here we split um two times maximum and then we reach leaf notes where the leaf node has a pure class so either yes or no nothing mixed but sometimes you have a very big data set and it is not possible to or you would have to build a very big tree to be able to reach leaf nodes that are pure so sometimes you have to decide when to stop beforehand so let's quickly go over how the trees are built and used and then i'm going to point out some details from that so during training what we do is given the whole data set we calculate the information gain from each possible split so this is each feature and if the future is numerical each possible split inside this feature and then we divide this data set using this feature and the value that gives us the most information gain and then we divide this node so create two branches and then we do this all over again for all the newly created nodes we do this until a stopping criteria is reached as i mentioned before so here there are two things that you're seeing that we don't know what mean what they mean yet so one of them is information gain and the other one is stopping criteria but we will see them in a second and during interference time what we do is we follow the tree based on the future values that we have on our data point until we reach a leaf node and if the leaf node is not pure pure we return the class of this leaf node and if the leaf node is not pure we do a majority vote so we return the most common class label alright so we talked about a couple of terms one of them was information gain is calculated as the entropy of the parent and the weighted average of the entropy of the children so again a new term entropy is basically the lack of order so if we look at our example from before here in this data set the entropy of the root node is higher than the entropy of the leaf nodes node is 0 because there is no disorder it's all ordered there is only no values included same with this leaf node and whereas for root we have yes and no class labels included that's why the entropy is higher if the entropy if the values there are 50 we get the highest possible entropy which is one so the entropy goes from zero to one so this is the equation that we use to calculate entropy and it's basically the sum of p of x multiplied by log so what is p of x is basically a number of times this class has occurred in this node divided by the number of total nodes and lastly stopping criteria so stopping criteria as i mentioned sometimes a tree might grow a little bit too big uh you might not have time to go to all possible pure leaf notes that's why you might want to stop your tree beforehand in that case what can what you can use is maximum depth so how many layers of nodes are you going to allow your tree to grow into and then you can use minimum number of samples for example that is the a node can have so if it has less than that amount of samples you are not going to divide that note anymore and lastly you have minimum impurity decrease and that is the minimum entropy change that needs to take place for a split to happen so let's start implementing decision trees now all

Decision Trees

right let's get started with the implementation of decision trees so there are two different classes i'm going to create this time one of them is a note class and the other one is going to be a decision tree class because there are some things that i want to include in the notes too if you remember this is one note and this whole thing is a tree and i want to keep the note and the tree separate so let's do that i have a note class and i also want to have a decision tree class all right the things that i want to include in my note is uh which feature this was divided with which threshold um the left tree the left node that we're pointing to the right tree and also what the value of this tree is if this is a if this node leaf node otherwise it's just going to be num so let's pass these by default this is going to be none same with threshold and the left and right trees will also be none by default but we will pass them something and lastly to pass the value we can use this python trick that i saw from python engineer of adding an asterisk and then specifying this parameter so what happens is from now on if i want to pass the value parameter i have to pass it by name so i have to say when i'm creating a node node value equals

Node

a node i have to say node value equals to this so it's going to be a bit more obvious in our code when we're creating a leaf node because the value will only be passed to leaf nodes so let me just write these threshold and left will be left and right we'll be right all right and we can also include a little check here to see if a specific node is a leaf node or not uh and then we'll say return uh self you is not num and this will automatically tell me if the value exists it means it's a leaf node so if i call this it will tell me if this node is a leaf node or not all right so basically i'm done with the node and now

Decision Tree

we can start with the decision tree so what we're going to have with the decision tree is again we're going to need a fit function predict function but we're going to have a bunch of helper functions of course so let's start building let's start with the initialization method we can start with some stop stopping criteria so mill samples split for example or max depth uh and then we can have a number of features so number of features is something that we're going to get uh from whoever is defining this tree while they're defining it at first it is a way to basically add some randomness to the tree so you're not using all of the features in your tree but just a subset of them for example this is going to be specifically useful when we're going to build random forests on top of the decision trees and of course we want to keep access to the root of the node because that's when we're going to start traversing the tree even during interference time so min sample split let's say for now two max depth we can say 100 to begin with but you know that can change and also number of features we'll just say none for now all right let's start with fitting our tree we need to pass it the x and y values of course and um it's going to be a very big method or very big function here so it's better to start dividing it up so i'll just create a helper function called grow tree and that's going to return the root of the tree at the end and we're going to build this tree recursively but you'll see more about it in a second to this grocery function what we need to pass is the x and the y values

Number of Features

one thing that we need to check here is that the number of features do not exceed the number of actual features that we have so we can just make sure that this number of features uh equals to x shape one so this shape gives us first a number of samples and a number of features you might remember from the previous lessons and we're saying it should be the number of features if this number of features from the initialization is not defined otherwise it needs to be the minimum in between this x shape or the number of features all right so this is just to kind of make sure that we don't get an error here and from now on we can actually start building our grow tree function as we said it needs to get the self x and the y all right so what we need to do here is basically first check the stopping criteria and then find oops find the best split create child notes and basically call this grocery function again so when we're creating a child notes we're basically going to create the new trees subtrees in a way that's why this is going to be a recursive function so let's start with getting some information so as we said we can get number of samples and number of features let's call it feeds from x shape and we would also like to have the number of labels and we can get that by using getting all the unique ones from y but let me import numpy here all right so we get all the unique values of y and then we call it the num the length will be the number of labels all right so some things that i want to check for are at first i also need to actually pass a depth to grow tree so at first it's going to be zero and then as we're creating child notes we're going to increase it by one so if the death depth is bigger or equal to self-max depth we cannot go in there or if the number of labels that we just found is equals to 1 then we don't really want to split it again of course that means it's a leaf node and if number of samples in this current place is smaller than minimum sample split then we don't really want to grow the tree anymore then what we're going to do is create a new note and then return it so then what i'm going to do is to return a leaf node and then pass it only a value parameter um but how we're going to do that is first we need to calculate what the value is so if there are only one labels of course then it's easy what the value is but otherwise we need to calculate or find out what the value is so for that i'm going to create another helper function most common label and i need to pass the y to it of course so let's define it here most common label but it's something similar in the previous lessons so you might remember it from that i need to first import from collections uh the counter data structure uh but not with the capital c here all right um and then i just need to create one here with the y values um and on this counter i just need to get the most common first one for most common one only and then i need to get the most common tuple and the first information that includes the value uh i showed you how this happened in the previous lesson i think uh i think probably in linear regression oh no in logistic regression so you can go check that one out or just go read the documentation it's also there so this will be the value and then we will return value here all right so that way we're getting the leaf value and then we can pass the leaf value here and then uh the tree this part of the tree will be concluded basically

Best Split

but if this is not the case if we do not get caught in the stopping criteria what we need to do is to find the best split and for that why not just use another helper function so we'll create a helper function that'll tell me what the best split is i'm going to pass x to it y to it and the features that i want to create this i want to include in this um consideration of creating a new split this is basically part of the place where we create the randomness in decision trees and it's going to return to me the best threshold and the best feature so i also need to pass it the future ids of course that's going to depend on self and features so how many features we want to consider at a time for that i'm going to use numpy random choice we need to pass it a total number of features that we have that's feats and then want to select from our object and also we're going to set replace to false by default it's true so this way when we're selecting the random group we're not going to have duplicate ones so we're only going to have unique features okay so once this is done

Helper Function

we can go ahead and create our helper function to select the best split let's pass it what it needs the xly and oh the feature indices here what we're going to do here is basically find the threshold among all possible thresholds that are out there all possible splits what's what the best one is so we can first say we're going to keep the best gain value here and i'll set it to -1 at first and then we'll also keep the split index and the split threshold these are the things that we're going to return uh to none at first and we're basically going to traverse all possible options here so for all features in future indices we first need to get the column uh this column and that's going to be x and the index so now i have only this column in my x column variable i need to get all the possible thresholds here so i can say mp unique again give me all possible things that i can split with here and then i'm going to traverse again for all the thresholds in thresholds uh maybe i'll call it thr to make it easier so here i need to calculate the information gain and once i you can say gain for example we're going to calculate this with a helper function again and then i can say if gain is better than the best game that we acquired so far um the best gain is going to be gain and split in this our index is going to be this one the index that we're working on right now and the split threshold is going to be the threshold and once we're done with this we can return the split index and the threshold but i think i switched them here yeah okay so i'll just make sure that we're assigning it to the correct one so at first it's the feature and then the threshold and it is correct here

Information Gain

here all right and now what we need to do is to calculate the information gain so for that i'm going to create another helper function here so let's define it information gain it's going to be self what i need to pass it is the y and the x column and the threshold we're working with that was thr all right let's check out how we calculated how we're supposed to calculate the information gain so it is the entropy of a parent so it looks like we're going to need yet another alpha function here and then the weighted average minus the weighted average of the entropy of the children um so first i need to get the parent entropy then i need to create children then calculate the weighted entropy of children and then finally calculate the information gain using all of these things but of course needs to be weighted average of entropy of children so the parent entropy is going to be entropy let's say self entropy and i'm going to pass it y uh well yeah now that we're here why don't we just define it actually and entropy um it's going to get the path itself and a y let's see how we are supposed to

Entropy

calculate entropy so it is the summation of p x times log 2 of p x and p x is p of x is a number of times x has occurred divided by the number of values in here so i'm going to use a numpy trick to do this and i'll show you how i did that in a second first i'm going to count the vise so what bing count does kind of like a histogram i'll show you over here so let's say i have an array of i and the values are 1 2 3 1 2. if i create a bin count for my y what i'm going to get is 0 2 1. so what does that mean it means 0 has occurred zero times one has occurred two times two and three has occurred one time basically i have like a nice little histogram here so the next thing that i'm going to do is to divide this histogram by the number of values that i have and why am i doing this is basically the p of x if we see here it's basically number of occurrences divided by the number of total value so that's what i'm doing here but i'm doing it for all of them basically at once and the last thing we need to do for all the p's that we've created for p in ps if i'll just write in one line if this p is bigger than zero we calculate this part basically so p times log 2 of p and nearly that's it of course i also need to sum these up and finally i think we also need a yeah minus sign here and return this and that will be our entropy all right now we did this too now we have the parent entropy let's create the children and calculate their entropies too so for that i'm going to have to create yet another helper function here i'll call this just split for now and split will get the column and the threshold that we're getting here let's create it the x column and the splitting threshold and in here i need to find which indices will go to the left and which right so for that i'm going to use numpy artware uh i'll show you how that one works so let's say this is my x column right this is the values that i have in there if i say x arguer let's say x smaller than three oops numpy hardware then it's going to tell me the zero the first and the second one are uh smaller than three but of course i don't want to have it in a list of list way i want to have it in just one list so that's why i will say flatten so that's what we're going to do here uh i will say give me all the ones where the value of the x column is smaller than the split threshold smaller and equal to the split threshold and then flatten it so it's just an array same here but except this time we're going to say bigger let's separate these a bit so it's easier to read and then we will return left indices and the right indices and that's all actually now we have the left and the right indices here after this we should just check if one of these is empty so if the left indices is empty or the right indices is empty then we'll just return that the information gained from this split is zero okay now it's time to calculate the

Weighted entropy

weighted entropy of the children so the first thing that i want to get is the length of the y how many samples we have in the y and in the left and in the right left and number of samples in the right that is length of the left indices and the length of the right indices and need to add the s's here yes and then we'll calculate their entropy of the left entropy of the right we'll just going to pass it to the entropy helper function uh the left indices and the right indices um but of course we need to get the actual values from it so not just the indices but we need to find the

Child entropy

y values that belong to these indices and finally we calculate the we calculated child entropy and let's see how we did that so it is basically the weighted average so weighted average in this case means how many samples are in one of them and the other one so i need to find out number of samples in the left one divided by total number of samples times left entropy plus number of samples and the right one divided by total number of samples times the right entropy and that gives me the entropy of the children we have parent so the only thing that we need to do is to find the information gain and the information not grain gain and that is parent entropy minus child entropy and then we return information gain there you go we also finished the information game function so let's go back where did we leave it all right so what happens is when we call fit we check that the number of features is not more than the actual number of features we call gro3 grocery checks um the best feature here at first checks for the stopping criteria and then calls best split to find the best split and the best split calls information gain information kane gain calls entropy and split is run it tells it what the information what the left and the right trees are i'm just checking that i don't have any typos at the same time we calculate the entropy of the children yeah i need to call it like this um let's see information gain is calculated it's passed back to best split is calculated here and then it goes back to grow the tree all right so the last thing that we need

Child nodes

to do is to create the child nodes then and we already did part of it actually so uh we call left indices and right indices and guess what we're going to call for that split because split is the thing that calculates what the left and right indices should be and the split requires x column and a split threshold we know what the split threshold is it is the best threshold and the x column in this case is the column that belongs to the best threshold so the best feature here this will give me left and right indices and now i have to call the grow tree uh using these left and right index values so the left would be grow tree makes left indices oops what else do we need the y y left indices and finally we need to include increase the depth by one and the right will be more or less the same self that grow tree the right indices and y right and this is and again the depth will be plus one and what we want to return here is a new node that we created with the information what we divided it with the best feature threshold that we've divided it with what else do we have here and left and right you also need to pass it left and right

Predict function

all right so it looks like the part where we do the training actually ended the next thing that we want to do is to write the predict function let's find it here right so let's pass it to self and an x test that we're going to get so what i want to do is for every x and set that is being passed to me i want to traverse the tree to find the result so one last helper function here and that will return to me a list of results and then i will parse it into a numpy array and return it all right let's start our last helper function traverse tree so the traversing of the tree will also happen um in a recursive way so i need to pass it to x but i also a node that we're starting from so at first it's going to be self root we're going to start from the root first i want to check if this node is leaf node what was the function there is leaf node if that is the case then we return the value of the node so if you remember when we're creating the left and right notes here we are passing them a certain feature which feature that we divided it with so we just need to find the feature that it's divided with if the value is smaller or equal to the threshold then we pass the left side of the tree uh to be traversed next or we pass the right side of the tree to be traversed next so we return calls traverse 3 again we pass it to x again and now the node the starting node will be node left otherwise though if that is not the case then we pass it the right and yeah that's all that we need to do to um traverse the tree we check if it's a leaf node if that's the case we pass the value so basically everything will end up here all of the paths that we follow are going to end up in the leaf node anyways but then we're going to get values and we're going to get one value for each number in our test set and then we're going to turn it into an mp numpy array and we're going to return

Accuracy test

it awesome so it looks like we did everything let's run it and see if we made any mistakes or what the accuracy is going to be like so for that i'm importing the breast cancer data set again from scikit learn need to import the decision tree and then we'll create a decision tree and the first run i'll just keep the default values that we already decided on and then we're going to fit the decision tree with x train and y train and then we're going to get predictions using the predict method and i'm going to pass it the x test for that um to see the results and how good the results are of course we need to have a metric we'll call accuracy for now i'll call predictions or maybe let's call it wipe red y test and y print and that's basically the number of times y test equals to y thread divided by length of the y test and we can just return this and uh yeah let's calculate the accuracy i'll call it acc and it's going to be again a predictions and y test and let's print it afterwards all right i'm excited let's see oh air um line 51 did a little comma here and accidentally let's try again all right took me a second but i found what i did wrong here in the definition the initialization of the note instead of self value equals to none it should have been value and not none and while i was figuring that out i found another problem in the traverse tree function i was returning node value but then i was calling the value i'm not supposed to call it i'm just supposed to return the value of this node all right so let's run this once again and see what we get all right so we get a accuracy of 0. 91 which is pretty good um for something that we wrote in a couple of minutes and let's try passing it some other arguments for example max step 10. okay we get a bit of a better accuracy here but um yeah you can play around with this and see what you get better on this breast cancer data set i hope everything was clear i know this was a bit of a longer one because there's a bit more involved to create decision trees if you have any questions don't forget to leave them as a comment we'll be happy to help you there and if you want this code you can find it in our github repository as always the link is in the description but i'm looking forward to seeing you in the next lesson you

Другие видео автора — AssemblyAI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник