Decision Trees - VisuallyExplained

Decision Trees - VisuallyExplained

Visually Explained

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (2 сегментов)

Segment 1 (00:00 - 05:00)

Decision trees are a machine learning model used for data classification. Unlike most modern machine learning techniques like neuronet networks that are complex and often act like a black box, decision trees rely on simple decision rules based on a series of yes or no questions that are easy to understand and interpret. And this makes them an amazing tool to have in your machine learning toolkit. Let's walk together through an example about how we can use decision trees in machine learning in Python. and feel free to follow along. You can find the link to the code in the description. To prepare the stage, make sure you have the libraries pandas and scikitlearn installed. Then download the Pokémon data set. And finally, fire up Python, or in my case, iPython, which I find slightly more convenient. The first thing we'll do before anything is to load the Pokémon data set to memory with the help of the pandas library. By taking a closer look at this data, we can see that it has 800 rows, and each row gives the stats for a given Pokémon. For example, the first row gives the stats of Bulbasaur. This Pokémon has an HP of 45, attack of 49, defense of 49, is of type grass, and so on. For the sake of simplicity, let's focus our attention only on two types of Pokémons. grass like Bulbasaur and electric like Pikachu. Now, here is the problem statement. Let's say the type of each Pokemon is unknown to us and we want to guess the type of each Pokémon based on its other stats or what we call features. This is a classification problem. We are trying to classify each Pokémon into one of two types or labels based on a bunch of features. To solve the classification problem, we will build a decision tree that will take a Pokémon stats as input and outputs the correct type for this Pokémon. In order to build such a tree, we need a training data that is already labeled correctly. This is why decision trees are called a supervised learning algorithm. They required a training data set that has the correct labels to learn from. then they can be used to make predictions on the label of new data. Back to Python, let's define the training data set. Then we call the decision tree classifier fit method from the library scikitlearn with the max depth argument set to one. And we will revisit this argument in a moment. And believe it or not, that's all it takes to train a decision tree in Python. Seriously, Python is awesome. to see what the tree we just learned looks like, we can just call the plot tree function to get a visual representation. What this means is given a Pokemon, we check whether its speed is less than 85. 5. If the answer is yes, then we predict the type of this Pokémon to be grass and if the answer is no, we predict the type to be electric. Now we see the appeal of decision trees. A decision tree is nothing but a simple yes or no question or as we will see later a series of yes or no questions that lead to a conclusion at the end and those yes or no questions can be learned automatically from the data. Now if you have a new Pokémon and you want to know its type, you can manually follow the tree and see where it leads you. But this gets tedious if you want to know the type of a large number of Pokémons or if the tree grows in size. A simpler way is to call the predict method from scikitlearn. It takes as input a data set of Pokémon stats and outputs predictions for their types. Let's now see how the fit method learns to predict the types of Pokémons based on our training data. The fit method looks at each feature in the data. for example, the speed feature. Then it lines up all the Pokémons according to this feature, looks at their types, and finds the cutoff that best separates the data in the sense that all grass Pokémons fall on one side and all electric Pokémons fall on the other side. Of course, it's not always possible to cleanly separate the data. So the fit method will find the cutoff that has the best accuracy where accuracy is the ratio of correctly labeled Pokémons over the total number Pokémons. In this case, the best cutoff 85. 5 and it leads to an accuracy rate of 73%. The fit method then loops through all the features in the data and picks the one that has the best accuracy. In our case, the fit method picked the speed feature. In Python, there is a convenient way to compute the accuracy of a decision tree. We can call the accuracy score method, which takes as input the correct labels of our Pokémons and the predictions that our tree makes.

Segment 2 (05:00 - 08:00)

To improve the accuracy, we can increase the depth of the decision tree, which controls how many features our tree looks at before reaching a conclusion, or in other words, how many yes or no questions we ask before we reach a prediction. This can be done through the variable max depth that we pass to the fit function. Increasing this variable to two leads to the following tree. This means that we first split the data according to the speed feature just like before. If the speed is less than 85. 5, then we look at the defense feature and we check if it's less than 34. 5. If the answer is yes, then we predict type electric. Otherwise, we predict grass. A similar reasoning can be applied to the other branch of the tree. If the answer to the first question is no, then we look at the feature HP and check whether it's less than 90. 5. Again, note that behind the scenes, the fit method automatically picks both the features and the cut of values at which to split the data. By increasing the depth to two, the accuracy now jumps to 78%. And as you continue to increase the depth, the accuracy will monotonically increase, which might look amazing. But there is a big danger associated with that. Overfitting. Overfitting happens when our model simply memorizes the data instead of learning the underlying patterns inside the data. This is a problem because it hurts generalization. If we are presented with a new Pokemon that is not present in our training data, it's very likely that the model will not be able to predict its type correctly. To illustrate this point, instead of using our entire data to train the decision tree, let's randomly split our data into two parts. We keep 20% of our data to test how well our model generalizes, and we only use the remaining 80% of our data for training. We can then train our decision tree on the training data and keep track of both the accuracy on the training data which measures how well our model fits the training data and the accuracy on the test data which measures how well our model generalizes to end data. Typically the test accuracy will be slightly lower than the training accuracy which makes sense since our model has never seen the test data before. But now watch what happens as we increase the depth of the decision tree. The training accuracy will keep increasing but the test accuracy will plateau or even decrease. This is a well-known trade-off in machine learning between how complex a model should be to fit the training data and how much it generalizes to new data it has never seen before. There are many ways to mitigate this. One way we've already seen it's to limit the depth of the tree. But that's not the only way. For instance, a technique called pruning is a post-processing step that aims to find parts of the decision tree that are not important and do not contribute much to the decision process and simply cut them out. Another technique is called random forests. Instead of learning one big tree, you can learn multiple small trees and make predictions based on a majority vote, for example. There is a lot more to say about decision trees, and we only have covered the basics here, but we will stop here for this short introduction. I hope you enjoyed this video and found it useful. Please like and subscribe if you like this video, and see you next time.

Другие видео автора — Visually Explained

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник