# (ML 2.5) Generalizations for trees (CART)

## Метаданные

- **Канал:** mathematicalmonk
- **YouTube:** https://www.youtube.com/watch?v=UMtBWQ2m04g
- **Дата:** 06.06.2013
- **Длительность:** 13:20
- **Просмотры:** 35,036

## Описание

Other "impurity" quantities (entropy and Gini index), and generalizations of decision trees for classification and regression using the CART approach.


A playlist of these Machine Learning videos is available here:
http://www.youtube.com/my_playlists?p=D0F06AA0D2E8FFBA

## Содержание

### [0:00](https://www.youtube.com/watch?v=UMtBWQ2m04g) Segment 1 (00:00 - 05:00)

we just saw how to grow a classification tree using the cart methodology and in this video we'll take a look at some generalizations of this type of procedure so the first type of generalization regards this the error measure that we use remember we had this e subr which was the this where is that so D subr was the fraction of points in this region r that was a subset of D dimensional real space the fraction that would be misclassified if we use a majority boat in that region and we used this this sort of metric this esar as our criteria for splitting and this is what is referred to well in this context at least they refer to as an impurity measure so let me say different impurity measures or quantities so the first one was this e subr and that's the misclassification rate in the region R another one which can be used is the entropy H subr and this has a very nice interpretation actually if we have this redraw this little picture for our reference we had some points oops let me make that blue so we got some Blue Points here and we've got some red points and if for some region R maybe this is our region R there's a there's a if we look at the empirical probability distribution on this region that was this thing that we were talking about earlier we had this fraction of points this is the defined an empirical probability distribution over the classes in this region so in this case the this empirical distribution has let's call the blues zeros and the Reds ones so the probability of zero in R maybe we should Den noty subscript R something like that is so we said blue zero so that's three out of four and the probability of ones is4 and if we had other classes you know we could have probability of two and so on so another option here instead of using the misclassification rate we could use the entropy of this distribution which is the sum over classes so maybe I'll right see well we were using Y before for classes let stick with that so this the sum let me just say I'll just say y over all this is some finite set of classes and I should mention this is these are all generalizations for the case of classification so the possible classes are some finite set and we would sum over these and we have this empirical distribution times the log and it's minus this quantity that makes the entropy positive so instead of using the misclassification rate you could instead use the entropy and this has a very actually a very appealing interpretation which is we want to choose a split let's go back to one of these earlier examples here so in our very first example when we were talking about decision trees we had these ones Reds and the blue zeros and we said well let's make our first split here and this was appealing because that made this whole region be all have the same class and therefore the empirical distribution would be the probability of one would be one probability of zero would be zero and so the entropy would be zero for this region so right if we computed the

### [5:00](https://www.youtube.com/watch?v=UMtBWQ2m04g&t=300s) Segment 2 (05:00 - 10:00)

entropy wherever it is so the probability by convention we take Z log0 to be zero and 1 time log of one is also zero because zero so the entropy for that left hand region go way back up to our example here is zero and we would get some entropy over here which is not great but you can think intuitively at least that this using the entropy you're trying to be as certain as possible about what's going on in each of the regions that you're splitting into so you want to make these each of the region as each of the regions as homogeneous as possible that's what the entropy is doing if you used the entropy instead of the M classification rate and if you use this you would just plug it directly in back here wherever it is in these criteria you would just use the entropy of the region instead of the misclassification another generalization another option people use which I won't go into too much detail I'll write it as G subr it's called the genie index sometimes people use it often used for this type of imp SEC measure in this setting of cart and the genie index is the sum again over Y's just like here and this time it's probability of Y times 1 minus the probability of Y so that's all the genie index is doing and so these are some ways to generalize that previous these actually are recommended the entropy and the genie index uh tend to work better than the misclassification rate so probably is better to use one of those and also the genie index has um has some nice analytic properties that make it work better for that make it easier to work with let me mention now I'm going to follow the book I'm going to mention a few other issues associated with this cart model and I'm G to follow the so Hasty tipani and Freedman have a book called what's it called the elements of statistical learning and I believe it's a aable freely online and um I'm just going to mention a few other issues that they talk about well they talk about they I'll follow their the order that they talk about them here so categorical predictors in the sense that the predictors the features x i j each of the different JS may not be in it may not be a real number so you might have one of your features might be just from F from some finite set some arbitrary set of possibilities and so you can't do that trick it doesn't make sense to talk about well I mean I numbered them here but if they weren't ordered you know it does it wouldn't make sense to talk about splitting when this is like less or equal to some value that that's not really meaningful in the setting where you just have some AR some abstract set but it turns out there's a very clever trick that one can you use at least in the case of binary classification to very efficiently find the optimal split even for categorical predictors so that's a very nice thing another possibility something that might come up is if you have a loss Matrix in other words it's not you know getting predicting one class when you should have predicted another it's not NE you know the error is not necessarily the same for different types of errors and this can also be incorporated into the growing the tree procedure you can incorporate the loss Matrix in a smart way another issue that might come up is if you have missing values so we were assuming that we had these right we had these X1 y On's Etc

### [10:00](https://www.youtube.com/watch?v=UMtBWQ2m04g&t=600s) Segment 3 (10:00 - 13:00)

and it might turn out that some of the values like maybe X1 J for some J is missing it's just not available it's Na and it's just blank so it turns out that uh there's well one way that this can be handled the the recommended way is to use what's called surrogate variables I won't go into that but I just want to mention so you're aware another thing which you might be sort of wondering about is we only were considering binary uh well B not only binary splits but also we were only considering splits that were sort of aligned with one of the axes which seems like a very very rigid very restricted type of split I mean why not allow splitting along something like this linear combinations at least and there are models where you can do this there are generalizations in which you can split on linear combinations the near combinations instead of just along one coordinate but in fact it turns out that for whatever reason these well one thing is that it's less efficient it's much less efficient there's not a very efficient way because there's many more possible linear splits well there's infinitely many and another reason why you might not want to do this is because it doesn't at least empirically people have found that it doesn't tend to give much better performance which is kind of surprising but that's uh especially when you combine this the cart model with the aggregation techniques that we're going to talk about very briefly and so the one last thing which will lead to these aggregation techniques is that is there's this instability in these trees in the sense that a very small change in your data set could result in a very drastic change in the resulting tree so it's very sensitive to the data and this instability gives R this makes the resulting tree have what we you we refer to as high variance the it's the estimator in statistical speak has high variance and so as a result it has generally it has lower performance but there's a fantastic way to get around this or at least to mitigate it to some degree using aggregation and that's what we're going to talk about next

---
*Источник: https://ekstraktznaniy.ru/video/52873*