How to implement Random Forest from scratch with Python

13:31

How to implement Random Forest from scratch with Python

AssemblyAI 16.09.2022 41 057 просмотров 816 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

In the fifth lesson of the Machine Learning from Scratch course, we will learn how to implement Random Forests. Thanks to all the code we developed for Decision Trees, this implementation will be quite a bit shorter. You can find the code here: https://github.com/AssemblyAI-Examples/Machine-Learning-From-Scratch Previous lesson: https://youtu.be/NxEHSAfFlK8 Next lesson: https://youtu.be/TLInuAorxqE Welcome to the Machine Learning from Scratch course by AssemblyAI. Thanks to libraries like Scikit-learn we can use most ML algorithms with a couple of lines of code. But knowing how these algorithms work inside is very important. Implementing them hands-on is a great way to achieve this. And mostly, they are easier than you’d think to implement. In this course, we will learn how to implement these 10 algorithms. We will quickly go through how the algorithms work and then implement them in Python using the help of NumPy. ▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬ 🖥️ Website: https://www.assemblyai.com/?utm_source=youtube&utm_medium=referral&utm_campaign=scratch05 🐦 Twitter: https://twitter.com/AssemblyAI 🦾 Discord: https://discord.gg/Cd8MyVJAXd ▶️ Subscribe: https://www.youtube.com/c/AssemblyAI?sub_confirmation=1 🔥 We're hiring! Check our open roles: https://www.assemblyai.com/careers ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ #MachineLearning #DeepLearning

Оглавление (7 сегментов)

Introduction

welcome to another lesson of machine learning from scratch today we're going to learn about random forests but a lot of the theory we're going to learn in the next couple of minutes is going to depend on the decision trees that we learned before so if you haven't watched the decision trees lesson the previous lesson go ahead and watch that first and then come back here

Random Forest

all right so basically a random forest is very simply a forest so it consists of a lot of different trees and the reason that is called random is because we introduce some randomness into the equation when creating these trees so basically what happens is you create as many trees as you determine before you start your algorithm and once the trees are created we get their votes of what the class label should be for this data point during inference time and then we hold a majority vote or if it's a regression we do something else but we'll learn about that in a second but basically it's as simple as that and the thing that makes it random is that we sample uh the data set into subsets that are randomly created so just to quickly

Recap

recap when we are training the random forest what we do is we get the subset of the data set randomly and then we create a decision tree with this random subset of the data sets and then we repeat this process as many times as the number of trees that we want to see

Inference

and during inference time what we do is we get the prediction from each tree if it's classification we hold the majority vote if it's regression we get the mean of all the predictions that were sent to us but that's it basically as i said depends a lot on decision trees but let's use the implementation that we use that we prepared in the previous lesson again for decision trees as you remember it was a little bit long and build random trees on top of it all right let's start with the implementation of random forests one good news is we're going to use the decision trees class that we created in the previous lesson you might if you watched it you might remember it was a quite long one but we're going to use everything that we have here in the

Implementing Random Forest

random forest so it's actually going to be quite fast so at first i'm going to create a class of course of random forest uh let's initialize it so what i'm going to have in there of course is number of trees first and then we're going to have again the max depth min samples split number of features self number of features and something to keep all the trees in and this is just going to be an empty array so let's start here number of trees we can start with 10 for example and then max depth could be 10 again and then min samples split let's say it's going to be 2 for now and number of features we'll just call none because otherwise we're just going to get the whole uh if you remember we implemented it in decision trees if number of features is nothing then we get it from however many features there is let's pass these guys here max depth here min sample split here and number of features here all right so oh we also need to pass self um so as before we're going to have a fit function and a predict function and that's actually pretty much it for this one we're not going to have a lot of helper functions this time what i need to do is to import decision trees before we forget all right so in the fit function what i need to do is once i initialize this as an empty array or empty yeah list i guess so we're going to create as many trees as the value number of trees okay and what we're going to do is basically literally create a decision tree for every single one of them so what do we need to pass to create a decision tree mint sample split max depth number of features and root is created automatically that's not something we pass and yeah basically these three all right well luckily we have them max depth is self max depth many typos today um min sample split is self mint sample split and lastly we have n features and that is self and features and i will call this the tree just forgot the commas here all right so we're going to fit this tree um but we're not going to fit it with all the samples that we have yeah we need to pass the x and y here also so we first need to get some sample of x and y you want to sample them and i'm going to create one only one helper function for that so let's create that bootstrap samples and we pass it the self and x and y data um now when i get the number of samples from x i've just talked about uh when you do it on a numpy array basically the first information that you get is the number of samples the second features so we get number of samples and then using numpy let me import numpy here we're going to randomly choose so we're going to choose a number of samples amount of samples from our number of samples but the difference is this time we're going to set the replace to true so after a sample has been selected from this data set it can again be selected so what happens is effectively you're dropping some of your samples and you're selecting some of the samples again so that's why i'm going to select the indices like this and then i will return the data set using these indices and yeah that way we have the samples so let's call this here both strap samples we just need to pass x and y to it and we're going to fit archery with this x sample and y sample and then we're going to append this new tree to the number of to the tree list of trees that we have for training our random forest that's all we need to do we basically create a forest we create a forest full of different types of trees trained on different subsets of this data set all right the next thing that we want to do is predict uh use this random forest in inference time we pass it an x of course for that so what's going to happen is for each tree in self trees uh is trees right yes we're going to want this tree to predict for our x and the result we will pass make a numpy array out of and call this predictions so once we write this what we get in the predictions is basically a list of lists and one list per tree basically and this list the inner list includes predictions for each sample so this is for the first sample second sample third sample fifth sample let's say and this for the second tree we have the same first sample second sample third sample fourth sample instead what we want to have is something again like a list of

Making predictions

lists but we will have um the prediction for the first sample from the first tree the prediction from the for the first sample from the second tree so all of the uh predictions for the same sample from different trees need to be in the same inner list so to do that there is a nice little numpy function called swap axis and i pass it the predictions and i would like to swap the xa0 and 1. so let's call these three predictions and lastly we want to get the most common uh a label one last i guess i lied last time we need one more helper function but we've written that before so i'm just going to copy and paste it here let's write it as a helper function we're going to use the counter data structure from collections so from collections import counter this is just a data structure that makes it really easy to get the most common um occurrence of a certain value in a counter data structure or array basically we create the counter data structure from the array and then we use the most common function to get the most common occurrence and we basically need to parse it to get the first array the first tuple and the first value in this tuple but i explained this better in more detail in some of the previous lessons so if you want to find about find out about that you can go watch the linear either linear regression or logistic regression video or you can go ahead and check out their documentation of course all right so once we have that what we need to do is to call this helper function here but in our three predictions we have predictions for each sample so instead we will use four um tree fret or maybe to make it less confusing bread in tree threads and i'll also call this just spread and let's have this in a list enter it into a numpy array and basically return it for addictions and it will be returned but yeah so that's it now that because we're using the decision tree class that we already created it's actually quite simple we just create the trees and then we traverse them to get the results here um let's try this out and see uh how much of an accuracy we're gonna get i'm using

Testing

the breast cancer data set again uh all i need to do here is import random forest from random forest i kept it accuracy definition it is where we look at how many times the true value and the predicted value was the same divided by the number of true values let's create a random forest oops let's try again random forest uh we don't need to pass anything to it immediately we can just fit it with x strain and y train and then we calculate the predictions for that clf predict and we pass it to x test so let's see if we made any mistakes but first let me calculate the accuracy and for that we're going to pass it predictions i know first y test and predictions and then let's print it all right let's see awesome so it looks like it worked we got a 0. 91 accuracy here with the random for us we didn't have any typos which is surprising because i tend to make typos but awesome so as we've seen before in the decision tree lesson two if you want you can change the number of trees max depth you can change the mint sample split so for example let's try with a higher number of trees let's see if that's going to give us a better accuracy all right we got something slightly higher i'm not really sure if this is the randomness in it or because we increase the trees but you can play around with it and you can explore it yourself once again you can find the code that we prepared here on our github repository the link is in the description if you have any questions don't forget to ask them in the comment section below but this was it for me from now on patrick is going to take you through the rest of the algorithms developing ml from scratch thanks for watching and i will see you well i will not see you but patrick will see you in the next lesson

Другие видео автора — AssemblyAI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник