# How to implement SVM (Support Vector Machine) from scratch with Python

## Метаданные

- **Канал:** AssemblyAI
- **YouTube:** https://www.youtube.com/watch?v=T9UcK-TxQGw
- **Дата:** 20.09.2022
- **Длительность:** 14:44
- **Просмотры:** 44,309
- **Источник:** https://ekstraktznaniy.ru/video/12942

## Описание

In the 9th lesson of the Machine Learning from Scratch course, we will learn how to implement the SVM (Support Vector Machine) algorithm.

You can find the code here: https://github.com/AssemblyAI-Examples/Machine-Learning-From-Scratch

Previous lesson: https://youtu.be/aOEoxyA4uXU
Next lesson: https://youtu.be/6UF5Ysk_2gk

Welcome to the Machine Learning from Scratch course by AssemblyAI.
Thanks to libraries like Scikit-learn we can use most ML algorithms with a couple of lines of code. But knowing how these algorithms work inside is very important. Implementing them hands-on is a great way to achieve this. 

And mostly, they are easier than you’d think to implement.

In this course, we will learn how to implement these 10 algorithms.
We will quickly go through how the algorithms work and then implement them in Python using the help of NumPy.

▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬

🖥️ Website: https://www.assemblyai.com/?utm_source=youtube&utm_medium=referral&utm_campaign=scratch09
🐦 Twit

## Транскрипт

### <Untitled Chapter 1> []

welcome to another video of the machine learning from scratch course presented by assembly AI in this series we Implement popular machine learning algorithms using only built-in python functions and numpy in this lecture we learn about support Vector machine or short svm so as always we start with a short Theory section and then we jump to the code so let's start so the idea behind svm is to use a linear model and try to find a linear decision boundary also called hyperplane that best separates the data and the best hyperplane is the one that yields the largest separation or margin between both classes so we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized and this is the whole idea so let's look at an example to make this more clear so here we have our feature vectors in 2D with two classes blue and green and we want to find a linear our decision boundaries so in this case this is just a line that has the largest margin between both classes so the distance from this decision boundary to the nearest point on each side is maximized and these nearest points are also called the support vectors so if we describe this in a mathematical way then here we have this line equation W Times x minus B and this should be 0 at the point of the decision boundary and one and on this side where the class is plus one and then minus one on this side where the class is -1 and then this margin here this distance here can be calculated with 2 over the magnitude of w and W are the weight so these are what we have to learn during training so now if we put this into a mathematical equation we can say W Times x minus B should be greater or equal than one if y equals one so for the right labels we want to lie on this side of the decision boundary and for the green for -1 we want to be on this side so here it should be smaller or equal than minus one and if we put this in only one equation then we can multiply this with the label so we say y i times W Times x minus B should be greater or equal than one and our y's should be should have the labels minus one and plus one so not zero and one in this case so now we have to find the weights and for this we Define a loss function so one part of the loss function is the hinge loss and the hinge loss is calculated as the maximum of zero or one minus and here we have this equation y i times W Times x minus B so in other words this is zero if we are greater than one or this term otherwise so this means if we fulfill this term if we are on the correct side of the decision boundary for here or here then our loss is zero and otherwise it's this term so it's if we plot this then it looks like this the blue line so the further we are away from the decision boundary on the incorrect side the higher our loss gets so this is one part and then we also add a

### Add Regularization [3:42]

regularization term and now this is our cost function Lambda times the magnitude of w to the power of 2 plus and then here this is our hinge loss so basically this is a trade-off between minimizing this loss here and maximizing the distance to both sides so if we have a look here then we see this distance is 2 over the magnitude of w so this should be maximized and in the cost function this should be minimized so we switch this around and then here we have this Lambda parameter that controls how important this part here is so then we can differentiate between if we are on the correct side of the decision boundary then our cost function is only the first part so Lambda times w and on the other hand um we have Lambda times W plus and then here we have the hinge loss so this is only for one component I so here we can get rid of the sum and now we have to find the weights and the bias so we calculate the derivative of the cost function with respect to W and to B so

### Gradients [5:11]

let's calculate the derivative or the gradients here again we make this differentiation so we check if we are on the correct side and then the gradients of the cost function with respect to W is 2 times Lambda times W and the gradients with respect to B is zero so no change in this case and in the other case then this is our gradient with respect to the weights and this with respect to B so yeah please double check the math for yourself and now know when we have the gradients we want to put

### Update rule [5:47]

this into our update rules so here we say the gradient equals the gradient minus Alpha minus the learning rate times the gradient and then we plug this in so in this case the gradient is this and then for the bias in this case there's no change and in the other case then we have this formula so this is what we need and now to summarize the

### Steps [6:13]

steps in the training part with the training data we first initialize the weights then also make sure that our class labels are -1 and 1 and then we apply the update rules that we've just seen for the number of iterations that we specify as parameter and then we learn the weights and then for the prediction we simply apply this linear function and say W Times x minus B and then we decide for the sine so if we are greater than zero then we say this is class 1 and if we are smaller than zero then we say this is class minus one and yeah this is all we have to do so now let's jump to the code so first let's import numpy S and P and then let's create our class svm this gets an init function and here we give it the parameters so this should get a learning rate and we can initialize this with 0. 001 for example then we also want to give it the Lambda parameter and a good starting point for example is point zero one and then we also want to give it the number of iterations and let's say this is one thousand by default then we want to store this so we say self dot LR equals learning rate self dot Lambda param equals Lambda param self Dot N errors equals n eaters and then we also want to store the weight self dot w equals none in the beginning and self dot b the bias is also none then we want to define the fit method which gets the training data X and Y and then we also want to have a predict method which gets only the test samples X so let's start with fit so here we say the number of samples and the number of features is x dot shape and here we assume that X and Y are already numpy and D arrays then we want to make sure that the classes have the values minus one or plus one so we say y underscore equals and then we can use numpy where and then we check where Y is smaller or equal than zero then we say this is -1 and otherwise plus one then we want to init the weights so we say self dot w equals numpy zeros with the shape and features and self dot bias equals zero so this is the simplest way to initialize this and it will work but this is actually not the best way so it would be better to randomly initialize the values here but for this I challenge you to do this on your own but yeah as I said it works um it still works so let's go on and now we want to learn the weights with the update rule so we say four underscore because we don't need this in range self Dot and headers and then for index and x i in enumerate and enumerate X so over all the samples and then we check the condition equals and now let's have a look back at the formula so this is the condition y times W Times x minus B should be greater or equal than one so y underscore of the current index times and now we can use the dot product numpy Dot and here we say x i and self dot w minus self dot b and this should be greater or equal than one and if the condition is true then we let's have a look again here we apply those different update rules depending on the different gradients so for this let's say self dot w minus equals self dot learning rate LR times 2 times self dot Lambda param times self dot w so let's check again W minus the learning rate times 2 Lambda W and for the bias so this has no update in this case because the gradient is zero so now we can check the other case so in the other case we say else the self dot w minus equals self dot l r times and then here this part is 2 times self dot Lambda param times self dot w minus and then we have the second part so in this case minus y i times x i so here we say y underscore or actually we can use the numpy dot product to make the multiplication so here we say x i and y underscore of the current index and then self dot B minus equals let's check again minus equals the learning rate times y i so minus equals self dot LR times y underscore of the index and this is all that we need in the fit method so then we have learned the weights and now in the predict method we can do the approximation by saying numpy Dot and then we apply X and here we have self dot w minus self dot bias and then we return numpy sine of the approximation so this is either plus one or minus one and this is all that we need so now we can test this so for testing I already prepared the code and let's go quickly over this you can also find all the code on GitHub so here we import train test split and data sets from sklearn and matplotlib then we create an example data set with two blobs so 50 samples and two features then we make sure that the classes are -1 and plus one then we split this into training and testing then we can set up our svm and fit this with the training data then we call classifier predict with the test data then we also calculate and print the accuracy and then I prepared some helper codes to also plot the decision boundary and also plot the other two hyperplanes at the offset minus one and plus one so if we run this then hopefully this should work so yeah here we have our two blobs and this is perfectly separated so this is our decision boundary and these are the two hyper planes at plus one and minus 1 and now if we print the accuracy so this is 100 in this case so this worked perfectly so our own svm class is working and yeah this is how we can Implement svm from scratch I hope you enjoyed this lesson and then I hope to see you in the next one bye