# How to implement PCA (Principal Component Analysis) from scratch with Python

## Метаданные

- **Канал:** AssemblyAI
- **YouTube:** https://www.youtube.com/watch?v=Rjr62b_h7S4
- **Дата:** 18.09.2022
- **Длительность:** 12:16
- **Просмотры:** 24,245
- **Источник:** https://ekstraktznaniy.ru/video/12956

## Описание

In the 7th lesson of the Machine Learning from Scratch course, we will learn how to implement the PCA (Principal Component Analysis) algorithm.

You can find the code here: https://github.com/AssemblyAI-Examples/Machine-Learning-From-Scratch

Previous lesson: https://youtu.be/TLInuAorxqE
Next lesson: https://youtu.be/aOEoxyA4uXU

Welcome to the Machine Learning from Scratch course by AssemblyAI.
Thanks to libraries like Scikit-learn we can use most ML algorithms with a couple of lines of code. But knowing how these algorithms work inside is very important. Implementing them hands-on is a great way to achieve this. 

And mostly, they are easier than you’d think to implement.

In this course, we will learn how to implement these 10 algorithms.
We will quickly go through how the algorithms work and then implement them in Python using the help of NumPy.

▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬

🖥️ Website: https://www.assemblyai.com/?utm_source=youtube&utm_medium=referral&utm_campaign=scratch07


## Транскрипт

### <Untitled Chapter 1> []

welcome to another video of the machine learning from scratch course presented by assembly ai in this series we implement popular machine learning algorithms using only built in python functions and numpy in this lesson we talk about principal component analysis or short pca as always we start with a short theory section and then we jump to the code so let's get started so pca is

### PCA: Goal [0:22]

an unsupervised learning method that is often used to reduce the dimensionality of the data set by transforming a large set into a lower dimensional set that still contains most of the information of the large set so unsupervised learning means that we can do this transformation without knowing about the class labels and this is very important here so in other words we try to find a new set of dimensions such that all the dimensions are orthogonal and hence linearly independent and ranked according to the variance of data along them so we try to find a transformation such that the transformed features are linearly independent the dimensionality can then be reduced by taking only the dimensions with highest importance the newly found dimensions should minimize the projection error and the projected points should have maximum spread which means maximum variance so this gets much clearer when we look at an example so in this case we want to map the points from 2d into 1d so onto one line and on the left side we have a good transformation this is what pca will do and on the right side we have a bad transformation so let's say we take these two new axes as our principal components and then when we map the points onto the first axis onto the first principal components then this is what we will end up with so this is a good projection so here the data has maximum spread so maximum variance and on the other hand on the right side here if we take this axis and map the points onto this line then a lot of points will end up on the same point so we lose a lot of information that's why maximizing the variance is very important here so this is the concept of pca and for this we need some math so we need to find the variance this is how much variation or spread the data n has and this is the formula 1 over n and then the sum over x i minus x bar squared and x bar this is the mean value and then we also need the covariance

### Covariance Matrix [2:44]

matrix this indicates the level to which two variables vary together so the formula the covariance of x and y is one over n and then the sum over x minus x bar times y minus y bar transformed we can also do the covariance of x with x itself then we have this formula here and now if we calculate the eigenvectors and eigenvalues of the covariance matrix then the eigenvectors point in the direction of the maximum variance and the corresponding eigenvalues indicate the importance of its corresponding eigenvector so if you have a look at this example on the left side then these two axes are the first two eigenvectors and this axis belongs to eigenvalue one with the highest importance and this one to eigenvalue two and yeah so basically this all boils down to an eigenvector eigenvalue problem um i'm not going into detail here what eigenvectors are so if you want to learn more about this then i will put a resource in the description below but basically an eigenvector has to fulfill this equation if we multiply an eigenvector with a matrix a then this is just a scaling with a scalar value lambda so yeah this is what eigenvectors do and now the steps we have to do is first

### Steps [4:15]

we subtract the mean from x then we calculate the covariance of x and eigenvectors and eigenvalues of the covariance matrix then we sort the eigenvectors according to their eigenvalues in decreasing order then we choose only the first k eigenvectors that will be the new k dimensions and then we transform the original n-dimensional data points into k dimensions and this transformation is basically just a projection with the dot product so this is all we have to do so let's jump to the code so first let's import numpy s and p then let's create our class pca this gets an init function with self and then as a parameter we give it the number of components so the number of dimensions we want to have after the transformation so we store this and say self. n components equals n components and then we also want to calculate the mean in a moment but for now we simply say this is none then we have our fit method which gets self and only x and not y because remember this is an unsupervised learning method so we don't need the class labels so only x here and then we don't call the second method we don't call this predict but here we call this transform and it also gets self and x so this could be the same data but this could also be new testing data and now let's start with fit so here the first thing we do is um we call this mean centering so we subtract the mean so we say self dot mean equals numpy mean of x along axis equals zero and then we say x equals x minus self dot mean now we calculate the covariance and for this we say cough equals numpy cough and then only of x we could also give it y but here we want to have x and x itself and this needs to be transposed because this function needs these samples as columns so for this yeah simply check out the documentation and now if you watch carefully you might say that this formula already subtracts the mean so if we have a look at the formula then here we also subtract x bar or the mean so why did we do this here again and this is because for example if we do this on the training data and then when we transform different data we also first want to subtract this mean so we say x equals x minus self dot mean so that's why we do this in a separate step here as well otherwise this result would be the very same here above in the fit method but yeah let's move on so now we want to calculate the i again vectors and i can values and for this so we can also simply calculate this in one line by saying numpy lin alk dot ike of the covariance and then for easier calculations we want to transpose the eigenvectors so we say eigenvectors equals eigenvectors dot transpose because the eigenvector v equals a this is a column vector so all the rows and then here we have the column i so this is a column vector and we want to transpose this for easier calculations later then we want to um sort the eigenvectors according to the eigenvalues so for this we say the indices equals and now we can use numpy arc sort of the eigenvalues and then we want to have this in decreasing order so we say colon minus one so from start to end with step minus one and then we want to sort the eigen values so we can say eigenvalues equals eigenvalues with those indices and the same for the eigen vectors eigenvectors equals eigenvectors with those indices and now we only want to save the first number of components so the first k dimensions and then we store this so here we say self dot sorry self dot components so we also want to have the components and store this here so self dot components equals none in the beginning and then self dot components equals the eigen vectors from start until self dot n component so only the first n components and again this is easier since we transposed this here so this is all for the fit method and then here we want to project the data and projection means after the mean centering then projection only means the dot product so we return np dot and then x and then self dot components and then again we have to transpose this here and this is all that we need for pca so now we can test this so for testing i already prepared the code you can find this on github so let's go briefly over this we import maplotlib and datasets from sklearn then we load the iris dataset in this example then we create our pca instance and we want to keep only two dimensions here so then we call pca. fit and pca transform and this is our projected data then if we print the shape of x and of x projected you will see the difference in a moment then let's extract the first two dimensions of the projected data and then plot this so here the first axis is principal component one and the second one is principal component two so if we run this then this works so this is the projected data now in 2d and if we close this and print this then we see this is the original data and this is the projected data so we reduce the number of dimensions from four to two so yeah this works and this is how we can implement pca from scratch i hope you enjoyed this lesson and then i hope to see you in the next one you