# Principal Component Analysis (PCA) | Dimensionality Reduction Techniques  (2/5)

## Метаданные

- **Канал:** DeepFindr
- **YouTube:** https://www.youtube.com/watch?v=ne6vnKoTHwk
- **Дата:** 08.12.2023
- **Длительность:** 26:17
- **Просмотры:** 24,030
- **Источник:** https://ekstraktznaniy.ru/video/52961

## Описание

▬▬ Papers / Resources ▬▬▬
Colab Notebook: https://colab.research.google.com/drive/1n_kdyXsA60djl-nTSUxLQTZuKcxkMA83?usp=sharing

Peter Bloem PCA Blog: https://peterbloem.nl/blog/pca

PCA for DS book: https://pca4ds.github.io/basic.html 

PCA Book: http://cda.psych.uiuc.edu/statistical_learning_course/Jolliffe%20I.%20Principal%20Component%20Analysis%20(2ed.,%20Springer,%202002)(518s)_MVsa_.pdf

Lagrange Multipliers: https://ekamperi.github.io/mathematics/2020/11/01/principal-component-analysis-lagrange-multiplier.html

PCA Mathematical derivation #1: https://www.quora.com/Why-does-PCA-choose-covariance-matrix-to-get-the-principal-components-of-features-X

PCA Mathematical derivation #2: https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553

PCA Mathematical derivation #3: https://rich-d-wilkinson.github.io/MATH3030/4.2-pca-a-formal-description-with-proofs.html

PCA Mathematical derivation #4: https://stats.stackexchange.com/questions/

## Транскрипт

### Introduction []

welcome back to the second part of this dimensionality reduction series the method we discuss today is PCA the well-known principle component analysis which was first introduced in 1901 by KL Pearson this video is structured as follows we start with some intuition about the technique then we'll look into the algorithmic details and finally we apply PCA on a data set and do some visualizations let's go there are many

### Used Literature [0:26]

great resources that vent into the cre of this video such as Peter blam's amazing blog post on PCA but also the PCA for data science book for example as always you find everything in the video description let's start with building a

### Example dataset [0:41]

conceptual understanding of how PCA works for this let's assume we have a bunch of observations and here for Simplicity we only have two Dimensions X1 and X2 the corresponding data in tular format could look like this the central idea of PCA is to reduce the number of Dimensions by finding so-called principal components these principal components are uncorrelated and we only keep the most informative ones to reduce the dimensionality we will learn more about them soon here we don't have many choices so we will reduce the 2D data into one dimension if you've read other PCA tutorials you most likely heard that this technique tries to retain as much variance as possible in the data I always found this a bit abstract and therefore constructed a little example to better understand it this data set represents the relative size of a city in X1 versus the cost of living in X2 we immediately observe that this data is kind of redundant because in most cases the size of a city correlates with the cost of living so basically one value can be computed from the other one we will highlight two data points namely Rome which is much cheaper than other big cities and on top we have New York City which is pretty expensive but also pretty big these are just madeup examples how would you squeeze this data set into one single Dimension while preserving as much information as possible and information is mainly the relationship between data points here you could only use the axis X1 which reflects the size of a city but you could also use X2 which reflects the cost of living in fact there are many options we could ALS use this line as a new Axis or we could use this line so which one is better obviously the green line somehow captures more of the data this spread can also be calculated namely by measuring the variance the formula of variance for a single feature

### Variance [2:47]

is defined as follows we substract the mean from all values and square them to make the distances positive the result is then summed up and divided by the number of samples overall variance is a measure for the spread of data when data points are far away from the center of the distribution highlighted in red here we have a larger variance if they're closer to the center we have a smaller variance now going back to our data set let's have a look at the variance of our two new axes the longer axis has a pretty large variance and the smaller one has also a smaller variance because there's less spread along this axis now when we take a look at how the data looks like along these axes we can see the variance even more clearly certainly this distribution plots are not perfectly precise it's just a sketch at the beginning we said that we want to reduce the 2D data into a single Dimension so let's do that now this can be done by projecting the data onto the

### Projecting data [3:52]

new axis we will later see how such a projection is performed in practice now all of our data points lie along one single Dimension each of our originally 2D data points is assigned to a single value along these axes in tabular form this would look like this and when we

### Variance as measure of information [4:14]

remind ourselves of the Cities from the beginning which were Rome and New York City we see that the left axis captures much more of the relationship between the data points and the right Axis is kind of cramped in fact the extreme case which is a variance of zero would mean that all points lie at the same value on one axis certainly when all data points look the same it's harder to distinguish between them and therefore variance can serve as a measure of information in data what we can see here are the two principal components of our data set both of them capture information about the size of a city as well as the cost of living therefore you can see these axes as a combination of our original features but the interesting piece is that these combinations capture even more information than the individual features would that's why they are called principle they are the main axis in our data set it's no coincidence that

### Scree Plot [5:15]

I chose exactly those two axes as they together capture all of the information in the data this can be displayed in a so-called scre plots which shows the explained variance of each principal components and now dimensionality reduction comes into play typically we only keep the principal components that retain most of the variance in the data so here we would just keep the first principle component and still have 90% of the information of the 2D case in other words we drop less informative principle components in practice this means all of the points are projected

### Principal Components [5:53]

onto this new axis which looks like this then we can get rid of the two features and only use the new axis and our selected principle components summarizes our data set in a lower dimensional space of course we can extend this principle to many dimensions

### PCA on images [6:14]

such as high-dimensional images here's an example for the principal components of a face image data set think of them like aess through the pixel space that capture most of the variation between faces usually these are face characteristics like the shape of the nose and so on remember in the introductory video when I said that faces can be reduced to their lower intrinsic dimensionality that's exactly what PCA is doing here the principal components for face images are called igen faces because they correspond to the igen vectors of the face image space we will learn more about that in a second the cool thing is that you can reconstruct any phase based on these spaces vectors here's an example of

### Reconstruction based on eigenfaces [7:00]

exactly that the image is reconstructed as linear combination of the IG faces of course reconstructing data based on the principal components has an inherent loss of information if the number of principal components is smaller than the original number of features I also like this visualization by Peter blur it shows how the face characteristics change when walking along a principal component or igen face this is just like walking along our newly created AIS in the previous example principal components are

### Orthogonal Basis [7:35]

orthogonal to each other because otherwise they would capture correlated information more specifically they form an orthogonal basis which we will learn more about later in this video technically speaking PCA is projecting data into a new coordinate system where the A's are the principal components and this is called a basis transformation because it translates rotates or scales observations this also means that for a n dimensional data set we will end up with n principal components of which we just keep the most valuable ones typically one last remark the classical

### Kernel PCA [8:12]

PCA is a linear method because we are only allowed to draw straight lines for this data set we will capture more variants using such a curved axis but this is not possible but there is an extension called kernel PCA which also allows to perform nonlinear dimensionality reduction but we won't cover it in this video now that we've gained a comprehensive understanding of PCA let's delve into the details of identifying these principal axes geometrically we are looking for

### Finding principal components [8:45]

the line with the minimal distance to all of the points this is nicely shown in this animation when we project the data onto this best fitting line like we've done before we preserve most of the information in higher Dimensions these lines become planes or hyperplanes the first idea that might come to your mind is that we need to perform some sort of optimization such as minimizing the sum of square distances like it's done in linear regression we could do this in principle for example fitting a line with gradient descent it turns out however that there exists a more efficient closed form solution to finding principal components when explaining PCA that there are many

### Distance minimization vs. Variance maximization [9:28]

different ways to approach this technique some explanations are geometric interpretations we try to find the line with the minimal distances others involve lots of linear algebra and some also talk about finding the direction of Maximum variance at the beginning it might be difficult to accept that all of these views lead to the same way of finding principal components in fact minimizing the squared distances is equivalent to maximizing the variance in the video description I've put a nice blog post that mathematically deres this result in detail it's based on L multipliers but I won't further discuss it in this video instead of deriving it mathematically we will take a look at this intuitive visualization by Alex Williams the spreads or variance along the principal axis is displayed in blue here the distance between a data point and the axis is displayed in red using the Pythagoras Theorem we realiz I that the maximum variance and minimum distance are reached at the same time and therefore it doesn't really matter which perspective we choose in fact all optimization problems arrive at the same solution in the end okay so what is this solution the efficient

### Covariance Matrix [10:45]

closed form solution to finding the principal components starts with The co-variance Matrix this just naturally arises in the mathematical solution The covariance Matrix shows the variances of variables on the diagonal and co-variances between variables on of diagonal positions the variables could for example be the cost of living and the size of a city like in the example from the beginning the co-variance between two variables is the sum of the product of their deviations from their respective mean divided by the number of observations like shown in this formula it tells us about the relationship between features and generally expresses the coari iation of two variables meaning if they move together in the same direction just for completeness the

### Correlation vs. Covariance [11:35]

difference to correlation is that correlation measures the intensity and direction of a relationship whereas covariance only measures the direction correlation is also called normalized covariance but it's not so important at this point as we are interested in the

### Covariance examples [11:50]

direction of the largest variance it absolutely makes sense to look at the variability in our data set expressed by this Matrix let's have a look at a few examples to get a feeling for the covariance Matrix for this data set we have a slightly higher variance of 1. 3 for the feature X2 and the two variables move in the same direction therefore the co-variance between the variables is also high and as you can see the covariance Matrix is symmetric along the diagonal this example shows a slightly lower variance for each of the variables and no clear Trends how they would move together therefore the covariance is quite low this means knowing the value of one feature doesn't tell us much about the other feature finally here we have the same variances like in the first example but the co-variance is negative as we have a joint negative Trend so the takeaway of this is that the co-variance Matrix tells us about the variance and Joint directions of the variables before

### Linear Algebra Basics [12:50]

we move on and discuss why the co-variance Matrix is relevant for finding the principal components let's discuss some linear algebra Basics which will be needed to understand the following points in a coordinate system are typically described with a value for each axis if we write 1 comma 2 it means nothing else but 1 * the first basis Vector plus 2 * the second basis Vector this makes data points nothing else but vectors themselves now every Matrix can be associated with a linear transformation that for example translates or rotates the vectors which are data points if we for example have this Matrix and multiply it with our data point vectors we can use the general matrix multiplication rule to calculate the transformation what this means is that each row is Multiplied with the coordinates of the data points and this moves the data point to a new position so let's do that for the points on the left when we plug them into the expression on top we end up at these new positions the original data space was spanned by unit vectors and looks like this when we visualize the target space we see what kind of transformation The Matrix on top is performing it's stretching and scaling all of our data points from the unit sphere into the ellipse space on the right now linear

### Eigenvectors and Eigenvalues [14:22]

transformation have some unique characteristics and one crucial characteristic can be described by so-called igen vectors these are the vectors that don't change the direction under a transformation let's say we have this red and blue vector and apply a matrix transformation and thereby move the vectors to these positions here the red Vector still points into the exactly same direction like before the transformation while the blue Vector points somewhere else the red Vector is what we call an igen Vector because it's not changing direction under transformation furthermore the length of the vector corresponds to the so-called igen value it's a measure for the intensity of the stretch in the vector's direction a very intuitive example of an igen Vector is when you point your arm into the air and spin in circles then your arm is the IG Vector of the rotation transformation as it doesn't change the direction more formally the igen vector equation

### Eigenvector Equation [15:30]

looks like this all it says is that multiplying an igen vector v with the Matrix a leads to the same result as multiplying the igen vector with the corresponding igen value therefore the vector's direction is unaffected the only thing that happens is that it is multiplied by some scaler just like we've seen in a previous example usually this is called diagonalizing a matrix because instead of our Matrix on top we can just multiply with a diagonal matrix of igen values later we will also quickly talk about how to calculate igen vectors and igen values equipped with this knowledge

### Spectral Theorem [16:10]

we can now discuss how the principal components are found based on the covariance Matrix according to a theorem called the spectral theorem any symmetric Matrix like the covariance Matrix always has real igen vectors and igen values the set of all igen values of a matrix is also called The Matrix Spectrum that's where the theorems name comes from this means that every covariance Matrix can be diagonalized and the amazing thing for

### Connection between Eigenvectors and Principal Components [16:40]

PCA is that igen vectors of the covariance Matrix correspond to the principal components so the Direction with the highest variance and the igen values correspond to the explained variance of each principal component now where is this surprising fact coming from The co-variance Matrix formulation naturally arises from the mathematical optimization problem that we want to solve I won't go further into the details here but if you're interested in finding out how the mathematical formulation looks like have a look at the links I've added to the video description now we should have everything in place to quickly go over each step of the PCA algorithm the first

### [STEP 1]: Centering the Data [17:23]

step is to Center the data centering the data means that the data is exactly at the origin of the coordinate system this can be achieved by substracting the mean of each feature some implementations additionally scale the data such that it has unit variance the reason for doing all of this is to have the same contribution of each variable as you can imagine all of the projections only work if we have the same ranges for each of the axes step two is to calculate the

### [STEP 2]: Calculate Covariance Matrix [17:54]

covariance Matrix as we now know the co-variance Matrix helps us to find a closed form solution to identifying the Direction with the highest variance it tells us about the variances of variables and between variables meaning to which degree the variables vary together the formula of variance is shown on top co-variance on the bottom in both Expressions xar represents the mean Vector of the corresponding variable the third step is the igen

### [STEP 3]: Eigenvalue Decomposition [18:25]

value decomposition as we've learned due to the co-variance being a symmetric Matrix and applying the spectral theorem The co-variance Matrix can be diagonalized the igen vectors correspond to the principal components and the igon values to the variance along the vectors this is an example of how it could look like the two vectors are the two principal components as well as the igen vectors of the covariance Matrix the blue values are the igen values and as you can see the spread of principal component one is about three times as big as principal component 2 but how do we actually calculate igen

### How to find eigenvectors? [19:05]

vectors this igen decomposition is performed with some more linear algebra starting with solving the characteristic polinomial for the Matrix finally I have

### The truth :O [19:17]

to be honest with you when you calculate PCA using your favorite programming Library most likely no icon decomposition is performed in practice singular value decomposition I is

### Singular value decomposition [19:27]

computed the probably most crucial reason for this is the runtime complexity of the two methods IG decomposition has a cubic complexity while a singular value decomposition has a much smaller one depending on the shape of the data Matrix SVD is a more General Matrix decomposition technique that can also handle rectangular matrices remember igen decomposition only works on squared matrices it also has many other applications Beyond PCA it calculates singular values and singular vectors just like igen values and igen vectors when using SVD on a data set we don't have to calculate the covariance Matrix as we can apply it directly to the data set that's another reason why SVD is typically faster the results of SVD and igen decomposition are of course the same okay so does this

### Why eigendecomposition at all? [20:21]

mean we don't need igen composition at all not quite remember when I talked about kernel PCA at the beginning to learn nonlinear principal components for this technique the data is mapped using a kernel function in this scenario the single of value decomposition doesn't naturally extend to the nonlinear mappings and we need to make use of the igen decomposition on The covariance Matrix okay the last step is to use the

### [STEP 4]: Projection onto PCs [20:45]

igen vectors to project the data onto the new axis for our example from the beginning this projection is performed as follows we take our two-dimensional data Matrix multiply it with the calculated igen vectors or principal components and get the transformed data on the principal component axis so that's exactly what we see here and that's the corresponding data symmetric matrices always have

### Orthogonal Eigenvectors [21:12]

orthogonal igen vectors that's why the principal components are orthogonal to each other so they're at right angles because they are orthogonal it also means that they have no correlation that's why PCA is set to project the data into a decorrelated basis it's also quite easy to prove this mathematically again for this you find another Link in the video description obviously we want to reduce the dimensionality therefore we only project onto the principal components with the highest variance this can be determined using igen values so again we take the data Matrix

### Dimensionality Reduction Projection [21:49]

select only the igen vector with the highest IG value and obtain the projections onto the first PR iple components and we finally end up with these projections and that's the corresponding data so that's the full picture and now let's go ahead and apply this on a real data set okay so this is

### [CODE] [22:11]

the notebook and you can find the link in the description I selected a wine data set from Psychic learn and this wine data set has a bunch of features that describe different characteristics of a wine and we have three classes of wines I don't know exactly what they stand for but overall we have around 200 data points with 13 features now here I Define two functions one is building a simple Seaborn Scatter Plots the other one is using IPI volume a very nice library to create 3D visualizations so let's run that and it will also automatically install IPI volume if it's not yet installed now let's start with the first technique and I just wanted to point out here that you can use numai to also calculate igen values and igen vectors using this linear algebra package and you can use NY COV to calculate the covariance Matrix and here are examples for the first uh three igen vectors and igen values of the covariance Matrix okay now the 2D projection is pretty simple of course we use the easiest package you can use which is psychic learn and here we import PCA and as you can see it gets imported from decomposition because it performs a decomposition in detail it performs singular value decomposition as you can read in the documentation we will also time the execution so that we can compare later the different dimensionality reduction techniques and now we just call fit transform and this will perform the singular value decomposition and then we visualize this and it will show this picture so we have the three classes in a two-dimensional space and what you can do with um this PCA object is you can also extract the igen values and igen vectors using explained variance ratio and components so that's quite nice and then secondly just because we have three dimensions let's also use them so we can also make a 3D plot I like this Library it's very nice for visualization and this might in some cases help to even better um represent data and that's pretty much it so we will continue this notebook in the next video looking at the other techniques to finish this video I

### Summary Table [24:52]

quickly want to go over this table which we will extend with the other dimensionality reduction techniques and for PCA we can say it's a global technique because we incorporate all of the data at the same time it's linear because we project onto linear principal components of course there are also extensions to nonlinearity it's a production based method because of this decomposition approach the main purpose is to use the principal components for data analysis and visualization it's deterministic when we rerun it is will lead to the same result which is not true for all of the techniques the computational complexity depends on the decomposition approach and ranges from cubic for igen Value decomposition to squared in the case of singular value decomposition the only real hyperparameter is the number of principal components which can be identified with a scre plot then the key idea is to decompose The covariance Matrix into igen vectors and igen values and finally some other applications are Den noising compression or correlation analysis awesome that's it for the first technique in this video in the next video we will have a look at multi-dimensional scaling thanks for watching and see you soon in the next