# t-distributed Stochastic Neighbor Embedding (t-SNE) | Dimensionality Reduction Techniques  (4/5)

## Метаданные

- **Канал:** DeepFindr
- **YouTube:** https://www.youtube.com/watch?v=U-s8q6HshZw
- **Дата:** 19.02.2024
- **Длительность:** 31:20
- **Просмотры:** 17,473
- **Источник:** https://ekstraktznaniy.ru/video/52959

## Описание

To try everything Brilliant has to offer—free—for a full 30 days, visit https://brilliant.org/DeepFindr. The first 200 of you will get 20% off Brilliant’s annual premium subscription.

(Video sponsered by Brilliant.org)

▬▬ Papers / Resources ▬▬▬
Colab Notebook: https://colab.research.google.com/drive/1n_kdyXsA60djl-nTSUxLQTZuKcxkMA83?usp=sharing

Entropy: https://gregorygundersen.com/blog/2020/09/01/gaussian-entropy/
Attractive / Repulsive Forces Gradient: https://jmlr.org/papers/volume23/21-0055/21-0055.pdf
t-SNE Parameters distill: https://distill.pub/2016/misread-tsne/

Other great resources:
- By the t-SNE author: https://lvdmaaten.github.io/tsne/
- A good view on probability: https://siegel.work/blog/tSNE/
- CalTech tutorial: http://bebi103.caltech.edu.s3-website-us-east-1.amazonaws.com/2016/tutorials/aux8_tsne.html 
- Great visuals: https://newsletter.theaiedge.io/p/formulating-and-implementing-the
- SNE vs T-SNE: https://www.linkedin.com/pulse/visualization-method-sne-vs-t-sne-

## Транскрипт

### Intro []

welcome back to part four of our dimensionality reduction Series in this video we will dive into the details of tne which is one of the most commonly used dimensionality reduction techniques it also seems that my viewers mostly use tney as their method of choice for you guys I think it also makes sense to familiarize with umap which will be discussed in the next video as you might know tne is considered a manifold learning technique so let's recap what this exactly means Disney is a nonlinear

### Manifold learning [0:30]

technique and often produces a better low dimensional embedding compared to linear methods like PCA here you can see an example of the embedding space for the amness data set which contains images of handwritten digits we can see that the 2D representation is much more organized in the case of tney but what is the reason for this in a nutshell it's because PCA tries to map the global ukian distances between all points into the embedding space and apparently ukian distance is not a great choice in higher Dimensions this problem was already mentioned in the first video of the series which relates to the curse of dimensionality trying to preserve pairwise distances in the low dimensional space is impossible because the less Dimensions the less space we have left to position the points consequently the distances will converge towards zero which means we need to place all the points close to each other that effectively leads to one big cluster like we've seen in this image the same applies for MDS which we've discussed in the last video a good way to visually understand why ukian distance is not ideal is to look at the manifold of the Swiss roll data set as PCA makes use of ukian distances the points on the top will be close to each other in the embedding space as they have a small ukian distance even though we would not consider them as direct neighbor the right distance for properly unrolling the Swiss roll data set is called geodesic distance and is shown on the image below it is defined as the shortest curve on a surface that connects two points and the surface is the manifold of the Swiss roll in our case it turns out that we can approximate the geodesic Distance by building a path from neighbor to neighbor and that's also a fundamental idea of tne as also mentioned in the name it's a neighbor embedding technique having that said in the following I will guide you through the basic ideas some improvements of the algorithm and finally how to make it scale to millions

### Relevant Papers & Agenda [2:40]

of data points we will mainly look at three important contributions to build the basis for how tne is implemented in most libraries today the first paper was proposed in 2002 by Jeffrey Hinton and Sam RIS and is called stochastic neighbor embedding or short sne knee in a follow-up paper by Lawrence fander Martin and Jeffrey Hinton a couple of improvements were proposed that led to tney and finally Barnes hsne presented by Lawrence fer Martin provides the toolbox to scale tne to millions of data points and this historical outline will also serve as agenda for this video so let's start with the first paper stochastic neighbor

### Stochastic Neighbor Embedding (SNE) [3:25]

embedding the foundational paper called sne presents the basic idea how to map high-dimensional data into a lower dimensional space in the following we will quickly discuss each of the steps of the algorithm and summarize the full thing at the end of this chapter following the notation of previous videos the high dimensional data is called X and lower dimensional data such as a single Dimension here is denoted with Y as always our goal is to preserve the structure of the original data let's

### Pairwise distances [3:56]

take this simple 2D data set as an example the first step of sne is to calculate pawise distances typically using the ukian distance function and this leads to this distance Matrix but wait we said we can't successfully preserve Global distances in the lower dimensions and that is right and because of that sne will introduce a way to only look at the local neighborhood of each point before that however it is still required to know all of the distances between the points in order to decide which ones are actually neighbors so we won't get around this step what's happening next is that sne converts the distances to

### Distance to Probability [4:35]

probability distributions for each point on the right you can see a gausian distribution that corresponds to the red points in the paper such a distribution is called a conditional probability because we condition on the distance to a specific point like the red Point here the Y values of this probability distribution are now expressed expressing the similarity between points okay wait so this came out of nowhere we converted the distances to a point to a probability distribution such that closer points have a higher value and further away points have a lower value a good way to justify this is to think of these distributions as a waiting of all data points as a result far away points become very irrelevant and therefore we sort of relax our Dimension scaling problem as we don't have to pay too much attention to them in this example the green orange and yellow points have very small similarities and the blue point has a rather High similarity now to be fair introducing probabilities is not too well motivated in the paper and I'm also not sure if probability is the right term as there're not really stochastic events here but we will just accept this notion in The Following also there is no need to draw both sides of the gaussian distribution here because distances are typically positive for visualization reasons I will however always use the full gaan Bell this conditional

### Conditional Probability Math [6:06]

probability can also be expressed in an equation like shown here the red point is highlighted as XY remember we are still operating in the high dimensional Space X in the equation we find the ukan distance between the red points and a neighbor Point J this distance is then plugged into the formul of the gaussian normal distribution where we also find the variance Sigma you might remember that gion distributions have two parameters the mean and the variance the mean is simply zero here because the distance of the red point to itself is zero finally the whole term is normalized over all points to get a probability value between Zer and one which makes sure that the data is represented equally so regardless of the density so that's basically how we can calculate the similarity value from the

### Adjustment of Variance [7:05]

distance okay so it looks pretty well for our initial example where we have a gaan centered around the red point and the blue point is mapped to 0. 3 which is relatively High the other three points are mapped to very low similarities like the Green Point to 0. 01 overall we are happy let's have a look at another Point's perspective such as the Green Point here we realize that the distribution is somehow not perfectly suited the blue point is still mapped to a relatively high probability value even though we don't really want it to be included in our cluster because the blue point was not really a neighbor so we're not happy with that it turns out that it's unlikely that the same variance value is optimal for all points in dense regions a smaller value is appr appropriate ideally we would like to see something more in the direction of this distribution as you know variance controls the width of a gaan and by making it smaller we assign a smaller value to the points at the tals of the distribution this adjustment is also happening systematically in sne which brings us to the term perplexity

### Perplexity [8:20]

is a hyperparameter and can be seen as a measure for how many neighbors we want to include for each point if we set it very high it will lead to very compact clusters as all points are in the neighborhood of each other if it's a lower value the low dimensional space will be more distributed as only a few neighbors are included under the hood perplexity simply controls what we just adjusted manually namely the variance of the Gan distribution of each point as you can see a high perplexity value is expressed by a flat Gan distribution and a low value corresponds to a peak distribution perplexity is mathematically defined as follows for a conditional probability distribution which is the neighborhood probability of a point I it equals to two to the power of the Shannon entropy of this distribution the Shannon entropy is a popular tool in information Theory and measures the average amount of information generally speaking the entropy increases if the variance increases because the flatter the distribution the more uncertainty we have I've put a nice blog post in the video description if you want to go deeper into entropy by putting this entropy into the exponents the perplexity tells us approximately how many neighbors our gausian kernel covers and therefore perplexity balances between preserving Global and local structure so at this point we know two

### How to find the variance [9:55]

things first we should adjust the width of the distribution for each point because it is not optimal to use the same variance every time and secondly we can use perplexity to measure how many neighbors a distribution currently captures so ideally we would like to capture the same number of neighbors for each point so what's left is we need to find the optimal variance for each point such that we have the same complexity value for all of the points in order to find the optimal variance for each Point most of the libraries perform a binary search that finds a perplexity as close as possible to what we specify as hyperparameter such as 30 in this example in a nutshell binary search goes through a list of audited values and evaluates one variance at a time by calculating the perplexity then the maximum or minimum bound is moved until a suitable approximation is found this algorithm is faster than trying all of the values in the list for example here we could find a close to 30 value already for the second Sigma and can terminate the search and this search can be easily parallelized for each data point which you can also see in the psychic learn

### KL-divergence [11:15]

documentation the last step is to apply the same principles to the low dimensional mapping this means for each point we have two probability distributions one for high dimensional data denoted with pji and one for the low dimensional embedding space denoted with qji the last step of sne is to compare these two distributions to evaluate how similar the neighborhood is this means we evaluate the Y values for p and Q and need to compare them how close their similarities are a popular measure for comparing distributions is cack liab Divergence or short K Divergence it simply sums over all of the points and all of the neighbors of each point and compares the function values according to this formula if you want to get a feeling for this loss function I highly recommend to pause the video and plug in the values of the green and blue point here trust me this will be very useful for understanding this video you will realize that this loss function weighs values close to the center more than values at the tals of the distribution and that's exactly what we want to put more emphasis on close neighbors so if a close neighbor in the original P distribution is embedded far away in the Q distribution then we receive a large penalty for that as the name implies this loss function measures the Divergence between the distributions and therefore we want to minimize it also note that we can only move the points in the Q distribution and P is fixed just a side note a good way to verify the quality of the tne fit is to

### Shepard Diagram [12:55]

use the shepher diagrams that we've learned about in the last video it's simply a scatter plot between the high-dimensional distances and the low-dimensional ones I've not seen many people doing this but it's an excellent way to build more trust into the tne results finally for minimizing the cost

### Gradient and it's interpretation [13:15]

function gradient descent is used the gradient of our cost function which is kld Divergence with respect to why has this simple form in the tne papers the author provide an intuitive explanation how to interpret this gradients it's simply the resulting Force created by a set of Springs between a specific point such as the red points and all other points of the low dimensional mapping these Springs repel or attract points depending on whether the distance is too small or too large compared to the high-dimensional similarities in the gradients the first term measures this mismatch so if the distance is too large or too small and the second term expresses the force of the spring so how strongly should we repel or attract you can even reformulate this gradient to separate the attractive and repulsive forces interestingly this

### N-body simulation [14:15]

corresponds to a so-called nbody simulation which is relevant in physics so our sne or tne mapping can be seen as a set of particles that pull and push each other and we update the positions until a steady state is reached so let's quickly summarize the steps of the traditional sne algorithm we start with

### Full SNE Algorithm [14:35]

initializing the low dimensional points randomly or running PCA for a starting configuration then we compute the pairwise ukan distances between all points after that we introduce locality by applying gaussian distributions to the distances then we find the sigma values of each Point's distribution using a binary search such that each point has roughly the same number of neighbors and finally we optimize their positions by improving the KL Divergence between the low and high dimensional probability distributions for each individual data point so that's it for sne now let's have a look at tne in a

### t-distributed Stochastic Neighbor Embedding (t-SNE) [15:15]

nutshell it's just sne with a couple of improvements but the overall algorithm stays the same in order to understand why we even need Improvement let's quickly discuss the crowding problem

### Crowding Problem and how to solve it [15:28]

which motivates the most important change namely the T in tne imagine we have a data set that is arranged in a two-dimensional space like this in order to map this to the one-dimensional space we can simply arrange the points in the line now how well does this mapping preserve the distances of the original data for this we can have a look at the neighbor probability distribution of a specific point such as green again the distances correspond to similarity values and the blue point is furthest away if we compare this with the distribution of the high dimensional green points we realize that the blue point has been placed too far away in the low-dimensional embedding the orange and red points seem to have more or less the right values but the blue point has definitely a too small similarity the issue here is that no matter how we arrange the points in embedding space we will always have the situation that some of the points are not as close as they should be the reason for this is simply that the intrinsic dimensionality of the data is too large for the number of Dimensions we have chosen this situation is exactly what the crowding problem describes when we can't preserve the right distance to all neighbors because there is simply not enough space for it because the blue point is not perfectly placed yet there will be a small attractive force between blue and green as indicated by this spring this Force stems from the last term of the K Divergence loss which punishes deviations between the distributions let's see what the results of these forces between points looks like as you can see sne on the left which suffers from the crowding problem has a hard time to form clusters because during optimization the small forces will make all points Clump together thisne on the right alleviates this problem and is able to form clear clusters so how is tne doing that the basic idea is to use a probability distribution that simply assigns a higher value to these problematic points and this simply reduces The Unwanted forces as the deviations between the similarity values is much smaller in a sense you can think about this as having a broader neighborhoods in the low dimensional space which allows for a bit of error the distribution that is used

### Gaussian vs. Student's t Distribution [17:58]

for this is called students te distribution and that's where the T and tne is coming from as you can see in this comparison the T distribution has heavier tails and this will assign higher similarity values this allows the points to be further away and still have a relatively high probability so here you can see the situation even more clearly the blue point from before gets a much higher value assigned which reduces the deviation to the original distance and hence reduces the forces between points so the takeaway of this is there is not much we can do about the missing space in low Dimensions but we can allow for some Arrow which we will do by using the T distribution using this distribution also brings some additional advantages Beyond alleviating the croing problem first of all we don't have to perform binary search to find the variances like for the gaussian distribution and instead can always use the same distribution for All Points more specifically this is the T distribution with one degree of Freedom secondly the calculation of the probabilities is computationally more efficient because we don't have to calculate exponentials like in the Gan distribution so this was definitely the most significant change in tne but let's now also quickly discuss other things that were adjusted to make the

### Symmetric Probabilities [19:21]

optimization more efficient symmetric probabilities have been introduced here you can see an example of the similarity of blue given the Green Point which is 0. 15 now what if we turn things around what is the similarity of green given blue somehow we would expect that the similarities between them are equal we realize however that this is not the case besides being a bit unsatisfying this also has a negative effect on outliers in our data the blue outlier here will always have a small probability in the neighborhood of each other points this not only makes the blue point very sad but most importantly doesn't include it into the cost function in other words it doesn't matter where we place the blue points in the low dimensional space because it will never be really considered in the loss function therefore the authors proposed to average the conditional probabilities to get joint probabilities this allows the outliers to contribute to the cost function another nice side effect of this is that the gradient calculation is simplified which leads to computational benefits finally the third

### Early Exaggeration [20:35]

most important Improvement is early exaggeration let's first discuss what the underlying problem is that the authors try to solve here assume we have two-dimensional data that is forming two clearly defined clusters when initializing the low dimensional space randomly we might have a starting configuration like this so each of these 2D points is mapped to one point of the one-dimensional line after optimization we would expect two separated clusters on this axis when we start the optimization now the KL Divergence loss Will activates attractive forces between similar points and repulsive forces between points that should be distant from each other so differently colored points will repel each other and points with the same color will move together the result without early exaggeration would look like this but wait that's not quite what we expected what about the blue guy on the left well because of all of the repulsive forces by the orange cluster in the middle the blue point has no chance to move to the other side the simple way to solve this is to increase attractive forces for the first couple of iterations as indicated by these thick green arrows this allows the separated blue point to overcome the repulsive forces in order to move to the other side after some iterations the attractive forces return to normal such that clear clusters can be formed all of these things can be controlled in the gradient if you're interested in more details I can recommend to take a look at this paper which performs a comprehensive analysis of the attractive and repulsive forces expressed by the gradient of the loss function for completeness early exaggeration simply multiplies the attractive forces of the gradient with a factor greater than one applied to a real data set like the genomic Mouse data set on this image the differences look like this on the right tne forms more densely packed clusters

### SNE vs. t-SNE [22:50]

so as a wrap up early exaggeration symmetric probabilities and the introduction of the T distribution for low dimensional space were the main improvements presented in the tney paper there were some additional minor things that I didn't address here such as adaptive learning rates to avoid local Minima now let's give ourselves a

### Brilliant.org Sponsoring [23:08]

short interactive Break by clicking through one of the CES of the sponsor of this video series which is brilliant. org brilliant offers plenty of interactive courses for data science programming and Tech in general to give you some more insights I thought makes sense to click through a few lessons from a very recent topic namely large language models I know that my viewers have advanced knowledge so I already jumped a bit ahead and completed the first few sessions here you can see the lessons on tokenization so how language models convert text to a dictionary used for model training you can go through these lessons on your phone whenever you have time and therefore have the chance to stay up toate every day you can get started for free for a full 30 days using the link brilliant. org deepinder and the first 200 to sign up using this link will get 20% off the annual premium subscription now it's time to go through

### Code [24:14]

some code and after that we will shortly discuss the Barnes Hut sne so originally I wanted to go a bit deeper into the programming stuff but this video is probably already two packed for this so we will just have a short look at the tne implementation in psychic learn so as you can see here it says tne minimizes the KL Divergence of the gaussian and the student T distribution que so that's exactly what we just talked about and you can see that there is the Barnes Hut method which we will discuss after this coding section and if you don't use Barnes hat it will use the exact computation and here you can also see that we perform or the method performs gradient descent and this gives you the KL diversions and here we also have the options to Define early exaggeration which we just discussed and when we move up a little bit then we can find the fit function right here and here are a couple of things so the learning rate and that kind of stuff for The gradients Descent and in addition to that we can it will use barns Hut or for the exact computation and you can also pass it another pre-computed metrix so typically it's calculating ukian distance so this is exactly the pawise parse distance part which we discussed at the beginning of this video and based on these distances it's then converting them into joint probabilities so we've also discussed that these are these symmetric probabilities and for that it will perform the binary search all right based on the implementation let's now run these first couple of cells to get the data set and now let's have a look at tne um using psychic learn again I won't do anything fancy here this is just to quickly go over the parameters so nend components is how many dimensions we want let's choose three because we have this 3D plot again perplexity defines how many neighbors we want to have in the neighborhood of each point then early exageration defines how strongly we want to scale the attractive forces when running early exaggeration for the first couple of EPO and then we have here some properties for the gradient descent then in it we can specify PCA to start with a PCA configuration let's run that and the final thing is end jobs um and end jobs is because for performing the binary search we can parallelize that as mentioned before and so we will just choose four workers here to do that and this is what the embedding looks like and just like before we can also do the

### Distill.pub Blogpost [27:15]

same thing in 2D if you need a great guides to better understand the parameters and the effects of each of the parameters of tne I recommend to check out this till. pop which has a great blog post regarding this also helps to understand or to realize that tne is actually not useful for clustering because it doesn't guarantee that any of the information within or between clusters makes sense the last thing we will discuss here only on a

### Barnes-Hut t-SNE [27:49]

high level is Barnes hard tne which actually is very important because it allows to scale TIY to m millions of data points so we've just had a look at the documentation and we saw that there are two ways to compute the embeddings either you use the exact way which has a squared computational complexity or you use Barnes Huts which has a lock linear complexity so it's much more efficient remember when I previously talked about how sne is connected to nbody simulations from physics well it turns out that the runtime also used to be a big issue in that domain especially when you deal with millions of particles Joshua Barnes and Pete Hut came up with an approximation algorithm for solving and body simulations more efficiently the basis for this approximation is to divide the space into cubic cells which summarize the particles in the cell using the cell's Center the exciting idea of Lawrence funder Martin was to also transfer this to our good old T algorithm imagine we have millions of points and need to compute pawise distances probabilities and all of that for every single point using Barn hatut approximation we can alleviate the problem by summarizing the points belonging to the Clusters so instead of using all points of each cluster we can just take the center the second step of barn Hut is to build a so-called quad tree which summarizes all of the points belonging to a specific cell in a tree structure by traversing this tree we can efficiently calculate the repulsive forces between the Clusters this allows to approximate the repulsive part of the gradients by using only the centers of each cell So eventually there's much more we could talk about when it comes to accelerations of TIY but this was just a quick Peak to give you some intuition about what is happening under

### Comparison [29:54]

the hood finally let's go over our dimensionality reduction comparison table first of all tne is a local technique as the embeddings are based on each Point's neighborhood secondly this method is nonlinear as we can approximate arbitrary shapes and that's why it also belongs to the manifold learning techniques the main purpose is visualization to gain intuition about the data as the name implies tne is stochastic and that's because the objective function is non-convex and different initializations lead to different Minima we learned that Barnes Hut is an efficient lock linear approximation of tne because tne itself has quadratic complexity the most important parameters are perplexity early exageration and everything related to gradient descent the key idea is to use each Point's neighborhood to create embeddings finally it's important to note that tne is an iterative algorithm and we can't apply it to new data like in the case of PCA that's why there are also not many other applications other than visualization all right this was a

### Outro [31:06]

lot and that's it for today thanks again for watching and see you soon in the final part when we look at umap
