# Uniform Manifold Approximation and Projection (UMAP) |  Dimensionality Reduction Techniques (5/5)

## Метаданные

- **Канал:** DeepFindr
- **YouTube:** https://www.youtube.com/watch?v=iPV7mLaFWyE
- **Дата:** 03.05.2024
- **Длительность:** 28:55
- **Просмотры:** 13,677

## Описание

▬▬ Papers / Resources ▬▬▬
Colab Notebook: https://colab.research.google.com/drive/1n_kdyXsA60djl-nTSUxLQTZuKcxkMA83?usp=sharing

Sources:
- TDA Introduction: https://www.frontiersin.org/articles/10.3389/frai.2021.667963/full
- TDA Blogpost: https://chance.amstat.org/2021/04/topological-data-analysis/
- TDA Applications Blogpost: https://orbyta.it/tda-in-a-nutshell-how-can-we-find-multidimensional-voids-and-explore-the-black-boxes-of-deep-learning/
- TDA Intro Paper: https://arxiv.org/pdf/2006.03173.pdf
- Mathematical UMAP Blogpost: https://topos.site/blog/2024-04-05-understanding-umap/
- UMAP Author Talk: https://www.youtube.com/watch?v=nq6iPZVUxZU&ab_channel=Enthought
- UMAP vs. t-SNE Global preservation paper: https://dkobak.github.io/pdfs/kobak2021initialization.pdf
- Fuzzy Topology Slidedeck: https://speakerdeck.com/lmcinnes/umap-uniform-manifold-approximation-and-projection-for-dimension-reduction?slide=39
- Short UMAP Tutorial: https://jyopari.github.io/umap.html


Image Sources:
- Thumbnail Image: https://johncarlosbaez.wordpress.com/2020/02/10/the-category-theory-behind-umap/
- Persistent Homology: https://orbyta.it/tda-in-a-nutshell-how-can-we-find-multidimensional-voids-and-explore-the-black-boxes-of-deep-learning/


▬▬ Support me if you like 🌟
►Link to this channel: https://bit.ly/3zEqL1W
►Support me on Patreon: https://bit.ly/2Wed242
►Buy me a coffee on Ko-Fi: https://bit.ly/3kJYEdl
►E-Mail: deepfindr@gmail.com

▬▬ Used Music ▬▬▬▬▬▬▬▬▬▬▬
Music from #Uppbeat (free for Creators!):
https://uppbeat.io/t/sulyya/weather-compass
License code: ZRGIWRHMLMZMAHQI

▬▬ Used Icons ▬▬▬▬▬▬▬▬▬▬
All Icons are from flaticon: https://www.flaticon.com/authors/freepik

▬▬ Timestamps ▬▬▬▬▬▬▬▬▬▬▬
00:00 Introduction
00:32 Local vs. Global Technqiues
1:25 Is UMAP better?
02:08 The Paper
02:40 Topological Data Analysis Primer
04:04 Simplices
05:04 Filtration
06:22 Persistent Homology
07:02 UMAP Overview
07:40 Step 1: Graph construction
08:25 Uniform distribution
09:44 Non-uniform real-world data
10:48 Enforcing uniformity
12:05 Exponential decay
12:43 Local connectivity constraint
14:24 Distance function
16:19 Local metric spaces
17:00 Fuzzy simplicial complex
18:38 The full picture of step 1
19:10 Step 2: Graph layout optimization
19:55 Comparing graphs
21:15 Cross entropy loss
22:14 Attractive and repulsive forces
22:56 More details
24:04 Code
26:28 t-SNE vs. UMAP
27:24 Outro


▬▬ My equipment 💻
- Microphone: https://amzn.to/3DVqB8H
- Microphone mount: https://amzn.to/3BWUcOJ
- Monitors: https://amzn.to/3G2Jjgr
- Monitor mount: https://amzn.to/3AWGIAY
- Height-adjustable table: https://amzn.to/3aUysXC
- Ergonomic chair: https://amzn.to/3phQg7r
- PC case: https://amzn.to/3jdlI2Y
- GPU: https://amzn.to/3AWyzwy
- Keyboard: https://amzn.to/2XskWHP
- Bluelight filter glasses: https://amzn.to/3pj0fK2

## Содержание

### [0:00](https://www.youtube.com/watch?v=iPV7mLaFWyE) Introduction

hello everyone and welcome back to the final video of this dimensionality reduction Series so far we've had a look at PCA MDS and tne and now it's finally time to talk about one of the most recent methods which is umap according to some memes on Reddit umap appears to be superior over tney but is this actually true to answer this question we should begin with discussing if there is at all anything that we would like to improve in tney the techniques we

### [0:32](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=32s) Local vs. Global Technqiues

discussed so far are somewhat located at the extremes when it comes to what kind of information is preserved MDS for example is trying to preserve all Global distances and we learned that this is not ideal because it's simply not possible in a low-dimensional space Disney on the other hand restricts itself to only look at the local neighborhoods and therefore might miss the overall structure in the data that's also the reason why this distances between clusters of tney points are meaningless the main argument for this behavior is typically that the K Divergence loss used in tney doesn't care about the global structure now umap claims to preserve both local and most of the global structure in the data so it sits somewhere in between this comes from how the algorithm is designed by paying more attention to the overall geometry of the data well but there is

### [1:25](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=85s) Is UMAP better?

some discussion around it a paper by kobak and lindaman has for example shown that these differences mainly arise from how the algorithms are initialized so there is no evidence per se that umap has any advantage over TIY in terms of preserving Global structure one clear Advantage however is the computational complexity which is close to linear and therefore faster than the quadratic complexity of the traditional tne as this plot from the documentation shows umap is clearly faster when the data set sizes become larger and the comparison also includes approximations of tney like Barnes Huts which is indicated by the green line having cleared up these rumors let's now talk about umap

### [2:08](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=128s) The Paper

stands for uniform manifold approximation and projection the technique was published in 2018 by Leland mckin John Healey and James mwell which are researchers from the Canadian Tut Institute for mathematics and Computing a historical side note William T was a British Canadian mathematician reading this paper was very refreshing as it is one of the most comprehensive papers that I've read in the last years in order to understand umap we will have to make a short journey into the world

### [2:40](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=160s) Topological Data Analysis Primer

of topological data analysis in the paper there's a complete section on the theoretical motivations drawn from this field and therefore we will first build the foundations to understand the rest of the technique as you know data usually corresponds to a set of points which are arranged in a high-dimensional space also called the ambient space each feature corresponds to a new dimension now topological data analysis or short TDA is interested in the shape of the data in the space or in other words the manifold in the TDA toolbox there are different methods that help us to extract topological and geometric features of the data this mathematical field is also called homology such features can for example be connections Loops or holes in the data set and they can reveal interesting details about our data so the set of points on the right doesn't give us any topological information per se so how do these methods calculate topological features all of this is based on triangles as you might know surfaces are often approximated using triangles and this is for example used in computer graphics and video games triangles are the simplest polygons and a nice property of triangles is that we can construct all other polygons with them which is called triangle lay what's even nicer we can generalize triangles into higher

### [4:04](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=244s) Simplices

Dimensions the triangle we just saw is a 2d example for 3D we call it a tetrahedron these generalizations are also called simplies one of the most basic ideas in topological data analysis is to Simply clue simplies together along faces to approximate the geometry in the data the result of that is a so-called simplicial complex which we can see on the right don't take this sketch too seriously however because it's just based on zero and one simplies the takeaway is that we can resemble the topological structure in our data by cluing together simplies later we will see that a similar approaches used in umap to learn the local structure of the manifolds for TDA we can now calculate all of our geometric features on this graph directly which is quite convenient so far so good the last question is how do we actually build such a simpli complex so how do we glue these things

### [5:04](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=304s) Filtration

together the general approach used in TDA is called filtration we simply put balls of a fixed radius Epsilon around each point and whenever these balls overlap in certain ways we connect the points this automatically generates a graph for each radius Epsilon in simplical homology all of this is defined much more concretely and in fact there are different way ways to build these structures for umap we mainly care about the vorus rips complex which is determined by the zero and one Simplex a great python Library called Giotto TDA has a nice visualization of the filtration procedure as you can see with each new radius Epsilon the shape is evolving and we get a pretty good feeling for how the shape of the data looks like which is of course also true for higher dimensions for umap this little introduction is probably sufficient at this point but I will talk about one more detail because I simply find it important given this evolving structure we now have different levels of topological features there's a large hole in the data set at some point which gets destroyed again once the radi become too large so what level of granularity is the right one to make

### [6:22](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=382s) Persistent Homology

this even more concrete if we perform filtration on such a shape we will get information about many small holes but also a big hole so how do we know which hole is the feature we are interested in this is where persistent homology comes into play it simply tracks the beginning and end of these features throughout the filtration procedure that's also where the name comes from it checks which topological features are persistent over the radii thinking about umap again topological Theory provides guarantees about how the manifold of the data can be captured in a min meaningful way and that's why we started this video with this quick

### [7:02](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=422s) UMAP Overview

primer we can now discuss the umap algorithm in two steps first we construct a topological representation using the vorus rips complex which effectively gives us a k nearest neighbor graph secondly we find a low dimensional representation that has a similar topological structure and this in the end is a graph layout optimization in the ukian 2D or 3D space case the ACT implementation is quite simple but the authors emphasized that the mathematical framework serves as a foundation to justify the

### [7:40](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=460s) Step 1: Graph construction

approach let's discuss the first step in more detail in the TDA introduction we learned about the vorus rips complex and how it can be built using filtration by putting a radius around each points we will do something similar in umap an immediate question now is which r Rus should we pick to construct a simpal complex that approximates the manifold on which the data lies there is no one size fits all radius for all data sets and therefore umap has a hyperparameter which is the number of Neighbors in order to understand why it's not a radius anymore we will have to dig a Little Deeper all of this has to do with the U in um map which stands for uniform

### [8:25](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=505s) Uniform distribution

let's assume our data lies on this manifold if the dat is uniformly distributed on a manifold the consequence is that we always roughly capture the same number of neighbors if we position a ball with a radius somewhere in this example we have six neighbors for a given radius this would make it quite easy for us as we could always choose the same radius Epsilon and capture the topology at the right resolution for the entire manifold there's even a theorem the Nerf theorem which proves this observation from a mathematical point of view if the data is uniformly distributed on the manifolds a simplish complex of the cover of the data will approximate the manifold correctly all of what we've just seen is also formulated mathematically in the paper in LMA 1 given a ranan manifolds in an ambient space and given a bunch of locally constant points P we can approximate the geodesic distance using balls centered at each points this means if the data is uniformly distributed like in this example the geodesic distance on the manifold is simply 1 divided by the radius of these balls which is simply the average distance between points the only problem is that real

### [9:44](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=584s) Non-uniform real-world data

world data doesn't behave that nicely the data might be scattered around like this if we place a ball with the same radius on this collection of points we might get a different number of neighbors for each of the points and one first issue we observe is that the two points on the bottom are isolated from the rest of the points in the end we however want to have one Global simplicial complex that covers all of the data the problem here is that the previously mentioned Nerf theorem doesn't apply anymore because the data doesn't cover the manifold sufficiently another problem is that at some places the balls cover too much which results in very high dimensional simplies that's also not ideal because it simply doesn't reflect the global manifold structure in that area all of this would be much nicer if the data was uniformly distributed but what can we do about it umap is called uniform manifold approximation for a reason so there must be a way to introduce uniformity and that's exactly

### [10:48](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=648s) Enforcing uniformity

what the authors do looking at LMA one once again the author simply decide to give each point an individual distance function such that the data is approximately uniformly distributed with regard to that metric this way the Assumption of uniform distribution still holds and we are able to approximate the geodesic distance according to this Lama I found this sketch by Tom hosgood which nicely visualizes the effect of these individual distance functions to force the data to be uniform the geometry changes sparely clustered regions will shrink and densely clustered ones will be in en larged practically this means that the balls around each point stretch to the K nearest neighbor of the point choosing k equals 2 it could look like this in dense and sparse regions instead of choosing the radius we now simply choose a value for the number of neighbors this is more convenient because choosing actual distances is very data set dependent whereas the default value for example 10 neighbors Works quite well for most of the data sets if we would build a simplicial complex out of this information it would result in a simple graph that looks like this

### [12:05](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=725s) Exponential decay

with our local metric we can now also measure distance between the points and therefore we can also include information about how close to neighbors are this means our radius can be interpreted as the probability that a connection exists in practice an exponential decay is used around each point to give more weight to points that are closer in the neighborhood this information can also be used in our graph namely by putting weights on the edges here the first neighbor receives a higher weight because it's closer to our Center points unfortunately using this

### [12:43](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=763s) Local connectivity constraint

approach it can still happen that points become isolated because they are too far away simply because the exponential decay assigns a very small value to them the authors decide to add a local connectivity constraint such that the Decay starts beyond the first neighbor this eventually looks more or less like this as you can see we have no exponential decay until the first neighbor and then the exponential decay starts beyond the first neighbor in our graph this leads to a connection weight of one for the first neighbor and then to smaller weights for all other nearest neighbors except for having no isolated points this also has further beneficial implications as distances in the high-dimensional space tend to be larger to the first neighbor the distance distributions are typically cramped together we've also seen this in the previous videos of the series as you can see when introducing the local connectivity constraint the distributions become a bit wider because it stretches out to the first neighbor this result can also be demonstrated ated mathematically and shows that umap is not simply building a k nearest neighbor graph but many considerations went into the design of the method in the documentation for the umap python Library there's also a nice visualization of the whole graph construction as you can see the circles around each point stretch to the first neighbor and then start to Decay exponentially all we've just seen is

### [14:24](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=864s) Distance function

mathematically expressed in the paper using the following distance function this formula eventually defines how large the radi around each individual Point are and as we know by now this is controlled by the number of nearest neighbors K let's begin with the term on the left here we find the exponential decay around the center point that weighs the neighbors according to their distance in the exponent we find this complex looking term which is nothing else but the local connectivity constraint we just discussed just visually by substracting row I which is defined as the minimum distance among the nearest neighbors we make sure that the exponential decay starts only beyond the first neighbor that's because substracting the first neighbor's distance leads to negative values which are replaced by zero because of the maximum operator and anything to the power of zero is one so we will always have a connection probability of one it's also possible to visualize this distance function which you can see here finally Sigma I is a normalization factor that allows to control the intensity of the Decay if we now sum over the Decay distances to all nearest neighbors we roughly want the ball to capture the probability mass of log 2 of K there's not much information in the paper why log 2 is used but I read on GitHub that empirically this leads to better results case simply acts as a global constant that defines how locally we want to estimate the metric if we set it to a higher value we will have larger balls around each point and capture more of the global structure at this point one or the other might spot some similarities with tne I will talk more about this later in this video so given

### [16:19](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=979s) Local metric spaces

this new distance function we now have a local view for each individual point in the data set for a specific number of nearest neighbors such as two here we get a radius around each point that connects the point to all neighbors within the radius this gives us a graph for each data point more specifically a weighted graph as we can also put the connection probability on the edges to indicate which points are closer than others note that the edge weights between nodes are not a symmetric relationship one node might be a nearest neighbor of another nodes but not vice versa the real challenge now is to combine all

### [17:00](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=1020s) Fuzzy simplicial complex

of these individual weighted graphs into one large simplicial complex that represents the entire manifolds how would you come up with for example this subgraph given the individual weighted nearest neighbor graphs do we take the max of the edge weights or the average or what do we do mathematically speaking we have different local metric spaces for each data points and want to merge them into a joint simpal complex the math section of the paper tells us exactly how this can be done namely by converting the graphs to fuzzy simplicial sets in fuzzy simplicial sets the membership is not binary but instead a probability between zero and one and that's exactly the situation we have with our Edge weights and in fuzzy topology there exists already a clearly defined operation for merging representations which can be found in definition n in the paper the basic idea behind it is that the combined Edge weight is the probability that an edge exists minus the probability that no Edge exists this combined weights then represents the probability that at least one of the edges exist in this formula a corresponds to The adjacency Matrix of the graph we won't get too deep into the math here but if you want to read more about it I've added a more mathematical slide deck to the video description following this merging approach we have a mathematical guarantee that we obtain an appropriate topological representation of the whole data set now

### [18:38](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=1118s) The full picture of step 1

the full picture we started with drawing balls around each point to build the simpal complex then we made these balls stretch to the case nearest neighbor to ensure local uniformity over the manifolds then we added exponential decay beyond the first neighbor to have Edge weights that correspond to the distance finally we found a mathematically grounded way to merge all of the local views of the topology and eventually we have a weighted K nearest neighbor graph that represents the structure in our data set so now let's

### [19:10](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=1150s) Step 2: Graph layout optimization

talk about the Second Step which is how do we produce a low dimensional embedding that preserves exactly this structure we start with our high-dimensional weighted nearest neighbor graph let's call it h for the low dimensional 2D or 3D space we initialize the points somehow later we will also discuss different ways to do this but let's focus on more important things first for this set of points we apply the same procedure as in the high dimensional space which gives us another K nearest neighbor graph with weighted edges which we call L now we want to change this low-dimensional graph such that it approximates the high dimensional one as good as possible for this we need to compare the two graphs somehow I don't

### [19:55](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=1195s) Comparing graphs

know if you ever tried to compare graphs but it's a challenging problem here we luckily have a one toone correspondence for the nodes because we can simply say this high-dimensional point belongs to this low dimensional Point as a result the only thing we actually need to compare are the edges or more specifically The Edge weights when taking a look at the umap python implementation we can see that the comparison between these graphs is nothing else but the comparison of their adjacency matrices these are the adjacency matrices of our two graphs and the entries can be simply interpreted as Edge probabilities I highlighted the edges in different colors to show that the graphs can have a different number of edges and therefore it doesn't suffice to only compare the existing ones but also the non-existing edges in other words The adjacency Matrix will consider everything at the same time and therefore it's the only reasonable Choice here by moving the lowest lower dimensional adjacency Matrix closer to the high dimensional one we end up with a graph that approximates the original graph H okay so how do we actually compare the two adjacency matrices do we substract them from each other or what is the right way the edges are simply beri

### [21:15](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=1275s) Cross entropy loss

variables and therefore it makes sense to use the cross entropy over all existing edges let's inspect this formula closer cross entropy simply compares The Edge probabilities for the high and low dimensional graph for both cases when edges exist and when edges don't exist on the left are all of the existing edges and on the right The adjacency Matrix shows one minus the edges which gives values for the non-existing edges the first term then simply makes sure that the low dimensional points are close to each other when they have a high Edge probability and the second term makes sure that the points withow low probabilities will be far away in the embedding space using this loss function we can simultaneously optimize the position of all points you probably already guessed it we can adjust the coordinates of the low dimensional points using stochastic radient descent

### [22:14](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=1334s) Attractive and repulsive forces

using this loss function the authors mentioned that corresponds to a force directed graph layout algorithm this simply means that the points push and pull each other until a steady state is reached we've also seen this in the tne video in case you're interested it is even possible to derive the exact attractive and repulsive forces given this optimization function Yi and YJ correspond to the coordinates of two low dimensional points for edges between these points with higher weights the attractive forces are stronger whereas lower weighted edges have stronger repulsive forces you can find more details about this in the umap paper so those were the

### [22:56](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=1376s) More details

most important aspect effects of umap and before we do a little bit of Hands-On coding let me quickly talk about four more minor things that are used in the actual implementation we haven't talked about how the K nearest neighbors are actually found as this needs to be done many times and therefore should be fast the efficient kous neighbor descent algorithm is used it has an empirical complexity that is close to linear it's an approximation by only looking at neighbors of neighbors instead of the full data set next the actual loss function is slightly simplified as some of the terms can be removed additionally negative sampling is used to not deal with all negative edges in The adjacency Matrix this speeds up the optimization finally the initialization of the low dimensional points is not random but the authors use a technique called spectral embedding I won't go into the details here but you can simply think of it as a technique designed for graphs or most specifically their adjacency Matrix and this leads to a better starting position for the low dimensional points now let's

### [24:04](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=1444s) Code

do a little bit of Hands-On coding and before that I can recommend to take a look at this website which is built by different researchers from Google it's an excellent playground to experiment with umap hyperparameters the certainly most important hyperparameter of umap is the number of neighbors this interactive visualization helps to get a feeling for it the second most important parameter is minist which controls how tightly the points Clump together okay so just like before we run the data set cell in order to have access to our data set so as you can see we import umap from umap so not from Psychic learn and the reason for that is that umap has not been added to Cy learn so far and there have been some tickets where people ask if umap will be added in the future which would be of course quite convenient but the justification also makes sense uh which is simply that the umap package itself is already nicely maintained and there are not many reasons to include it in cycle learn okay so we can use umap using the umap class and then we Define the number of components in this example we have three because we have three dimensions down here and then number of neighbors is how many nearest neighbors we want to use in that algorithm and then mindis defines how closely the points can be placed next to each other in the low dimensional embedding and then we can also use metric which uh defines the metric we want to use to calculate the distance between the points and if you want or if you need more help with the parameters the umap documentation is excellent and there is also a lot of examples and those four are basically the main parameters of um map so it's quite simple and when we run this we can also call fit transform like before and this leads to Arrangement like this and we can also do it for the two-dimensional space

### [26:28](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=1588s) t-SNE vs. UMAP

to finalize this video Let's quickly revisit the connection between tne and umap and tne are quite similar in many aspects as they both create low dimensional embeddings based on the nearest Neighbors when we look at the distance functions we spot a few similarities both use exponential decay and have a sigma to control the intensity umap in addition has the local connectivity constraint which has beneficial effects for maintaining the global structure in tne we used perplexity to adjust the resolution scale between more local and more Global views in umap we can use the number of nearest Neighbors which is not just more intuitive but also easier to pick there's literature that discusses more connections between those two and it is argued that umap is simply a special case of tney let's

### [27:24](https://www.youtube.com/watch?v=iPV7mLaFWyE&t=1644s) Outro

complete this series by filling out the comparison table that we used throughout the videos umap is a local and nonlinear manifold learning technique that's because we try to approximate the manifold using the nearest neighbors of each point the main purpose is visualization and data analysis it is not deterministic because there are stochastic components that come from the approximations used to speed up performance the computational complexity of umap is n to the power of 1. 14 and this bound comes from the nearest neighbor descent algorithm and was estimated empirically we also discussed the two most important hyperparameters the number of neighbors and mindist the key idea is motivated mathematically and aims to build a fuzzy simpal complex and this in the end leads to a canorous neighbor graph umap has become very popular in bioinformatics but beyond that I haven't looked into any other applications congratulations with this you have successfully completed the dimensionality reduction Series in total this summed up to around 2 hours of videos and I hope it served as a solid foundation to get you started with these four techniques we have however just scratched the surface and there are plenty of other methods out there maybe I will add more videos to this playlist in the future so thank you for watching and see you soon

---
*Источник: https://ekstraktznaniy.ru/video/52958*