# Multidimensional Scaling (MDS) | Dimensionality Reduction Techniques  (3/5)

## Метаданные

- **Канал:** DeepFindr
- **YouTube:** https://www.youtube.com/watch?v=VKSJayDi_lQ
- **Дата:** 16.01.2024
- **Длительность:** 30:12
- **Просмотры:** 16,505
- **Источник:** https://ekstraktznaniy.ru/video/52960

## Описание

To try everything Brilliant has to offer—free—for a full 30 days, visit https://brilliant.org/DeepFindr . The first 200 of you will get 20% off Brilliant’s annual premium subscription

▬▬ Papers / Resources ▬▬▬
Colab Notebook: https://colab.research.google.com/drive/1n_kdyXsA60djl-nTSUxLQTZuKcxkMA83?usp=sharing

Kruskal Paper 1964: http://cda.psych.uiuc.edu/psychometrika_highly_cited_articles/kruskal_1964a.pdf
Very old MDS Website: http://www.analytictech.com/borgatti/mds.htm
Rearrange Scalar Product: 
- https://www.sjsu.edu/faculty/guangliang.chen/Math250/lec10mds.pdf
- https://github.com/drewwilimitis/Manifold-Learning/blob/master/Multidimensional_Scaling.ipynb
Metrics Blog Post: https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa
Proof that MDS has no unique solution: https://github.com/drewwilimitis/Manifold-Learning/blob/master/Multidimensional_Scaling.ipynb

Best MDS Tutorials:
- https://www.jstatsoft.org/article/download/v073i08/1052
- https://rich-d-

## Транскрипт

### Introduction []

hi everyone welcome back to part three of this dimensionality reduction Series today we will have a look at multi-dimensional scaling or short MDS it might not be too popular in machine learning but it's definitely popular in statistics and has its roots in the 1950s in a publication by Tuson since then there have been many significant improvements of the technique and therefore MDS can be viewed as a class of techniques instead of a single method in this video we will discuss three variants of MDS just like before we look into the algorithmic details and also some code for this video I've been

### Why learning old stuff? [0:36]

digging into details on some pretty old websites and you might think why should I even bother looking at these old methods many of the ideas from last century are still relevant in even the basis for machine learning today I came across this quote by Shephard one of the MDS authors which says the idea of representing objects as points in space had a occurred to me while I was an undergraduate student in 1951 the embeddings we use in large neuron networks today are nothing else but a modern version of this idea therefore I think it's important to familiarize with some of the classical methods because many ideas of today have their inspiration from the past let's

### Basics of MDS [1:19]

start with a short introduction what multi-dimensional scaling is and what we need it for most of the dimensionality reduction techniques the starting point is a data Matrix with features along the columns and observations in the rows for MDS it's a bit different because we assume to start with a distance Matrix this Matrix contains the distances between all points based on some distance function just like in the literature we can use Delta to refer to the entries of this Matrix of course we can always calculate the distance Matrix from the data but MDS also works if we don't have access to the raw features we will see in a second where this is assumption comes from then the idea is like for all dimensionality reduction techniques to map the data into a lower dimensional space but with the goal of preserving the distances between the observations so the spatial configuration on the right is supposed to reflect the distances in our distance Matrix this is also where the name comes from we want to scale our distances into the multi-dimensional space on the right

### Example from Psychometrics [2:26]

let's use an example to understand why we start from distance MDS has its origins in psychometrics where it was proposed as a method that should help to better understand the human Judgment of similarity we can for example collect rankings of the similarity of different pairs of emotions here we have a surprised happy and sad face the numbers represent if people find two emotions similar or not for instance surprised and happy seem to be very similar with the value of 0. 7 we don't have a data Matrix for these observations which means we don't have numeric features describing the emotions this means that the actual data space is latent to us using MDS we can now visualize the emotional space only based on these distances and this visualization might for example reveal that joy and surprise are often perceived as similar similarities can always be converted to dissimilarity and this similarities reflect distances which are used in MDS nowadays there are some mathematical differences between similarity and dissimilarity and also numerically similarity tends to have more close to zero values but we won't really discuss this further and just keep in mind that MDS typically operates

### Symmetry of Distance Matrix [3:50]

on dissimilarities finally distance matrices always need to be symmetric for this method therefore we can just use the upper or lower triangular part of this Matrix also the diagonal can be ignored as the distance of a point to itself is always zero so these entries are what we end up with the number of distances for n observations can also be calculated using the gausian sum formula on the right now that we have a rough

### Mathematical Framework [4:17]

idea what MDS is all about let's take a look at the mathematical framework for this we will use the standard example for all MDS tutorials flight distance between cities this is what Well Suited because it allows us to intuitively define the relationships as distances or dissimilarities here we have five cities and their distance in miles in a symmetric distance Matrix given such a distance Matrix the task is to find the optimal distance preserving low dimensional map finding such a map means that we need to find a set of vectors in a p dimensional space for the orange point we would for example look for these coordinates the challenging part is that we need to maintain the distance to all other points according to our distance Matrix and that's exactly what's meant by scaling the distances in MDS as mentioned at the beginning each of our distances in The Matrix is typically denoted with Lambda i j representing the distance from point I to J the low dimensional distances can also be displayed in the distance Matrix these distances are denoted with d IJ what we want to do in MDS is to approximate the distance Matrix such that each individual distance is approximately the distance of the original Matrix we can use an arrow measure like the square distances to optimize the approximation how to solve these approximations will be explained in the next minutes as mentioned in the beginning MDS is not one single algorithm but instead a collection of methods there

### 3 MDS Variants [5:58]

are different ways to categorize them but on a high level there are these three subcategories the first one was independently published by Togos and go and is called the classical MDS it shares a lot with principal component analysis and that's why it's also termed PCA principal coordinate analysis under the hood it uses igen decomposition to find the optimal mapping this projection only works if we assume the distances to be ukian a more general form is called metric MDS mainly published by Shephard and koscal which uses an iterative algorithm to find a solution here the distances might be non- uclean and in the following we will also discuss different metrics and what a metric is in general classical MDS can be seen as a special case of metric MDS finally non-metric MDS can be used in the case of qualitative relationships this is the case when we have no direct distance information but just an ordinal ranking for this a monotonic function like ascending or descending is applied to the distances just like with metric MDS an iterative optimization algorithm is used to find the solution so this is the high level overview of what we will discuss in this video just one more

### Symbolic notations [7:20]

remark if you look through the literature the notations can be quite confusing sometimes and I realize that they're not always consistent therefore here the explicit explanation for this video Lambda i j are the original distances DJ the distances in the mapping space and dij head are called disparities which will become relevant

### Classical MDS (potentially boring) [7:45]

in the last part of this video let's begin with the classical multi-dimensional scaling which is also known as principal coordinate analysis or tus and go MDS to fully understand this first method it's recommended to watch the previous PCA video as there will be some overlaps just one note if you want to learn what is happening under the hood when using the good old psychic learn MDS implementation feel free to skip to the metric MDS part instead of the classical decomposition approach explained in the next minutes metric MDS uses an iterative algorithm therefore the following explanation of classical MDS will be rather high level and we will focus on the psychic learning implementation the classical MDS algorithm consists of three steps first we construct a so-called gram Matrix then we perform igen decomposition on this gram Matrix and finally we project the distances onto new coordinates let's understand this step by step what is a gram Matrix and why do we need it for classical MDS we assume that the distances are ukian distances for two vectors XI and XJ the ukian norm corresponds to the inner or scalar product of the distances as an exercise you can easily calculate the ukian distance for these two two-dimensional vectors now the twist is that our situation is exactly the other way around we have the distance and want to find the vectors so the numbers in Orange here that generate our distance Matrix in order to find these vectors we need to formulate this expression such that we can obtain the coordinates from it the inner product can be rearranged such that we have this equation and for details how to rearrange the inner product have a look at the links in the description here we however still have some dependencies on XI and XJ these dependencies are however the scalea product of XI and XJ with themselves this is nothing else than the distance to the origin of the coordinate system consequently we only have distances on the right side of the equation and are completely independent of the cordinates of x i and XJ with this simple trick of linear algebra we can find the coordinates by calculating the elements on the left if we do this for all points we end up with the inner product Matrix which is also called gram Matrix named after the Danish mathematician Jorgen Graham so here we have the close form equation that defines the gram Matrix S technically this is nothing else but the previous term written in Matrix representation the D squ is the result of the element wise multiplication of all distance elements additionally we can see that the so-called centering Matrix C is Multiplied twice it's simply removing the row and column averages such that the rows and columns are centered this centering is needed to incorporate transl ational invariance this means we want to be invariant to shifts of the points meaning that the distances should stay the same if the coordinate system moves as a summary we have to go in the reverse direction to recover the data Matrix from the distances because the gram Matrix is symmetric and positive definite it can be diagonalized through igen decomposition so the next step should be familiar to you we can use the igen vectors to project the data just like in the last video keep however in mind that this classical PCA approach only works if we have ukian distances because otherwise we can't do the reformulations we just did the final step is to use the igen vectors to project the gram Matrix to find the points of our configuration for this we multiply the IG vectors with the square root of the gram Matrix the square root naturally arises when we reformulate the igen problem in order to reduce dimension ality we can just keep the first and igen vectors belonging to the largest igen values here it's just like with PCA the first coordinate contains the largest variation and explains most of the distance relationship one remark about this solution it's not unique because the coordinates of the configuration can be rotated or reflected and still lead to the same distance Matrix so in this reverse process all information about location and orientation is lost when calculating the distances after this

### Metric MDS [12:31]

quick PCA refresher let's move on with the most widely used version of MDS called metric MDS it's based on a least squares loss function and sometimes also called least squares MDS previously we had the assumption that the distance Matrix is ukian but it turns out that there are plenty of other distance matric available so metric MDS is a more General algorithm that finds a mapping for any distance function the downside is that we can't make use of any closed form solution like in the previous section and therefore metric MDS uses an iterative optimization algorithm you can see a couple of common distance metrics beyond the ukian distance as we will see later we can pass the algorithm any distance Matrix and get a good low dimensional approximation before we however look at the details let's first Define the term distance more precisely mathematically a distance metric needs to satisfy three axum first distance needs to be positive and the distance to a point itself zero secondly distance needs to be symmetric so the distance from I to J is the same as from J to I and finally a distance matric needs to satisfy the triangle inequality which means that the sum of the shorter distances between three points always needs to be larger than the longest distance like visualized in this triangle importantly a distance function is called metric distance if it satisfies the triangle inequality if not it's called a non-metric distance that's also where the names metric MDS and nonmetric MDS come from if you want to learn more about metrics especially which metrics exist I've linked the great blog post in the video description now let's talk about the metric MDS algorithm as already mentioned it's an iterative method the steps can be outlined like this first we need to initialize the low dimensional points and this can happen randomly but there are also more advanced strategies then we need to calculate a distance Matrix for the low dimensional map and don't mix this up the original distances can still have any distance metric but the low dimensional approximation is usually displayed as ukian distances because this is simply most intuitive for humans in the next step a loss function called stress is calculated which compares the low dimensional distances with the provided distances and finally some optimization procedure like gradient descent is used to minimize the error step one and two don't need any further elaboration so let's talk about the loss function of this optimization procedure what is the stress function the raw stress function is defined through this equation it's a simple Square distance between the original distance Matrix and the distance Matrix of the low dimensional configuration the summation in the stress formula implies that we can exploit the symmetry so visually the above term is simply the square distance between the leftover Matrix elements pretty straightforward so far this stress function is one of the most significant improvements of MDS Shephard first presented the idea and his colleague cuscal finally proposed this loss function the word stress simply stands for standardized residual sum of squares and is a variance measure an improvement suggested by cuscal is to additionally normalize the stress this raw stress function is sensitive to the scaling of distances if another scale is applied the stress value changes to work around that the stress function has a normalization term which reduces the distances to the same unit the denominator is simply a scaling factor that makes the whole term invariant under uniform stretching or shrinking so this is another Improvement that makes the method more robust another convenient property is that normalized stress is between zero and one KCAL also provided an empirical evaluation that suggests which values of normalized stress are acceptable generally speaking every stress value higher than 20% indicates a very bad fit of the real distances there are several variants and extensions of the stress function the lve introduced weights WJ that indicates the importance of a distance a nice side effect is that these weights can be used to handle missing data for all missing entries the weights are simply set to zero salmon stress is another MDS Arrow function presented by John salmon in 1969 and this loss function is better with preserving small distances by giving them more weights in the fitting procedure besides the empirical orientation using the stress values from koscal there's another way to assess the goodness of the MDS fit the Shephard diagram named after its inventor Shephard is a scatter plot between the actual distances and the distances in the low dimensional space an ideal fit would have all points on the diagonal let's discuss the last step of metric MDS how do we minimize the stress function in general pretty much any optimization algorithm of choice can be used KCAL originally proposed a gradient descent like algorithm similar like the one used in your networks gradient descent however doesn't guarantee a good solution as the stress objective function is non-convex one of the most important contributions of the to MDS is the smack of algorithm this optimization approach stands for scaling by majorizing a complicated function this class of methods can also be found under the name minimization by majorization and that's exactly what psychic learn uses when you call the fit function so what is this majorization in a nutshell this approach exploits the convexity of a function in order to find the minimum as stress is a non-convex function a convex surrogate function is defined to minimize the objective this means that stress is wrapped in a simpler function the algorithm then iteratively minimizes the majorized stress function through supporting points the advantage of smackoff is that it guarantees a monotone convergence you can think of it like sandwiching the stress function in order to minimize it without actually touching it the exact details of this method won't fit into this video but there are plenty of websites to read more about it there's also a visualization of metric MDS in kar's original paper from 1958 on the left is the starting configuration and on the right the final configuration after 50 iterations as with all optimization algorithms local Minima are a problem and is recommended to run MDS several times with different starting configurations one last open question for metric MDS is how do we choose the optimal dimension for the low dimensional conf configuration here we find an answer again in the cuscal paper we simply run the algorithm several times and plot the stress values versus the dimension effectively this is a scre lot like in PCA and we can now make use of the elbow Criterion to find the dimension with a sufficient approximation it's natural that higher Dimension leads to smaller stress values because we have more space to position the points before we go ahead and

### Brilliant.org Sponsoring [20:25]

Implement MDS for our custom data set let's have another look at the sponsor of this video series which is brilliant. org brilliant is a free and easy way to learn math computer science statistics coding and much more interactively let's say for example you need a refresher on linear algebra then you can simply go to the foundational math section and try the vectors course as brilliant is also available on your phone you can integrate these courses into your routine and become better day by day I'm a big fan of interactive courses as they simply maintain a much higher level of attention and as you can see here it's actually a lot of fun to go through the exercises another thing I really like about brilliant is that they have a variety of courses for very rare and interesting topics including Quantum Computing computational biology and many other Concepts in technology brilliant offers a free 30-day trial which is a great way to try it out you can use the link brilliant. org deepinder to get 20% off the premium

### Code [21:31]

subscription one thing I wanted to recommend is to take a look at the implementations of algorithms you are using for MDS for instance you can precisely see what is happening under the hood if you scroll up a bit you can see that there is a function called smackoff and also one called smackoff single and here you can see what exactly is happening when you fit this function so we will talk about non-metric MDS in a second but if you use metric MDS you simply use the similarities or the distances and additionally you calculate the distances as ukian distances for the low dimensional configuration and down here you can see that the stress value is computed and we have a couple more steps optimizing the values and that's pretty much it so all that is happening is mainly what I've just described now as always let's have a look at this in action and again let's load the wine data set and the visualization functions that such that we can see what the embeddings look like going to the MDS section there's one detail which is quite useful you can pass the MDS algorithm precomputed um distances and this is what we will also do so we won't let the algorithm automatically calculate distances but instead we want to start with a distance Matrix so using our wi data set this calculates distance Matrix and this has a shape of 178 * 178 because that's the number of samples we have and typically Manhattan distance which is the L1 Norm works better in higher Dimensions than the ukian distance and here is just um a heat map of the distances only visualizing the lower triangular and given that we can now pass that into the MDS implementation and for this we instantiate MDS and we say first we want a 3D um embedding and we say we don't want to use normalized stress but we use metric MDS and we pass in pre-computed because we want to give it the distance Matrix we've just calculated here and then what you give it when you fit the Matrix is exactly this distance Matrix so there are two modes either you pass it a pre-computed distance Matrix or you pass it the raw features then it will automatically calculate the metric you specify here for example you can put in ukle in here and it will calculate that for you okay so we when we run this it will give us this visualization and I think for higher dimensional data MDS is not perfectly suited because distances are simply a problem in higher spaces but this is still acceptable and as I put here Manhattan distance typically works better than ukan distance and there you can't use normalized stress and Metric MDS at the same time I'm not entirely sure why that is the case but because of that we have to print the raw stress it would be better if we had the normalized stress which is between 0 and one but still this is the stress value so that's the overall deviation between the low dimensional mapping and the original distances and then we can run this for two Dimensions as well and we get this plot the last technique discussed in

### Non-metric MDS [25:20]

this video is the non-metric variant of MDS non-metric distances are distances that violate the triangle inequality we can easily construct such an example by defining the distance as zero whenever the same animal occurs on two images and one if the animal doesn't occur this certainly violates the inequality because the two upper distances are zero while the one on the bottom is one another common example is ordinal data whenever we rank data according to similar ilarity it can also happen easily that the triangle inequality is violated that's where the name non-metric MDS comes from ordinal data is however quite useful because it allows us to express distances simply based on comparisons and comparisons are most of the time easier to calculate than actual distance values the goal of ordinal MDS is therefore to find a configuration that preserves the order of the distances like in the original distance Matrix this order can be simply defined by arranging all distances in an increasing order in this example the distances between blue and yellow are the smallest in The Matrix and therefore also need to be the smallest in the low dimensional map the important question now is how do we ensure that the distances order is preserved for this KCAL has introduced a monotonic function that Maps the original distances to so-called disparities denoted by dad generally speaking a monotonic function is a function that preserves order given the original distances the function output should also yield the same order therefore the function f is always non-increasing or always non decreasing on the right you can see a collection of some monotonic functions and we are particularly interested in the ordinal one these functions of course have some par that need to be learned in the case of the ordinal function these parameters can be learned using least squares monotonic regression or also called isotonic regression fitting such an ordinal function looks like the red line in this image the monotonic regression can be solved using an iterative algorithm called pool adjacent violators algorithm but we won't go into further details here using the fitted line we can now compare the low dimensional distances against the disparities such that we optimize for the rank order and not the actual distance values the optimization procedure of MDS only changes slightly by using the disparities predicted from the monotonic regression model the rest basically stays the same meaning that we also use smackoff to optimize the stress function consequently in ordinal MDS we learn both the parameters of the monotonic function as well as the coordinates of the data points to finish

### Summary Table [28:26]

this video Let's further extend our comparison table MDS is a global technique as we consider all points at the same time classical MDS is just like PCA a linear method the iterative metric and non-metric MDS are both nonlinear classical MDS or PCA is a projection based method the others are manifold learning techniques the main purpose is visualization of the distance Matrix but also mapping to the ukian space classical MDS is deterministic the other variants however aren't because of the randomly chosen starting points classical MDS has a cubic complexity as it needs to find the igen pairs of the gram Matrix the other variants are roughly quadratic most of the time MDS is just used on smaller data sets there are however also approximate versions of the algorithms available the main hyperparameters are the used metrics The Chosen dimensionality the number of iterations and the stress tolerance Epsilon the key ideas are igen composition in the case of classical MDS and an iterative algorithm that minimizes the square distances in the other cases finally MDS has many applications among others also Gene clustering in computational biology that's it for this video hopefully this helped you in some way to shed some light on multi-dimensional scaling of course I just scratched the surface in this video and therefore feel free to have a look at the links in the description if you want to learn more thanks for watching and see you soon in the next video where we will discuss tne