# Introduction to Clustering and K-means Algorithm

## Метаданные

- **Канал:** Hedu AI by Batool Haider
- **YouTube:** https://www.youtube.com/watch?v=7Qv0cmJ6FsI
- **Источник:** https://ekstraktznaniy.ru/video/44650

## Транскрипт

### Segment 1 (00:00 - 05:00) []

hello everybody Welcome to the introduction to clustering using K's course you're about to embar on a journey that will teach you one of the most powerful techniques in the world of data Sciences in this course you will be exposed to the science of clustering you will then learn about the working principles behind a very powerful algorithm of clustering called K means and finally in the next video that will follow this one you will learn how to write a code to make computer distinguish between various species of a flower type totally on its own so let's get started so what is clustering in the simplest words clustering is grouping similar objects together and using this powerful algorithm you can literally do machine learning magic let's see how this includes training your computer to say for instance differentiate between different species of flowers or cluster articles that are on similar subjects together such as done by search engines or be able to distinguish between different stages of cancer tumor provided the relevant information though all of these seem bizarrely different projects they all rely on clustering but before we dive deeper into mathematical details of clustering or actually any machine learning algorith them there are two very important points to keep in mind let's discuss it using a very interesting example let's say for instance you are given various characters of simpsons's cartoon your job is to divide these characters into two groups based on how similar they appear to each other and explain the reason behind the similarity you can simply do that by dragging the characters into to each grou like so and once you're done click on proceed to discussion button to continue while doing this exercise you must have realized that there are no unique fixed groupings in fact you could group these characters in several different ways let's say for instance one possible grouping is dividing the characters as Simpsons family members and as schol employees the second possible division could be females and males even more you could group them as kids and adults or separation based on hairstyle or posture etc so with this we come to the first important note Belle in his book writes you need to know the question you're trying to answer and the same point has been itrated by several other machine learning groups this is to say on what basis do you want to Cluster your data automatically this is a very famous and Powerful tool we call it K means clustering now something that is quite interesting about this technique is the fact that it is based on this simple formula if you look closely this D that you see here this is the distance in our example we will use Simple geometrical distance also termed as ukian distance now the second thing to observe of here are these two submission signs and these two variables I and J upon carefully expanding this formula we arrive at the following expression thus this D is actually a combination of two different distances let me make this simpler for you this middle complex expression can be better explained in terms of two distances W which represents distances Within the cluster and B which

### Segment 2 (05:00 - 10:00) [5:00]

represents the distances between the Clusters now let's see what does that even mean let's assume we are given data points belonging to three different clusters one of the objectives of K means algorithm is to assign each cluster with its respective centroid centroids are the center of each cluster and here are marked as these dotted hexagonal objects B represents the distances between the centroids while W represents distances within each cluster between their respective centroids and data points let's try solidifying our understanding with an example suppose we are looking at the heart rates and ages of various patients suspected of having certain heart effects suppose this is the data sets we have here each data point represents a patient assuming that we already know that the patients that fall in this blue cluster are the normal people without any heart effect while the one with relatively higher than normal heart rate rates fall in this red cluster we know that these are the patients the ones that form in fall in red cluster with some sort of cardiac arhythmia now while we could visually with some knowledge group these data points into two clusters in reality we have thousands of data points with multiple attributes therefore it becomes near to impossible to do this by visual inspection we thus need a clustering algorithm such as K means which could do this for us automatically the way K means work is only with distances and without the AIS we do have remove the exis also remember the number of clusters you want to form is your choice in this case we will choose to form two clusters this K means we'll use two centroids at first the algorithm randomly assigns these centroids simply because it does not know where is the center of each cluster next distances B come into play a straight line is drawn between the two centroids here and then a perpendicular bis sector divides this line into two halves this is called the boundary line and its purpose is to demarcate the regions of the two clusters any data point that lies towards the left of this boundary line that is closer to the blue centroid is marked as a member of the blue cluster while any data point closer to the red centroid is marked as the member of the red cluster this is the clustering of the data set that you will see after this first step but as you can see the data points don't seem to belong to the Red Zone they're closer to the blue cluster and should have been placed there and this is where K means iterations come into play iterations simply mean repeating the same steps with an intention to get closer to the desired result with every iteration in iteration one the distances within each cluster are calculated like so once these distances are computed the centroid is pulled towards the center of the cluster using the mean of these distances the same happens with the red centroid and all the distances are computed as shown then the red centroid 2 moves closer to the center of its respective cluster the same step of computing distance between the two clusters is then repeated and a new boundary line is drawn note that since the Cento have shifted so has this boundary line now note that these two data points that were previously in the Red Zone since they now lie to the left of the boundary line will now be placed within the blue cluster this is how the clustering after the first ceration would look like as you can see there's still one data point that can shift the positions of the two clusters centroids this algorithm proceeds to the second nitration where the distances are recomputed again and centroid is moved once again the boundary line is constructed and reclustering done this is how the result looks like after the second iteration so when do the iterations of K means stop click the option that you think should

### Segment 3 (10:00 - 10:00) [10:00]

apply with every moment of the centroids the position of the boundary line changes and this keeps on happening till there are no more changes in the positions of both the centroids and the boundary line does iteration stop when the boundaries position changes no more than a small tolerance value for the upcoming iterations when this happens the algorithm is set to have converged and this is how the fin clustering looks like well now that you're well aware of the working principles of K means clustering in the next video we will get our hands dirty with some actual code and real data so stay tuned
