# How to implement K-Means from scratch with Python

## Метаданные

- **Канал:** AssemblyAI
- **YouTube:** https://www.youtube.com/watch?v=6UF5Ysk_2gk
- **Дата:** 21.09.2022
- **Длительность:** 23:41
- **Просмотры:** 23,776

## Описание

In the 10th lesson of the Machine Learning from Scratch course, we will learn how to implement the K-Means algorithm.

You can find the code here: https://github.com/AssemblyAI-Examples/Machine-Learning-From-Scratch

Previous lesson: https://youtu.be/T9UcK-TxQGw
First lesson: https://youtu.be/rTEtEy5o3X0

Welcome to the Machine Learning from Scratch course by AssemblyAI.
Thanks to libraries like Scikit-learn we can use most ML algorithms with a couple of lines of code. But knowing how these algorithms work inside is very important. Implementing them hands-on is a great way to achieve this. 

And mostly, they are easier than you’d think to implement.

In this course, we will learn how to implement these 10 algorithms.
We will quickly go through how the algorithms work and then implement them in Python using the help of NumPy.

▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬

🖥️ Website: https://www.assemblyai.com/?utm_source=youtube&utm_medium=referral&utm_campaign=scratch10
🐦 Twitter: https://twitter.com/AssemblyAI
🦾 Discord: https://discord.gg/Cd8MyVJAXd
▶️  Subscribe: https://www.youtube.com/c/AssemblyAI?sub_confirmation=1
🔥 We're hiring! Check our open roles: https://www.assemblyai.com/careers

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

#MachineLearning #DeepLearning

## Содержание

### [0:00](https://www.youtube.com/watch?v=6UF5Ysk_2gk) <Untitled Chapter 1>

welcome to the last video of the machine learning from scratch course presented by assembly ai in this series we implement popular machine learning algorithms using only built in python functions and numpy in this lesson we learn about k means so first we start

### [0:16](https://www.youtube.com/watch?v=6UF5Ysk_2gk&t=16s) Theory

with a short theory section and then we jump to the code so let's get started so k-means is an unsupervised learning method which means we have unlabeled data that clusters a data set into k different clusters each sample is assigned to the cluster with the nearest mean and then the means also called centroids and clusters are updated during an iterative optimization

### [0:39](https://www.youtube.com/watch?v=6UF5Ysk_2gk&t=39s) Iterative Optimization Process

process so this iterative optimization process is pretty easy first we initialize the cluster centers randomly and then we do step two until they are converged so until there is no more change first we update the cluster labels and we do this by assigning the points to the nearest cluster center and then in the next step we update the cluster centers so we set the center to the mean of each cluster and then we do these two steps until there's no more change so let's look at an example to make this more clear first we start with unlabeled data and now let's say we want to find three clusters here so we randomly initialize three centroids so i hope you can see these three markers so the black x so these are the centroids first randomly initialized and now we assign the cluster labels to the nearest centroid so here we have three clusters now and now we update the cluster centers so we calculate the mean of each cluster so i think the blue center will be somewhere here the orange one will be somewhere down here and the green center will be somewhere here in the middle so let's look at the next step so yeah here we have this here for example we have the green centroid then we again update the cluster labels so we assign it to the nearest center so i think this upper part here now will become orange because this orange centroid is the closest one so let's look at the next step so yeah here we have it then again we update the centroids so again we calculate the center of each cluster for example i think the green one will end up somewhere here the orange one somewhere up here and the blue one is already pretty good so yeah here we have it then again we update the cluster labels and now i think this upper part here will also become orange then again we update the centroids then again the labels then again the centroids and i think right now there is no more change so now we can stop the algorithm and then we have it and yeah this is all that we have to do i think this is pretty intuitive and yeah so right now there's no more change so right now we are done so all we have to do is these two steps so first initialize this and then we do these two steps repeatedly and in order to find the nearest centroids we

### [3:33](https://www.youtube.com/watch?v=6UF5Ysk_2gk&t=213s) The Euclidean Distance

have to know the euclidean distance so this is the formula which is the square root over the sum and then the difference squared between two feature vectors so this is all we have to do so let's jump to the code first let's import numpy snp and then let's create a class k means this gets an init function and here as parameters we give it k so the number of clusters let's say by default this is five then it also gets max iters and let's say by default this is 100 and then we also give it plot steps if we want to plot the different steps of the clusters and centroids during the optimization so now let's store this so we say self. k equals k self dot max adders equals max iters and self dot plot steps equals plot steps then i also want to initialize empty clusters so let's say self dot clusters equals and now we use list comprehension an empty list for underscore in self in range self dot okay so for each cluster we put in an empty list here and later here we want to start these sample indices so let's say list of sample indices for each cluster and then let's also initialize self centroids equals and this is an empty list and here we want to store um store the centers and this is the mean vector for each cluster so this is what we put in the init then let's define the predict function with self and x and there is no fit method we can do the prediction right away and we also don't have y because remember this is an unsupervised learning technique with unlabeled data so here we first let's save self. x equals x so we also have this in other helper functions then we also want to say self dot n samples and self dot n features equals x dot shape so here we assume this is already a numpy nd array then we want to

### [6:35](https://www.youtube.com/watch?v=6UF5Ysk_2gk&t=395s) Initialize Initialize the Centroids

initialize the centroids so first we choose a random sample from the or k random samples from the from all the samples so we say random sample indices equals and for this we can use numpy random choice of self dot in samples and then we want self dot k so k different choices and we also want to say replace equals false so we get different indices and then we store this so now we say self dot centroids equals and now this is a list so here we say self dot x of this random of this index for index in random sample indices so we do k different samples and these are our initial centroids and then we can do the

### [7:46](https://www.youtube.com/watch?v=6UF5Ysk_2gk&t=466s) Optimization Loop

optimization loop so here we say let's write a comment optimize clusters and now we say for underscore in range self dot max errors but of course we can stop earlier if there is no more change later but these are the maximum iterations and here first we want to assign the samples to the closest centroids and with this we create the clusters so um let's say self dot clusters equals self dot and for this let's create a helper function self dot create clusters and this should get self dot centroids as the input so down here let's create this as a helper function define and then underscore create clusters and this gets self and the centroids and then for now we say pass then as next step we want to

### [9:05](https://www.youtube.com/watch?v=6UF5Ysk_2gk&t=545s) Calculate the New Centroids from the Clusters

calculate the new centroids from the clusters so first we store the old centroids so we say centroids alt equals self dot centroids so we can compare this later if this is converged and then we say self dot centroids equals self dot underscore get centroids and now this gets self dot clusters so this is another helper function that we need down here so we define underscore get centroids with self and the clusters and now for now we say pass so now once we have this we check if self dot underscore is underscore converged and now here this gets the centroids old and self dot centroids new and if this is converged then we break so we can stop and then again we need this here as a helper function so we define underscore is converged and this gets self then the centroids alt and the centroids new and then again for now we can say pass and then i also want to have a helper function to plot the centroids and the

### [10:43](https://www.youtube.com/watch?v=6UF5Ysk_2gk&t=643s) Plot the Centroids and the Clusters

clusters so this only gets self and for now we say this is pass and then up here we check if we want to plot so if self dot plot steps we call self dot plot and then we do this again after calculating the new centroids if this is not converged and we go on then let's again plot this so then we are done with the optimization loop and then down here we want to return the cluster labels so for this we say um classify the samples as the index of their clusters and for this we say return self dot underscore get cluster labels and now this gets self dot clusters as the input and then again this is a helper function that we put here so define underscore get cluster labels with self and clusters and for now we say pass so now we are done with the predict method so now we only have to implement

### [12:13](https://www.youtube.com/watch?v=6UF5Ysk_2gk&t=733s) Helper Functions

all the helper functions so again here we first initialize this then we update the clusters and centroids by calculating the means then we check if it's converged and if this is the case we can stop and in the end we return the cluster

### [12:30](https://www.youtube.com/watch?v=6UF5Ysk_2gk&t=750s) Cluster Labels

labels so for the cluster labels we can say that each sample will get the label of the cluster it was assigned to so in the beginning we say labels equals numpy empty and it if it is of size self dot n samples and then we iterate for cluster index and each cluster in a numerate the clusters and then we iterate over each cluster so remember the clusters so sorry this is a typo clusters so this is a list for each cluster we have a list with the indices so let's check if there are no other typos for clusters but i think we should be fine so now we have a list of the indices for each cluster so here we say for sample index in cluster and then we want to assign the cluster index to this label of the sample index so we say labels of sample index equals the cluster index and then we return the labels so this is how we determine the cluster labels so here we assign the samples to the close centroids so for this we start with an empty list for each cluster so we say clusters and then we do the same that we do up here so we say this is an empty list for each cluster and then we can assign the indices so we iterate for index and sample in enumerate and then we enumerate over self dot okay so this gives us both the index and the sample not self. k but self. x and then we get the centroid index by saying this is self dot underscore closest centroid and this gets the current sample and then the centroids and then we add this to the cluster so to this list so we say clusters of the current centroid index dot appends the current sample index and then in the end we return the clusters so then again we need this as a helper function so we say define get cl find closest centroid with self and then the sample and then the centroids and here we want to determine the distance of the current sample to each centroid and then get the one with the closest distance so we say distances equals and then we use a list comprehension and then again we use another helper function and call the euclidean this tense off and here we put in the sample and then the point four point in centroid so for all the center points and then we need this as a helper function but before we do this let's get the minimum index and we can do this by saying closest index equals numpy dot arc min of the distances and then we return the closest index so this is the closest centroid so now we need to define the euclidean this tense function so let's copy the name and then let's make this as a global function up here so define euclidean distance between two

### [17:13](https://www.youtube.com/watch?v=6UF5Ysk_2gk&t=1033s) Euclidean Distance

points or two feature vectors x1 and x2 and we can do this in one line return numpy and now this is the square root over and then here we have the numpy sum and the sum over and then the difference x1 minus x2 to the power of 2. so this is how we calculate the euclidean distance so now we have this so now we need to get centroids so in get centroids we want to assign the mean value of the clusters to the centroids so first we initialize these centroids so we say centroids equals numpy zeros off shape self dot k and then self dot n features and then we say for cluster index and for each cluster in enumerate clusters so this is the input parameter and then for each one we calculate the cluster mean by saying this is numpy mean of self dot x of only this cluster and then along axis equals zero and then we say centroids of this cluster index equals the cluster mean and then in the end we return return the centroids so for each cluster we calculate the mean of only this cluster and now assign this then to the centroid so this is the get centroids and then for is converged we check the distances between old and new centroids for all centroids and if this is zero if there are no more distances then we can return true so first let's say distances equals and then here again we can use the euclidean distance and then we can say centroid alt of i and then centroids new for i in range self dot k for each cluster these are the distances and then we can return if the sum over the distances so this is a built in python function because this is a list if this equals zero and then we have no change and it is converged so now the only missing function is the

### [20:42](https://www.youtube.com/watch?v=6UF5Ysk_2gk&t=1242s) The Plot Function

plot function and i already prepared the code you can also find this on github the link is in the description so for this we say import much plotlip dot pi plot splt and then down here we create a figure and then for each cluster we scatter all the points in a different color and then for all the centroid points we also scatter this point as a marker with the sign x in the color black and this is all we need for our k-means class and now we can test the code and

### [21:17](https://www.youtube.com/watch?v=6UF5Ysk_2gk&t=1277s) Test the Code

for the testing i also already prepared this and we'll simply copy paste this but before we run this i noticed two errors so first in numpy zeros the shape has to be an inner tuple and then up here i also made a typo for cluster index and cluster in enumerate clusters in get cluster labels and now this should work hopefully so i want to do one more thing i want to say numpy random seed and set this to for example 42 to make this reproducible and then we create blobs with um make blobs from sklearn data sets so here we specify three centers then we get the number of different clusters by getting the unique labels of y and getting the length and then we create our k means instance with k clusters and then we say max iters and plot steps equals true then we call dot predict and in the end we plot this again so since plot steps equals true this should now plot all the different steps so let's save this and run this and first here we have the initial centroids and now in the next step it should update the centroids yeah then cluster labels yeah then again it should update the centroids then again the cluster labels then again the centroids the cluster labels and now there is no more change so now we are done and this works so now we implemented the k means from scratch using only numpy and build in python so yeah i really hope you enjoyed this lesson and also the whole course if you didn't watch the other videos then you can find the links to the videos or to the playlist in the description below also as i said the code is on github so please check that out as well and then i hope to see you in another video on the assembly i channel bye you

---
*Источник: https://ekstraktznaniy.ru/video/12935*