Finding groups, or clustering, is one of the most common machine learning tasks, and it's often the first thing we do when we start working with a dataset. This video is separated into three chapters to make it easier to follow. Chapter 1: Neighbors I introduced the student grades dataset in the previous video of the Introduction to Machine Learning series. This data contains 16 students, each profiled by their grades in 7 subjects. Take Bill, for example. His grades in English and French are high, but he struggles in physics. Are there other students like him in the class? Let's find three students whose grades are most similar to Bill's. Here they are: Maya, Lea, and Eve, who also excel in languages but struggle in physics. Note, however, that when these three were selected as Bill's neighbors, all their other grades were also considered in computation, so there are other similarities in their scores. A meta-feature called "distance" has been added to our table. Maya is the most similar to Bill, as her computed distance to him is the smallest, followed by Lea and Eve. Finding the closest neighbors to selected data instances is useful. However, we would also like to get the overall picture of groups of students. We can use the technique called hierarchical clustering. Chapter 2: Hierarchical Clustering For hierarchical clustering, we start by computing the distances between all pairs of data instances - in this case, the students - and then create a visual representation of the groups. Here it is. The cluster with Bill includes Maya and Eve, but also Lea and George. If we look at them closely, they all do well in languages and often fail in science and math. Their cluster differs from the cluster of Cynthia, Jana, Nash, and the others, who generally do better in algebra, biology, and physics, but fail in languages. The visualization we are looking at is called a dendrogram, which nicely depicts the clusters in the data. We can also slice the dendrogram to reveal different groups visually. Here, my cut has divided the data into three groups. Finally, what are Ana and Henry good at? Let's check: they're good at sports. Clustering becomes even more fun when we explore the clustering results in various ways, say, on a map. Chapter 3: Explaining Clusters Here's an example using the Human Development Index data. I'm working with data on 188 countries, each described by 50 socioeconomic indicators, such as life expectancy, average years of schooling, national income, and more. Let's do the clustering. As before, we compute the distances between countries and then apply hierarchical clustering. Here's the result. I have a cluster of African countries, a cluster of central European countries, including Germany, Belgium, and Luxembourg, and another cluster of mostly Eastern European countries with Ireland thrown in. Let me zoom out and split the dendrogram into three groups. Now, I'd like to see these clusters on a map. Here it is. Wow! The clustering shows a clear socioeconomic divide running from north to south. Sometimes, the data are simply too large to fit into a dendrogram. Fortunately, other techniques besides hierarchical clustering can help reveal groups. For example, we can embed the data into a two-dimensional plane and visually search for groups. Let me show you an example. Chapter 4: Data maps I will look at data on 1470 employees, profiled with 32 attributes that report employees’ age, travel frequency, income, department, distance from home, and other characteristics. I will use a technique called t-SNE, which embeds this data into a two-dimensional plane. In this visualization, each data instance - an employee - is represented by a point. Two points are close to each other if two employees are similar according to their data profile. The visualization reveals several clusters.
Segment 2 (05:00 - 06:00)
For example, here is one on the top left and the other on the right, and we have a cluster in the middle and a large cluster on the bottom. Clustering of data is only the beginning. We need to dig deeper and understand the characteristics of each cluster. Cluster exploration falls under the domain of explainable machine learning. Let me show you how it works. I will select the cluster in the top left corner and examine the difference between the employees in that cluster and everyone else. The employees in the cluster I selected all work in human resources. The cluster on the right is mostly salespeople, and many are trained in marketing. And the small cluster in the middle contains managers. We can already get so much out of the data with clustering! Visualizations such as dendrograms and t-SNE maps are great tools for exploring data. Clustering treats all data features equally, which means no specific criteria guide the calculation of distances. For this reason, clustering falls into the category of machine learning approaches known as "unsupervised learning". There are other techniques in which the development of data models is guided by specific goals, such as predicting whether an employee will leave the company. These techniques are called "supervised learning," and I'll introduce them in my next video. ________________ 3 Making predictions