# Intro to Machine Learning: Making Predictions

## Метаданные

- **Канал:** Orange Data Mining
- **YouTube:** https://www.youtube.com/watch?v=EtUZd5YeQtU

## Содержание

### [0:00](https://www.youtube.com/watch?v=EtUZd5YeQtU) Segment 1 (00:00 - 05:00)

﻿In the last video, I showed you a clustering using a t-SNE map with the employee data. As a reminder, the data I have profiles 1,470 employees, including information on their age, travel frequency, income, education, department, and more. There is also an important feature called attrition, which indicates whether an employee has left the company. This feature is not an input variable and was not used for clustering. However, it appears in the t-SNE graph, where the dots representing each employee are red for those who left and blue for those who stayed. In our previous video, we showed that the t-SNE map effectively identifies groups of employees with different backgrounds and characteristics. However, it's not very useful for predicting attrition - the red and blue dots are mixed and don't seem to correspond to the clusters. Clustering only identifies groups in the data; it doesn't build predictive models. There are other machine learning techniques designed specifically for tasks like attrition prediction that fall under the umbrella of supervised learning. Let me show you one in action. Say that the human resources department, HR, would like to know how likely it is that certain employees will leave the company. They should have a conversation with those employees who are at risk of leaving to understand their concerns and see if anything can be done to change their minds. Chapter 1: Making predictions Here's the list of employees. Note that HR only provided a few characteristics from the attrition data set, while the rest of the features are missing. One comment: Dealing with missing data is an important aspect of machine learning. In cases like ours, we should consider using modeling techniques that can work with incomplete information. Back to our attrition problem. We have Maya, a manager in the later stages of her career, Chris, a young salesperson, Bill, a new hire in research, and Ana, a more senior employee in the same department. Who is most likely to consider leaving the company? We'll use our large attrition dataset of 1,470 employees to build a Naive Bayes classifier, one of the simplest supervised machine learning models. Once we've developed the model, we can use it to make predictions. First, we need to read the data from our Excel spreadsheet, and then feed the data and the model to the crystal ball. Here are the results: Maya, the manager, is least likely to leave: her chance of leaving is only 3%. However, it's Chris, our sales rep, that we need to talk to - his probability of leaving is almost certain at 74%. Bill and Ana are both likely to stay, although we suggest that HR checks in with Bill, who has a 22% chance of leaving, to see how he's doing. This looks great, right? We used a large dataset - what we call a training dataset in machine learning - to develop a predictive model, and we applied it to a set of employees to help HR identify potential candidates for review. But if we are to trust these models, how accurate are they? Machine learning researchers have developed several methods for estimating the accuracy of classification models. One common approach, called cross-validation, involves dividing our large dataset into a training set and a test set. We build the model on the training set and evaluate its accuracy on the test set. This process can be repeated many times to avoid relying on a single random data split. Let me show you how this works on the employee data. Chapter 2: Accuracy We will test our modeling technique on the attrition dataset and estimate the accuracy of the Naive Bayes classifier. We ran the training and testing cycle five times, which is the default in our software. Accuracy is reported in several ways, one of which is called "classification accuracy". With an accuracy of about 79%, the model fails to predict the correct class only 21% of the time. We can better understand the model by examining the confusion matrix, which compares the predictions to the actual results. The data set contains 1,470 employees, of which 237 left the company. The model correctly predicted 131 employees would leave

### [5:00](https://www.youtube.com/watch?v=EtUZd5YeQtU&t=300s) Segment 2 (05:00 - 07:00)

but incorrectly predicted that 106 would stay. Accuracy was higher for employees who stayed: of the 1,233 employees who stayed, the Naive Bayes classifier misclassified only 195. There are many other modeling techniques we can try on classification data. One of the most popular is logistic regression, which is often used for binary classification problems, such as predicting whether or not an employee will leave the company. On the attrition dataset, logistic regression performs better overall, with an accuracy of 88%. Interestingly, however, it misclassified more employees who left the company. Suppose HR prefers to focus on identifying those who are likely to leave. In that case, the Naive Bayes classifier may be a better choice for this specific task, as it performs better at predicting attrition. In any case, it would be up to HR to decide which model to use, why, and how. Remember that machine learning is about building these models, but applying them to decision-making adds another layer of complexity. I've left out a lot of details in this short video on predictive modeling. If we want to trust these models, we must understand how the predictions are made. Understanding predictions falls under Explainable AI, a topic we will cover in another video series. Also, I haven't covered how predictive models work. Please see the link to another video series where we explore these topics in depth. Finally, we've only used one example dataset here. Predictive models are used today in science, marketing, healthcare, finance, and almost any other field. They can also be applied to images and text. I'll show you how in our next video. ________________

---
*Источник: https://ekstraktznaniy.ru/video/29389*