Cost-sensitive learning in scikit-learn

Cost-sensitive learning in scikit-learn

Data School

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (1 сегментов)

Segment 1 (00:00 - 04:00)

Now that we know that our next step is to improve our model's AUC, how do we actually do that? The good news is that we can use any technique I've covered in the course, such as hyperparameter tuning, feature selection, trying non-linear models, and so on. All of those techniques have the potential to improve the model's AUC. However in this lesson, I want to focus on one particular technique that we haven't covered in the course that is particularly useful in cases of class imbalance. That technique is called cost-sensitive learning. The insight behind cost-sensitive learning is that not all prediction errors have the same cost. This can refer to an actual dollar cost of one type of error versus another, or in our case, the real-world implications of a certain type of error. When there's severe class imbalance, it's usually the case that False Negatives, in which positive samples are identified as negative, have a higher cost than False Positives, in which negative samples are identified as positive. This makes sense because the positive samples are rare occurrences, and thus we're more interested in locating the positive samples than the negative samples. In simple terms, we would prefer a False Positive to a False Negative. So how does cost-sensitive learning actually work? In scikit-learn, this is implemented using the class weight parameter for some models, such as logistic regression and Random Forests. By setting class weight to balanced, scikit-learn will give more weight to the samples from the minority class than samples from the majority class. More specifically, the model is penalized more for making mistakes on the minority class, meaning False Negatives, than it is for making mistakes on the majority class, meaning False Positives. Because the model's goal is to minimize the total cost, the model may show increased bias toward predicting the minority class. Let's try this out by creating a logistic regression instance that uses class weight equals balanced. This specifies a class weighting that is inversely proportional to the class frequencies in the input data, though you can specify custom weights for each class if you like. We'll fit our logistic regression model on the training set, and use the fitted model to make class predictions as well as predicted probabilities on the testing set. Then we'll calculate the AUC, and it has increased from 0. 93 to 0. 94 simply by setting class weights. Keep in mind that class weighting is not guaranteed to improve your AUC, and thus it should be tuned like any other parameter, as we'll see in the next chapter. Let's take a look at the classification report to see how our rates have changed: The True Positive Rate, which was 43%, is up to 88%. The True Negative Rate, which was nearly 100%, is down to 89%, which means that the False Positive Rate has increased from around 0% to 11%. I do want to point out that even though this model might match our priorities better, its accuracy is down from 98% to 89%. This illustrates how sometimes a useful classifier has a lower accuracy than null accuracy.

Другие видео автора — Data School

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник