The 3 Most Important Composite Classification Metrics

This is the third and final article in a series to help you understand, use, and remember the seven most popular classification metrics. In thefirst article in the series I explained the confusion matrix and the most common evaluation term: accuracy . In thesecond article I shined a light on the three most common basic metrics: recall ( sensitivity ), precision , and specificity . If you don’t have those terms down cold, I suggest you spend some more time with them before proceeding. :+1:

Each of the composite metrics in this article is built from basic metrics. Let’s look at some beautiful composite metrics!

Balanced Accuracy

As you saw in thefirst article in the series, when outcome classes are imbalanced, accuracy can mislead.

Balanced accuracy is a better metric to use with imbalanced data. It accounts for both the positive and negative outcome classes and doesn’t mislead with imbalanced data.

Here’s the formula:

Balanced Accuracy = (((TP/(TP+FN)+(TN/(TN+FP))) / 2

Thinking back to thelast article, which metric is TP/(TP+FN) the formula for? That’s right, recall — also known as sensitivity and the true positive rate !

And which metric is TN/(TN+FP) the formula for? That’s right, specificity, also known as the true negative rate !

So here’s a shorter way to write the balanced accuracy formula:

Balanced Accuracy = (Sensitivity + Specificity) / 2

Balanced accuracy is just the average of sensitivity and specificity. It’s great to use when they are equally important. :point_up:

Let’s continue with an example from the previous articles in this series. Here are the results from our model’s predictions of whether a website visitor would purchase a shirt at Jeff’s Awesome Hawaiian Shirt store. :hibiscus::shirt:

                  Predicted Positive    Predicted Negative
Actual Positive          80  (TP)            20 (FN)
Actual Negative          50  (FP)            50 (TN)

Our sensitivity is .8 and our specificity is .5. Average those scores to get our balanced accuracy:

(.8 + .5) / 2 = .65

In this case our accuracy is 65%, too: (80+50) / 200.

When the outcome classes are the same size, accuracy and balanced accuracy are the same! :grinning:

RRrAnij.jpg!web

Rocks balancing: Source: pixabay.com

Now let’s see what happens with imbalanced data. Let’s look at our previous example of disease detection with more negative cases than positive cases.

                 Predicted Positive    Predicted Negative
Actual Positive           1                      8
Actual Negative           2                    989

Our accuracy is 99%: (990/1,000).

But our balanced accuracy is 55.5%!

(((1/(1 + 8)) + ( 989/(2 + 989))) / 2 = 55.5%

:astonished:

Do you think balanced accuracy of 55.5% better captures the model’s performance than 99.0% accuracy?

Balanced accuracy bottom line

Balanced accuracy is a good measure when you have imbalanced data and you are indifferent between correctly predicting the negative and positive classes. :grinning:

The scikit-learn function name is balanced_accuracy_score .

Another, even more common composite metric is the F1 score.

vQJ32an.jpg!web

Formula 1: Source: pixabay.com

F1 Score

The F1 score is the harmonic mean of precision and recall. If you care about precision and recall roughly the same amount, F1 score is a great metric to use. Note that even though all the metrics you’ve seen can be followed by the word score F1 always is. :point_up:

Remember that recall is also known as sensitivity or the true positive rate.

Here’s the formula for F1 score , using P and R for precision and recall, respectively:

F1 = 2 * (P * R) / (P + R)

Let’s see how the two examples we’ve looked at compare in terms of F1 score. In our Hawaiian shirt example, our model’s recall is 80% and the precision is 61.5%

The model’s F1 score is:

2 * (.615 * .80) / (.615 + .80) = .695

That doesn’t sound so bad.

Let’s calculate the F1 for our disease detection example. There the model’s recall is 11.1% and the precision is 33.3%.

The model’s F1 is:

2 * (.111 * .333) / (.111 + .333) = .167

That is not so hot. ☹

mYFzymB.jpg!web

Much hotter than our model’s F1 score. Source: pixabay.com

The F1 score is popular because it combines two metrics that are often very important — recall and precision — into a single metric. If either is low, the F1 score will also be quite low.

The scikit-learn function name is f1_score . Let’s look at a final popular compound metric, ROC AUC.

ROC AUC

ROC AUC stands for Receiver Operator Characteristic — Area Under the Curve . It is the area under the curve of the true positive ratio vs. the false positive ratio . Remember that the true positive ratio also goes by the names recall and sensitivity.

The false positive ratio isn’t a metric we’ve discussed in this series.

False Positive Ratio

The false positive ratio (FPR) is a bonus metric. :+1: It’s calculated by dividing the false positives by all the actual negatives.

FPR = (FP / N)

The false positive ratio is the only metric we’ve seen where a lower score is better. :arrow_down:=:grinning:

The FPR is used alone rarely. It’s important because it’s one of the two metrics that go into the ROC AUC.

To visualize the ROC curve, you can plot it using sklearn’s plot_roc_curve . The function signature matches the plot_precision_recall_curve function you saw in thesecond article in this series.

plot_roc_curve(estimator, X_test, y_test)

Here’s an example of a ROC curve:

ayEVba2.png!web

Blue is the model’s performance. Orange is the baseline

The ROC curve is a popular plot that can help you decide where to set a decision threshold so that you can optimize other metrics.

The AUC (area under the curve) can range from .5 to 1. A higher score is better. A score of .5 is no bueno and is represented by the orange line in the plot above. ☹️

You want your model’s curve to be as close to the top left corner as possible. You want a high TPR with a low FPR.

Our model does okay, but there’s room for improvement. :neutral_face:

The ROC AUC is not a metric you want to compute by hand. ✍ Fortunately, the scikit-learn function roc_auc_score can do the job for you . Note that you need to pass the predicted probabilities as the second argument, not the predictions. :point_up:

roc_auc_score(y_test, y_predicted_probabilities)

ROC AUC is a good summary statistic when classes are relatively balanced. However, with imbalanced data it can mislead. For a good discussion see this Machine Learning Mastery post.

ZzEJBfn.jpg!web

Balancing act. Source: pixabay.com

Summary

In this article you learned about balanced accuracy, F1 score, and ROC AUC.

Recap

Here are the formulas for all the evaluation metrics you’ve seen in this series:

Accuracy = (TP + TN) / All
Recall (Sensitivity, TPR) = TP / (TP + FN)
Precision = TP / (TP + FP)
Specificity (TNR) = TN / (TN + FP)
Balanced Accuracy = (Sensitivity + Specificity) / 2
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
ROC AUC = Area under TPR vs. FPR

ROC AUC stands for Receiver Operating Characteristic Area Under the Curve. It does NOT stand for Receiver Operating Curve . :+1:

Results Summary

Here are the results from the Hawaiian shirt example:

Accuracy = 65%
Recall (Sensitivity, TPR) = 80%
Precision = 61.5%
Specificity (TNR) = 50%
Balanced Accuracy = 65%
F1 Score = .695

Here are the results from the disease detection example:

Accuracy = 99%
Recall (Sensitivity, TPR) = 11.1%
Precision = 33.3%
Specificity (TNR) = 99.8%
Balanced Accuracy = 55.5%
F1 Score = .167

As the results of our two examples show, with imbalanced data, different metrics paint a very different picture.

ZryQZvI.jpg!web

Paints. Source: pixabay.com

Wrap

There many, many other classification metrics , but mastering these seven should make you a pro! :grinning:

The seven metrics you’ve seen are your tools to help you choose classification models and decision thresholds for those models. Your job is to use these metrics sensibly when selecting your final models and setting your decision thresholds.

I should mention one other common approach to evaluating classification models. You can attach a dollar value or utility score for the cost of each false negative and false positive. You can use those expected costs in your determination of which model to use and where to set your decision threshold.

I hope you found this introduction to classification metrics to be helpful. If you did, please share it on your favorite social media so other folks can find it, too. :grinning:

I write about Python , SQL , Docker , and other tech topics. If any of that’s of interest to you, sign up for my mailing list of data science resources and read more to help you grow your skills here . :+1:

Happy choosing! :grinning:

Balanced Accuracy

Balanced accuracy bottom line

F1 Score

ROC AUC

False Positive Ratio

Summary

Recap

Results Summary

Wrap

Recommend

一到节日就掉头发，到底要送程序员男朋友什么礼物

Why Is This Website Port Scanning Me?

Separation Logic (2019)

The Liskov Substitution Principle (Simplified)

Ray Tracing In Notepad.exe At 30 FPS

Things I hate about Rust

Adding peephole optimization to Clang

两年前被微软收购的 Bonsai，成为了 Build 2020 的重要杀器

小米Q1成绩单：疫情之下，逆商爆表

阿里巴巴正式开源 Inclavare Containers 技术

About Joyk