Find Out How Well Your Machine Learning Model is Calibrated

If your machine learning model predicts probability of target, which is common for classification task, how much confidence do you have for the predicted probability. If you need make a critical decision based on the prediction in the medical domain for example, will you feel confident about making such decisions. Calibration is a metric that will tell you whether your model is trust worthy. Although it’s a critical metric, it’s not discussed that often. It’s been observed that large and complex neural networks while being more accurate has worse model calibration.

In this post we will walk through an example of Neural Network model for heart disease prediction and find out if the model is calibrated well. Calibrating a poorly calibrated model is a separate issue and that won’t be the topic of this post. But you can go through the citations in the post to learn about calibrating a poorly calibrated model. The Python code is in my GitHub repository avenir.

Model Calibration

For a perfectly calibrated model the prediction probability aka confidence of a sample should be same as the local accuracy of your model in the vicinity of the sample. In other words pn(x)(y = 1 | pth) = px(y = 1) where n(x) is some population of samples around x and pth is the probability threshold for non probabilistic prediction. The definition is always with respect to some target value. Here a target value of 1 has been used for a classification problem

In reality no model will not be perfectly calibrated. It will be either overconfident i.e confidence is greater than local accuracy or under confident i.e confidence is lower than local accuracy. One common way to represent calibration is with reliability diagram. Here are the steps using prediction for a batch of data

Decide on the number of bins and bin the confidence values
For each bin calculate accuracy and the average confidence for some target value
Plot accuracy against confidence.
Higher the deviation of accuracy from the 45 deg line weaker the calibration

Heart Disease Model Calibration

The feed forward neural network model with one hidden layer uses around uses these features based on personal and health data and predicts the probability of getting heart disease. The prediction data is used to plot the reliability diagram. The data has the following fields

sex
age
weight
systolic blood pressure
dialstolic blood pressure
smoker
diet
physical activity per week
education
ethnicity
has heart disease

Let’s find out how well the model is calibrated with the calibration diagram. Calibration calculation is based on positive values of the target

The model is under confident at higher range of confidence values and over confident at lower range of confidence values. Class imbalance exacerbates calibration problem, which is the case for this model. Here are some aggregate calibration values.

expected calibration error	0.170
maximum calibration error	0.472

Calibration can be found locally for a given record also. Accuracy is calculated based on specified number of nearest neighbors. Here is some sample output

conf	accu	record
0.623	1.000	1.000,0.000,0.700,0.796,0.557,0.720,0.000,1.000,0.000,1.000,0.000,0.000,0.111,0.286,0.000,0.000,0.000,1.000
0.633	1.000	1.000,0.000,0.525,0.827,0.400,0.980,0.000,1.000,0.000,1.000,0.000,0.000,0.111,0.214,0.000,1.000,0.000,0.000
0.582	0.900	1.000,0.000,0.900,0.969,0.886,0.800,0.000,1.000,0.000,0.000,1.000,0.000,0.167,0.286,0.000,0.000,1.000,0.000
0.586	0.900	1.000,0.000,0.200,0.653,0.386,0.740,0.000,1.000,0.000,1.000,0.000,0.000,0.056,0.286,0.000,0.000,0.000,1.000

Another related and important metric is sharpness. Sharpness is the difference between local accuracy for a given target value and the global accuracy. Higher the contrast is desirable. Here is contrast diagram for the same model.

To try them out please follow the tutorial document. The python code for model calibration is also available.

Improving Model Calibration

At first glance it may seem like changing the probability threshold for non probabilistic prediction will improve calibration. However that’s not the case. Changing probability threshold will improve calibration in one area of the reliability diagram while making it worse in other areas. Expected calibration error is likely to remain the same.

The preferred way to re calibrate is to tune the model with calibration in mind. However, some times that won’t far enough to meet your objective and you have to resort to post hoc techniques. Calibration criterion can be set in different ways e.g keeping expected calibration error or maximum calibration error below some pre defined threshold.

The paper cited earlier has various calibration improvement techniques. These are post hoc techniques i.e for deployed model the predicted probability goes through additional processing for the final predicted probability. Here is a post on recalibration using scikit library.

Recalibration can be also be done online for deployed models. This technique will also handle any potential concept drift post deployment.

Tracking Calibration for Deployed Model

Calibration properties might change in deployed models due to drift using a base line reliability diagram created before deployment. The base line reliability diagram might include recalibration applied to the predicted probability. After deployment, using on line prediction data you could create another reliability diagram with a sliding window. Any significant difference between two reliability plots will indicate potential drift.

Calibration can be tracked locally also, using predicted probability of base line data set. As predictions are made for a new sample, we accumulate and maintain a data set with all the predictions. As a new sample is predicted in production, we find the predicted probability of a nearest sample from the base line data set along with average accuracy in a neighborhood of that sample. This repeated for the production data. We end with two calibration error values, one from the base line data and one from the production data set.

Final Thoughts

Model calibration is an important metric, especially for critical application. In this post our discussion has been focussed on finding out how well calibrated a model is. Recalibrating a model so that confidence is in line with accuracy is a separate topic.

Find Out How Well Your Machine Learning Model is Calibrated