My Deep Learning Model Says: “sorry, I don’t know the answer”. That’s Absolutely...

Motivation

Although deep learning technique works, it’s most of the time unclear why deep learning works . This makes it tricky to deploy artificial intelligence in high-risk areas like aviation, judiciary, and medicine.

A neural network identifies that a cell biopsy is cancerous — It does not tell why.

Typically, a classifier model is forced to decide between two possible outcomes even though it does not have any clue. It has just flipped a coin. in real life, a model for medical diagnosis should not only care about the accuracy but also about how certain the prediction is. If the uncertainty is too high, a doctor should take this into account in his decision process.

A deep learning model should be able to say: “sorry, I don’t know”.

A model for self-driving cars that has learned from an insufficiently diverse training set is another interesting example. If the car is unsure where there is a pedestrian on the road, we would expect it to let the driver take charge.

Networks with greater generalization are less interpretable. Interpretable networks don’t generalize well. ( source )

Some models may not require explanations because they are used in low-risk applications, such as a product recommender system. Nevertheless, integrating critical models into our daily lives requires interpretability to increase the social acceptance of AI . This is because people like to attribute beliefs, desires, and intentions to things ( source ).

Understanding and explaining what a neural network doesn’t know is crucial for the end-users. Practitioners also seek better interpretability to build more robust models that are resistant toadversarial attacks.

yA3aMnz.png!web

Images By Goodfellow et al, ICLR 2015. Explaining and Harnessing Adversarial Examples . Adding a little noise to a photo of panda causes incorrect classification as gibbon.

In the following sections, we will have a closer look at the concept of uncertainty. We also introduce easy techniques for how to assess uncertainty in deep learning models.

Types of uncertainty

There are two major different types of uncertainty in deep learning: epistemic uncertainty and aleatoric uncertainty. Both terms do not roll off the tongue easily.

Epistemic uncertaintydescribes what the model does not know because training data was not appropriate. Epistemic uncertainty is due to limited data and knowledge. Given enough training samples, epistemic uncertainty will decrease. Epistemic uncertainty can arise in areas where there are fewer samples for training.

Aleatoric uncertaintyis the uncertainty arising from the natural stochasticity of observations. Aleatoric uncertainty cannot be reduced even when more data is provided. When it comes to measurement errors, we call it homoscedastic uncertainty because it is constant for all samples. Input data-dependent uncertainty is known as heteroscedastic uncertainty.

The illustration below represents a real linear process ( y=x ) that was sampled around x=-2.5 and x=2.5 .

YnUjIbm.png!web

An exhibit of the different kinds of uncertainty in a linear regression context (Image by Michel Kana).

A sensor malfunction introduced noise in the left cloud. Noisy measurements of the underlying process leading to high aleatoric uncertainty in the left cloud. This uncertainty cannot be reduced by additional measurements, because the sensor keeps producing errors around x=-2.5 by design .

High epistemic uncertainty arises in regions where there are few observations for training. This is because too many plausible model parameters can be suggested for explaining the underlying ground truth phenomenon. This is the case of the left and right parts of our clouds. Here we are not sure which model parameters describe the data best. Given more data in that space uncertainty would decrease. In high-risk applications, it is important to identify such spaces.

How to access uncertainty using Dropout

Bayesian statistics allow us to derive conclusions based on both data and our prior knowledge about the underlying phenomenon. One of the key distinctions is that parameters are distributions instead of fixed weights.

If instead of learning the model’s parameters, we could learn a distribution over them, we would be able to estimate uncertainty over the weights.

How can we learn the weights’ distribution? Deep Ensembling is a powerful technique where a large number of models or re-multiple copies of a model are trained on respective datasets and their resulting predictions collectively build a predictive distribution.

Because ensembling can require plentiful computing resources an alternative approach was suggested: Dropout as a Bayesian Approximation of a model ensemble. This technique was introduced by Yarin Gal and Zoubin Ghahramani in their 2017’s paper .

Dropout is a well-used practice as a regularizer in deep learning to avoid overfitting. It consists of randomly sample network nodes and drop them out during training. Dropout zeros out neurons randomly according to a Bernoulli distribution.

In general, there seems to be a strong link between regularization and prior distributions in Bayesian models. Dropout is not the only example. The frequently used L2 regularization is essentially a Gaussian prior.

In their paper, Yarin and Zoubin showed that a neural network with dropout applied before every weight layer is mathematically equivalent to a Bayesian approximation of the Gaussian process.

Image By Yu Ri Tan on yuritan.nl — Dropout changes model architecture at different forward passes allowing Bayesian approximation. (Authorized citation of the image obtained from Yu Ri Tan)

With droupout, each subset of nodes that is not dropped out defines a new network. The training process can be thought of as training 2^m different models simultaneously, where m is the number of nodes in the network. For each batch, a randomly sampled set of these models is trained.

The key idea is to do dropout at both training and testing time. At test time, the paper suggests repeating prediction a few hundreds times with random dropout. The average of all predictions is the estimate. For the uncertainty interval, we simply calculate the variance of predictions. This gives the ensemble’s uncertainty.

Predicting Epistemic Uncertainty

We will assess epistemic uncertainty on a regression problem using data generated by adding normally distributed noise to the function y=x as follows:

100 data points are generated in the left cloud between x=-2 and x=-3
100 data points are generated in the right cloud between x=2 and x=3.
Noise is added to the left cloud with 10 times higher variance than the right cloud.

vI7FJvq.png!web

Below we design two simple neural networks, one without dropout layers and a second one with a dropout layer between hidden layers. The dropout layer randomly disables 5% on neurons during each training and inference batch. We also include L2 regularizers to apply penalties on layer parameters during optimization.

YB36zmq.png!web

network without dropout layers

qyINn2j.png!web

layout with dropout layers

The rmsprop optimizer is used to train batches of 10 points by minimizing the mean squared errors. The training performance is displayed below. Convergence is very fast for both models. The model with dropout exhibits slightly higher loss with more stochastic behavior. This is because random regions of the network are disabled during training causing the optimizer to jump across local minima of the loss function.

viq6Jrz.png!web

Below, we show how the models perform on test data. The model without dropout predicts a straight line with a perfect R2 score. Including dropout caused a nonlinear prediction line with an R2 score of 0.79. Although dropout overfits less, has higher bias, and decreased accuracy, it highlights uncertainty in predictions in the regions without training samples. The prediction line has higher variance in those regions, which can be used to computed epistemic uncertainty.

nUvAJzR.png!web

The model with dropout exhibits predictions with high variance in regions without training samples. This property is used to approximate epistemic uncertainty.

Below, we evaluate both models (with and without dropout) on a test dataset, while using dropout layers at evaluation a few hundreds of times. This is equivalent to simulating a Gaussian process. We obtain each time, a range of output values for each input scalar from test data. This allows us to compute the standard deviation of the posterior distribution and display it as a measure of epistemic uncertainty .

ieam6zb.png!web

The model without dropout predicts fixed values without 100% certainty even in regions without training samples.

YfuAbqv.png!web

The model with dropout estimates high epistemic uncertainty in regions without training samples.

As expected, data for x <-3 and x>3 have high epistemic uncertainty as no training data is available at these points.

Dropout allows the model to say: “all my predictions for x <-3 and x>3 are just my best guess.”

22YRjun.png!web

Image by OpenClipart-Vectors on Pixabay

Polynomial Regression

In this section, we investigate how to assess epistemic uncertainty by dropout for more complex tasks, such as polynomial regression.

For this purpose, we generate a synthetic training dataset randomly sampled from a sinusoidal function, and adding noise of different amplitudes.

The results below suggest that including dropout brings a way to access epistemic uncertainty in the region where there is no data, even for nonlinear data. Although dropout affects model performance, it clearly shows that predictions are less certain in data regions where there were not enough training samples.

JjiqmiQ.png!web

Model without dropout overfits to training samples and shows over-confidence when predicting in regions without training data.

Jzy2AvF.png!web

The model with dropout has a high bias, but is less confident in regions without training data. Epistemic uncertainty is higher where training samples are missing.

Predicting Aleatoric Uncertainty

While epistemic uncertainty is a property of the model, aleatoric uncertainty is a property of the data. Aleatoric uncertainty captures our uncertainty concerning information that our data cannot explain.

When aleatoric uncertainty is a constant, not dependent on the input data, it is called homoscedastic uncertainty , otherwise, the term heteroscedastic uncertainty is used.

Heteroscedastic uncertainty depends on the input data and therefore can be predicted as a model output. Homoscedastic uncertainty can be estimated as a task-dependent model parameter.

Learning heteroscedastic uncertainty is done by replacing the mean-squared error loss function with the following ( source ):

The model predicts both a mean y ^ and variance σ ². If the residual is very large, the model will tend to predict large variance. The log term prevents the variance to grow infinitely large. An implementation of this aleatoric loss function in Python is provided below.

The aleatoric loss can be used to train a neural network. Below, we illustrate an architecture that is similar to the one used for epistemic uncertainty in the previous section with two differences:

there is no dropout layer between hidden layers,
the output is a 2D tensor instead of a 1D tensor. This allows the network to learn not only the response y^ , but also the variance σ ².

yu2mIny.png!web

The learned loss attenuation forced the network to find weights and variance which minimize the loss during training, as shown below.

Inference for aleatoric uncertainty is done without dropout.The result below confirms our expectation: the aleatoric uncertainty is higher for data on the left than on the right. The left region has noisy data due to a sensor error around x=-2.5 . Adding more samples wouldn’t fix the problem. Noise will still be present in that region. By including aleatoric uncertainty in the loss function, the model will predict with less confidence for test data falling in the regions, where training samples were noisy.

BJneE3U.png!web

The model with dropout detects regions with noisy training data. This helps in predicting with higher aleatoric uncertainty in these regions.

Measuring aleatoric uncertainty can become crucial in computer vision. Such uncertainty in images can be attributed to occlusions when cameras can’t see through objects. Aleatoric uncertainty can also be caused by over-exposed regions of images or the lack of some visual features.

Both epistemic and aleatoric uncertainty can be summed up to provide total uncertainty. Including the total level of uncertainty in predictions of a self-driving car can be very useful.

FZ3Mj2j.jpg!web

Image By Alex Kendall, the University of Cambridge on Arvix — Aleatoric and epistemic uncertainty for semantic segmentation in computer vision. Aleatoric uncertainty (d) captures object boundaries where labels are noisy due to occlusion or distance. Epistemic uncertainty (e) highlights regions where the model is unfamiliar with image features such as an interrupted footpath.

Conclusion

In this article we demonstrated how using Dropout at inference time is equivalent to doing Bayesian approximation for assessing uncertainty in deep learning predictions.

Knowing how confident a model is with its predictions is important in a business context. Uber has been using this technique to assess uncertainty in time-series predictions .

Properly including uncertainty in machine learning can also help to debug models and making them more robust against adversarial attacks. The new TensorFlow Probability offers probabilistic modeling as add-ons for deep learning models .

You can read further through my article about responsible data science and see what can go wrong when we trust our machine learning models a little too much. This comprehensive introduction to deep learning and practical guide to Bayesian inference can help deepen and challenge classical approaches to deep learning.

Thanks to Anne Bonner from Towards Data Science for her editorial notes.

Stay safe in uncertain times.

Motivation

Types of uncertainty

How to access uncertainty using Dropout

Predicting Epistemic Uncertainty

Polynomial Regression

Predicting Aleatoric Uncertainty

Conclusion

Recommend

Leader Election Using Apache Kafka

Implement Canny Edge Detection from Scratch with Pytorch

二季度展望：全球经济下行，通缩风险加大，黄金估值抬升

《工业智能白皮书》正式发布（附完整报告下载方式）

Springboot整合https原来这么简单

MySQL Sharding at Quora

Teleforking a process onto a different computer

图解半监督学习FixMatch，只用10张标注图片训练CIFAR10

Flutter混合开发小记

Develop an Interactive Drawing Recognition App based on CNN — Deploy it with Fla...

About Joyk