23

Bias-Variance Tradeoff — Fundamentals of Machine Learning

 4 years ago
source link: https://towardsdatascience.com/bias-variance-tradeoff-fundamentals-of-machine-learning-64467896ae67?gi=6c779c927f6b
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

The Bias-Variance tradeoff is the fundamental design decision of machine learning. It is the tradeoff between model capacity and variance of predictions — so we have to decide in which way our model will be wrong . This article will explain thematically how this tradeoff relates to No Free Lunch and show how the handoff arises from probability distributions .

All models are wrong, but some are useful — George Box

MZJ7Vz7.jpg!web

Source — Author, Playa Hermosa, Costa Rica.

No free lunch

Whenever an optimization, program, or algorithm gains information or specialization in a region of interest it loses ability elsewhere. The idea that you cannot gain anything without cost is the No Free Lunch Theorem. This is a crucial idea to machine learning topics, and bias-variance tradeoff.

JB3uYvN.png!web

Possible distributions over a target space. No model can achieve perfect accuracy everywhere — the capacity is spread differently. The order of most to least specified is Green, Yellow, Red, Blue.

Model Accuracy as Distributions

Mapping model accuracy to a probability density function over the target space is a great mental model. It gives us the fact that the integral of any model — any distribution — must be equal to 1, or some constant in the design space. To the left, I have sketched how different models can predict in a target space.

Capacity In Neural Networks

The No Free Lunch theorem applies to neural networks as well. In this case, when we have a network with finite capacity (finite layers, neurons), we have finite predictive power. Changing training methods on a set network, changes the focus of the outputs . A network has a finite total capacity, which is that the integral of the accuracy curve is constant. A model can gain accuracy in one area (the specific testing distribution), but it’ll lose ability in other areas.

When training networks, engineers normally search for the minimization of the test error — or when predictions over the unknown are most accurate. Maximizing predictions over unknown elements is flattening the distribution of prediction accuracy (larger coverage, less specificity).

The design decision is specificity versus generalizability. How does this continue numerically?

Bias Variance Tradeoff

Again, core to useful machine learning models is the inverse tradeoff between the underlying structure of a model and the resulting variation in downstream predictions. Numerically, this relationship we have introduced is known as the bias-variance tradeoff.

Numerically

73eqiyN.png!web
Notation for predictive model.

If we look at the accuracy of a predictor “f-hat” over a dataset, a useful equation emerges. Consider how the model fit differs from the true data, f , plus noise, ε — ( y = f + ε ).

Mean error of a predictor (MSE).

What happens here is via probability rules and axioms, we derive the numerical bias-variance tradeoff. First, the definitions of bias and variance of a model.

Bias: how well a model structure matches the training set.

Bias models structure. Looking at the equation below, the bias is the difference of the mean value of the model over the dataset from the true values. Think when fitting a line to noisy data from an unknown (m,b) y=mx+b — we can add terms such as an offset or slope to get closer to the data. A model that is y=cx or y =c will be far off from the true, always — high bias. But adding terms like y=cx+d+ex² may be lower the bias at a cost.

Variance: how much the model changes from a small change in the data used.

Variance carries uncertainty in the model. How much will a small change of x change the prediction? Consider again the last example. The biased solutions y=cx or y =c will change very little with a perturbation in x . But, when we add higher order terms to further lower the bias, the risk in variance increases. Intuitively, one can see this in the equations because we have the new term with the square of the model in the expectation (left). The term on the (right) ties a mean term into the variance to account for offsets.

After a few steps (omitted), we arrive at the equation we want. The derivation is here (Wikipedia) . Note that σ is the standard deviation of the original function noise (ε).

The bias variance tradeoff.

What happens here is — no matter how the model is changed, the bias and variance terms will have an inverse relationship. There is a point where error is minimized over the training dataset D but there is no guarantee that the dataset 100% mirrors the real world.

An Example

Consider an example where we are trying to fit datapoints to an model, source — Wikipedia .

3ummy2m.png!web

Sampled data and true function.

The underlying function (red) is sampled with noise. We then want to use approximations to fit the sampled points. The model is constructed using radial basis functions (blue, below). From left to right the model gains terms and capacity (multiple lines because multiple models are trained on different subsets of the data ). It is apparent that the models on the left have higher bias — similar structure, but little variation between the datapoints. Towards the right, the variance rises.

A3Iryaa.png!web

BNVr2eU.png!web

ZvAzai2.png!web

Different model fits. From left to right an increasing number of terms in the model are used. Each model is trained on a different subset of the sampled points. This is a visualization of the bias-variance tradeoff. Source — Wikipedia .

This change in models is the bias variance tradeoff.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK