How to do Bias-Variance Tradeoff the Right Way

Learn how to evaluate the performance of your models

Source: https://unsplash.com/photos/zBsXaPEBSeI

One of the most common decisions that data scientists and machine learning experts have to face daily is how to go about validating their models.

Ask any data engineer about the topic of validation and they will instantly start to drop names like overfitting, underfitting, and bias-variance tradeoff. If you are one of those people, you probably want to go directly to the section: optimizing your workflow.

For those of you who are just new to data science, bear with me a second.

In general, when we want to make predictions based on sampled data, obviously our goal is to build a model as accurately as possible. However, when testing our model predictions against real data we will always find out we have some error. And errors can be of two different types: overfitting and underfitting. Let’s see:

Overfitting: is caused by an excess of complexity in your model. It captures very well the meaning of your sample data, but don’t generalize well when tested against real data that was not contained in your training data set. These models have low bias and high variance.
Underfitting: is caused by a lack of complexity in your model. It occurs when the nature of your error is mostly due to high bias and the variance is low.

To explain the nature of these two concepts it is helpful to remember this simple definition of the total error:

The total error is the summation of the squared bias, plus the variance, plus the noise of the model. That is, expressed by the following formula:

Err((f(x) − ˆ f(x))2 ) = Bias( ˆ f(x))2 + Variance( ˆ f(x)) + σ 2

Where:

Noise is an error of an unknown source, and therefore we cannot do anything to decrease it.

Bias is an error caused by oversimplifying assumptions made by your model as opposed to variance which is caused by too complex assumptions or paying “too much attention” to sample data.

This is quite a simple formula but it tells a lot.

For example, if we hold constant the total error and cannot work with noise, it follows that we cannot decrease both bias and variance at the same time. So no free lunch, here. Unless you get good and new training data that brings information to the model to improve your model, it is not possible to sink total error. You are faced with the popular variance-bias dilemma, you can pick one of them to decrease, not both.

how-to-do-bias-variance-tradeoff-the-right-way-in-machine-learning-a892e8b5d7aa

The intersection of variance and squared Bias minimizes the error. Source: Author.

Good, so far we have laid out the basic concepts that we need for model validation, but enough of basic stuff.

How does a true expert make a difference to improve the performance of the model?

Optimizing your validation workflow

The best practice is to use the following workflow:

Data Preparation -> Algorithm Selection -> Model Fine-Tuning

This process is well ingrained in the mind of any data engineer.

Give any data scientist some supervised data and they will quickly jump to create 3 different datasets: training, test, and validation datasets (a common percentage allocation is 50–25–25 respectively).

After that, the most common procedure is to do some resampling with cross-validation (splitting data in k folds and then training iteratively your algorithm in k-1 folds using the remaining as a test, to make sure all possible folds have been tried as test dataset).

Now, you might wonder, why do you arbitrarily split your data into 50–25–25 and not 60–20–20 or 40, 30–30? Or for instance, how do you pick k=10 folds over k=100 folds? Well, you can gain some intuition of what will work well with experience but nothing will save you from experimenting with different values.

So, if you are going to do something a thousand times, I would opt for automation. Write a piece of code in your langue of choice that runs different experiments with several partition sizes and k values and choose the one that better suits your problem.

Once you have handled your supervised data the right way, is time to select the optimal for this particular problem. Most of the time, data scientists will see a problem and quickly come up with one or several algorithms to build a model.

But what if you overlooked another algorithm that could perform better?

Again, this is something you are going to do many times so you can have some code to run several algorithms that usually apply for the same kind of problem and then select the one with optimal complexity for your particular case. You want to be as close as possible to the above mentioned golden point, where there is only noise left, and the total error is minimized.

Congratulations, you already have your error-optimized model and some results to show.

Now, it’s time to remember the bias-variance tradeoff dilemma above and ask yourself: is my error mostly caused by variance or bias? is my model overfitting or underfitting?

Depending on which of the two is causing the error you will want to tune your model parameters in different ways. For example, let’s see some used for the most popular algorithms:

If you are using decision trees you can create a random forest. By creating many of these trees, in effect a “forest”, and then averaging them the variance of the final model can be greatly reduced over that of a single tree.
If you are using K-Nearest Neighbors try increasing the number of K to decrease variance and increase bias and vice-versa.
If you are using Support Vector Machines increasing the C parameter would influence the violations of the margin allowed in the training data. This will increase the bias but decrease the variance. Viceversa is also true.

And that’s it, you have now learned how to optimize your validation process as a true expert machine learning engineer.

Thanks for reading!

Happy coding :)

How to do Bias-Variance Tradeoff the Right Way

How to do Bias-Variance Tradeoff the Right Way

Learn how to evaluate the performance of your models

Optimizing your validation workflow

Recommend

Getting Started with DesignKit in Figma

Poker Simulation with Python

Sagar Chhatrala (@sagarchhatrala) / FAUN

Pavel G (@hej_pavel) / FAUN

Olga Shafran (@ahosha) / FAUN

weikiat _ (@weikiat) / FAUN

@danielpa_n

Adarsh Negi (@adarshnegi23) / FAUN

Kumar Chanapathi (@kumarchanapathi123) / FAUN

Parthiban Sampath (@parthiban) / FAUN

About Joyk