18

Improving Deep Neural Networks

 3 years ago
source link: https://towardsdatascience.com/improving-deep-neural-networks-d5d096065276?gi=a99705c0f015
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Andrew Ng’s advice for Hyperparameter Tuning and Regularisation from his Deep Learning Specialisation course.

nqIBz2a.jpg!web

Pawel Kadysz

I have recently been going through Coursera’s Deep Learning Specialisation course, designed and taught by Andrew Ng. The second sub-course is Improving Deep Neural Networks: Hyperparameter Tuning, Regularisation, and Optimisation. Before I started this sub-course I had already done all of those steps for traditional machine learning algorithms in my previous projects. I’ve tuned hyperparameters for decision trees such as max_depth and min_samples_leaf, and for SVMs tuned C, kernel, and gamma. For regularisation I have applied Ridge (L2 penalty), Lasso (L1 penalty), and ElasticNet (L1 and L2) to regression models. So I thought it would not be much more than translating those concepts over to neural networks. Well, I was somewhat right, but given how Andrew Ng explains the mathematics and visually represents the inner-workings of these optimisation methods, I have a much greater understanding from a fundamental level.

In this article I want to go over some of Andrew’s explanations for these techniques, accompanied with some mathematics and diagrams.

Hyperparameter Tuning

Here are a few popular hyperparameters that are tuned for deep networks:

  • α (alpha): learning rate
  • β (beta): momentum
  • number of layers
  • number of hidden units
  • learning rate decay
  • mini-batch size

There are others specific to optimisation techniques, for instance, you have β1, β2, and ε for the Adam optimisation.

Grid Search vs Random Search

Let’s say for a model we have more than one hyperparameter we are tuning, one hyperparameter probably will have more of an influence on train/validation accuracy than another hyperparameter. In this case, we may want to try a wider variety of values for the more impactful hyperparameter, but also at the same time, we don’t want to run too many models as it is time consuming.

For this example let us say we are optimising two different hyperparameters, α and ε. We know α is more important and needs to be tuned by trying out as many different values as possible. Then again you still want to try 5 different ε values as well. So, if I choose to try 5 different α values, that comes to 25 different models. We have run 25 models with different combinations of 5 α and 5 ε.

But we want to try more α values without increasing the number of models. Here is Andrew’s solution:

For this, we use a random search, where we choose 25 different random values of each α and ε, and each pair of values is used for each model. Now we have to only run 25 models but we get to try 25 different values of α instead of the 5 we did in a grid search.

ia6rQ3q.jpg!web

Left: Grid Search, Right: Random Search

Bonus: Using a coarse to fine can help further to improve tuning. This involves zooming into a smaller region of the hyperparameters which performed best and then creating more models within that region to more precisely tune those hyperparameters.

Choosing a Scale

When trying out different hyperparameter values, choosing the correct scale can be difficult, especially trying to make sure you thoroughly search within a range of really large numbers and a range of really small numbers.

Learning rate is a hyperparameter that can vary so much based on the model, it can between 0.000001 and 0.000002, or between 0.8 and 0.9. It is very hard to search fairly between these two different ranges at once when looking at a linear scale, but we can solve this issue with using the log scale.

Let’s say we are looking at values between 0.0001 and 1 for α. Using a linear scale means 10% of the attempted α values are between 0.0001 and 0.1 and 90% between 0.1 and 1. This is bad, as we are not giving a thorough search for such a wide range of values. By using a log of 10 scale, 25% of α values are between 0.0001 and 0.001, 25% between 0.001 and 0.01, 25% between 0.01 and 0.1, and a final 25% between 0.1 and 1. This way we have a thorough search of α. The range of 0.0001 to 0.1 was 10% with a linear scale but 75% with a log scale.

6RfUrej.jpg!web

Left: Linear Scale, Right: Log Scale

Here is a little bit of mathematics with a numpy function to demonstrate how this works for a random value for α.

MJNf6za.jpg!web

Regularisation

Overfitting can be a huge problem with models due to high variance, this can be solved by getting more training data, but that’s not always possible, so a great alternative is regularisation.

L2 Regularisation (‘Weight Decay’)

Regularisation utilises one of two penalty techniques, L1 and L2, with neural networks L2 is predominantly used.

We must first look at the cost function for a neural network:

Cost Function

And then add the L2 penalty term, which includes the Frobenius Norm:

YBNrqqi.jpg!web

L2 penalty term, which includes the Frobenius Norm

With L2 regularisation the weight reduces not only by the learning rate and backpropagation but also by the middle term which includes the regularisation hyperparameter λ (lambda). The larger λ is the smaller w becomes.

7BRr6rU.jpg!web

Weight Decay

How does regularisation prevent overfitting?

We see that L2 regularisation uses the λ penalty to reduce the weights w, but how does this reduce variance and prevent overfitting of the model?

λ rises, w falls, changing the magnitude of z

If w is small the size of z will drop too, if z is a large positive number it will become smaller, if it is a large negative number it will become larger, nearing to 0. When passing z through the activation function we have a more linear effect (as you can see the image below, the tanh curve is more linear near 0).


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK