A, B, Cs… of Deep Learning Hyperparameters

D eep learning is currently in the news because of its accuracy and the controls over the models we have. With lots of programming software as TensorFlow , Keras , Caffe , and a huge list in the way simplified the work of programming for deep learning. Now we do not have to worry about backpropagation steps, weight updations, etc, but just have to tune hyperparameters.

However, people in deep learning know it is not a ‘ just ’ thing. With high prediction accuracy, deep learning is boon to have lots of hyperparameters that are employed by practitioners according to their tasks or kind of problem on which they are implementing deep learning models.

AIis a boon to the world but can be a curse to you if you do not use it wisely, And to use it wisely, you need to understand it precisely.

So to “ understand them precisely ” we need to answer three Ws, i.e when, where, and in what context. And before that, let us list them for a smooth start.

We are going to discuss the three Ws of the following hyperparameters:

Learning Rate
Number of hidden units and Number of hidden layers
β (Gradient Descent with momentum)
β1, β2, and ϵ (Adam optimizer)
Learning rate decay
Mini-batch size (Mini-Batch Gradient descent)

1. Learning Rate

When you write the most fundamental form of gradient descent and update the parameters, here is where the learning rate appears

Steps of Gradient Descent (Published by Author)

The learning rate decides how long jump you are going to take in each iteration of gradient descent. Generally, the learning rate is between 0.001 and 1, this can vary as you progress towards the minima of the function.

2. Number of hidden layer and hidden units

Another important hyperparameter is the number of hidden units in a hidden layer and hidden layer itself. It decides how complex is the function described in the given data points by the model. More the number of hidden units and hidden layers, more complex is the function outlined by the model, hence more chances of overfitting.

Frequently, people take lots of hidden layers and hidden units to make a deep neural network and use some techniques like L2 or dropout regularization, which prevents the stipulations of overfitting the data.

There is no prescribed way for determining the correct or optimal number of layers, you have to commence with some minimum number and increment it until you reach a desirable predictive accuracy. That is the reason why applied machine learning is a highly iterative process.

Another parameter is your system, CPU , or GPU that determines the number of hidden layers and units. As it is a highly iterative process, you want the results of each iteration promptly, for that you should have high computational powers.

3. Gradient Descent with a Momentum

What is a Weighted Average?

Suppose we recorded the temperature of 200 days of summer (scatter distribution in yellow). The additional curves represent the weighted average of temperature with different weights (beta).

##CODE FOR WEIGHTED AVERAGE
temp                      ## list of temperatures of 200 days
v[0]  = temp[0]           ## random initialization (for scalability)
for i in range(len(temp)) :
    v[i+1] = (beta)*v[i] + (1-beta)*temp[i+1]

Temperature distribution of 200 days of Delhi, India ((Published by Author))

Note: we have initialized v[0] = temp[0] to ensure that the weighted average temperature remains well within the domain of actual distribution.

How we can use this?

Suppose we have the following type of cost function and our gradient descent is working well with it.

Batch Gradient Descent on a Cost function (Published by Author)

But when we have ample data to train, gradient descent spent a large amount of time in oscillations to reach the minima of the cost function. When we apply a weighted average to gradient descent, it averages out the values in both the directions. Consequently, the vertical values cancel out each other and we obtain more momentum in the horizontal direction.

Gradient Descent with a Momentum on a Cost function (Published by Author)

This is called gradient descent with momentum. Generally, β ranges from 0.9 to 0.99 and we use a log scale to find the optimal value for this hyperparameter.

## UPDATION STEP OF GRADIENT DESCENT
W = W - learning_rate * dW
b = b - learning_rate * db## CHANGE IN UPDATION STEP OF GRDIENT DESCENT
## WEIGHTED AVERAGE form of  GRADIENT DESCENT
VdW = beta * VdW + (1-beta)*dW    ## Taking weighted avg of weights Vdb = beta * Vdb + (1-beta)*db    ## Taking weighted avg of biases## Updating weights and biases
W = W - learning_rate*VdW
b = b - learning_rate*Vdb

4. Adam Optimizer

Batch Normalization

It is observed that when we normalize input data, training becomes faster. So, why not we normalize the input for every hidden layer? This technique is called batch normalization. Usually, we normalize the value before putting them into the activation function.

But if we train on a given picture of cats as shown below, there are chances that our model does not predict correctly.

(1) Photo by Kazuky Akayashi on Unsplash , (2)Photo by Lamna The Shark on Unsplash , (3)Photo by Daria Shatova on Unsplash , (4) Photo by Tran Mau Tri Tam on Unsplash

This is because all cats in the training data set are black and picture in test dataset is of a white cat. Mathematically, the data distribution of test and train datasets are different.

Normalization is a technique by which we obtain the same data distribution each time when we have to provide input. General steps of normalization are:

Steps for normalizing input data (Published by Author)

These steps make the mean zero and unit standard deviation. This particular distribution may not work always. There may be situations where we need a distribution with different central tendencies. Hence, we require to adjust the distribution parameters i.e. mean and deviation.

Additional steps for adjusting the central tendency for normalized data (Published by Author)

Step 4 allows us to adjust the distribution parameters. Also, in step 3 we have added an epsilon term just to ensure denominator is never equal to zero.

5. Mini Batch Size

Process of one iteration of gradient descent:

Process of Batch-gradient Descent (Published by Author)

When the training set is very large, say around 5 million examples than the time required for upgrading the parameters once, will be large. So, as a remedy, we divide our large dataset into smaller data sets, and for every iteration over these smaller data set we update our parameters.

Process of Mini-batch-gradient Descent (Published by Author)

Here, we have divided training set into three smaller sets, but there are some norms to do that. People recommend that we should make a batch equivalent to some power of two i.e. 64, 128, 1024 examples in a mini-batch. This somewhat optimizes memory allocation indeed the performance of the model.

The completion of one cycle through the whole data set is called one epoch . When we use batch gradient descent (simple gradient descent) we update the parameter once in an epoch but in mini-batch gradient descent, multiple numbers of times parameters are updated during an epoch.

6. Learning Rate decay

Taking large values of the learning rate optimizes the time but there can be chances that we never reach a minimum. Contrarily, if we take a small learning rate, the learning speed is low. So, why can’t we vary the learning rate during training the model?

As the learning rate approaches convergence (minima of the cost function), it can be slow down for better results. A very general form of the learning rate, in this case, maybe shown as

Implementing Learning rate decay (Published by Author)

In the example, we can observe how the learning rate is varying. It becomes now important to choose the “ decay_rate ” wisely, hence can be called another hyperparameter.

Conjectures…

It is very important to perceive the details of every parameter and hyperparameters when you train a model with a dream that it will bring some change to the world and society. Even a small thing can induce a large innovation.

With that, I hope this article would add something to you. Please share your thoughts because

“Criticism is an indirect form of self-boasting.”– Emmet Fox

1. Learning Rate

2. Number of hidden layer and hidden units

3. Gradient Descent with a Momentum

What is a Weighted Average?

How we can use this?

4. Adam Optimizer

Batch Normalization

5. Mini Batch Size

6. Learning Rate decay

Conjectures…

Recommend

曾学忠出任小米集团副总裁兼手机部总裁

想要反抗996，“摸鱼神教”了解一下？

如何使用 Ktor 快速开发 Web 项目 - 简书

我以为我很了解JVM，直到我遇见了阿里面试官-程序员麦冬

使用DNSpy 调试.net 服务-Yoke-home

美国检方：窃取机密的前谷歌工程师应被判入狱27个月

站长爆料：网站被恶意威胁k站索要5000元 - 卢松松博客

第一次第一次中签

中国5G手机销量全球第一 Q2华为份额最高苹果增速最快

Gavin Wood撰文回应雪崩协议质疑：雪崩协议不安全也无可拓展性

About Joyk