Introduction to Artificial Neural Networks

Practical Aspects Of Back-Propagation

Activation Function

The main purpose of the activation function is to convert the input signal to a node in ANN to output signal. A neural network without an activation function is just a linear regression model. Hence to learn complex and non-linear curves, we need activation functions.

Properties that an activation function should follow:

Non-Linear To generate non-linear input mappings, we need a non-linear activation function.
Saturate A saturating activation function squeezes the input and puts the output in a limited range; hence, no single weight can have a significant impact on the final output.
Continuous and Smooth For gradient based optimizations smoother functions have generally shown better results. Since input takes a continuous range output also should take a continuous range for a node.
Differentiable As we see while deriving back-propagation derivative of f should be defined.
Monotonic If the activation function is not monotonically increasing a neuron’s weight might cause it to have less influence and vice versa; which is precisely opposite of what we want.
Linear for small values If it is non-linear for small values, we need to take care of the constraints while weight initialization of the neural network since we can face vanishing gradient or exploding gradient problem.

Vanishing And Exploding Gradient

V anishing gradient As more layers are added to the neural network with certain activation functions, the gradient of the loss function tends to zero, thereby making the neural network hard to train.

As we discussed earlier saturate activation functions squeezes the input to a small value, hence a substantial change in input would result in small change in output, hence a smaller derivative.

ReLU is one activation function which does not suffer from the vanishing gradient problem. Most deep learning models use this as their activation function.

But if you are still adamant about using tanh or sigmoid, you can go for Batch Normalization. It just keeps the input in the green area where the derivative is not small.

E xploding Gradient Error gradients can accumulate during updates and result in large gradients. In extreme situations, the value of weights can overflow giving NaN weights. Hence, these NaN weights cannot be further updated, bringing a halt to the learning process.

There are many ways to deal with an exploding gradient, like gradient clipping( clip gradient if norm exceeds a particular threshold) and weight regularization (penalize loss function for large weight values).

Loss Functions

The cost function or the loss function essentially calculates the difference between the neural net output and the target variable. They can be categorized into three taxonomy:

Regressive Loss Function:When the target variable is continuous regressive loss functions are used. Most commonly used is Mean Square Error. Other names include absolute and smooth absolute error.

Classification Loss Function:When the output variable is a probability of a class, we use classification loss function. Most classification loss function tends to maximize the margin. Notable names include categorical cross-entropy, negative log-likelihood, margin classifier.

Embedding Loss Function:When we have to measure the similarity between two or more inputs, we use embedding loss functions. Some widely used embedding loss functions are L1 hinge error and cosine error.

Optimization Algorithms

Optimization algorithms are responsible for updating the weights and biases of the neural network to reduce the loss. They can be primarily divided into two categories:

Constant Learning Rate Algorithms:Where the learning rate η is fixed for all parameters and weights. The most common example is the Stochastic Gradient Descent.

Adaptive Learning Rate Algorithms:Adaptive optimizer like Adam have per parameter learning rate method which provides a more straightforward heuristic method instead of working on hyper-parameters manually.

Practical Aspects Of Back-Propagation

Activation Function

Vanishing And Exploding Gradient

Loss Functions

Optimization Algorithms

Recommend

余承东转发《针对华为的黑公关狂欢》一文，表态前员工事件_创事记_新浪科技_新浪网

罗永浩发布鲨纹抗菌材料丁香园:用它该病还是会病的

写了个视差滚动布局 ParallaxLayout

[译] Go语言的协程，系统线程以及CPU管理

扎克伯格家庭专访：中文说得比华裔妻子还要溜(视频)

摄影入门设备求推荐

GitHub - Yggdroot/LeaderF: An efficient fuzzy finder that helps to locate files,...

JavaScript Promises: 9 Questions - Dan Levy's Programming Blog

Library providing compile-time checking of SQL and simple data-mapping for Hasql

Spring 框架基础(05)：事务管理机制，和实现方式

About Joyk