19

Introduction to Artificial Neural Networks

 4 years ago
source link: https://mc.ai/introduction-to-artificial-neural-networks/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Practical Aspects Of Back-Propagation

Activation Function

The main purpose of the activation function is to convert the input signal to a node in ANN to output signal. A neural network without an activation function is just a linear regression model. Hence to learn complex and non-linear curves, we need activation functions.

Properties that an activation function should follow:

  • Non-Linear To generate non-linear input mappings, we need a non-linear activation function.
  • Saturate A saturating activation function squeezes the input and puts the output in a limited range; hence, no single weight can have a significant impact on the final output.
  • Continuous and Smooth For gradient based optimizations smoother functions have generally shown better results. Since input takes a continuous range output also should take a continuous range for a node.
  • Differentiable As we see while deriving back-propagation derivative of f should be defined.
  • Monotonic If the activation function is not monotonically increasing a neuron’s weight might cause it to have less influence and vice versa; which is precisely opposite of what we want.
  • Linear for small values If it is non-linear for small values, we need to take care of the constraints while weight initialization of the neural network since we can face vanishing gradient or exploding gradient problem.

Vanishing And Exploding Gradient

V anishing gradient As more layers are added to the neural network with certain activation functions, the gradient of the loss function tends to zero, thereby making the neural network hard to train.

As we discussed earlier saturate activation functions squeezes the input to a small value, hence a substantial change in input would result in small change in output, hence a smaller derivative.

ReLU is one activation function which does not suffer from the vanishing gradient problem. Most deep learning models use this as their activation function.

But if you are still adamant about using tanh or sigmoid, you can go for Batch Normalization. It just keeps the input in the green area where the derivative is not small.

E xploding Gradient Error gradients can accumulate during updates and result in large gradients. In extreme situations, the value of weights can overflow giving NaN weights. Hence, these NaN weights cannot be further updated, bringing a halt to the learning process.

There are many ways to deal with an exploding gradient, like gradient clipping( clip gradient if norm exceeds a particular threshold) and weight regularization (penalize loss function for large weight values).

Loss Functions

The cost function or the loss function essentially calculates the difference between the neural net output and the target variable. They can be categorized into three taxonomy:

Regressive Loss Function:When the target variable is continuous regressive loss functions are used. Most commonly used is Mean Square Error. Other names include absolute and smooth absolute error.

Classification Loss Function:When the output variable is a probability of a class, we use classification loss function. Most classification loss function tends to maximize the margin. Notable names include categorical cross-entropy, negative log-likelihood, margin classifier.

Embedding Loss Function:When we have to measure the similarity between two or more inputs, we use embedding loss functions. Some widely used embedding loss functions are L1 hinge error and cosine error.

Optimization Algorithms

Optimization algorithms are responsible for updating the weights and biases of the neural network to reduce the loss. They can be primarily divided into two categories:

Constant Learning Rate Algorithms:Where the learning rate η is fixed for all parameters and weights. The most common example is the Stochastic Gradient Descent.

Adaptive Learning Rate Algorithms:Adaptive optimizer like Adam have per parameter learning rate method which provides a more straightforward heuristic method instead of working on hyper-parameters manually.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK