Deep Learning Basics

Basic Concepts for Deep Reinforcement Beginners

May 26 ·14min read

vMJV7vz.png!web

This is the third post in the series “ Deep Reinforcement Learning Explained ” . If you have previous knowledge of Deep Learning you can skip this post and go for the following one.

In this post, I will review the main concepts of neural networks to allow the reader to understand Deep Learning basics in order to use it to program an Agent in a Reinforcement Learning problem. In order to ease the explanation, I will base the post in an example that will help to get introduced in the theoretical concepts.

Classification of handwritten digits

As a case study, we will create a mathematical model that allows us to identify handwritten digits such as the following ones:

The goal will be to create a neural network that, given an image, the model identifies the number it represents. For example, if we feed to the model the first image, we would expect it to answer that it is a 5. The next one a 0, next one a 4, an so on.

Classification problem with uncertainty

Actually, we are dealing with a classification problem, which given an image, the model classifies it between 0 and 9. But sometimes, even we can find ourselves with certain doubts, for example, the first image represents a 5 or a 3?

aAJV7vR.png!web

For this purpose, the neural network that we will create returns a vector with 10 positions indicating the likelihood of each of the ten possible digits:

FjmiI3f.png!web

Data format and manipulation

In the next post, we will explain how to code this example using PyTorch. At the moment only mention that we will use the MNIST dataset which contains 60,000 images of hand-made digits to train the model. This dataset of black and white images (images contain gray levels) has been normalized to 28×28 pixels.

To facilitate the ingest of data into our basic neural network we will make a transformation of the input (image) from 2 dimensions (2D) to a vector of 1 dimension (1D). That is, the matrix of 28×28 numbers can be represented by a vector (array) of 784 numbers (concatenating row by row), which is the format that accepts as input a densely connected neural network like the one we will see in this post.

We need to represent each label with a vector of 10 positions as we presented before, where the position corresponding to the digit that represents the image contains a 1 and the rest contains 0s. This process of transforming the labels into a vector of as many zeros as the number of different labels, and putting a 1 in the index corresponding to the label, is known as one-hot encoding . For example, the number 7 will be encoded as:

Neural Networks components

Now we are ready to start explain the minimum set of basic neural network concepts.

A plain artificial neuron

In order to show how a basic neuronal is, let’s suppose a simple example where we have a set of points in a two-dimensional plane and each point is already labeled “square” or “circle”:

iuYfAvB.png!web

Given a new point “ × “, we want to know what label corresponds to it:

bEveqm2.png!web

A common approach is to draw a line that separates the two groups and use this line as a classifier:

rq6fIvf.png!web

In this case, the input data will be represented by vectors of the form ( x1, x2 ) that indicate their coordinates in this two-dimensional space, and our function will return ‘0’ or ‘1’ (above or below the line) to know if it should be classified as “square” or “circle”. It can be defined by

More generally, we can express the line as:

To classify input elements X, which in our case are two-dimensional, we must learn a vector of weight W of the same dimension as the input vectors,that is, the vector ( w1, w2 ) and a b bias.

With these calculated values, we can now construct an artificial neuron to classify a new element X . Basically, the neuron applies this vector W of calculated weights on the values in each dimension of the input element X , and at the end adds the bias b. The result of this will be passed through a non-linear “activation” function to produce a result of ‘0’ or ‘1’. The function of this artificial neuron that we have just defined can be expressed in a more formal way such as

aEnqMzZ.png!web

Now, we will need a function that applies a transformation to variable z so that it becomes ‘0’ or ‘1’. Although there are several functions ( “activation functions”), for this example we will use one known as a sigmoid function that returns an actual output value between 0 and 1 for any input value:

If we analyze the previous formula, we can see that it always tends to give values close to 0 or 1. If the input z is reasonably large and positive, “e” at minus z is zero and, therefore, the y takes the value of 1. If z has a large and negative value, it turns out that for “e” raised to a large positive number, the denominator of the formula will turn out to be a large number and therefore the value of y will be close to 0. Graphically, the sigmoid function presents this form:

ANB7JnE.png!web

So far we have presented how to define an artificial neuron, the simplest architecture that a neural network can have. In particular, this architecture is named in the literature of the subject as Perceptron , invented in 1957 by Frank Rosenblatt, and visually summarized in a general way with the following scheme:

aUf6fiy.png!web

A simplified (two) visual representation of the previous neuron (that we will use) can be:

BZjEzaf.png!web

Multi-Layer Perceptron

In the literature of the area, we refer to a Multi-Layer Perceptron (MLP) when we find neural networks that have an input layer , one or more layers composed of perceptrons, called hidden layers and a final layer with several perceptrons called the output layer . In general, we refer to Deep Learning when the model based on neural networks is composed of multiple hidden layers. Visually it can be presented with the following scheme:

mMRfyaQ.png!web

MLPs are often used for classification, and specifically when classes are exclusive, as in the case of the classification of digit images (in classes from 0 to 9). In this case, the output layer returns the probability of belonging to each one of the classes, thanks to a function called softmax . Visually we could represent it in the following way:

IBzU3i7.png!web

As we mentioned, there are several activation functions in addition to the sigmoid , each with different properties. One of them is the one we just mentioned, the softmax activation function, which will be useful to present an example of simple neural network to classify in more than two classes. For the moment we can consider the softmax function as a generalization of the sigmoid function that allows us to classify more than two classes.

Softmax activation function

We will solve the problem in a way that, given an input image, we will obtain the probabilities that it is each of the 10 possible digits. In this way, we will have a model that, for example, could predict a five in an image, but only being sure in 70% that it is a five. Due to the stroke of the upper part of the number in this image, it seems that it could become a three in a 20% chance and it could even give a certain probability to any other number. Although in this particular case we will consider that the prediction of our model is a five since it is the one with the highest probability, this approach of using a probability distribution can give us a better idea of how confident we are of our prediction. This is good in this case, where the numbers are made by hand, and surely in many of them, we cannot recognize the digits with 100% certainty.

Therefore, for this example of classification, we will obtain, for each input example, an output vector with the probability distribution over a set of mutually exclusive labels. That is, a vector of 10 probabilities each corresponding to a digit and also the sum of all these 10 probabilities results in the value of 1 (the probabilities will be expressed between 0 and 1).

As we have already advanced, this is achieved through the use of an output layer in our neural network with the softmax activation function, in which each neuron in this softmax layer depends on the outputs of all the other neurons in the layer, since that the sum of the output of all of them must be 1.

But how does the softmax activation function work? The softmax function is based on calculating “the evidence” that a certain image belongs to a particular class and then these pieces of evidence are converted into probabilities that it belongs to each of the possible classes.

An approach to measure the evidence that a certain image belongs to a particular class is to make a weighted sum of the evidence of belonging to each of its pixels to that class. To explain the idea, I will use a visual example.

Let’s suppose that we already have the model learned for the number zero. For the moment, we can consider a model as “something” that contains information to know if a number is of a certain class. In this case, for the number zero, suppose we have a model like the one presented below:

r2yQJfB.png!web

In this case, with a matrix of 28×28 pixels, where the pixels in red represent negative weights (i.e., reduce the evidence that it belongs), while that the pixels in blue represent positive weights (the evidence of which is greater increases). The white color represents the neutral value.

Let’s imagine that we trace a zero over it. In general, the trace of our zero would fall on the blue zone (remember that we are talking about images that have been normalized to 20×20 pixels and later centered on a 28×28 image). It is quite evident that if our stroke goes over the red zone, it is most likely that we are not writing a zero; therefore, using a metric based on adding if we pass through the blue zone and subtracting if we pass through the red zone seems reasonable.

To confirm that it is a good metric, let’s imagine now that we draw a three; it is clear that the red zone of the center of the previous model that we used for the zero will penalize the aforementioned metric since, as we can see in the left part of the following figure, when writing a three we pass over:

juYVZv7.png!web

But on the other hand, if the reference model is the one corresponding to number 3 as shown in the right part of the previous figure, we can see that, in general, the different possible traces that represent a three are mostly maintained in the blue zone.

I hope that the reader, seeing this visual example, already intuits how the approximation of the weights indicated above allows us to estimate what number it is.

Once the evidence of belonging to each of the 10 classes has been calculated, these must be converted into probabilities whose sum of all their components add 1. For this, softmax uses the exponential value of the calculated evidence and then normalizes them so that the sum equates to one, forming a probability distribution. The probability of belonging to class i is:

Intuitively, the effect obtained with the use of exponentials is that one more unit of evidence has a multiplier effect and one unit less has the inverse effect. The interesting thing about this function is that a good prediction will have a single entry in the vector with a value close to 1, while the remaining entries will be close to 0. In a weak prediction, there will be several possible labels, which will have more or less the same probability.

Neural Network model for Handwritten digits

For this example, we will define a very simple neural network as a sequence of two layers. Visually we could represent it in the following way:

qamIFvF.png!web

In the visual representation, we explicitly express that we have 784 input features of the model (28×28). The first layer of 10 neurons with sigmoid activation function, “distills” the input data to obtain the desired 10 outputs required as a input in the next layer. The second layer will be a softmax layer of 10 neurons, which means that it will return a matrix of 10 probability values representing the 10 possible digits (as we presented before, where each value will be the probability that the image of the current digit belongs to each one of them).

Learning Process

Training our neural network, that is, learning the values of our parameters (weights W and b biases) is the most genuine part of Deep Learning and we can see this learning process in a neural network as an iterative process of “going and return” by the layers of neurons. The “going” is a forward-propagation of the information and the “return” is a back-propagation of the information.

Training loop

The first phase forwardpropagation occurs when the network is exposed to the training data and these cross the entire neural network for their predictions (labels) to be calculated. That is, passing the input data through the network in such a way that all the neurons apply their transformation to the information they receive from the neurons of the previous layer and sending it to the neurons of the next layer. When the data has crossed all the layers, and all its neurons have made their calculations, the final layer will be reached with a result of label prediction for those input examples.

Next, we will use a loss function to estimate the loss (or error) and to compare and measure how good/bad our prediction result was in relation to the correct result (remember that we are in a supervised learning environment and we have the label that tells us the expected value). Ideally, we want our cost to be zero, that is, without divergence between estimated and expected value. Therefore, as the model is being trained, the weights of the interconnections of the neurons will gradually be adjusted until good predictions are obtained.

Once the loss has been calculated, this information is propagated backwards. Hence, its name: backpropagation . Starting from the output layer, that loss information propagates to all the neurons in the hidden layer that contribute directly to the output. However, the neurons of the hidden layer only receive a fraction of the total signal of the loss, based on the relative contribution that each neuron has contributed to the original output. This process is repeated, layer by layer, until all the neurons in the network have received a loss signal that describes their relative contribution to the total loss. Now that we have spread this information back, we can adjust the weights of connections between neurons.

Visually, we can summarize what we have explained with this visual scheme of the stages (based on the previous visual representation of our neural network):

UFNvUri.png!web

What we are doing is making the loss as close as possible to zero the next time we go back to using the network for a prediction.

In general, we can see the learning process as a global optimization problem where the parameters (weights and biases) must be adjusted in such a way that the loss function presented above is minimized. In most cases, these parameters cannot be solved analytically, but in general they can be approached well with an optimizer (iterative optimizing algorithms), such as a technique called gradient descent . This technique changes the weights in small increments with the help of the calculation of the derivative (or gradient) of the loss function, which allows us to see in which direction “to descend” towards the global minimum; this is done in general in batches of data in the successive iterations (epochs) of all the dataset that we pass to the network in each iteration.

The reader can visit the post Learning Process of a Deep Neural Network for more details about this training loop, although for the purpose of this section I do not think it is necessary.

Cross-Entropy Loss Function

We can choose from a a wide range of loss functions for our neural network model. For instance, sure that the reader knows the Mean Squared Error (MSE) loss function commonly used for regression. For classification as the presented in this post, the loss function that is usually used is Cross-Entropy , which allows measure the difference between two probability distributions.

Cross-entropy loss, or log loss , measures the performance of a classification model whose output is a probability value between 0 and 1. Both are slightly different depending on context, but in Deep Learning when calculating error rates between 0 and 1 they resolve to the same thing.

Cross-entropy loss increases as the predicted probability diverges from the actual label. A perfect model would have a log loss of 0. In binary classification, where the number of classes are 2, cross-entropy can be calculated as:

In our example, a multiclass classification, we calculate a separate loss for each class label per observation and sum the result:

where

C — number of classes (10 in our case)
log — the natural log
y — binary indicator (0 or 1) if class label is the correct classification for observation
p — predicted probability observation o is of class c

In this post, I reviewed the main concepts of neural networks to allow the reader to understand Deep Learning basics in order to use it to program an Agent in a Reinforcement Learning problem. In the next post we will program the example presented in this post using PyTorch. See you in the next post!

See you in the next post, where we will introduce the reader to the basics features of PyTorch, the framework that we will use in this series of posts.

Basic Concepts for Deep Reinforcement Beginners

Classification of handwritten digits

Classification problem with uncertainty

Data format and manipulation

Neural Networks components

A plain artificial neuron

Multi-Layer Perceptron

Softmax activation function

Neural Network model for Handwritten digits

Learning Process

Training loop

Cross-Entropy Loss Function

Recommend

经济实惠海信55s7陕西传智现货大促中

加快布局“无人农场”

Changelog #26

Contributing to Rust

Compiling Rust binaries for Windows 98 SE and more: a journey

What is the optimal password length

Failing Faster and Iterating with Modern Software Development Practices

架构设计 | 缓存管理模式，监控和内存回收策略

Concurnas: The New Language on the JVM for Concurrent and GPU Computing

Go 每日一库之 rpcx

About Joyk