53

Deep Reinforcement Learning: Pong from Pixels — Keras Version

 5 years ago
source link: https://www.tuicool.com/articles/rEj2Uj3
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
R7vA3y2.png!web

This is a follow on from Andrej Karpathy’s (AK) blog post on reinforcement learning (RL). What I’m hoping to do with this post is to hopefully simplify Karpathy’s post, and take out the maths (thanks to Keras). This article ought to be self contained even if you haven’t read the other blog already.

The blog here is meant to accompany the video tutorial which goes into more depth (code in YouTube video description):

MVbey2e.jpg!web

What is Reinforcement Learning (RL)

Unlike other problems in machine learning/ deep learning, reinforcement learning suffers from the fact that we do not have a proper ‘y’ variable. The input ‘X’ however, is no different. For instance, in this particular example we will be using the pong environment from openAI. The input would be the image of the current state of the game. The output is the move to play.

Reward Function

The task in RL is given the current state (X) of the game/ environment, to take the action that will maximise **future** expected discounted rewards. A proxy to that is that we play the game to the end and sum up all the rewards from the current time step onwards with a discount variable gamma (number between 0 and 1) as shown here:

yiqaeyj.png!web
R variable

Why not simply the current reward r_t? This is due to delayed rewards. When an action is taken, its implications do not only affect the current state but subsequent states too, but at a decaying rate. Therefore, the current action is responsible for the current reward and future rewards but with lesser and lesser responsibility moving further into the future.

The Model

The model is used to generate the actions. So in our case we use the images as input with a sigmoid output to decide whether to go up or down.

In relation to the R-variable mentioned above, notice how the actions generated by our model, leads to the rewards. This is very much a case of the blind leading the blind. But as more iterations are done, we converge to better outputs.

The model that we will be using is different to what was used in AK’s blog in that we use a Convolutional Neural Net (CNN) as outlined below. The advantage of using a CNN is that the number of parameters that we have to deal with is significantly less. A dense network with 1 hidden layer with 100 neurons would lead to ~640000 parameters (since we have 6400 = 80×80 pixels). Whereas we only have 3100 parameters in the model shown below.

Also note that the final layer has a sigmoid output. This is so that the model will predict the probability of moving the paddle up or down.

model = Sequential()
model.add(Conv2D(4, kernel_size=(3,3), padding='same', activation='relu', input_shape = (80,80,1)))
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Conv2D(8, kernel_size=(3,3), padding='same', activation='relu'))
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Conv2D(16, kernel_size=(3,3), padding='same', activation='relu'))
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(2, activation='softmax'))

Preprocessing

We crop the top and bottom of the image, and subsample every second pixel both horizontally and vertically. We set the paddles and balls to a value of 1 while the background is set to 0. What is fed into the DL algorithm however is the difference of two subsequent frames. This leads to an input image of size 80×80.

Loss function

The last piece of the puzzle is the loss function. We have our input, which is the X variable mentioned above, however, the target y-variable is the actions that were taken at that time step. i.e. it will be 1 for going up and 0 for going down.

But wait, wasn’t the y-variable what the model dictated it to be? Yes, you are absolutely right. So we cannot simply use the usual cross-entropy loss since the probability p(X) and the y are generated by the same model. What we do instead is to weight this by the expected future reward at that point in time.

Thus at the end of each episode we run the following code to train:

model.fit(x, y, sample_weight=R, epochs=1)

whereas, the actual loss function remains the same,

model.compile(optimizer='rmsprop',loss='sparse_categorical_crossentropy')

The key takeaway being that we use sample_weight functionality above to weight them if the move was a good move.

Summary

To wrap things up, policy gradients are a lot easier to understand when you don’t concern yourself about the actual gradient calculations. All current deep learning frameworks take care of any derivatives that you would need.

Policy gradients are one of the more basic reinforcement learning problems. If you wish to learn more on reinforcement learning, subscribe to my YouTube channel . This playlist contains tutorials on more advanced RL algorithms such as Q-learning.


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK