21

Deep Deterministic Policy Gradient (DDPG) Theory and Implementation

 4 years ago
source link: https://towardsdatascience.com/deep-deterministic-policy-gradient-ddpg-theory-and-implementation-747a3010e82f?gi=eafefdf6ab7a
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Deep Deterministic Policy Gradient (DDPG): Theory and Implementation

May 31 ·6min read

Deep Deterministic Policy Gradient (DDPG) is a reinforcement learning technique that combines both Q-learning and Policy gradients. DDPG being an actor-critic technique consists of two models: Actor and Critic. The actor is a policy network that takes the state as input and outputs the exact action (continuous), instead of a probability distribution over actions. The critic is a Q-value network that takes in state and action as input and outputs the Q-value. DDPG is an “off”-policy method . DDPG is used in the continuous action setting and the “deterministic” in DDPG refers to the fact that the actor computes the action directly instead of a probability distribution over actions.

DDPG is used in a continuous action setting and is an improvement over the vanilla actor-critic.

There are plenty of libraries implementing DDPG online however they come with significant overhead and aren’t “simple” to understand for a beginner. I decided to write a simple TF2 implementation that covers the important bits of the DDPG method.

Let’s dive into the theory of the technique.

Theory

In DQN the optimal action is taken by taking argmax over the Q-values of all actions. In DDPG the actor is a policy network that does exactly this. It outputs the action directly (action can be continuous), bypassing the argmax.

Policy Network

Models

The policy is deterministic since it directly outputs the action. In order to promote exploration some Gaussian noise is added to the action determined by the policy. To calculate the Q-value of a state, the actor output is fed into the Q-network to calculate the Q-value. This is done only during the calculation of TD-error which we will describe later.

fqymqyE.png!web

Actor and Critic chained together

To stabilize learning we create target networks for both critic and actor. These target networks will have “soft”-updates based on main networks. We will discuss these updates later.

Loss Functions

Now that we have described the model architecture, we proceed to show how to train the model, or rather what is the loss function for both the models. The loss function for Critic (Q) and Actor (mu) is,

VbUBZ3Y.png!web
Losses for Actor and Critic

We first analyze the Actor (policy network) loss . The loss is simply the sum of Q-values for the states. For computing the Q values we use the Critic network and pass the action computed by the Actor-network. We want to maximize this result as we wish to have maximum returns/Q-values.

The Critic loss is a simple TD-error where we use target networks to compute Q-value for the next state . We need to minimize this loss.

To propagate the error backwards, we need derivatives of the Q-functions. For the critic loss the derivatives of Q-values are straightforward as mu is treated as constant, however, for actor loss the mu-function is contained inside Q-value. For this, we would use the chain rule,

2U3MZb2.png!web
Actor loss chain rule

We now have all the losses and derivatives.

Target Update

To increase stability during training we include target critic and actor networks to calculate Q-value for next state in TD-error computations. The target networks are delayed networks compared to main/current networks. The weights of targets are updated periodically based on the main networks. In DQN the target gets the main network weights copied over periodically, this is known as a “ hard update” . In DDPG we perform a “soft update” where only a fraction of main weights are transferred in the following manner,

z2qEfuz.png!web
Target network update rle

Tau is a parameter that is typically chosen to be close to 1 (eg. 0.999).

With the theory in hand we can now have a look at the implementation.

Implementation

Model

Now we come to the model-creation bit. In this implementation, we will use a simple gym environment ( Pendulum-v0 ). Our actor and critic networks will be composed of only dense layers. We remind the reader that the actor is the policy network that takes the state as inputs and its output is the action. The critic takes both the state and action as input and outputs Q-value for the state-action pair. In the original DDPG paper for the critic network the action enters the network in middle layers instead of entering the network from the beginning . This is only done to increase performance/stability, and we would not resort to this trick. For us the action and state input will enter the critic network from the beginning. We write a function that generates both the actor and critic,

Model generator function

The function ANN2 generates both critic and actor networks using input_shape and layer_size parameters. The hidden layers for both networks have ‘relu’ activations. The output layer for the actor will be a ‘tanh’ , ( to map continuous action -1 to 1) and the output layer for critic will be ‘None’ as its the Q-value. The output for the actor-network can be scaled by a factor to make the action correspond to the environment action range.

Model Initialization

We initialize 4-networks: The main Actor and Critic and the target Actor and Critic,

Model Initializations

Replay Buffer

As with other deep reinforcement learning techniques, DDPG relies on the use of Replay Buffer for stability. The replay buffer needs to maintain a balance of old and new experiences.

Simple replay buffer implementation

Training

Now we directly use the loss functions defined above to train our network. To compute loss and gradients in TF2 the computations need to be performed in a tf.GradientTape() block. TF2 recommends using different gradient tapes for different networks. Our implementation of training looks like this,

Training loop

Let us quickly run through this code block. We first sample from our replay buffer. For the Actor, we first calculate action on states (X) and then use both the calculated action and state(X) to calculate Q-value using the critic. During the backpropagation the critic remains constant as we only differentiate wrt actor variables. The negative sign in from of loss is there because in the optimization we want to maximize this loss.

For the Critic error, we use target networks to calculate Q-target for TD-error computation. The current state (X) Q-values are computed using the main critic network. In this process, the actor remains constant.

Model Updates

The target models need to be updated based on the main models. We update them using the equation described earlier,

This completes our implementation of the DDPG method.

Results

We run the DDPG for 100 episodes on the Pendulum-v0 environment and we get the following returns.

fIneiin.png!web

Training Returns

The full code implementation can be found here .

References

  1. Continuous control with deep reinforcement learning.
  2. DDPG Implementation: Github
  3. Open AI DDPG

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK