Unreliable? The Problem with Deep Deterministic Policy Gradients (DDPG)
source link: https://mc.ai/unreliable-the-problem-with-deep-deterministic-policy-gradients-ddpg-2/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
The Deadlock Cycle
But it doesn’t stop there. We can see this saturation as a doorway to deadlock. After our agent’s actor stabilizes to a suboptimal state, DDPG perpetuates a cycle that is difficult to recover from. Here, we take a look at each of the cycle’s components, but, if you would like to see the rigorous mathematical derivations, feel free to take a look here .
1. Q Tends to Q Conditioned on Policy
As our critic continually updates its parameters, its output doesn’t converge to the true, optimal Q-value, but rather the Q-value conditioned on our policy.
This intuitively makes sense. Looking at the critic update equation, we directly feed in our policy’s actions to calculate the target value. But taken by itself, this doesn’t seem to be much of an issue. There are many methods like SARSA that use on-policy updates similar to this, so what’s wrong?
This part of the cycle is problematic because our actor is already saturated. Our policy is stagnant. As a result, our algorithm keeps feeding our critic the same actions whenever updating, making the estimated Q-value stray from its true value.
2. Estimated Q is Piece-Wise Constant
Looking at equation 2, we notice that, in sparse environments, the reward term takes on a constant value very often. Without loss of generality, we can set this value to zero, since all transitions can be scaled accordingly.
So, we’re left with the second term. Notice how this term can be replaced with the value function conditioned on our policy. In sparse environments, this value function is dependent on two things: the number of steps until a rewarded state and the value of that reward. This value in itself is piece-wise constant, making the overall Q-value piece-wise constant as well.
3. Critic Gradients Approach Zero
As our Q-value tends towards the Q-value conditioned on our policy, it becomes more piece-wise constant.
This is an issue.
Because of this fact, local gradients become mostly flat, making it roughly equal to zero. Discontinuous function approximators rarely happen, so we can’t expect our gradients to perfectly equal zero. Regardless, our critic is being trained to match this piece-wise function, making it a valid approximation. Most importantly, the flatness prevents our agent from receiving any information on how to improve its policy.
4. Our Agent’s Policy Barely Changes
Then, we come full circle. As DDPG is a deterministic algorithm, our Q-value is always differentiated exactly at state s and a policy-given action. Coupled with the fact that our Q-value gradients are very close to zero, this prevents our actor from updating its policy properly, regardless of whether the reward is found regularly in future transitions. Then, we loop back to step one.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK