Proximal Policy Optimization (ongoing)

In the previous blog posts, we went through Policy Gradient methods (REINFORCE and Actor-Critic) and dqn. In this blog post we'll go through another Policy Gradient Method called Proximal Policy** **Optimization. We'll also talk bout GRPO. To begin we'll first study trust region policy optimization. We'll first talk about why this algorithms were developed and what problem they try to solve.

The main issue with REINFORCE is that the gradient estimate is only valid locally. If you take too large a step, you might overshoot into a bad policy region where performance collapses, destroy the policy you've built so far (catastrophic forgetting) or violate the assumptions that made your gradient estimate valid.

REINFORCE will give you the direction to improve, but it tells you nothing about how far to step, TROP fixes this by constraining how far the new policy can deviate from the old one.

TRPO defines:

\eta(\tilde{\pi}) = \eta(\pi) + \mathbb{E}_{s_0,a_0,\ldots \sim \tilde{\pi}}\left[\sum_{t=0}^{\infty} \gamma^t A_\pi(s_t, a_t)\right]

This equation says that the performance of a new policy $\tilde \pi$ equals to the old policy's performance plus expected advantages under the new policy. The problem is that we can't optimize this directly because $\tilde \pi$ appears in both the expectation (sampling states) and the policy itself.

raideno

Proximal Policy Optimization (ongoing)