Proximal Policy Optimization

A family of policy gradient methods that alternate between sampling data through interaction with the environment and optimizing a trust region inspired surrogate objective function.

There are two primary variants of Proximal Policy Optimization, one in which large KL-divergences are penalized in the objective function and one in which the objective function is clipped to remove incentives for the new policy to get too far from the old.

The clipped surrogate objective to be optimized, governed by a small hyperparameter $\epsilon$, is given as:

$$ L(\theta, \theta_{old}) = \mathbb{E} \left[ min \left( \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} \hat{A_t}, clip \left( \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}, 1 - \epsilon, 1 + \epsilon \right) \hat{A_t} \right) \right] $$

The adptive penalty version incorporates the KL divergence directly into the objective function.

$$ L(\theta, \theta_{old}) = \mathbb{E} \left[ \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} \hat{A_t} - \beta KL(\pi_{\theta_{old}}(\cdot | s_t), \pi_{\theta}(\cdot | s_t)) \right] $$

The degree of penalty, $\beta$, to assign to the KL term is adaptively adjusted according to:

$$ \beta \leftarrow \begin{cases} \beta \div 2 \ \ if \ \ \mathbb{E}[KL(\pi_{\theta_{old}}(\cdot | s_t), \pi_{\theta}(\cdot | s_t))] < target \div 1.5 \\ \beta \times 2 \ \ if \ \ \mathbb{E}[KL(\pi_{\theta_{old}}(\cdot | s_t), \pi_{\theta}(\cdot | s_t))] > target \times 1.5 \end{cases} $$