Trust Region Policy Optimization¶

When using policy gradients, small changes in the parameter space can sometimes have very large differences in performance. This makes it dangerous to use large step sizes as a single bad step can collapse policy performance.

The trust region policy optimization algorithm aims to update policies by taking the largest step possible while satisfying a constraint on how close the new and old policies are.

$\underset{θ}{m a x i m i z e} L (θ, θ_{o l d}) s u b j e c t t o D_{K L} (θ | | θ_{o l d}) \leq δ$

The surrogate advantage function $L$ measures how a policy $π$ performs relative to an old policy $π_{o l d}$ .

$L (θ, θ_{o l d}) = E [\frac{π_{θ} (a | s)}{π_{θ_{o l d}} (a | s)} A_{θ_{o l d}} (s, a)]$

Updates are constrained by the KL divergence, which measures how different a probability distribution $P$ is from another distribution $Q$ .

$D_{K L} (P | | Q) = \sum_{x \in X} P (x) l o g (\frac{P (x)}{Q (x)})$

In this case, we measure the KL divergence between the new and old policy.

$D_{K L} (θ | | θ_{o l d}) = [D_{K L} (π_{θ_{o l d}} (\cdot | s) | | π_{θ} (\cdot | s))]$

The bound for the KL divergence constraint is an arbitrary constant $δ$ .