Trust Region Policy Optimization

When using policy gradients, small changes in the parameter space can sometimes have very large differences in performance. This makes it dangerous to use large step sizes as a single bad step can collapse policy performance.

The trust region policy optimization algorithm aims to update policies by taking the largest step possible while satisfying a constraint on how close the new and old policies are.

maximizeθ L(θ,θold) subject to DKL(θ || θold)δ

The surrogate advantage function L measures how a policy π performs relative to an old policy πold.

L(θ,θold)=E[πθ(a|s)πθold(a|s)Aθold(s,a)]

Updates are constrained by the KL divergence, which measures how different a probability distribution P is from another distribution Q.

DKL(P || Q)=xXP(x) log(P(x)Q(x))

In this case, we measure the KL divergence between the new and old policy.

DKL(θ || θold)=[DKL(πθold(|s) || πθ(|s))]

The bound for the KL divergence constraint is an arbitrary constant δ.