When using policy gradients, small changes in the parameter space can sometimes have very large differences in performance. This makes it dangerous to use large step sizes as a single bad step can collapse policy performance.
The trust region policy optimization algorithm aims to update policies by taking the largest step possible while satisfying a constraint on how close the new and old policies are.
The surrogate advantage function measures how a policy performs relative to an old policy .
Updates are constrained by the KL divergence, which measures how different a probability distribution is from another distribution .
In this case, we measure the KL divergence between the new and old policy.
The bound for the KL divergence constraint is an arbitrary constant .