When using policy gradients, small changes in the parameter space can sometimes have very large differences in performance. This makes it dangerous to use large step sizes as a single bad step can collapse policy performance.
The trust region policy optimization algorithm aims to update policies by taking the largest step possible while satisfying a constraint on how close the new and old policies are.
$$ \underset{\theta}{maximize} \ \mathcal{L}(\theta, \theta_{old}) \ subject \ to \ D_{KL}(\theta \ || \ \theta_{old}) \leq \delta $$
The surrogate advantage function $\mathcal{L}$ measures how a policy $\pi$ performs relative to an old policy $\pi_{old}$.
$$ \mathcal{L}(\theta, \theta_{old}) = \mathbb{E}\left[ \frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)} A_{\theta_{old}}(s,a) \right] $$
Updates are constrained by the KL divergence, which measures how different a probability distribution $P$ is from another distribution $Q$.
$$ D_{KL}(P \ || \ Q) = \sum_{x \in X} P(x) \ log\left( \frac{P(x)}{Q(x)} \right) $$
In this case, we measure the KL divergence between the new and old policy.
$$ D_{KL}(\theta \ || \ \theta_{old}) = [D_{KL}(\pi_{\theta_{old}}(\cdot | s) \ || \ \pi_{\theta}(\cdot | s))] $$
The bound for the KL divergence constraint is an arbitrary constant $\delta$.