Optimal Brain Damage

Optimal Brain Damage is an approach to identify and remove unimportant weights from a neural network. The procedure involves training the network, computing the second derivatives for each parameter, computing the saliencies, sorting the parameters by saliency, and deleting some low-saliency parameters.

The saliency of a parameter is defined as the change in the objective function caused by deleting that parameter. Using the second derivative of the objective function with respect to the parameters avoids the prohibitive labor of directly evaluating the saliency by temporarily deleting each parameter and reevaluating the objective function. They approximate the objective function $E$ by a Taylor series where a perturbation $\delta U$ of the parameter vector will change the objective function by

\begin{gather*} \delta E = \sum_i g_i \delta u_i + \frac{1}{2} \sum_i h_{ii} \delta u_i^2 + \frac{1}{2} \sum_{i \neq j} h_{ij} \delta u_i \delta u_j + O(|| \delta U ||^3) \\ h_{ij} = \frac{\partial^2 E}{\partial u_i \partial u_j} \\ g_i = \frac{\partial E}{\partial u_i} \end{gather*}

where $\delta u_i$ is a component of the perturbed parameters $\delta U$, $g_i$ is the component of the gradient $G$ of $E$ with respect to $U$, and $h_{ij}$ are elements of the Hessian matrix $H$ of $E$ with respect to $U$. Second order methods are typically difficult with neural networks due to the enormity of the Hessian matrix. Optimal Brain Damage introduces a simple diagonal approximation where the change in the objective function $E$ caused by deleting several parameters is the sum of the changes caused by deleting each parameter individually. Cross terms of the Hessian are neglected so the third term in $\delta E$ can be discarded. An extremal approacimation assumes that parameter deletion will be performed after training has converged. The parameter vector is then at a local minimum of $E$ such that the first term of $\delta E$ can be neglected. These simplifications reduce $\delta E$ to:

$$ \delta E = \frac{1}{2} \sum_i h_{ii} \delta u_i^2 $$

where the diagonal terms of the second derivatives are given by:

$$ h_{kk} = \sum_{(i,j) \in V_k} \frac{\partial^2 E}{\partial w_{ij}^2} $$

Given a standard formula for computing network state, $x_i = f(a_i)$ and $a_i = \sum_j w_{ij} x_j$, where $x_i$ is the state of unit $i$, $a_i$ is the weighted sum of the input, $w_{ij}$ is the connection from unit $j$ to $i$, and $f$ is the activation function, the summand can be expanded to

$$ \frac{\partial^2 E}{\partial w_{ij}^2} = \frac{\partial^2 E}{\partial a_i^2} x_j^2 $$

And back propagated from layer to layer with a boundary condition at the output layer.

\begin{align*} \frac{\partial^2 E}{\partial a_i^2} &= f'(a_i)^2 \sum_l w_{li}^2 \frac{\partial^2 E}{\partial a_l^2} - f''(a_i) \frac{\partial E}{\partial x_i} \\ \frac{\partial^2 E}{\partial a_i^2} &= 2 f'(a_i)^2 - 2(d_i - x_i)f''(a_i) \end{align*}