Optimal Brain Damage¶

Optimal Brain Damage is an approach to identify and remove unimportant weights from a neural network. The procedure involves training the network, computing the second derivatives for each parameter, computing the saliencies, sorting the parameters by saliency, and deleting some low-saliency parameters.

The saliency of a parameter is defined as the change in the objective function caused by deleting that parameter. Using the second derivative of the objective function with respect to the parameters avoids the prohibitive labor of directly evaluating the saliency by temporarily deleting each parameter and reevaluating the objective function. They approximate the objective function $E$ by a Taylor series where a perturbation $δ U$ of the parameter vector will change the objective function by

$\begin{matrix} δ E = \sum_{i} g_{i} δ u_{i} + \frac{1}{2} \sum_{i} h_{i i} δ u_{i}^{2} + \frac{1}{2} \sum_{i \neq j} h_{i j} δ u_{i} δ u_{j} + O (| | δ U | |^{3}) \\ h_{i j} = \frac{\partial^{2} E}{\partial u_{i} \partial u_{j}} \\ g_{i} = \frac{\partial E}{\partial u_{i}} \end{matrix}$

where $δ u_{i}$ is a component of the perturbed parameters $δ U$ , $g_{i}$ is the component of the gradient $G$ of $E$ with respect to $U$ , and $h_{i j}$ are elements of the Hessian matrix $H$ of $E$ with respect to $U$ . Second order methods are typically difficult with neural networks due to the enormity of the Hessian matrix. Optimal Brain Damage introduces a simple diagonal approximation where the change in the objective function $E$ caused by deleting several parameters is the sum of the changes caused by deleting each parameter individually. Cross terms of the Hessian are neglected so the third term in $δ E$ can be discarded. An extremal approacimation assumes that parameter deletion will be performed after training has converged. The parameter vector is then at a local minimum of $E$ such that the first term of $δ E$ can be neglected. These simplifications reduce $δ E$ to:

$δ E = \frac{1}{2} \sum_{i} h_{i i} δ u_{i}^{2}$

where the diagonal terms of the second derivatives are given by:

$h_{k k} = \sum_{(i, j) \in V_{k}} \frac{\partial^{2} E}{\partial w_{i j}^{2}}$

Given a standard formula for computing network state, $x_{i} = f (a_{i})$ and $a_{i} = \sum_{j} w_{i j} x_{j}$ , where $x_{i}$ is the state of unit $i$ , $a_{i}$ is the weighted sum of the input, $w_{i j}$ is the connection from unit $j$ to $i$ , and $f$ is the activation function, the summand can be expanded to

$\frac{\partial^{2} E}{\partial w_{i j}^{2}} = \frac{\partial^{2} E}{\partial a_{i}^{2}} x_{j}^{2}$

And back propagated from layer to layer with a boundary condition at the output layer.

$\begin{aligned} \frac{\partial^{2} E}{\partial a_{i}^{2}} & = f^{'} (a_{i})^{2} \sum_{l} w_{l i}^{2} \frac{\partial^{2} E}{\partial a_{l}^{2}} - f^{″} (a_{i}) \frac{\partial E}{\partial x_{i}} \\ \frac{\partial^{2} E}{\partial a_{i}^{2}} & = 2 f^{'} (a_{i})^{2} - 2 (d_{i} - x_{i}) f^{″} (a_{i}) \end{aligned}$