Optimal Brain Damage

Optimal Brain Damage is an approach to identify and remove unimportant weights from a neural network. The procedure involves training the network, computing the second derivatives for each parameter, computing the saliencies, sorting the parameters by saliency, and deleting some low-saliency parameters.

The saliency of a parameter is defined as the change in the objective function caused by deleting that parameter. Using the second derivative of the objective function with respect to the parameters avoids the prohibitive labor of directly evaluating the saliency by temporarily deleting each parameter and reevaluating the objective function. They approximate the objective function E by a Taylor series where a perturbation δU of the parameter vector will change the objective function by

δE=igiδui+12ihiiδui2+12ijhijδuiδuj+O(||δU||3)hij=2Euiujgi=Eui

where δui is a component of the perturbed parameters δU, gi is the component of the gradient G of E with respect to U, and hij are elements of the Hessian matrix H of E with respect to U. Second order methods are typically difficult with neural networks due to the enormity of the Hessian matrix. Optimal Brain Damage introduces a simple diagonal approximation where the change in the objective function E caused by deleting several parameters is the sum of the changes caused by deleting each parameter individually. Cross terms of the Hessian are neglected so the third term in δE can be discarded. An extremal approacimation assumes that parameter deletion will be performed after training has converged. The parameter vector is then at a local minimum of E such that the first term of δE can be neglected. These simplifications reduce δE to:

δE=12ihiiδui2

where the diagonal terms of the second derivatives are given by:

hkk=(i,j)Vk2Ewij2

Given a standard formula for computing network state, xi=f(ai) and ai=jwijxj, where xi is the state of unit i, ai is the weighted sum of the input, wij is the connection from unit j to i, and f is the activation function, the summand can be expanded to

2Ewij2=2Eai2xj2

And back propagated from layer to layer with a boundary condition at the output layer.

2Eai2=f(ai)2lwli22Eal2f(ai)Exi2Eai2=2f(ai)22(dixi)f(ai)