Knowledge Distillation¶

This is a classic transfer learning technique used to train a smaller model to mimic the outputs of a larger model or ensemble. This can allow for a much smaller model to learn complex representations that would have been impossible for to learn from the raw data itself.

The simplest case of distillation is to use the class probabilities produced by the large model as soft targets for the small model as opposed to the hard class label targets. Soft targets provide much more information and less variance in the gradient per training sample than hard targets in states of high entropy. In cases where the larger model is very confident, much of the information about the learned function lies in the ratios of very small probabilities in the soft targets. This information helps define a similarity structure over the data, despite the very little influence that it has over cross-entropy loss. Distillation leverages this information by raising the temperature $T$ of the softmax probabilities.

$q_{i} = \frac{e x p (z_{i} / T)}{\sum_{j} e x p (z_{j} / T)}$

The small model could be trained on the original dataset or even an unlabeled transfer set. The objective function could also be modified to add a term that incorporates both the soft targets and the true label. Typically the small model cannot exactly match the soft targets and erring in the direction of the correct predictions is helpful.

Each sample in the transfer set contributes a gradient $\partial C / \partial z_{i}$ with respect to each logit $z_{i}$ of the distilled model. If the larger model has lagits $v_{i}$ , the gradient at temperature $T$ is given as:

$\frac{\partial C}{\partial z_{i}} = \frac{1}{T} (\frac{e x p (z_{i} / T)}{\sum_{j} e x p (z_{j} / T)} - \frac{e x p (v_{i} / T)}{\sum_{j} e x p (v_{j} / T)})$

If the temperature is high compared to the magnitude of the logits, and the logits are zero-meaned for each transfer sample such that $\sum_{j} z_{j} = \sum_{j} v_{j} = 0$ , then this can be approximated to:

$\begin{matrix} \frac{\partial C}{\partial z_{i}} \approx \frac{1}{T} (\frac{1 + z_{i} / T}{N + \sum_{j} z_{j} / T} - \frac{1 + v_{i} / T}{N + \sum_{j} v_{j} / T}) \\ \frac{\partial C}{\partial z_{i}} \approx \frac{1}{N T^{2}} (z_{i} - v_{i}) \end{matrix}$