Lottery Ticket Hypothesis

It's known that large numbers of parameters can be removed from fully trained networks with little discernible loss in predictive power after a few epochs of continued training. However, for a long time it has been difficult to successfully train these sparse subnetworks from scratch to an equivalent accuracy. We know that the smaller subnetwork can approximate the full size model, however the learning dynamics appear to benefit significantly from overparameterization.

The Lottery Ticket Hypothesis introduced a methodology for finding sparse subnetworks that train as well as the dense model they reside in.

  1. Randomly initialize a neural network $f(x; \theta_0)$ (where $\theta_0 ∼ \mathcal{D}_{\theta}$).
  2. Train the network for $j$ iterations, arriving at parameters $\theta_j$.
  3. Prune $p\%$ of the parameters in $\theta_j$, creating a mask $m$.
  4. Reset the remaining parameters to their values in $\theta_0$, creating the winning ticket $f(x; m \circ \theta_0$).

Effective winning tickets are found with Iterative Magnitude Pruning, where this algorithm is repeated several times such that a small percentage of the weights with the smallest magnitude are pruned for each iteration until reaching the desired target sparsity.