Gradient Boosting¶

Gradient Boosting is an approach in ensemble learning where models are trained in succession with adaptively weighted training sets to correct the errors of previous models. At its core, gradient boosting applies the principle of gradient descent to the functional space of predictive models.

Boosting algorithms typically use large populations of weak learners like shallow decision trees. Many variations of boosting have been introduced with different approaches to reweighting training data, with AdaBoost perhaps being the most popular. Consider a dataset ${(x_{1}, y_{1}), . . ., (x_{n}, y_{n})}$ where a set of weak classifiers ${f_{1}, . . ., f_{m}}$ are combined to output a classification label $y_{n} \in {1, . . ., C}$ for each item. The boosted ensemble makes predictions using a linear combination of the individual classifiers weighted by $α_{i}$ .

$F (x_{i}) = α_{1} f_{1} (x_{i}) + . . . + α_{n} f_{m} (x_{i})$

Observation weights are initialized to $w_{i} = \frac{1}{n}$ for each sample in the training set. Then, for each classifier in the ensemble of size $m$ , a classifier $f_{m}$ is fit to the training data using the training set weights where the error is computed:

$e r r (f_{m}) = \frac{\sum_{i = 1}^{n} w_{i} I (c_{i} \neq f_{m} (x_{i}))}{\sum_{i = 1}^{n} w_{i}}$

The classifier weight $α_{m}$ is then computed:

$α_{m} = l o g \frac{1 - e r r (f_{m})}{e r r (f_{m})} + l o g (C - 1)$

The observation weights are then updated to assign more importance to samples the previous classifier misclassified and then re-normalized such that the observation weights $\sum_{i = 1}^{n} w_{i} = 1$ .

$w_{i} \leftarrow w_{i} \cdot e x p (α_{m} \cdot I (y_{i} \neq f_{m} (x_{i})))$