Gradient Boosting Machine

ML

  • gradient boosting (GB) is an ensembling technique that gradually moves an approximate model to a good model, by adding weak models to composite model
  • GB itself does not specify how to choose the weak models, but in practice almost always Decision Trees are used
  • GB adds one weak model at a time
  • each next weak model is designed to reconstruct the residual


Gradient boosting for Regression

We are given a $n$-element data set ${(x_i, y_i)}, i = 0 \dots n$ and differentiable loss function $L(y_i, F(x_i))$. $F_m$ is some imperfect model at the stage $m$.

  1. Let $F_0(X)$ be value $\gamma$ minimizing $L(y_i, \gamma)$ across all observations: \[ \begin{equation} F_0(X) = \arg\min_{\gamma}\sum_{i=1}^{n} L(y_i, \gamma) \end{equation} \] If our loss function is SSR then $\gamma$ is just the mean of $y_1,\dots y_n$.

  2. For $m = 1 \dots M $:
    1. Compute the \[ \begin{equation} r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F(x)=F_{m-1}(x)}, i = 1\dots n \end{equation} \] where $r_{im}$ is pseudo residuals

    2. Fit the regression tree to predict the residuals $r_{im}$ giving terminal regions $R_{jm}, j=1 \dots J_m$, where $j$ is the index for each leaf in the tree

    3. For $j = 1 \dots J_m$ compute \[ \begin{equation} \gamma_{jm} = \arg\min_{\gamma} \sum_{x_i \in R_{jm}} L(y_i, f_{m-1}(x_i) + \gamma) \end{equation} \] This step is similar to the very first step: we just need to determine the output value for each leaf
    4. Update \[ \begin{equation} F_m(x) = F_{m-1}(x) + \nu\sum_{j=1}^{J_m} \gamma_{jm}I(x \in R_{jm}) \end{equation} \] where $0 <\nu \leq 1$ is the learning rate
  3. Output $F(x) = F_M(x)$.

Insights

  1. There is no reason to use a linear model as the base model of GB. Because linear combination of linear models is still linear model.
  2. But we can use logistic regression as the base model of GB. Linear combination of sigmoids is nonlinear function.