Gradient Boosting Machine

gradient boosting (GB) is an ensembling technique that gradually moves an approximate model to a good model, by adding weak models to composite model
GB itself does not specify how to choose the weak models, but in practice almost always Decision Trees are used
GB adds one weak model at a time
each next weak model is designed to reconstruct the residual

Gradient boosting for Regression

We are given a $n$-element data set ${(x_i, y_i)}, i = 0 \dots n$ and differentiable loss function $L(y_i, F(x_i))$. $F_m$ is some imperfect model at the stage $m$.

Let $F_0(X)$ be value $\gamma$ minimizing $L(y_i, \gamma)$ across all observations: \[ \begin{equation} F_0(X) = \arg\min_{\gamma}\sum_{i=1}^{n} L(y_i, \gamma) \end{equation} \] If our loss function is SSR then $\gamma$ is just the mean of $y_1,\dots y_n$.
For $m = 1 \dots M $:
1. Compute the \[ \begin{equation} r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F(x)=F_{m-1}(x)}, i = 1\dots n \end{equation} \] where $r_{im}$ is pseudo residuals
2. Fit the regression tree to predict the residuals $r_{im}$ giving terminal regions $R_{jm}, j=1 \dots J_m$, where $j$ is the index for each leaf in the tree
3. For $j = 1 \dots J_m$ compute \[ \begin{equation} \gamma_{jm} = \arg\min_{\gamma} \sum_{x_i \in R_{jm}} L(y_i, f_{m-1}(x_i) + \gamma) \end{equation} \] This step is similar to the very first step: we just need to determine the output value for each leaf
4. Update \[ \begin{equation} F_m(x) = F_{m-1}(x) + \nu\sum_{j=1}^{J_m} \gamma_{jm}I(x \in R_{jm}) \end{equation} \] where $0 <\nu \leq 1$ is the learning rate
Output $F(x) = F_M(x)$.

Insights

There is no reason to use a linear model as the base model of GB. Because linear combination of linear models is still linear model.
But we can use logistic regression as the base model of GB. Linear combination of sigmoids is nonlinear function.

Gradient Boosting Machine

Gradient boosting for Regression

Insights

Links