- gradient boosting (GB) is an ensembling technique that gradually moves an approximate model to
a good model, by adding weak models to composite model
- GB itself does not specify how to choose the weak models, but in practice almost always Decision Trees are used
- GB adds one weak model at a time
- each next weak model is designed to reconstruct the residual
Gradient boosting for Regression
We are given a $n$-element data set ${(x_i, y_i)}, i = 0 \dots n$ and differentiable loss function $L(y_i, F(x_i))$.
$F_m$ is some imperfect model at the stage $m$.
-
Let $F_0(X)$ be value $\gamma$ minimizing $L(y_i, \gamma)$ across all observations:
\[
\begin{equation}
F_0(X) = \arg\min_{\gamma}\sum_{i=1}^{n} L(y_i, \gamma)
\end{equation}
\]
If our loss function is SSR then $\gamma$ is just the mean of $y_1,\dots y_n$.
- For $m = 1 \dots M $:
-
Compute the
\[
\begin{equation}
r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F(x)=F_{m-1}(x)}, i = 1\dots n
\end{equation}
\]
where $r_{im}$ is pseudo residuals
-
Fit the regression tree to predict the residuals $r_{im}$ giving terminal regions $R_{jm}, j=1 \dots J_m$,
where $j$ is the index for each leaf in the tree
- For $j = 1 \dots J_m$ compute
\[
\begin{equation}
\gamma_{jm} = \arg\min_{\gamma} \sum_{x_i \in R_{jm}} L(y_i, f_{m-1}(x_i) + \gamma)
\end{equation}
\]
This step is similar to the very first step: we just need to determine the output value for each leaf
- Update
\[
\begin{equation}
F_m(x) = F_{m-1}(x) + \nu\sum_{j=1}^{J_m} \gamma_{jm}I(x \in R_{jm})
\end{equation}
\]
where $0 <\nu \leq 1$ is the learning rate
- Output $F(x) = F_M(x)$.
Insights
- There is no reason to use a linear model as the base model of GB.
Because linear combination of linear models is still linear model.
- But we can use logistic regression as the base model of GB.
Linear combination of sigmoids is nonlinear function.
Links