In simple words, Random Forest uses bagging to build a collection of decision trees and then averages them.
Let us denote $B$ as a number of tree we are going to create.
$p$ is the number of features.
Random Forest algorithm:
- For $b = 1$ to $b = B$:
- create bootstrapped data set $Z$ from the training data
- create decision tree using this new data set by recursively repeating the following steps
for each terminal node:
- Randomly select $m$ variables from the $p$ variables
- Pick the best variable and split point among the $m$
- Split the node into two child sub-nodes
- output the ensemble of trees $\{T_1, T_2, \dots, T_B\}$
Prediction for Regression trees:
\[
\begin{equation}
f(x) = \frac{1}{B}\sum_{b=1}^{B}T_b(x)
\end{equation}
\]
Prediction for Classification trees:
\[
\begin{equation}
c(x) = majority\_vote(\{c_b,\;b \in [1,B]\})
\end{equation}
\]
Out of bag error
Since we create bootstrapped data sets with duplicate entries, some entities
will not be included to these data sets. These entities form an out of bag data set.
To estimate out of bag error we calculate the average prediction error on all
entities $x_i$ using only the trees that do not have $x_i$ in their training data set.
Typical parameter values
For classification typical value of $m$ is $m = \lfloor\sqrt p \rfloor$ and the minimum node size is $1$.
For regression typical value of $m$ is $m = \lfloor\frac{p}{3} \rfloor$ and the minimum node size is $5$.
Since we know how to estimate the accuracy of Random Forest we can use this knowledge to find the optimal parameter $m$.
(Just by comparing the accuracies of different RFs created using different $m$ values)
Missing data in the training dataset
Probably the easiest way to fill in a table with missing data is to use median, mean or mode’s values of the feature.
It is a good initial guess. RF allows us to refine that initial guess until it is a good guess. To do this
we first determine which samples are similar to the one with missing data. So here we go.
- First we need to fill in our data table with initial guesses
- We repeatedly do the following until the missing values converge:
- Build a Random Forest
- Run all of the data
- Fill in Proximity Matrix. Proximity Matrix is a $M \times M$ matrix $P$ with a row and a column for each sample.
Each cell contains a similarity measure: there is a similarity between two samples if running them down a tree ends up in the same node.
If $i$-th and $j$-th samples are similar $P[i,j]$ is incremented. Finally we divide each proximity value by the total number of
trees in RF.
- Recalculate the missing values using proximity matrix (as a weighted average, or weighted frequency)
Variable importance
- Naive approach is to merely count the number of times each variable is selected by all individual trees
- Mean Decrease in Impurity: improvement in the splitting criterion produced by each variable over all trees
- Permutation feature importance. Estimation is based on decrease in a model score then a single feature
variable is randomly shuffled
Key features
- there is no need to use validation data set to estimate the quality of trained model
- additional usage case: RF can be used for identifying the most informative features
Links
StatQuest: Random Forests Part 1 - Building, Using and Evaluating
Sklearn. Permutation feature importance