Lectures /
RandomForests
Given {$n$} observations with {$p$} predictors.
Input:
- {$m << p$} the fraction of the predictors to sample at each split (often {$m= \sqrt{p}$})
- {$f$}, the fraction of the data to use for training
- {$k$}, the number of trees in the forest.
Repeat k times:
- Choose a training set by choosing {$f*n$} training cases (with replacement). This is called {$bagging$}
- Build a decision tree as follows
- For each node of the tree, randomly choose {$m$} variables and find the best split from among those m variables
- Repeat until the full tree is built (no pruning)
To predict, take the modal classification ('majority vote') over all the trees or the average of the real values across the trees.
See also wikipedia
Variation: Extremely Random Forests
“As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule.”