# Decision Trees

After the Nearest Neighbor approach to classification/regression, perhaps the second most intuitive model is Decision Trees. Below is an example of a two-level decision tree for classification of 2D data. Given an input x, the classifier works by starting at the root and following the branch based on the condition satisfied by x until a leaf is reached, which specifies the prediction.  Here's a more complicated example courtesy of The Slate: Here are decision trees (forests, actually) in action!!

## An example with data

Consider trying to predict Miles per Gallon (MPG) for a set of cars using the following dataset.

mpgcylindersdisplacementhorsepowerweightaccelerationmodelyearmaker
OK412198294514.575Asia
OK623290308517.676America
OK412097250614.572Europe
OK415185285517.678America
OK411675215815.573Asia
OK41199725451775Europe
OK6146120293013.881Europe
OK411681222016.976Asia
OK415692262014.481America
OK414088287018.180America
OK4976018341971Asia
OK413495256014.278Europe
OK4977521711675Europe
OK49778194014.577Asia
OK49883221916.574Asia
Good47970207419.571Asia
Good49168197017.682Europe
Good4897119251479Asia
Good4836120031974Europe
Good41128823951882America
Good48160176016.181Europe
Good41358423701382America
Good410563212514.782America

It's useful to consider the decision tree as a way to organize this data. Let's first split the dataset according to how heavy the car is:

The root node represents the entire dataset, which has 19 bad, 15 OK, and 8 good examples (note that this is a subset of a much larger dataset that we also supply). As we consider each subcase at the leaves, the distribution of the subset of the data of bad/OK/good changes. Several leaves are "pure" -- they are all bad, OK, or good. For the mixed ones, we can further split them along different attributes as follows:

This might still not be enough, so let's continue until all leaves are pure. Note that this actually might not be possible, as you see in the un-splittable leaves below. In the un-splittable case, we predict whatever the majority of the outcomes is. For the leaves below 1 out of 2 of the leaves were "Bad"; we could have chosen the other result as well, but the code that generated the tree happened to choose "Bad" twice.

Now we can use this organization of the data as a classifier. Given a new car with a set of attributes, we follow the splits from the root down to the leaf and return the majority label at that leaf as or prediction. In general, a decision tree maps an input {$\textbf{x}$} to a leaf of the tree {$leaf(\textbf{x})$} by following the path determined by the splits on individual features down to the leaf, where a distribution {$P(y\mid leaf(\textbf{x}))$} or simply the decision {$y(leaf(\textbf{x}))$} is defined.

## Learning decision trees

There are many possible trees we can use to organize (i.e., classify) our data. It is also possible to get the same classifier with two very different trees. If we have a lot of features, trees can get very complex. Intuitively, the more complex the tree, the more complex and high-variance our classification boundary. Just like in linear models, we would like to control this complexity.

Let's assume for simplicity the there exists a tree that splits the data perfectly (all leaves are pure). A tree that splits the data with all pure leaves is called consistent with the data. This is always possible when no two samples {$(\textbf{x},y),(\textbf{x}',y')$} have different outcomes {$y\ne y'$} but identical features {$\textbf{x}=\textbf{x}'$}. Suppose we want to find the smallest (shortest) consistent tree. It turns out that this problem is NP-hard (Hyafil & Rivest, 76). We resort instead to a greedy heuristic algorithm:

Greedy Decision Tree Learning:

• Start from empty decision tree
• Split on next best feature (we'll define best below)
• Recurse on each leaf

## Choosing what feature to split on

In order to choose what feature is best to split on for the above algorithm, we need to quantify how predictive a feature is for our outcome at the current node in the tree (which corresponds to the appropriate subset of the data). A standard measure of how much information a feature carries about the outcome is called information gain, and it's based on the notion of entropy. Recall that entropy of a discrete distribution {$P(Y)$} is a measure of uncertainty defined as:

{$\textbf{Entropy:}\;\; H(Y) = - \sum_y P(Y=y)\log_2 P(Y=y)$}.

A note about the notation: we write {$H(Y)$} when {$P(Y)$} is understood and unambiguous from the context. When it isn't, we will also need to specify which distribution on {$Y$} we mean.

For binary {$Y$}, the entropy is a function of {$p = P(Y=1)$}, {$H(Y) = - p\log_2 p - (1 - p)\log_2 (1-p)$} so we can plot it:

Note that entropy is zero when p = 0 or p = 1, and 1 when p = 0.5. One interpretation of entropy is the expected number of bits needed to encode Y or questions needed to guess Y. E.g., an unbiased coin requires 1 bit, an eight sided die requires 3 bits, etc. However, if a coin is biased to always come up heads, it requires no bits to encode; regardless of what you do, it will always come up heads, so you don't need to encode anything.

Here is a heat map showing entropy of distributions over a ternary variable {$Y$}.

Each point in the triangle corresponds to a distribution {$(p_1,p_2,p_3)$}, where {$p_1+p_2+p_3=1$}. The corners correspond to the distributions that put all the weight on one of 3 outcomes (the entropy is zero there) and the center corresponds to the uniform distribution {$(\frac{1}{3},\frac{1}{3},\frac{1}{3})$} which achieves the entropy of {$1.585$}.

In order to quantify predictiveness of a feature X for Y, we consider the conditional entropy, or the expected number of bits needed to encode Y or questions needed to guess Y, knowing X. The conditional entropy of Y given X, H(Y|X), again assuming Y and X are discrete is defined as:

{Conditional Entropy: \begin{align*} H(Y|X) & = \sum_x P(X=x)H(Y|X=x) \\ & = -\sum_x P(X=x)\sum_y P(Y=y|X=x)\log_2 P(Y=y|X=x) \\ & = -\sum_{x,y} P(X=x,Y=y) \log_2 P(Y=y|X=x)\\ \end{align*}}

So to measure the reduction in entropy of Y from knowing X, we use information gain:

{${\rm \textbf{Information Gain}:}\;\;\; IG(X) = H(Y)-H(Y|X)$}

For continuous variables, we can create splits by discretizing via introducing a thresholded feature {$X' = \mathbf{1}(X>\alpha)$} for some {$\alpha$} (this is not the only way, but perhaps the most common).

So our algorithm from above can be fleshed out now as:

Greedy Decision Tree Learning (based on ID3 algorithm by Quinlan, 1986):

• Start from empty decision tree
• Case 1: If all records in current data subset have the same output {$y$} then return leaf with decision {$y$}
• Case 2: If all records have exactly the same set of input attributes then return leaf with majority decision {$y$}
• Case 3 (Recursion): Split dataset on feature with best information gain and recurse on each branch with appropriate subset of the dataset

What if the best information gain is zero? Should we stop? Consider the same XOR function below.

 y {$\mathbf{x_1}$} {$\mathbf{x_2}$} 1 1 0 0 0 0 1 0 1 0 1 1

which has the following graphical representation:

The information gain from either feature is zero, but the correct tree is:

The problem is that the information gain measure is myopic, since it only considers one variable at a time, so we cannot stop even if best IG=0.

## Sample code

We will use the Sklearn decision tree package. If anyone wants matlab code, it is here.

## Overfitting

Learning the shortest tree consistent with the data is one way to help avoid overfitting, but it may not be sufficient when the outcome labels are noisy (have errors). The lower we are in the tree, the less data we're using to make the decision (since we have filtered out all the examples that do not match the tests in the splits above) and the more likely we are to be trying to model noise. One simple and practical way to control data sparsity is to limit the maximum depth or maximum number of leaves. Another, more refined method, is decide for each split whether it is justified by the data. There are several methods for doing this, often called pruning methods, with chi-squared independence test being the most popular. Nilsson's chapter talks about several pruning methods, but we will not discuss these in detail.

## Comparison to Nearest Neighbor

So why bother with decision trees at all when we have nearest neighbors? The answers are speed and size. When the dataset is large and has many dimensions, it can take a long time to decide which neighbor is closest (linear time in size of training data, naively, or, using faster using approximate methods). Although there are many methods to solve the problem, they are usually a lot slower than following a tree from root to leaf. This makes decision trees very attractive for large datasets. In addition to running time, nearest neighbors methods require storing the training data, which can be prohibitive for embedded systems. Another advantage is relative robustness to noisy or irrelevant features (assuming pruning or depth constraints).