VariableImportance
These can either be based entirely on input/output (e.g. the effect on prediction accuracy of removing the feature) or specific to the model in question.
Linear Models
A scaled measure of the regression coefficients. E.g.
- Standardize x and y; then look at coefficient size
- The absolute value of the t-statistic for each coefficient
- The univariate correlation of each feature with y.
Decision Trees
from the CART package: "To calculate a variable importance score, CART looks at the improvement measure attributable to each variable in its role as a either a primary or a surrogate splitter. The values of ALL these improvements are summed over each node and totaled, and are then scaled relative to the best performing variable. The variable with the highest sum of improvements is scored 100, and all other variables will have lower scores ranging downwards toward zero. A variable can obtain an importance score of zero in CART only if it never appears as either a primary or a surrogate splitter. Because such a variable plays no role anywhere in the tree, eliminating it from the data set should make no difference to the results"
Random Forest and XBoost
From the Elements of Statistical Learning p.368: At each such node t, one of the input variables Xv(t) is used to partition the region associated with that node into two subregions; within each a separate constant is fit to the response values. The particular variable chosen is the one that gives maximal estimated improvement () in … risk over that for a constant fit over the entire region. The squared relative importance of variable is the sum of such squared improvements over all internal nodes for which it was chosen as the splitting variable.
from the R package: “For each tree, the prediction accuracy on the out-of-bag portion of the data is recorded. Then the same is done after permuting each predictor variable. The difference between the two accuracies is then averaged over all trees, and normalized by the standard error. For regression, the MSE is computed on the out-of-bag data for each tree, and then the same computed after permuting a variable. The differences are averaged and normalized by the standard error. If the standard error is equal to 0 for a variable, the division is not done.”
"A second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares."
Inputs to Neural Nets
A common way to interpret neural nets is to determine which input maximizes the output of a given neuron (node). This can either be done by
- Trying each of the training data points to see which one maximizes the output of the neuron. (e.g., which image in the training set does the neuron most respond to?)
- Using gradient descent over the feature space to find which of all possible inputs maximizes the input. Note that this is not gradient descent on the weights, which are fixed, but on the inputs. I.e., the gradient is {$d Output/d{\bf x} $}, not {$d Error/d{\bf w}$}