The False Discovery Rate (FDR) is the proportion of hypotheses that we falsely think are true. In our case, the FDR is the faction of the feature weights that we think are not zero, even though they really are, in fact, zero.
For a single hypothesis test, we can compute a p-value: the probability that we say the hypothesis is true even though it is false. (Or more precisely, the probability that we incorrectly reject the null hypothesis.) People often use a threshold such as requiring the p-value to be less than 0.05 to accept a hypothesis.
If we test, say m=1,000 hypotheses (e.g. check for each of 1,000 different features whether or not they are correlated with some y), then with a p-value threshold of α=0.05 we would expect on average 0.05*1,000=50 hypotheses to be falsely selected. If we want to avoid having, in expectation, more than one false discovery, then we need to have a far more stringent threshold. Bonferroni correction requires that the p-value be less than α/m, where m is the number of hypotheses, such as the number of candidate features being considered for inclusion in the model. Thus, the Bonferroni-adjusted threshold would be p-value < 0.05/1,000 in our example.
The Bonferroni correction ends up being exactly the same as RIC — we’re expecting 1 out of m (or p) features to come into the model, so we need a probability p times higher, or a penalty (in entropy, which is log probability space) of log(p).
A more generous but still very sensible approach is to sequentially add features, increasing the threshold each time you add a feature or accept a hypothesis. If you’re an empirical Bayesian, then before any features are significant you might expect 1 to be significant (or each one to enter with probability 1/p); Once you’ve seen one signficant feature, it’s more likely that you’ll see another significant one, if you’ve seen 2 significant features then it’s even more likely. More precisely, your prior is 1/p probability of seeing a significant feature. After you’ve seen one significant feature, your new ‘prior’ is 2/p; if you’ve seen two signficant features, your new prior is 3/p, etc.
This is Simes or in modern terms the Benjamini–Hochberg procedure (BH step-up procedure).
Benjamini-Hochberg controls the FDR at level α, and works as follows:
For a given α, find the largest q (the number of the p features we’re putting into the model) such that P-value({$w_j$}) ≤ q/p α (Sort the features by the p-values of their weights being non-zero.) Reject the null hypothesis (i.e., accept as true and add into the model) all {$w_j$} for {$ j=1,\ldots ,q $}