RealML

  • Before you do anything:
    • What question are you trying to answer?
    • How can you measure the answer?
    • What data can you get?
    • How well do you think you can do?
  • Overfitting is your worst enemy
    • Train, Validate (Quiz), Test
    • Out-of-sample in the real world is subtle
      • new people, products, words, time periods, countries, ...
  • Loss functions
    • L2 vs. L1 vs. L0 vs. Hinge vs. Exponential vs. cost
    • Classification problems often have asymmetric costs
    • precision/recall, sensitivity/specificity, ROC
  • Feature generation is critical
    • Think about the problem!!
    • How might you transform the features?
      • Do you want a scale-invariant method or not?
    • What else could you measure?
    • Is semi-supervised learning possible?
    • Are there surrogate labels you might use?
      • 'distant supervision'
  • Feature Blocks
    • Different feature sets need different regularization
    • One solution: block-stagewise regression
  • Ensemble methods
    • Combinations of multiple methods are almost always the most accurate
    • Averaging methods (or experts)
      • equal weighting
        • {$ \hat{y} = (1/k) \sum_k \hat{y}_k$}
      • inverse variance-based weighting
        • {$ \hat{y} = (\sum_k \hat{y}_k / \sigma_k^2 ) /( \sum_k 1/ \sigma_k^2)$}
      • regression-based weighting
        • {$ \hat{y} = \sum_k w_k \hat{y}_k$}
    • Boosting
    • Random Forests
  • Missing data
    • Missing at random (MAR) or not requires different handling
      • Imputation works well for MAR, but most data are not MAR, so it is best to add a new variable which indicates whether or not the feature is missing.
  • Explanation/Insight is often important
    • Look at the data!
      • posts, images scoring highest in some feature or outcome
      • error analysis
    • variable importance
      • How "important" is each feature for the prediction?
    • visualization: word clouds, PCA, MDS
      • MDS: given an {$n x n$} matrix of distances between points, find a new (usually 2-D) representation of each of the points that as closely as possible preserves that distance matrix
  • Feature representation
    • reduce dimension using PCA, Autoencoding, or clustering like LDA
    • Think about how to best handle categorical variables
  • Correlation is not causality?

Back to Lectures