Lectures /

# RealML

**Before you do anything**- What question are you trying to answer?
- How can you measure the answer?
- What data can you get?
- How well do you think you can do?
- Is the number of observations and features reasonable?

**Overfitting is your worst enemy**- Train, Validate (Quiz), Test
- Out-of-sample in the real world is subtle
- new people, products, words, time periods, countries, ...

**Loss functions**- What is the end-use loss function (utility)?
- what is a good loss function for selecting a model?
- what is a good training loss function
- L2 vs. L1 vs. L0 vs. Hinge vs. Exponential vs. cost

- Classification problems often have asymmetric costs
- precision/recall, sensitivity/specificity, ROC

**Feature generation is critical**- Think about the problem!!
- How might you transform the features?
- Do you want a scale-invariant method or not?

- What else could you measure?
- Is semi-supervised learning possible?
- Are there surrogate labels you might use?
- 'distant supervision'

- Feature Blocks
- Different feature sets need different regularization
- One solution: block-stagewise regression

**Ensemble methods**- Combinations of multiple methods are almost always the most accurate
- Averaging methods (or experts)
- equal weighting
- {$ \hat{y} = (1/k) \sum_k \hat{y}_k$}

- inverse variance-based weighting
- {$ \hat{y} = (\sum_k \hat{y}_k / \sigma_k^2 ) /( \sum_k 1/ \sigma_k^2)$}

- regression-based weighting
- {$ \hat{y} = \sum_k w_k \hat{y}_k$}

- equal weighting
- Boosting
- Random Forests

- Missing data
- Missing at random (MAR) or not requires different handling
- Imputation works well for MAR, but most data are not MAR, so it is best to add a new variable that indicates whether or not the feature is missing.

- Missing at random (MAR) or not requires different handling
**Explanation/Insight is often important****Look at the data!**- posts, images scoring highest in some feature or outcome
- error analysis

- variable importance
- How "important" is each feature for the prediction?

- visualization: word clouds, PCA, MDS
- MDS: given an {$n x n$} matrix of distances between points, find a new (usually 2-D) representation of each of the points that as closely as possible preserves that distance matrix

**Feature representation**- reduce dimension using PCA, Autoencoding, or clustering like LDA
- Think about how to best handle categorical variables

**Correlation is not causality?**