FinalReview

Review of key concepts and take-aways

ML algorithms are characterized by the model form used, the loss function being minimized, and the optimization method used to minimize the loss function.

ML can to parametric or non-parametric

Non-parametric methods like K-NN, decision trees, and random forests do not assume any particular model form, and allow the model to be made as complex as the data warrant. They have no parameters to be learned.

Parametric models like linear regression, logistic regression, and belief nets (including Naive Bayes and HMMs) make strong assumptions about the form of the model. They are particularly useful when the number of observations is small (or, more precisely, the number of observations is not much larger than the number of features).

Semi-parametric is sometimes used as a label for models such as neural nets, which have so many parameters that they can flexibly fit almost any function. Overfitting is then controlled via regularization.

ML often models probability distributions

Probabilistic reasoning is central to many machine learning tasks. Probability is an extremely useful way of quantifying our beliefs about the state of the world.

Generative vs. discriminative models. Many of the methods discussed in this course model one of the following probability distributions:

 Generative (Unsupervised)GenerativeDiscriminative
 {$P(X)$}{$P(X,Y)$}{$P(Y \mid X)$}
ExampleGMMNaive BayesLogistic Regression

Neural nets also often model probability distributions, either outputting a distribution over class labels using a softmax function, or building a generative model with a GAN.

Remember that using the basic rules of probability, we can compute a discriminative posterior {$P(Y \mid X) = P(X,Y) / \sum_{Y'} P(X,Y')$} for any generative model in order to make decisions. In lecture, we showed that NB and LR can both lead to the same form of posterior, although the parameters estimated will generally be different.

Also, note that any generative model can be used as an unsupervised method. Since {$P(X) = \sum_Y P(X,Y)$}, a generative model defines a distribution over {$X$} by marginalizing out the class labels. If the class labels are unknown, we can use EM to estimate them! In this sense, a GMM is just a Gaussian Naive Bayes model with unknown classes.

Graphical models cover most of the probability models we discussed. Naive Bayes, LDA, HMM, MDP. Graphical models are an efficient and intuitive way of encoding a set of independence assumptions about a set of random variables.

ML is (often) optimization

While probability is the language by which we express our beliefs about the state of the universe, in order to talk about achieving goals we need to introduce the language of optimization.

Objective (Loss) functions. The standard machine learning paradigm is to (1) define an objective function that captures performance of a model on a given task and (2) optimization that objective with respect to some parameters. We saw a variety of loss functions in this course: (here, {$f(x)$} is a predictive model that returns a real-valued number)

0-1 (binary)SquaredExponentialLogisticHinge
{$\mathbf{1} [y \ne \textrm{sign} (f(x))]$}{$(y-f(x))^2$}{$\exp\{-yf(x)\}$}{$\log(1 + \exp\{-yf(x)\}) $}{$\max\{0, 1-yf(x)\} $}

The best way to understand the behavior of an algorithm generally by first determining what loss function it optimizes, and then how it optimizes that loss.

Regularization: MLE vs. MAP. We learned that the phenomenon of overfitting can occur if we optimize our objective on training data too tightly. To explicitly control the extent to which we "trust" the training data, we introduced the concept of priors or regularization. If {$\mathcal{D}_X$}, {$\mathcal{D}_Y$} are the dataset, and {$\theta$} our parameters, then probabilistically we have

{$ - \log P(\mathcal{D}_X,\mathcal{D}_Y,\theta) = - \log P(\mathcal{D}_X,\mathcal{D}_Y \mid \theta) - \log P(\theta) = loss(\theta) + regularizer(\theta) $}

In practice, we can substitute whatever loss function we like and whatever regularization function we like, provided we can still solve the resulting optimization problem.

Convex: Gradient descent. How do we solve optimization problems? We saw a very simple, powerful method for optimization convex objectives: gradient descent (or ascent for concave functions.) All we need to do is be able to compute a gradient (or any sub-gradient, if the function is not differentiable) at every point. Gradient descent is widely used for non-convex problems as well. Sometimes we need to alternate between different descent steps (e.g. the E and M steps or EM, or learning a policy and a Q function, or fitting a model and finding its minimum.

Lagrange Multipliers. Often we want to optimize subject to some constraints. The standard way to pull constraints into to optimization function is through the use of Lagrange multipliers. (not really covered)

Primal vs. dual problems. We saw that for the any of the linear methods we discussed in the course, we can represent a hypothesis {$h(x) = \mathbf{w}^\top \mathbf{x}$} in two equivalent ways:

PrimalDual
{$ \mathbf{w}^\top \mathbf{x} $}{$ \sum_i \alpha_i \mathbf{x}_i^\top\mathbf{x}$}

In the case of SVMs, we saw how to obtain the dual by computing the Lagrangian, but this sort of dual representation can be used in almost all of the methods we discussed in the course so far. Using the dual representation lets us kernelize a method by substituting {$\mathbf{x}^\top\mathbf{x'} = k(\mathbf{x},\mathbf{x'}) $}. The dual for regression is covered in the notes on sampling.

Non-Convex: Expectation Maximization. Many of the loss functions we covered are not convex. When we write EM in objective form, letting {$\mathcal{D}_Z$} be the set of unknown data labels, it is a function with two inputs:

{$ \; \begin{align*} \log P(\mathcal{D}_X \mid \theta) \ge F(Q, \theta) & = \mathbf{E}_Q \left[\log P(\mathcal{D}_X, \mathcal{D}_Z \mid \theta) \right] + H(Q) \\ & = -KL(Q || P(\mathcal{D}_Z \mid \mathcal{D}_X, \theta)) + \log P (\mathcal{D}_X \mid \theta) \end{align*} \; $}

Note that we've written the EM objective in two different forms, highlighting how it can be written to make solving for either {$\theta$} or {$Q$} easy to derive. If we fix {$Q$}, {$F$} becomes convex in {$\theta$}, and visa versa. This is known as a Bi-Convex function, and an easy way to get to a local optima is to alternate between solving for {$Q \mid \theta$} and {$\theta \mid Q$}. K-means is another example of this sort of optimization.

PCA: Non-convex optimization through SVD. Unlike general EM, PCA finds the optimal solution of a bi-convex problem by taking advantage of the orthonormal property of the PCA basis. It's also possible to find local minima of the PCA objective through EM-like procedures when computing the SVD isn't possible, but that's outside the scope of this course.

ML can be analyzed theoretically

Bias, Variance trade-off. The Bias-Variance trade-off characterizes the trade-off between underfitting and overfitting.

Minimum description length (MDL). (not covered) One way of addressing the bias/variance trade-off is to look at how many bits it takes to code a model (e.g., which variables are in the regression, and how accurately to code each of the coefficients) plus how many bits it takes to code the residual. When the number of bits to specify this two part model (code plus residual) is minimized, one does not (in expectation) under- or over-fit.

Occam's Razor, VC-Dimension. (not covered) The VC-dimension of a class is a measure of the complexity of that hypothesis class. Whereas bias-variance analyzed the error of algorithms over many random experiments, with generalization bounds we attempt to bound the test error of the algorithm given any single instance of a learning problem. And what can one show? That all else being equal, the algorithm/hypothesis class with smaller VC-dimension has a better bound on generalization error. This should sound familiar to everyone: it is the machine learning version of the famous Occam's Razor.

Online learning regret bounds. We introduced another learning setting that is different from the batch setting mostly considered in the class. The key differences are (1) we make predictions one at a time in order to receive the next training example and (2) we do not assume an underlying probability distribution {$P(X,Y)$} in our analysis, i.e. we do not make the i.i.d. assumption. Again, we didn't cover the theory, but by making further assumptions on can prove limits on the error rates; e.g. for linear classifiers (Perceptrons), if one assumes linear separability, then the one can prove nice things about convergence.

ML cares about linear vs. non-linear

Linear separability. Linearly separable datasets make like easier: Hard Margin SVMs work, Perceptrons converge, etc.

Non-linearity through kernels. Why do we use the dual and kernelize? Firstly kernels let us transform the data to a new space that is closer to being linearly separable. Kernel also offer a computational advantage (not covered): kernels let us compute the dot product of a vast array (possibly infinite) of features efficiently. Without a cheap analytic form, we might as well just compute all the features and use the primal form explicitly.

Neural networks Another way to achieve non-linearity is through "deep" model architectures, such as neural networks. These lead to highly non-convex, difficult to solve optimization problems, yet they show amazingly promising results on large datasets. Deep learning, CNNs and related architectures are increasingly dominating machine learning.

Reinforcement Learning

Problems from game play to robot control can be framed as reinforcement learning, where an optimal policy is learned by experimentation. Learning can be "on policy" or "off policy", can use a model (as in an MDP) or can be model-free (as in V and Q-learning). There can be an explicit representation of the policy or it can be represented implicitly (e.g. in a Q-function). For game-play, Deep Q-learning currently dominates.

ML in practice

Problem is determined by the labels (or lack thereof). What makes a machine learning problem? The simplest way of phrasing many of the ML problems we've discussed is "given some computed features {$X$}, predict something {$Y$}." Depending on whether or not {$Y$} is given to us, we have unsupervised (not given) vs supervised (given) vs semi-supervised (some {$Y$}'s are given) problems. Depending on the domain of {$Y$}, we end up with binary classification, regression, multi-class, etc.

One take-away from this course should be knowing how to approach a real-world problem and formulate it as a machine learning task: what model form is likely to be good? what loss function? how can you regularize?

Cross-validation. Cross-validation is critical for selecting hyperparameters, selecting between different models, and 'honestly' evaluating their performance. Overfitting is the most common mistake that beginning ML practitioners make, often by overfitting to a "test" set, that is not held out until all hyperparamters have been selected. Good auto-ML methods can reduce this danger, as can simply hiding a test set until your are *really* done tuning your model.

Tabular Overview

This table compares some of the algorithms we covered this semester. It is not exhaustive, so don't hesitate to add your own rows and columns!

 Inputs (X)Outputs (Y)Supervised?Generative / Discrim.Misc.
AlgorithmContinuousDiscreteClassificationRegressionSUGDLinearOptimization (optimal?)
KNN    
Kernel Regression    
Kernel Ridge Regression    (yes, closed-form)
Decision Trees     (IG, not optimal) 
Linear Regression    (yes, closed-form)
Naive Bayes {$(\mu,\Sigma)$} {$(\theta)$}    (yes, iclosed-form)
Logistic Regression {$(\mu,\Sigma)$} {$(\theta)$}    (yes)
SVM    (yes)
Boosting     
K-means     (no)
EM for Mix. of Gaussians {$(\mu,\Sigma)$}    ( if shared variance) (no)