Final ReviewOn this page… (hide) Review of key concepts and take-awaysML is (often) modeling a probability distributionWe learned in this course that probabilistic reasoning is central to many machine learning tasks. Probability is an extremely useful way of quantifying our beliefs about the state of the world. Generative vs. discriminative models. Many of the methods discussed in this course model one of the following probability distributions:
Remember that using the basic rules of probability, we can compute a discriminative posterior {$P(Y \mid X) = P(X,Y) / \sum_{Y'} P(X,Y')$} for any generative model in order to make decisions. In lecture, we showed that NB and LR can both lead to the same form of posterior, although the parameters estimated will generally be different. Also note that any generative model can be used as an unsupervised method. Since {$P(X) = \sum_Y P(X,Y)$}, a generative model defines a distribution over {$X$} by marginalizing out the class labels. If the class labels are unknown, we can use EM to estimate them! In this sense, a GMM is just a Gaussian Naive Bayes model with unknown classes. Graphical models cover all probability models we discussed. Graphical models are an efficient and intuitive way of encoding a set of independence assumptions about a set of random variables. As an exercise, try to draw the models to represent Naive Bayes and Logistic Regression. Once you have learned about graphical models, it’s rare to ever talk about probabilities again without them! ML is (often) optimizationWhile probability is the language by which we express our beliefs about the state of the universe, in order to talk about achieving goals we need to introduce the language of optimization. Objective (Loss) functions. The standard machine learning paradigm is to (1) define an objective function that captures performance of a model on a given task and (2) optimization that objective with respect to some parameters. We saw a variety of loss functions in this course: (here, {$f(x)$} is a predictive model that returns a real-valued number)
The key thing to remember is that we can often understand the behavior of an algorithm by figuring out and analyzing the behavior of the loss function that it optimizes and how it optimizes it. Regularization: MLE vs. MAP. We learned that the phenomenon of overfitting can occur if we optimize our objective on training data too tightly. To explicitly control the extent to which we “trust” the training data, we introduced the concept of priors or regularization. If {$\mathcal{D}_X$}, {$\mathcal{D}_Y$} are the dataset, and {$\theta$} our parameters, then probabilistically we have {$ - \log P(\mathcal{D}_X,\mathcal{D}_Y,\theta) = - \log P(\mathcal{D}_X,\mathcal{D}_Y \mid \theta) - \log P(\theta) = loss(\theta) + regularizer(\theta) $} In practice, we can substitute whatever loss function we like and whatever regularization function we like, provided we can still solve the resulting optimization problem. Convex: Gradient ascent. How do we solve optimization problems? We saw a very simple, powerful method for optimization convex objectives: gradient descent (or ascent for concave functions.) All we need to do is be able to compute a gradient (or any sub-gradient, if the function is not differentiable) at every point. Lagrange Multipliers. Often we want to optimize subject to some constraints. The standard way to pull constraints into to optimization function is through the use of Lagrange multipliers. Primal vs. dual problems. We saw that for the any of the linear methods we discussed in the course, we can represent a hypothesis {$h(x) = \mathbf{w}^\top \mathbf{x}$} in two equivalent ways:
In the case of SVMs, we saw how to obtain the dual by computing the Lagrangian, but this sort of dual representation can be used in almost all of the methods we discussed in the course so far. Using the dual representation lets us kernelize a method by substituting {$\mathbf{x}^\top\mathbf{x'} = k(\mathbf{x},\mathbf{x'}) $}. The dual for regression is covered in the notes on sampling. Non-Convex: Expectation Maximization. Some of the objectives we have covered are not convex. When we write EM in objective form, letting {$\mathcal{D}_Z$} be the set of unknown data labels, it is a function with two inputs: {$ \; \begin{align*} \log P(\mathcal{D}_X \mid \theta) \ge F(Q, \theta) & = \mathbf{E}_Q \left[\log P(\mathcal{D}_X, \mathcal{D}_Z \mid \theta) \right] + H(Q) \\ & = -KL(Q || P(\mathcal{D}_Z \mid \mathcal{D}_X, \theta)) + \log P (\mathcal{D}_X \mid \theta) \end{align*} \; $} Note that we’ve written the EM objective in two different forms, highlighting how it can be written to make solving for either {$\theta$} or {$Q$} easy to derive. If we fix {$Q$}, {$F$} becomes convex in {$\theta$}, and visa versa. This is known as a Bi-Convex function, and an easy way to get to a local optima is to alternate between solving for {$Q \mid \theta$} and {$\theta \mid Q$}. K-means is another example of this sort of optimization. PCA: Non-convex optimization through SVD. Unlike general EM, PCA finds the optimal solution of a bi-convex problem by taking advantage of the orthonormal property of the PCA basis. We didn’t cover the proof of this in the general case, but it’s important for you to recognize this fact. Note that it’s also possible to find local minima of the PCA objective through EM-like procedures when computing the SVD isn’t possible, but that’s outside the scope of this course. ML can be analyzed theoreticallyBias, Variance trade-off. The Bias-Variance trade-off was the first analysis of the error of algorithms that we saw, and is key to steering between underfitting and overfitting. Minimum description length (MDL). We showed that one way of addressing the bias/variance trade-off is to look at how many bits it takes to code a model (e.g., which variables are in the regression, and how accurately to code each of the coefficients) plus how many bits it takes to code the residual. When the number of bits to specify this two part model (code plus residual) is minimized, one does not (in expectation) under- or over-fit. Occam’s Razor, VC-Dimension. The VC-dimension of a class is a measure of the complexity of that hypothesis class. Whereas bias-variance analyzed the error of algorithms over many random experiments, with generalization bounds we attempt to bound the test error of the algorithm given any single instance of a learning problem. And what can one show? That all else being equal, the algorithm/hypothesis class with smaller VC-dimension has a better bound on generalization error. This should sound familiar to everyone: it is the machine learning version of the famous Occam’s Razor. Online learning regret bounds. We introduced another learning setting that is different from the batch setting mostly considered in the class. The key differences are (1) we make predictions one at a time in order to receive the next training example and (2) we do not assume an underlying probability distribution {$P(X,Y)$} in our analysis, i.e. we do not make the i.i.d. assumption. Again, we didn’t cover the theory, but by making further assumptions on can prove limits on the error rates; e.g. for linear classifiers (Perceptrons), if one assumes linear separability, then the one can prove nice things about convergence. ML cares about linear vs. non-linearLinear separability. At this point it should be apparent to everybody that a linearly separable dataset makes life 1000 times easier; Hard Margin SVMs work, Perceptrons converge, etc. Non-linearity through kernels. Why do we use the dual and kernelize? The “trick” behind the kernel is computational: kernels let us compute the dot product of a vast array (possibly infinite) of features efficiently. Without a cheap analytic form, we might as well just compute all the features and use the primal form explicitly. Neural networks Another way to achieve non-linearity is through “deep” model architectures, such as neural networks. These lead to highly non-convex, difficult to solve optimization problems, yet they show amazingly promising results on large datasets. ML in practiceProblem is determined by the labels (or lack thereof). What makes a machine learning problem? The simplest way of phrasing many of the ML problems we’ve discussed is “given some computed features {$X$}, predict something {$Y$}.” Depending on whether or not {$Y$} is given to us, we have unsupervised (not given) vs supervised (given) vs semi-supervised (some {$Y$}’s are given) problems. (Note that we didn’t discuss semi-supervised much in this course.) Depending on the domain of {$Y$}, we end up with binary classification, regression, multi-class, etc. In this sense, one take-away from this course should be knowing how to approach a real-world problem and formulate it as a machine learning task, with some idea of algorithms you could use to begin to tackle the problem. Cross-validation. As described in the video, cross validation is an incredibly important tool in an ML practitioner’s toolbox. It’s critical for both training, evaluating, and selecting between different models. At this point, you should be know how to approach a problem, divide the data into training and test, and compare different algorithms on that problem. Lessons from the project. The project is arguably the most practical learning experience of the course. As we saw, overfitting is a real problem, and occurs even when the overfitting happens at the human designer level (i.e. overfitting to the Quiz set.) Tabular OverviewThis table compares some of the algorithms we covered this semester. It is not exhaustive, so don’t hesitate to add your own rows and columns!
|