On this page… (hide)
Here’s ML about 50 years ago: from the vault.
Your new camera detects faces in images for better focusing and light-metering. This is probably the most-fielded example of classification. Photo management software cluster faces in your images to index photos by people in them.
You can dictate your blog or search the web by voice. Speech recognition systems are built using (mostly) supervised learning.
Google clusters news stories into groups on the same topic and classifies them into sections like World, Sports, Entertainment, Tech, etc. This is primarily an example of unsupervised learning.
Netflix uses your movie ratings to predict what other movies you might like. This is a very useful (and profitable) example of supervised learning: regression (although it uses unsupervised techniques like dimensionality reduction).
Kinect uses supervised learning to detect joint positions from depth images (decision trees, actually, a lot of them)
Complex control policies can be learned using reinforcement learning.
Probabilistic reasoning is central to many machine learning tasks. Probability is an extremely useful way of quantifying our beliefs about the state of the world.
Generative vs. discriminative models. Many of the methods discussed in this course model one of the following probability distributions:
Generative (Unsupervised) | Generative | Discriminative | |
---|---|---|---|
{$P(X)$} | {$P(X,Y)$} | {$P(Y \mid X)$} | |
Example | GMM | Naive Bayes | Logistic Regression |
Using the basic rules of probability, we can compute a discriminative posterior {$P(Y \mid X) = P(X,Y) /\sum_{Y'} P(X,Y')$} for any generative model in order to make decisions. In lecture, we showed that NB and LR can both lead to the same form of posterior, although the parameters estimated will generally be different.
Any generative model can be used as an unsupervised method. Since {$P(X) = \sum_Y P(X,Y)$}, a generative model defines a distribution over {$X$} by marginalizing out the class labels. If the class labels are unknown, we can use EM to estimate them! In this sense, a GMM is just a Gaussian Naive Bayes model with unknown classes.
Graphical models cover all probability models we discussed. Graphical models are an efficient and intuitive way of encoding a set of independence assumptions about a set of random variables. As an exercise, try to draw the models to represent Naive Bayes and Logistic Regression. Once you have learned about graphical models, it’s rare to ever talk about probabilities again without them!
While probability is the language by which we express our beliefs about the state of the universe, in order to talk about achieving goals we need to introduce the language of optimization.
Objective (Loss) functions. The standard machine learning paradigm is to (1) define an objective function that captures performance of a model on a given task and (2) optimization that objective with respect to some parameters. We saw a variety of loss functions in this course: (here, {$f(x)$} is a predictive model that returns a real-valued number)
0–1 (binary) | Squared | Exponential | Log | Hinge |
---|---|---|---|---|
{$\mathbf{1}(y \ne \textrm{sign}(f(x))$} | {$(y-f(x))^2$} | {$\exp\{-yf(x)\}$} | {$\log(1 + \exp\{-yf(x)\}) $} | {$\max\{0, 1-yf(x)\} $} |
The key thing to remember is that we can often understand the behavior of an algorithm by figuring out and analyzing the behavior of the loss function that it optimizes and how it optimizes it.
Regularization: MLE vs. MAP. We will learn that the phenomenon of overfitting can occur if we optimize our objective on training data too tightly. To explicitly control the extent to which we “trust” the training data, we introduced the concept of priors or regularization. If {$\mathcal{D}_X$}, {$\mathcal{D}_Y$} are the dataset, and {$\theta$} our parameters, then probabilistically we have
{$ \log P(\mathcal{D}_X,\mathcal{D}_Y,\theta) = \log P(\mathcal{D}_X,\mathcal{D}_Y \mid \theta) + \log P(\theta) = -loss(\theta) + regularizer(\theta) $}
In practice, we can substitute whatever loss function we like and whatever regularization function we like, provided we can still solve the resulting optimization problem.
Convex: Gradient descent. How do we solve optimization problems? We will see a very simple, powerful method for optimization convex objectives: gradient descent (or ascent for concave functions.) All we need to do is be able to compute a gradient (or any sub-gradient, if the function is not differentiable) at every point.