On this page… (hide)

What is Machine Learning?

Definition 1. A funny thing happened on the way to AI.

Definition 2. Statistics + Algorithms

Definition 3. Fancy Function Fitting

Here’s ML about 50 years ago: from the vault.

Types of ML

Supervised Learning: classification, regression, ranking
In a nutshell: observe n training examples {$D = \{x_i,y_i\}$}, and learn a function mapping any input x to output y: {$h(x) \approx y$}.
Unsupervised Learning: clustering, dimensionality reduction
In a nutshell: observe n examples {$D = \{x_i\}$}, find a well-separated partition of examples {$D = \{ D_1 \cup D_2 \cup \ldots \cup D_k\}$} or low-distortion, low-dimensional projection {$y=Px$}, where the number of dimensions of {$y$} is less than (sometimes much less than) the number of dimensions of {$x$}. The goal is to replace the high-complexity description with a lower-complexity one.
Reinforcement Learning: learning to act from delayed feedback.
In a nutshell: a policy maps system states to actions: {$\pi(s)=a$}. Learning consists of trying control policies {$\{\pi_i\}$} on system, observing system states {$\{\mathbf{S}_i\}$} and receiving rewards {$\{\mathbf{R}_i\}$} to come up with a policy {$\pi$} with high expected reward. A simple example is learning a new video game (without reading instructions): you try a bunch of moves and actions and see what leads to earning points and moving on to the next level.

Some examples

Vision: detecting faces in photographs

Face detection on a cell phone

Your new camera detects faces in images for better focusing and light-metering. This is probably the most-fielded example of classification. Photo management software cluster faces in your images to index photos by people in them.

Speech: recognizing dictation

Google Search via Speech

You can dictate your blog or search the web by voice. Speech recognition systems are built using (mostly) supervised learning.

Text: News digest

Google News

Google clusters news stories into groups on the same topic and classifies them into sections like World, Sports, Entertainment, Tech, etc. This is primarily an example of unsupervised learning.

Recommendation systems: Netflix

Netflix

Netflix uses your movie ratings to predict what other movies you might like. This is a very useful (and profitable) example of supervised learning: regression (although it uses unsupervised techniques like dimensionality reduction).

Games: Microsoft Xbox Kinect

Kinect

Kinect uses supervised learning to detect joint positions from depth images (decision trees, actually, a lot of them)

Learning to control robots: Pancake flippin’

YouTube Video. Learning to Flip Pancakes

Complex control policies can be learned using reinforcement learning.

ML is (often) modeling a probability distribution

Probabilistic reasoning is central to many machine learning tasks. Probability is an extremely useful way of quantifying our beliefs about the state of the world.

Generative vs. discriminative models. Many of the methods discussed in this course model one of the following probability distributions:

	Generative (Unsupervised)	Generative	Discriminative
	{$P(X)$}	{$P(X,Y)$}	{$P(Y \mid X)$}
Example	GMM	Naive Bayes	Logistic Regression

Using the basic rules of probability, we can compute a discriminative posterior {$P(Y \mid X) = P(X,Y) /\sum_{Y'} P(X,Y')$} for any generative model in order to make decisions. In lecture, we showed that NB and LR can both lead to the same form of posterior, although the parameters estimated will generally be different.

Any generative model can be used as an unsupervised method. Since {$P(X) = \sum_Y P(X,Y)$}, a generative model defines a distribution over {$X$} by marginalizing out the class labels. If the class labels are unknown, we can use EM to estimate them! In this sense, a GMM is just a Gaussian Naive Bayes model with unknown classes.

Graphical models cover all probability models we discussed. Graphical models are an efficient and intuitive way of encoding a set of independence assumptions about a set of random variables. As an exercise, try to draw the models to represent Naive Bayes and Logistic Regression. Once you have learned about graphical models, it’s rare to ever talk about probabilities again without them!

ML is (often) optimization

While probability is the language by which we express our beliefs about the state of the universe, in order to talk about achieving goals we need to introduce the language of optimization.

Objective (Loss) functions. The standard machine learning paradigm is to (1) define an objective function that captures performance of a model on a given task and (2) optimization that objective with respect to some parameters. We saw a variety of loss functions in this course: (here, {$f(x)$} is a predictive model that returns a real-valued number)

0–1 (binary)	Squared	Exponential	Log	Hinge
{$\mathbf{1}(y \ne \textrm{sign}(f(x))$}	{$(y-f(x))^2$}	{$\exp\{-yf(x)\}$}	{$\log(1 + \exp\{-yf(x)\}) $}	{$\max\{0, 1-yf(x)\} $}

The key thing to remember is that we can often understand the behavior of an algorithm by figuring out and analyzing the behavior of the loss function that it optimizes and how it optimizes it.

Regularization: MLE vs. MAP. We will learn that the phenomenon of overfitting can occur if we optimize our objective on training data too tightly. To explicitly control the extent to which we “trust” the training data, we introduced the concept of priors or regularization. If {$\mathcal{D}_X$}, {$\mathcal{D}_Y$} are the dataset, and {$\theta$} our parameters, then probabilistically we have

{$ \log P(\mathcal{D}_X,\mathcal{D}_Y,\theta) = \log P(\mathcal{D}_X,\mathcal{D}_Y \mid \theta) + \log P(\theta) = -loss(\theta) + regularizer(\theta) $}

In practice, we can substitute whatever loss function we like and whatever regularization function we like, provided we can still solve the resulting optimization problem.

Convex: Gradient descent. How do we solve optimization problems? We will see a very simple, powerful method for optimization convex objectives: gradient descent (or ascent for concave functions.) All we need to do is be able to compute a gradient (or any sub-gradient, if the function is not differentiable) at every point.

ML in practice

Problem is determined by the labels (or lack thereof). What makes a machine learning problem? The simplest way of phrasing the ML problems we’ve discussed “given some computed features {$X$}, predict something {$Y$}.” Depending on whether or not {$Y$} is given to us, we have unsupervised (not given) vs supervised (given) vs semi-supervised (some {$Y$}’s are given) problems. (Note that we didn’t discuss semi-supervised much in this course, though it came up on the project.) Depending on the domain of {$Y$}, we end up with binary classification, regression, multi-class, etc. In this sense, one take-away from this course should be knowing how to approach a real-world problem and formulate it as a machine learning task, with some idea of algorithms you could use to begin to tackle the problem.

Cross-validation. As described in the video, cross validation is an incredibly important tool in an ML practioner’s toolbox. It’s critical for both training, evaluating, and selecting between different models. At this point, you should know how to approach a problem, divide the data into training and test, and compare different algorithms on that problem.

Lessons from the project. The final project is arguably the most practical learning experience of the course. You will see that overfitting is a crucial problem, and occurs both with algorithmic estimation and by human designers. Back to Lectures

Intro