# Fun history:

Here's ML about 50 years ago: from the vault.

# Types of ML

• Supervised Learning: classification, regression, ranking
In a nutshell: observe n training examples {$D = \{x_i,y_i\}$}, and learn a function mapping any input x to output y: {$h(x) \approx y$}.
• Unsupervised Learning: clustering, dimensionality reduction
In a nutshell: observe n examples {$D = \{x_i\}$}, find a well-separated partition of examples {$D = \{ D_1 \cup D_2 \cup \ldots \cup D_k\}$} or low-distortion, low-dimensional projection {$y=Px$}, where the number of dimensions of {$y$} is less than (sometimes much less than) the number of dimensions of {$x$}. The goal is to replace the high-complexity description with a lower-complexity one.
• Reinforcement Learning: learning to act from delayed feedback.
In a nutshell: a policy maps system states to actions: {$\pi(s)=a$}. Learning consists of trying control policies {$\{\pi_i\}$} on system, observing system states {$\{\mathbf{S}_i\}$} and receiving rewards {$\{\mathbf{R}_i\}$} to come up with a policy {$\pi$} with high expected reward. A simple example is learning a new video game (without reading instructions): you try a bunch of moves and actions and see what leads to earning points and moving on to the next level.

# Some examples

## Vision: detecting faces in photographs

Face detection on a cell phone

Your new camera detects faces in images for better focusing and light-metering. This is probably the most-fielded example of classification. Photo management software cluster faces in your images to index photos by people in them.

## Speech: recognizing dictation

You can dictate your blog or search the web by voice. Speech recognition systems are built using (mostly) supervised learning.

## Text: News digest

Google clusters news stories into groups on the same topic and classifies them into sections like World, Sports, Entertainment, Tech, etc. This is primarily an example of unsupervised learning.

## Recommendation systems: Netflix

Netflix

Netflix uses your movie ratings to predict what other movies you might like. This is a very useful (and profitable) example of supervised learning: regression (although it uses unsupervised techniques like dimensionality reduction).

## Games: Microsoft Xbox Kinect

Kinect

Kinect uses supervised learning to detect joint positions from depth images (decision trees, actually, a lot of them)

## Learning to control robots: Pancake flippin'

YouTube Video. Learning to Flip Pancakes

Complex control policies can be learned using reinforcement learning.

### ML is (often) modeling a probability distribution

Probabilistic reasoning is central to many machine learning tasks. Probability is an extremely useful way of quantifying our beliefs about the state of the world.

Generative vs. discriminative models. Many of the methods discussed in this course model one of the following probability distributions:

 Generative (Unsupervised) Generative Discriminative {$P(X)$} {$P(X,Y)$} {$P(Y \mid X)$} Example GMM Naive Bayes Logistic Regression

Using the basic rules of probability, we can compute a discriminative posterior {$P(Y \mid X) = P(X,Y) /\sum_{Y'} P(X,Y')$} for any generative model in order to make decisions. In lecture, we showed that NB and LR can both lead to the same form of posterior, although the parameters estimated will generally be different.

Any generative model can be used as an unsupervised method. Since {$P(X) = \sum_Y P(X,Y)$}, a generative model defines a distribution over {$X$} by marginalizing out the class labels. If the class labels are unknown, we can use EM to estimate them! In this sense, a GMM is just a Gaussian Naive Bayes model with unknown classes.

Graphical models cover all probability models we discussed. Graphical models are an efficient and intuitive way of encoding a set of independence assumptions about a set of random variables. As an exercise, try to draw the models to represent Naive Bayes and Logistic Regression. Once you have learned about graphical models, it's rare to ever talk about probabilities again without them!

### ML is (often) optimization

While probability is the language by which we express our beliefs about the state of the universe, in order to talk about achieving goals we need to introduce the language of optimization.

Objective (Loss) functions. The standard machine learning paradigm is to (1) define an objective function that captures performance of a model on a given task and (2) optimization that objective with respect to some parameters. We saw a variety of loss functions in this course: (here, {$f(x)$} is a predictive model that returns a real-valued number)

0-1 (binary)SquaredExponentialLogHinge
{$\mathbf{1}(y \ne \textrm{sign}(f(x))$}{$(y-f(x))^2$}{$\exp\{-yf(x)\}$}{$\log(1 + \exp\{-yf(x)\})$}{$\max\{0, 1-yf(x)\}$}

The key thing to remember is that we can often understand the behavior of an algorithm by figuring out and analyzing the behavior of the loss function that it optimizes and how it optimizes it.

Regularization: MLE vs. MAP. We will learn that the phenomenon of overfitting can occur if we optimize our objective on training data too tightly. To explicitly control the extent to which we "trust" the training data, we introduced the concept of priors or regularization. If {$\mathcal{D}_X$}, {$\mathcal{D}_Y$} are the dataset, and {$\theta$} our parameters, then probabilistically we have

{$\log P(\mathcal{D}_X,\mathcal{D}_Y,\theta) = \log P(\mathcal{D}_X,\mathcal{D}_Y \mid \theta) + \log P(\theta) = -loss(\theta) + regularizer(\theta)$}

In practice, we can substitute whatever loss function we like and whatever regularization function we like, provided we can still solve the resulting optimization problem.

Optimization. How do we solve optimization problems? We will see a very simple, powerful method for optimization convex objectives: gradient descent (or ascent for concave functions.) All we need to do is be able to compute a gradient (or any sub-gradient, if the function is not differentiable) at every point. We'll see lots of variations on this.