IntroTANotes

TA Notes - Introduction to machine learning

The idea of these TA notes is to help clarify points that the TAs found confusing when they were learning the material. Our hope is that these are the kinds of things that a fellow grad student might tell you, but that there may not be time to cover in lectures. Please let us know if you find this helpful or if you think it is confusing or has mistakes.

As we've talked about in class, machine learning is about trying to learn to predict an "output" from some sort of "input." For example, we talked about learning to predict whether or not a user will like a movie based on input from other users, learning to predict control movements for a helicopter based on past flight data, or learning to predict the identity of a person in an image based on a database of labelled images. The goal of machine learning (and this course) is not merely to succeed at these applications but to develop a rich set of theorems and algorithms that can handle broad classes of problems. Each of the examples we have considered so far falls into one of these broad classes. This way, improving one of the theorems or algorithms we'll talk about in class leads to advances beyond the particular application that a researcher might be studying.

Broad classes of problems

We can write down these basic classes very easily using some basic mathematical notation. In general, we use {$X$} to represent the inputs to the problem, and {$Y$} to represent the outputs. Regardless of {$X$}, we can then distinguish problems by the domain of {$Y$}:

  • Classification: {$Y \in \{0,1\}$} or {$Y \in \{-1,+1\}$}. Predicting whether or not a given input belongs to one of two classes, typically in terms of true or false, or positive or negative. (e.g., face detection).
  • Regression: {$Y \in \mathbb{R}$}. Predicting a real number. (e.g., predicting gas prices).
  • Multi-class Classification: {$Y \in \{1, \dots, K\}$}. Predicting which of {$K$} classes an input belongs to. (e.g., face recognition).
  • Density estimation: There is no {$Y$}, instead we are trying to figure out a probability distribution over {$X$}, {$P(X)$}. (e.g., the coin-flipping example.)

This is just a small sampling of the many machine learning problems that are out there.

Different methods for solving these problems

How do we go about solving one of these problems? In class so far, we've been talking about maximum likelihood estimation and Bayesian methods. To zoom out for a moment, let's look at how this fits into the larger picture of machine learning. Since we're trying to predict the output {$Y$} from an input {$X$}, let's see how we could use probabilistic reasoning:

  • We can try to solve the problem directly and compute a formula for {$P(Y|X),$} the probability of {$Y$} given a value of {$X.$} How do we "compute a formula" for {$P(Y|X)?$} We need a create model with parameters that we can tune until we think we have a good estimate of {$P(Y|X).$} For example, we've talked about a probabilistic linear regression model: {$p(Y|\mathbf{X}) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp{ -\frac{(Y - \mathbf{Xw})^2}{2\sigma^2} }$} and we've talked about using MLE and MAP techniques to find a value for the parameters {$\mathbf{w}$} that we think will be useful for prediction in the future.
  • We can solve the problem indirectly, using Bayes rule: {$P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}.$} This means that if we find it too difficult to figure out an equation for {$P(Y|X),$} we can instead write down a model in terms of how the inputs {$X$} will be influenced by the output {$Y.$} We haven't talked about this yet in class, but why might this be useful? Suppose you're trying to detect spam from email text. We have a binary decision, and a simple, intuitive model for {$P(X|Y)$} is that, given that an email is spam, we would expect a very different distribution over words than if the email is not spam. In other words, our reasoning is that "if {$Y$} is spam, then {$X$} will be different", not "if {$X$} changes, we expect {$Y$} to change like so."

The two approaches above are called discriminative and generative approaches. So far, we have only described discriminative approaches, but we will see generative approaches soon.

Finally, we can also have algorithms that don't use probabilities at all. Such methods encompass many different algorithms, some of which we will cover later in the class (for instance, Support Vector Machines). Often times, predicting {$Y$} from {$X$} looks something like this in these methods:

{$Y = \textrm{arg} \max_Y \mathbf{w} \cdot \mathbf{f}(X,Y)$}

This says that we output the guess {$Y$} that has the highest score according to the dot product between a weight vector {$\mathbf{w}$} and a feature function {$\mathbf{f}(X,Y)$}. For example, a feature could be, "{$X$} has the words 'machine learning' and the {$Y$} label is 'spam'", which we would expect to have a negative or low weight in {$\mathbf{w}$}, since it is most likely rare that spam emails talk about machine learning. Don't worry too much about this sort of formulation yet, except for the idea that we don't strictly need probabilities in order to solve many prediction problems.