Intro TA NotesOn this page… (hide) 1. TA Notes - Introduction to machine learningThe idea of these TA notes is to help clarify points that the TAs found confusing when they were learning the material. Our hope is that these are the kinds of things that a fellow grad student might tell you, but that there may not be time to cover in lectures. Please let us know if you find this helpful or if you think it is confusing or has mistakes. As we’ve talked about in class, machine learning is about trying to learn to predict an “output” from some sort of “input.” For example, we talked about learning to predict whether or not a user will like a movie based on input from other users, learning to predict control movements for a helicopter based on past flight data, or learning to predict the identity of a person in an image based on a database of labelled images. The goal of machine learning (and this course) is not merely to succeed at these applications but to develop a rich set of theorems and algorithms that can handle broad classes of problems. Each of the examples we have considered so far falls into one of these broad classes. This way, improving one of the theorems or algorithms we’ll talk about in class leads to advances beyond the particular application that a researcher might be studying. 2. Broad classes of problemsWe can write down these basic classes very easily using some basic mathematical notation. In general, we use {$X$} to represent the inputs to the problem, and {$Y$} to represent the outputs. Regardless of {$X$}, we can then distinguish problems by the domain of {$Y$}:
This is just a small sampling of the many machine learning problems that are out there. 3. Different methods for solving these problemsHow do we go about solving one of these problems? In class so far, we’ve been talking about maximum likelihood estimation and Bayesian methods. To zoom out for a moment, let’s look at how this fits into the larger picture of machine learning. Since we’re trying to predict the output {$Y$} from an input {$X$}, let’s see how we could use probabilistic reasoning:
The two approaches above are called discriminative and generative approaches. So far, we have only described discriminative approaches, but we will see generative approaches soon. Finally, we can also have algorithms that don’t use probabilities at all. Such methods encompass many different algorithms, some of which we will cover later in the class (for instance, Support Vector Machines). Often times, predicting {$Y$} from {$X$} looks something like this in these methods: {$Y = \textrm{arg} \max_Y \mathbf{w} \cdot \mathbf{f}(X,Y)$} This says that we output the guess {$Y$} that has the highest score according to the dot product between a weight vector {$\mathbf{w}$} and a feature function {$\mathbf{f}(X,Y)$}. For example, a feature could be, “{$X$} has the words ‘machine learning’ and the {$Y$} label is ‘spam’ ”, which we would expect to have a negative or low weight in {$\mathbf{w}$}, since it is most likely rare that spam emails talk about machine learning. Don’t worry too much about this sort of formulation yet, except for the idea that we don’t strictly need probabilities in order to solve many prediction problems. |