LinearAlgebraReview
Linear Algebra Notes and Resources :
Linear Algebra lectures by Professor Gil Strang at MIT
A Tutorial on Linear Algebra by Professor C. T. Abdallah
Linear Algebra Review by Professor Fernando Paganini, UCLA
Machine learning relies on two basic areas of mathematics: linear algebra and probability. This lecture is a brief review of both of these subjects that you are expected to go over on your own. If you feel that you need more practice, a good book for matrices and linear algebra is Matrix Analysis and Applied Linear Algebra by Carl D. Meyer, and a good book for probability is A First Course in Probability, by Sheldon Ross. Any other books that you are comfortable with are fine, as this course simply uses the information in the books, and does not teach them.
NOTE If the topics covered in this review are entirely new to you, we STRONGLY urge you to reconsider taking this class! Either drop the course, or take an incomplete and finish it later; the material in this section is only a review, it is not intended to teach you the material for the first time.
Matrices and Linear Algebra
One description of Machine Learning is that it is fancy curve fitting; you have one or more linear equations that can make a prediction of some kind based on some number of inputs. For the moment, we're going to sidestep how you choose the linear equations that you choose, their coefficients, etc., and concentrate only on what you do once you have them (the bulk of the course is finding those linear equations and their coefficients, so you'll see all that soon enough). So, let's jump in:
{$ \mathbf{y = X w } $}
Thus far, this is exceedingly uninformative. Let's break it down into its parts to see what is going on.
- {$ \mathbf{y} $} is a vector of outputs. The vector can be any size, but each entry in the vector will have some meaning. For example, if we were predicting the weather, {$ y_1 $} might be tomorrow's temperature, {$ y_2 $} might be the day after tomorrow's temperature, {$ y_3 $} might be tomorrow's humidity, etc. The meaning of each entry is fixed, (tomorrow's temp, etc.) but the value in an entry will always be dependent on the input vector {$ \mathbf{X} $}.
- {$ \mathbf{X} $} is a matrix of inputs. The inputs are generally some set of measurements that we've made; for example, if we're trying to predict the weather, we might have temperature measurements from a number of points, humidity measurements, the time, etc. The only requirements of {$ \mathbf{X} $} is that every input is a number, and that the order and number of inputs remains constant. Thus, input {$ X_{11} $} might be the temperature at 1 pm outside of the Moore building, {$ X_{12} $} at 2 pm, etc. The next line might be similar values but for one day prior. Once we've chosen an order, and a set of values to use for the inputs, this can't be changed unless we also change the predictor {$ \mathbf{w} $}.
- {$ \mathbf{w} $} is our predictor (given to us by our fairy godmother). When we apply it to some input {$ \mathbf{X} $}, we'll get some vector output {$ \mathbf{y} $} that is a prediction or classification.
So another way of looking at {$ \mathbf{y = X w} $} is:
{$ \left[\begin{array}{c} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{array}\right] = \left[\begin{array}{ccccc} X_{11} & X_{12} & X_{13} & \ldots & X_{1m} \\ X_{21} & X_{22} & X_{23} & \ldots & X_{2m} \\ X_{31} & X_{32} & X_{33} & \ldots & X_{3m} \\ \vdots & \vdots & \vdots & \vdots\;\vdots\;\vdots & \vdots \\ X_{n1} & X_{n2} & X_{n3} & \ldots & X_{nm} \end{array}\right] \times \mathbf{w} $}
{$ \mathbf{w} $} is the most interesting part of the stuff above. We know we have some set of inputs, and we have some set of outputs we're interested in. We think that there is a relation between the two. We need {$ \mathbf{w} $} to be an accurate representation of this relation. Let's start with the simplest relation we can define. The simplest one is to copy some input variable to the output. So we choose one and do so:
{$ y_1 = X_{11} $}
Well, this is a good start, but what if our output {$ y_1 $} is dependent on several different inputs, like how today's weather is dependent on the temperature of the past several days, the humidity, etc. Clearly, a better approximation is a combination of functions:
{$ y_1 = X_{11} + X_{12} + X_{13} + \ldots $}
So far, so good. But what happens if we try to predict the humidity next using the same set of equations?
{$ y_2 = X_{21} + X_{22} + X_{23} + \ldots $}
Well, we're giving the same weights to each of the inputs, which is problem because humidity is measured by percentage and has to be restricted to the values [0.0, 100.0], whereas the temperature in some parts of the world go over 100.0. This is a problem, not just because the output might be out of range, but also because the importance of the incoming input variables might be different depending on what we're trying to predict. It'd be best if we can scale the weight of the inputs individually:
{$ y_2 = X_{21} w_1 + X_{22} w_2 + X_{23} w_3 + \ldots $}
Regression explanation
{$ \begin{array}{ccccccccc} y_1 & = & X_{11} w_1 & + & X_{12} w_2 & + & \ldots & + & X_{1m} w_m \\ y_2 & = & X_{21} w_1 & + & X_{22} w_2 & + & \ldots & + & X_{2m} w_m \\ \vdots & = & \vdots & + & \vdots & + & \vdots\;\vdots\;\vdots & + & \vdots \\ y_n & = & X_{n1} w_1 & + & X_{n2} w_2 & + & \ldots & + & X_{nm} w_m \end{array} $}
Now we have a way of controlling the influence of any particular function and input on any particular output value.
But... we're lazy. We keep repeating the same {$w_1, \cdots, w_m$} on each line, and it'd be nice if there was a simple method of avoiding writing them over and over again. In short:
{$ \left[\begin{array}{c} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{array}\right] = \left[\begin{array}{ccccc} X_{11} & X_{12} & X_{13} & \ldots & X_{1m} \\ X_{21} & X_{22} & X_{23} & \ldots & X_{2m} \\ X_{31} & X_{32} & X_{33} & \ldots & X_{3m} \\ \vdots & \vdots & \vdots & \vdots\;\vdots\;\vdots & \vdots \\ X_{n1} & X_{n2} & X_{n3} & \ldots & X_{nm} \end{array}\right] \times \left[\begin{array}{c} w_{1} \\ w_{2} \\ \vdots \\ w_{m} \end{array} \right] $}
If you notice, the matrix has {$n$} rows and {$m$} columns. By choosing to make the matrix like this, we can have a different number of inputs and outputs, which could be very handy. While we're at it, we can also see how the matrices and vectors multiple against one another. The matrix rows are multiplied against the input column producing the output {$ y_i = [X]_{i*}*w $}. This is the basics of matrix multiplication.
A few things to notice about multiplying to matrices together. If the first matrix has {$k$} rows and {$l$} columns, the second matrix must have {$l$} rows, and {$m$} columns. This should be obvious from the example above; if the number of columns in {$\mathbf{X}$} was not the same as the number of rows in {$\mathbf{w}$} then either there would be too many coefficients or not enough coefficients, which would make the multiplication illegal.
Examples
Here are a couple of worked examples:
{$\left[\begin{array}{ccccccccc} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \\ \end{array} \right] \times \left[\begin{array}{c} 1 \\ 2 \\ 3 \end{array}\right] = \left[\begin{array}{c} 1*1 + 2*2 + 3*3 \\ 4*1 + 5*2 + 6*3 \\ 7*1 + 8*2 + 9*3 \end{array}\right] = \left[\begin{array}{c} 14 \\ 32 \\ 50 \end{array}\right]$}
{$\left[\begin{array}{ccccccccc} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \\ 10 & 11 & 12 \\ \end{array} \right] \times \left[\begin{array}{c} 1 \\ 2 \\ 3 \end{array}\right] = \left[\begin{array}{c} 1*1 + 2*2 + 3*3 \\ 4*1 + 5*2 + 6*3 \\ 7*1 + 8*2 + 9*3 \\ 10*1 + 11*2 + 12*3 \end{array}\right] = \left[\begin{array}{c} 14 \\ 32 \\ 50 \\ 68 \end{array}\right]$}