The data

Training data set
- 100,000,000 ratings
- 480,000 users
- 18,000 movies
Data is sparse
- 100,000,000/(18,000*480,000) = 0.01
- but it is worse than that!
Quiz set
- 1.4 million ratings used to calculate leaderboard
Test set
- 1.4 million ratings used to determine winners

Modeling method

The winning team used a weighted average of many methods, including clever versions of k-NN and PCA.

Nearest Neighbor Method

Estimate how well user {$u$} will like a movie {$i$} by taking a similarity-weighted average of the neighboring movies {$j$}

{$ \hat{r}_{ui} = b_{ui} + \sum_{j \in N(i;u)} s_{ij} (r_{uj} - b_{uj}) / \sum_{j \in N(i;u)} s_{ij} $}

{$r_{ui}$} = rating by user {$u$} for movie {$i$} {$b_{ui}$} = baseline rating - e.g. mean rating of user {$u$} or movie {$i$} {$s_{ij}$} = similarity between movies {$i$} and {$j$} {$N(i;u)$} = the set of {$k$} (typically 20–50) movies for which user {$u$} has provided a rating that are most similar to movie {$i$}

Note the trick of subtracting off the baseline, e.g. the average rating of each user.

Standard similarity measure have a number of limitations

They don’t take into account “diversity” of the most similar points
- E.g., watching all three of the Lord of the Rings trilogy should maybe count for one point, not three)
They don’t give any weighting for degree of similarity.
- If none of the “most similar” movies is similar, one should upweight the baseline

Alternate approach: replace the similarity {$s_{ij}$} between movies with a weighting {$ w_{ij} $} calculated using least squares.

{$ \hat{r}_{ui} = b_{ui} + \sum_{j \in N(i;u)} w_{ij} (r_{uj} - b_{uj}) $}

Find {$ w_{ij} $} by seeing what weights on similar movies {$j$} would have done the best for estimating the rating {$r_{vi}$} on the target movie {$i$} by people {$v$} other than the user {$u$}.

{$ \min_w [ \sum_{v \ne u} (r_{vi} - \hat{r_{vi}})^2 ] $} {$= \min_w [ \sum_{v \ne u} (r_{vi} - b_{vi} - \sum_{j \in N(i;v)} w_{ij} ( r_{vj} - b_{vj} ))^2 ] $}

This works best if one finds the most similar users {$v$}, but that can be expensive since one needs to compare each user against every other user, finding all movies they have in common. This can be speeded up by represented each user in a lower dimensional space (k=10 here). (I.e. use SVD). Further speed comes from storing these in a kd-tree. But the matrix we are taking the SVD of is missing over 99% of its entries. We’ll see below how do SVD in this case.

Matrix Factorization Method

Find a factorization which approximately reconstructs the rating matrix {$R$}

{$ \hat{R} = PQ^T$}

{$ \hat{r}_{ui} = p_u q_i^T $}

where {$P$} has dimension number of users by number of hidden factors {$k$} and {$Q$} has dimension number of movies by number of hidden factors {$k$}. (60 in this case)

{$P$} looks like principal components {$Q$} looks like loadings

Note that 99% of {$R$} is not known, so we’ll need to only reconstruct the ones that are known. Also note that regularization helps avoid overfitting.

{$\sum_{(u,i) \in K} [(r_{ui} - p_uq_i^T)^2 + \lambda(|p_u|^2 + |q_i|^2) ]$}

where the summation is over the set {$K$} of {$ (u,i)$} pairs for which {$ r_{ui} $} are known.

This is solved using alternating least squares (rather like the probabilistic PCA model).

first fix {$P$} and solve for {$Q$} using Ridge regression
then fix {$Q$} and solve for {$P$} using Ridge regression
repeat.

They also further regularize by forcing the elements of {$P$} and {$Q$} to be non-negative. This yields a special case of the popular “Non Negative Matrix Factorization (NNMF)” method.

The above model implicitly assumes that all of the “factors” (the hidden variables) are from normal distributions with the same variance. We can relax these assumptions and use a more general probabilistic model.

Here each of the factors is assumed to be drawn from a distribution with more general priors {$ p_u \sim N(\mu, \Sigma) $} {$ q_i \sim N(\gamma, \Gamma) $} The ratings are assumed to come from: {$r_{ui} \sim N(p_u q_i^T, \epsilon) $}

But we won’t go into detail on this.

They also extend the matrix factorization method to do a locally weighted method, in which, rather than summing over all ratings (with equal weightings), they use a similarity-weighted reconstruction error:

{$\sum_{(u,i) \in K} [s_{ij} (r_{ui} - p_uq_i^T)^2 + \lambda(|p_u|^2 + |q_i|^2) ]$}

and we can further restrict the summation to only be over close neighbors.

Questions

How to handle out-of-sample users?

Treat each person as a vector of movies.
- e.g. work with {$R^TR$}, which is movie by movie (so data is summed over people)
- could use demographics, if you had them

How to handle out-of-sample movies?

Use features of the movie (e.g. actor, director, words in title, …)

To do better, find more features.

Features

For each movie: how many people rated it
For each person: how many movies they rated
For each person: the (normalized) vector of which movies they rated

If one had it, use which movies people rented — or even looked at and decided not to rent.

Normalize for the date of a rating

Ratings tend to go down as movies have been out longer.
Some users have a “drift” in their average rating.

Privacy

Researchers could identify individual users by matching the data sets with film ratings on the Internet Movie Database. Netflix cancelled a follow-up competition after a lawsuit over privacy.

Back to Lectures