Netflix
The data
Netflix offered a competition to predict the ratings users would have given to movies which they didn't rate. The training data set can be viewed as a matrix where each row is a user and each column is a movie. The entries are either "NA" (missing) or a rating (a number from 1 to 5).
- Training data set
- 100,000,000 ratings
- 480,000 users
- 18,000 movies
- Data is sparse
- 100,000,000/(18,000*480,000) = 0.01
- but it is worse than that!
- Quiz set
- 1.4 million ratings used to calculate leaderboard
- Test set
- 1.4 million ratings used to determine winners
Modeling method
The winning team used a weighted average of many methods, including clever versions of k-NN and PCA.
Nearest Neighbor Method
Estimate how well user {$u$} will like a movie {$i$} by taking a similarity-weighted average of the neighboring movies {$j$}
{$ \hat{r}_{ui} = b_{ui} + \sum_{j \in N(i;u)} s_{ij} (r_{uj} - b_{uj}) / \sum_{j \in N(i;u)} s_{ij} $}
{$r_{ui}$} = rating by user {$u$} for movie {$i$}
{$b_{ui}$} = baseline rating - e.g. mean rating of user {$u$} or movie {$i$}
{$s_{ij}$} = similarity between movies {$i$} and {$j$}
{$N(i;u)$} = the set of {$k$} (typically 20-50) movies for which user {$u$} has provided a rating that are most similar to movie {$i$}
Note the trick of subtracting off the baseline, e.g. the average rating of each user.
Standard similarity measures have a number of limitations
- They don't take into account the "diversity" of the most similar points.
- E.g., watching all three of the Lord of the Rings trilogy should maybe count for one point, not three.
- They don't give any weighting for the degree of similarity.
- If none of the "most similar" movies is similar, one should upweight the baseline
Alternate approach: replace the similarity {$s_{ij}$} between movies with a weighting {$ w_{ij} $} calculated using least squares.
{$ \hat{r}_{ui} = b_{ui} + \sum_{j \in N(i;u)} w_{ij} (r_{uj} - b_{uj}) $}
Find {$ w_{ij} $} by seeing what weights on similar movies {$j$} would have done the best for estimating the rating {$r_{vi}$} on the target movie {$i$} by people {$v$} other than the user {$u$}.
{$ \min_w [ \sum_{v \ne u} (r_{vi} - \hat{r_{vi}})^2 ] $} {$= \min_w [ \sum_{v \ne u} (r_{vi} - b_{vi} - \sum_{j \in N(i;v)} w_{ij} ( r_{vj} - b_{vj} ))^2 ] $}
This works best if one finds the most similar users {$v$}, but that can be expensive since one needs to compare each user against every other user, finding all movies they have in common. This can be speeded up by representing each user in a lower-dimensional space (k=10 here). (I.e. use SVD). Further speed comes from storing these in a kd-tree. But the matrix we are taking the SVD of is missing over 99% of its entries. We'll see below how to do a generalization of SVD in this case.
Matrix Factorization Method
Find a factorization that approximately reconstructs the rating matrix {$R$}
{$ \hat{R} = PQ^T$}
or
{$ \hat{r}_{ui} = p_u q_i^T $}
where {$P$} has dimension number of users by number of hidden factors {$k$} and {$Q$} has dimension number of movies by number of hidden factors {$k$}. (60 in this case)
{$P$} looks like principal components
{$Q$} looks like loadings
Note that 99% of {$R$} is not known, so we'll need to only reconstruct the ones that are known. Also, note that regularization helps avoid overfitting.
{$\sum_{(u,i) \in K} [(r_{ui} - p_uq_i^T)^2 + \lambda(|p_u|^2 + |q_i|^2) ]$}
where the summation is over the set {$K$} of {$ (u,i)$} pairs for which {$ r_{ui} $} are known.
This is solved using alternating least squares (rather like the probabilistic PCA model).
- first fix {$P$} and solve for {$Q$} using Ridge regression
- then fix {$Q$} and solve for {$P$} using Ridge regression
- repeat.
The above model implicitly assumes that all of the "factors" (the hidden variables) are from normal distributions with the same variance. We can relax these assumptions and use a more general probabilistic model.
Here each of the factors is assumed to be drawn from a distribution with more general priors {$ p_u \sim N(\mu, \Sigma) $} {$ q_i \sim N(\gamma, \Gamma) $} The ratings are assumed to come from: {$r_{ui} \sim N(p_u q_i^T, \epsilon) $}
But we won't go into detail on this.
They also extend the matrix factorization method to do a locally weighted method, in which, rather than summing over all ratings (with equal weightings), they use a similarity-weighted reconstruction error:
{$\sum_{(u,i) \in K} [s_{ij} (r_{ui} - p_uq_i^T)^2 + \lambda(|p_u|^2 + |q_i|^2) ]$}
and we can further restrict the summation to only be over close neighbors.
Non-Negative Matrix Factorization (NNMF)
They also further regularize by forcing the elements of {$P$} and {$Q$} to be non-negative. This yields a special case of the popular "Non-Negative Matrix Factorization (NNMF)" method.
NNMF is a similar matrix factorization, {$ \hat{R} = PQ^T$} again keeping {$k$} components, but now requiring all of the entries of {$P$} and {$Q$} to be non-negative. (This works for ratings, since every element of {$R$} will be non-negative.} This will (again) mean that the {$Q$} are not orthogonal; SVD won't help, but we can still solve this using alternating gradient descent: Find {$P$} given {$Q$}, find {$Q$} given {$P$}, repeat.
Questions
How to handle out-of-sample users?
- Treat each person as a vector of movies.
- e.g. work with {$R^TR$}, which is movie by movie (so data is summed over people)
- could use demographics, if you had them
How to handle out-of-sample movies?
- Use features of the movie (e.g. actor, director, words in the title, ...)
To do better, find more features!
Features
- For each movie: how many people rated it
- For each person: how many movies they rated
- For each person: the (normalized) vector of which movies they rated
If one had it, use which movies people rented -- or even looked at and decided not to rent.
Normalize for the date of a rating
- Ratings tend to go down as movies have been out longer.
- Some users have a "drift" in their average rating.
Privacy
Researchers could identify individual users by matching the data sets with film ratings on the Internet Movie Database. Netflix canceled a follow-up competition after a lawsuit over privacy.