#!/usr/local/bin/php
Warning: "continue" targeting switch is equivalent to "break". Did you mean to use "continue 2"? in /cgihome/cis520/html/dynamic/2016/wiki/pmwiki.php on line 691
Warning: "continue" targeting switch is equivalent to "break". Did you mean to use "continue 2"? in /cgihome/cis520/html/dynamic/2016/wiki/pmwiki.php on line 694
Warning: Use of undefined constant MathJaxInlineCallback - assumed 'MathJaxInlineCallback' (this will throw an Error in a future version of PHP) in /cgihome/cis520/html/dynamic/2016/wiki/cookbook/MathJax.php on line 84
Warning: Use of undefined constant MathJaxEquationCallback - assumed 'MathJaxEquationCallback' (this will throw an Error in a future version of PHP) in /cgihome/cis520/html/dynamic/2016/wiki/cookbook/MathJax.php on line 88
Warning: Use of undefined constant MathJaxLatexeqrefCallback - assumed 'MathJaxLatexeqrefCallback' (this will throw an Error in a future version of PHP) in /cgihome/cis520/html/dynamic/2016/wiki/cookbook/MathJax.php on line 94
Lectures /
NetflixThe data
Modeling methodThe winning team used a weighted average of many methods, including clever versions of k-NN and PCA. Nearest Neighbor MethodEstimate how well user {$u$} will like a movie {$i$} by taking a similarity-weighted average of the neighboring movies {$j$} {$ \hat{r}_{ui} = b_{ui} + \sum_{j \in N(i;u)} s_{ij} (r_{uj} - b_{uj}) / \sum_{j \in N(i;u)} s_{ij} $} {$r_{ui}$} = rating by user {$u$} for movie {$i$} {$b_{ui}$} = baseline rating - e.g. mean rating of user {$u$} or movie {$i$} {$s_{ij}$} = similarity between movies {$i$} and {$j$} {$N(i;u)$} = the set of {$k$} (typically 20–50) movies for which user {$u$} has provided a rating that are most similar to movie {$i$} Note the trick of subtracting off the baseline, e.g. the average rating of each user. Standard similarity measure have a number of limitations
Alternate approach: replace the similarity {$s_{ij}$} between movies with a weighting {$ w_{ij} $} calculated using least squares. {$ \hat{r}_{ui} = b_{ui} + \sum_{j \in N(i;u)} w_{ij} (r_{uj} - b_{uj}) $} Find {$ w_{ij} $} by seeing what weights on similar movies {$j$} would have done the best for estimating the rating {$r_{vi}$} on the target movie {$i$} by people {$v$} other than the user {$u$}. {$ \min_w [ \sum_{v \ne u} (r_{vi} - \hat{r_{vi}})^2 ] $} {$= \min_w [ \sum_{v \ne u} (r_{vi} - b_{vi} - \sum_{j \in N(i;v)} w_{ij} ( r_{vj} - b_{vj} ))^2 ] $} This works best if one finds the most similar users {$v$}, but that can be expensive since one needs to compare each user against every other user, finding all movies they have in common. This can be speeded up by represented each user in a lower dimensional space (k=10 here). (I.e. use SVD). Further speed comes from storing these in a kd-tree. But the matrix we are taking the SVD of is missing over 99% of its entries. We’ll see below how do SVD in this case. Matrix Factorization MethodFind a factorization which approximately reconstructs the rating matrix {$R$} {$ \hat{R} = PQ^T$} or {$ \hat{r}_{ui} = p_u q_i^T $} where {$P$} has dimension number of users by number of hidden factors {$k$} and {$Q$} has dimension number of movies by number of hidden factors {$k$}. (60 in this case) {$P$} looks like principal components {$Q$} looks like loadings Note that 99% of {$R$} is not known, so we’ll need to only reconstruct the ones that are known. Also note that regularization helps avoid overfitting. {$\sum_{(u,i) \in K} [(r_{ui} - p_uq_i^T)^2 + \lambda(|p_u|^2 + |q_i|^2) ]$} where the summation is over the set {$K$} of {$ (u,i)$} pairs for which {$ r_{ui} $} are known. This is solved using alternating least squares (rather like the probabilistic PCA model).
They also further regularize by forcing the elements of {$P$} and {$Q$} to be non-negative. This yields a special case of the popular “Non Negative Matrix Factorization (NNMF)” method. The above model implicitly assumes that all of the “factors” (the hidden variables) are from normal distributions with the same variance. We can relax these assumptions and use a more general probabilistic model. Here each of the factors is assumed to be drawn from a distribution with more general priors {$ p_u \sim N(\mu, \Sigma) $} {$ q_i \sim N(\gamma, \Gamma) $} The ratings are assumed to come from: {$r_{ui} \sim N(p_u q_i^T, \epsilon) $} But we won’t go into detail on this. They also extend the matrix factorization method to do a locally weighted method, in which, rather than summing over all ratings (with equal weightings), they use a similarity-weighted reconstruction error: {$\sum_{(u,i) \in K} [s_{ij} (r_{ui} - p_uq_i^T)^2 + \lambda(|p_u|^2 + |q_i|^2) ]$} and we can further restrict the summation to only be over close neighbors. QuestionsHow to handle out-of-sample users?
How to handle out-of-sample movies?
To do better, find more features.
Features
If one had it, use which movies people rented — or even looked at and decided not to rent. Normalize for the date of a rating
PrivacyResearchers could identify individual users by matching the data sets with film ratings on the Internet Movie Database. Netflix cancelled a follow-up competition after a lawsuit over privacy. |