Project /
AdviceOn this page… (hide) Advice for the projectIMPORTANT NOTE: Remember, the point of this project is NOT to implement everything from scratch but to learn how to use tools and create features to accomplish a given goal. You can and should take advantage of all the reference code we give you, any methods or algorithms from papers that you can find and use. ‘*The general rule is, if it’s not data and you can cite it, you can use it. But you MUST cite it.*’ This page provides a large reservoir of tips and pointers for where you might start on the project. If you’re stuck and don’t have any ideas, try some of the suggestions on this page. ‘*We also have a lot more code for you to download at the bottom of this page.*’ General tips‘*Use a version control system to work on your code.*’ If you’ve never used git before, now is a good time to learn. ‘*Use cross-validation to estimate your progress.*’ ‘*Use the Matlab debugger and never assume that your code is correct.*’ If it’s not working like you think it should, it almost always means that there is a bug in your code. Step through each line of your code by hand, learn to use Cell Mode, and use the ‘*Inspect the data, in particular incorrectly classified examples.*’ Figure out which examples your classifiers aren’t working on, and take a look at them. What is different about those examples? Are they outliers that should be removed from the training set? Is there a new feature you need to add to account for these hard examples? Note: we have not sanitized these reviews in any way. If you are senstitive to profanity, vulgar language, references, or descriptions, hate speech, etc., be forewarned that there could be offensive content and we obviously do not condone anything that might be in the data. ‘*If you estimate probabilities, compute expected rating instead of argmax.*’ If you get a distribution {$P(Y=k \mid X$}, then you can compute the expected rating by computing a weighted average according to the distribution of the possible ratings 1, 2, 4, and 5 stars. This more accurately represents the guess of the classifier than taking the argmax, and might reduce RMSE. ‘*Incorporate the unlabeled data.*’ You have access to many features of the test set, so you can incorporate those into your solution. However, there’s no way to train a purely supervised method on features that never occur in the training set (like new words). One solution to this is to run PCA on the entire dataset. ‘*Try feature selection.*’ Many words or bigrams occur only a single time (or very few times) in the data. By definition, these are almost useless for learning. Many other statistics or heuristics can be devised to determine if a feature is worth including. ‘*Try breaking down the problem by category.*’ If you find that your methods are too slow or impossible to run on all the data at once, you might try breaking the problem down into sub-problems for each category, and trying to find the best category to use for each test category. ‘*Any unsupervised generative model can be used as a generative classifier.*’ Suppose you have {$K$} classes. In Gaussian Naive Bayes we assume that each class has a single mean and variance parameter. But what if that’s not enough to really describe the classes? For each class, we can fit a probability distribution for any unsupervised model we can think of. In other words, if we are after {$P(X,Y)$}, we can use {$P(X|Y=k) = P_k(X)$}, where {$P_k$} is ‘~any~’ probability model {$P(X)$}. For example, we can choose {$P_k(X)$} to be a Gaussian Mixture Model with {$D$} components and use EM to find the parameters. Thus, each class will have {$D$} means and {$D$} covariances. We can predict a new class just as we would for Naive Bayes, by computing {$P_k(X)P(Y=k)$} for each class and taking the argmax. ‘*Any multi-class problem can be encoded as multiple binary problems.*’ There are many ways of doing this, but here’s the simplest: if you have {$K$} classes, train {$K$} classifiers to predict class {$k$} vs. all other classes. Then you can choose a rule for combining the output of each of these classifiers (e.g., take the one with largest margin.) Alternatively, you can get fancier by considering all pairs of class decisions, or even fancier by considering an error correcting code (e.g. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2639&rep=rep1&type=pdf). ‘*Standardizing avoids regularization biases.*’ We talked about various kinds of regularization in class, where you penalize an objective by some norm of the weights. Suppose that you are fitting regularized linear regression with two features: {$X_1$}, and {$X_2$}. Suppose the features are independent and {$\textrm{Var}(X_1) = 1$}, {$\textrm{Var}(X_2) = 1000$}. Intuitively, the goal of regularization is to limit the variance of our prediction, {$f(X_1,X_2) = w_1\cdot X_1 + w_2 \cdot X_2$}, by restricting the norm of {$\mathbf{w}$}. But consider {$\mathbf{w} = [5,0]$} and {$\mathbf{w}' = [0,0.005]$} If {$X_1$} and {$X_2$} are independent, then {$\textrm{Var}(f(X_1,X_2)) = 5$} for both {$\mathbf{w}$} and {$\mathbf{w}'$}, but clearly {$\mathbf{w}'$} has smaller norm. Therefore we see that the regularized solution will always prefer {$\mathbf{w}'$}, when in reality, both solutions have the same variance and one should not be preferred over the other. To fix this, we can preprocess the data by “standardizing” or “Z-scoring,” which is replacing each feature {$X_i$} by {$X_i' = (X_i-\textrm{Mean}(X_i))/\textrm{Std}(X_i)$}. In Matlab, you can compute variance, mean, and standard deviation with the Matlab-specific tips‘*Some useful matlab commands.*’ K-means: ‘*Some useful matlab toolboxes.*’ See below in the last section for a list of ‘*Using the matlab debugger and profiler.*’ Run the command ‘*Check for external dependencies.*’ Use the Dependency Toolbox checker: http://alliance.seas.upenn.edu/~cis520/fall09/deptoolbox.zip. (note: I fixed a bug in this version, don’t download the one off of Matlab Central File Exchange). Put ‘*Look at the vectorization tutorial.*’ In Matlab, matrix operations are implemented natively and are much faster than using Tips for text data‘*Try simple features first.*’ It is tempting to come up with really complicated features, but in practice, the simplest features that describe a given property of the data are often the best. (Think Occam’s razor and VC dimension.) For instance, language complexity might be measured by the average word length or average sentence length, rather than looking for specific words or specific phrases. ‘*Analyze the words in the dictionary.*’ The dictionary you were given is very crude. Many words are actually related and maybe should not be treated as separate words. For instance, “fisher”, “fishing”, “fished”, etc., might all be considered the same concept “fish”. This is known as word stemming. There is a free, standard word stemmer with implementations for practically every language (including Matlab) available here. You also might try correcting things like ‘???’ being considered a separate word from ‘?’. You could also try doing an unsupervised technique and clustering the words (consider the training examples 1 example per word, where each word is a binary vector of occurrences in the training data.) ‘*Standard document analysis techniques.*’ There are many “standard” ways of analyzing text data to look for trends, that can be used as dimensionality reduction or as a probabilistic model. Here are a few that you should be able to find implementations for if you’re interested: Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (pLSA), Non-negative Matrix Factorization (NNMF), or Latent Dirichlet Allocation (LDA). (LDA is probably the most widely used algorithm for looking at trends in documents.) A much simpler method is a Multinomial or Binomial Mixture Model — these are to regular Naive Bayes as the Gaussian mixture model is to Gaussian Naive Bayes, and are very easy to implement. ‘*Some kernels to try.*’ There are many kernels for text, just as there are for images, for instance, the string kernel. ‘*Visualizing is key.*’ You can tell if your model is making sense by looking at what words are most important for determining cluster or class membership, for instance. For example, if you find clusters of words, what words define the cluster? What words have high {$P(X|Y)$}, but which aren’t just one-time associations? Reference implementations & useful toolboxesBelow is a listing of all the additional code we are providing to help you get started as fast as possible. Note in particular liblinear, a very fast version of the libsvm library used in your homework that only works on sparse data and linear kernels. In addition to the baselines provided, we’re giving you additional reference implementations: Gaussian Mixture Models (training via EM) and an implementation for L1 regularized multinomial logistic regression, A.K.A. Sparse Multinomial Logistic Regression (SMLR). (For your homeworks, you implemented gradient descent for unregularized or L2 regularized binary logistic regression). We covered Gaussian mixtures in class, so this mainly shows how to fit them in a way that avoids common problems you’d otherwise run into (numerical underflow) by keeping all probabilities in log space. The SMLR routine is a potential multi-class alternative to Boosting; it will find a small set of features (depending on the regularization parameter {$\lambda$} — remember to cross-validate to choose {$\lambda$}!) for each class that are used to predict according to the multi-class logistic regression prediction rule. For the reference implementations, we have only given you the training code, so you need to be able to understand how to use the parameters of the models (e.g., multi-class logistic regression and a Gaussian mixture) in order to use them in your projects. These reference implementations can also serve as a guide of how to write efficient and reusable Matlab code. Finally, note that GMMs are not really appropriate for text, since indicator random variables are not gaussian random variables! Note both the Lightspeed toolbox and SMLR need to be compiled into MEX in order to run properly; they already have been compiled for 64-bit machines, so you just need to handle compiling them for your own machine. (Instructions are provided with each.)
|