Advice
Advice for the project
Remember, the point of this project is NOT to implement everything from scratch but to learn how to use tools and create features to accomplish a given goal. You can and should take advantage of all the reference code we give you, any methods or algorithms from papers that you can find and use. The general rule is, use anything you want, but cite it.
This page provides a large reservoir of tips and pointers for where you might start on the project. If you're stuck and don't have any ideas, try some of the suggestions on this page. There is more code for you to download at the bottom of this page.
General tips
- Think carefully about scoping. Is the problem likely to be possible to solve? and to solve on the computers you have? How big is {$n$}? {$p$}?
- What similar problems have people solved? What methods did they use? What accuracy did they achieve?
- Start simple First try linear or logistic regression; then get fancy. First, try the feature classes that you think are most likely to be helpful; then add in additional features.
- Use a version control system. If you've never used git before, now is a good time to learn.
- Use cross-validation to estimate your progress. Hold out an entirely separate test set to do a final assessment of your model accuracy.
- Inspect the data, in particular incorrectly classified examples. Figure out which examples your classifiers aren't working on, and take a look at them. What is different about those examples? Are they outliers that should be removed from the training set? Is there a new feature you need to add to account for these hard examples?
- Think about what models should work Kernels? Large margin methods? Is online learning needed? See Sklearn's learning map, but don't trust it (Skearn doesn't include many key methods such as deep learning and RL.)
- Do you have unlabeled data for semi-supervised learning? Or can you use "transductive learning" -- do feature reduction using the {$x$}'s from both training and testing sets.
- Use feature selection. In general, you want to start with too many features and then prune them down. (but not for problems like machine vision)
- Is there structure over the features? Often one needs to do "block regularization"--different regularizations for different feature types.
- Is there structure over the labels? Should you run one vs. all? A two level model?
- Any unsupervised generative model can be used as a generative classifier.' Suppose you have {$K$} classes. In Gaussian Naive Bayes we assume that each class has a single mean and variance parameter. But what if that's not enough to really describe the classes? For each class, we can fit a probability distribution for any unsupervised model we can think of. In other words, if we are after {$P(X,Y)$}, we can use {$P(X|Y=k) = P_k(X)$}, where {$P_k$} is '~any~' probability model {$P(X)$}. For example, we can choose {$P_k(X)$} to be a Gaussian Mixture Model with {$D$} components and use EM to find the parameters. Thus, each class will have {$D$} means and {$D$} covariances. We can predict a new class just as we would for Naive Bayes, by computing {$P_k(X)P(Y=k)$} for each class and taking the argmax.
- Any multi-class problem can be encoded as multiple binary problems.' There are many ways of doing this, but here's the simplest: if you have {$K$} classes, train {$K$} classifiers to predict class {$k$} vs. all other classes. Then you can choose a rule for combining the output of each of these classifiers (e.g., take the one with the largest margin.) Alternatively, you can get fancier by considering all pairs of class decisions, or even fancier by considering an error-correcting code.
- If you estimate probabilities, compute the expected rating instead of the argmax. If you get a distribution {$P(Y=k \mid X$)}, then you can compute the expected rating by computing a weighted average according to the distribution of the possible ratings 1, 2, 4, and 5 stars. This often more accurately represents the guess of the classifier than taking the argmax, and might reduce the error.
- Standardizing makes regularization more stable. Usually, it is a good idea if the features are in many different units but a bad idea if they are all in the same unit (e.g. color intensity).
Debugging learning algorithms
Here's some additional general advice from Andrew Ng's lecture on getting ML to work in practice.
- If train performance is much higher than test performance, this indicates variance is too high (overfitting).
- If both train and test performance are low, this indicates bias is too high (likely too few good features).
- Example analysis comparing 2 algorithms: if SVM performs better than LR and also has higher value when its parameters are plugged into the LR objective, the problem is in the LR optimization (try a different regularization parameter or running the code with a tighter convergence requirement); if SVM just performs better, the problem is in the LR objective and you should switch learning methods.
Tips for image data
- RGB is not always the most informative color space. Using red, green, and blue to represent images is fine and dandy for viewing by eye, but in this color space, your classifier must learn that lots of different shades of colors are equivalent if it is going to detect facial tones, for instance. There are many other color spaces that have been devised: hue, saturation, and value (HSV) represents color in terms of absolute hue and saturation, while "brightness" is determined by value; the Lab colorspace uses two orthogonal dimensions (a,b) to represent color, and a third dimension L to represent lightness. Either of these would be a better representation for the image than RGB.
- Kernels to consider.' There are many specialized kernels for dealing with images that you might want to consider. For instance, the spatial pyramid kernel or the pyramid match kernel. These are far less widely used now that everything uses CNNs.
- Visualizing is key. What do the cluster means look like? Can you plot your features as an image? What aspects of the image is the classifier picking up on? Remember that you can plot any learned weightings over features in the original feature space.
- Can you make virtual training data? Suppose that all men face left in the training data, and all women face right in the training data. Then your gender classifier will really pick up on a face direction classifier, or really, which side of the image is brighter (assuming light is falling on the face). In reality, we know that flipping the pixels in an image do not change its label. If you want your classifier to be invariant to some transformation of the data, then one technique is to create "virtual examples" by applying the transformation to training examples and reinserting them into the dataset. That way you will "teach" the classifier that labels should be invariant of these transformations.
- Plot a learning curve: Try training on subsets of the training data, and plotting a learning curve with the amount of training data on the x-axis and train/test error on the y-axis. If you don't get relatively smooth curves even after averaging over several different subsets, start doing some error analysis to figure out if the problem is the learning algorithm you're using, or the features you're using, or noise in the data you're using.
- Clean the data: To remove some noise from your data, you can throw out pictures or blogs that are too noisy. For the blogs, for instance, some people have noticed that there are several stretches of symbols in the dictionary that look like nonsense, which correspond to blogs written in other languages. You might want to locate these blogs and exclude them from your training set. Or, you can try to clean up examples instead of just throwing them away. For example, consider cropping images more precisely to remove more of the background.
- Visualizations: Don't wait until writing the final report to make visualizations of what your algorithms are doing. The point of visualizations is to better understand where your algorithm is doing well (on what examples is it doing better than the baseline and why?), and where it's doing poorly (compared to an oracle classifier, what is your algorithm missing and why?).
- Cross-validation: Try a wide range of values for hyperparameters when doing cross-validation. Then, once you've found a decent value, you might consider doing a more restricted search of values close to the best one to find a more precise value. Don't forget to include endpoints (no regularization)
- Virtual examples: Since your training sets are fairly limited, it's often a good idea to create additional "virtual" examples by performing simple transformations on the given data. For the images, for instance, you can create several different types of virtual examples by flipping, shifting, rotating or sheering the images. You can probably think of other transformations that could help your algorithms learn to better ignore attributes such as the lighting in images.
Tips for text data
- Try simple features first. It is tempting to come up with really complicated features, but in practice, the simplest features that describe a given property of the data are often the best. For instance, language complexity might be measured by the average word length or average sentence length, rather than looking for specific words or specific phrases.
- Stemming rarely helps.
- Tokenizing is critical.
- Standard document analysis techniques. There are many "standard" ways of analyzing text data to look for trends, that can be used as dimensionality reduction or as a probabilistic model. Here are a few that you should be able to find implementations for if you're interested: Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (pLSA), Non-negative Matrix Factorization (NNMF), or Latent Dirichlet Allocation ( LDA ). (LDA is probably the most widely used algorithm for looking at trends in documents.) A much simpler method is a Multinomial or Binomial Mixture Model -- these are to regular Naive Bayes as the Gaussian mixture model is to Gaussian Naive Bayes, and are very easy to implement.
- Some kernels to try. There are many kernels for text, just as there are for images, for instance, the string kernel. Mostly, these days people use vector embeddings like word2vec, glove, ELMo and BERT.
- Visualizing is key. You can tell if your model is making sense by looking at what words are most important for determining cluster or class membership, for instance. For example, if you find clusters of words, what words define the cluster? What words have high {$P(X|Y)$}, but which aren't just one-time associations?
How to extract features?
Sklearn has some classes to do easy bag of words type extraction here More high powered packages would be: spacy and stanford's NLP kit
Fasttext has good pretrained context-oblivious embeddings. You can embed a sentence by summing the embeddings of the words in it.
HuggingFace is great for running pre-trained transformers. It's also moderately easy to run fine-tuning (if one understands pytorch). If you want "features" per document you still need to aggregate from the hugging face output, which is an embedding per word per layer, broken into 256 word segments. The most standard decent aggregation is taking the mean of all words in the second to last layer.