MoreAdvice

General suggestions

  • Plot a learning curve: Try training on subsets of the training data, and plotting a learning curve with amount of training data on the x-axis and train/test error on the y-axis. If you don't get relatively smooth curves even after averaging over several different subsets, start doing some error analysis to figure out if the problem is the learning algorithm you're using, or the features you're using, or noise in the data you're using.
  • Clean the data: To remove some noise from your data, you can throw out pictures or blogs that are too noisy. For the blogs for instance, some people have noticed that there are several stretches of symbols in the dictionary that look like nonsense, which correspond to blogs written in other languages. You might want to locate these blogs and exclude them from your training set. Or, you can try to clean up examples instead of just throwing them away. Since you have limited resources, this might be the better option. For example, consider cropping images more precisely to remove more of the background.
  • Visualizations: Don't wait until writing the final report to make some visualizations of what your algorithms are doing. The point of visualizations is to better understand where your algorithm is doing well (on what examples is it doing better than the baseline and why?), and where it's doing poorly (compared to an oracle classifier, what is your algorithm missing and why?).
  • Cross-validation: Try a wide range of values for hyperparameters when doing cross-validation. Then, once you've found a decent value, you might consider doing a more restricted search of values close to the best one to find a more precise value.
  • Virtual examples: Since your training sets are fairly limited, it's a good idea to create additional "virtual" examples by performing simple transformations on the given data. For the images for instance, you can create several different types of virtual examples by transforming the given images with fliplr, or by slight rotations or sheering. You can probably think of other transformations that could help your algorithms learn to better ignore attributes such as the lighting in images.
  • Dependency checker: To help prevent your nightly checkpoints from crashing, you should check for external dependencies using Dependency Toolbox Checker http://alliance.seas.upenn.edu/~cis520/fall09/deptoolbox.zip.

Debugging learning algorithms

Here's some additional general advice from Andrew Ng's lecture on getting ML to work in practice.

  • If train performance is much higher than test performance, this indicates variance is too high (overfitting).
  • If both train and test performance are low, this indicates bias is too high (likely too few good features).
  • Example analysis comparing 2 algorithms: if SVM performs better than LR and also has higher value when its parameters are plugged into the LR objective, the problem is in the LR optimization (try a different regularization parameter or running the code with a tighter convergence requirement); if SVM just performs better, the problem is in the LR objective and you should switch learning methods.

Additional features and algorithms

  • Blogs: Consider using LSA, LDA (one MATLAB implmentation can be found here), or a kernel such as the string kernel. Given a string, this kernel essentially maps it to the space of all possible k-tuples of characters, which would be too expensive to compute explicitly.
  • Images: Try experimenting with the Canny edge detector that is built into MATLAB. Also, check out histogram of oriented gradient (HOG) features. To compute HOG features, first divide the image into cells. For each cell, compute a histogram of intensity gradients or edge directions. Optionally, add contrast normalization: across larger blocks of the image, compute a measure of intensity, and normalize histograms by this measure. Here's the CVPR paper that introduced HOG features, and here's a MATLAB implementation. The task HOG features have most frequently been applied to is detection of people in images, but the features may also prove useful for this project. The advantage of this method is that it is relatively invariant to geometric transformations. Since the images in the project dataset aren't perfectly registered, such invariance is very important.