On this page… (hide)
This project involves developing a “genre classification” system that will predict the genre of a song from its lyrics. A song’s genre is a categorical description of the “type” of music it is, and examples of genre include “hip hop,” “jazz,” or “rock.” Internet radio services such as last.fm and Spotify use genre prediction to create a station that plays songs from a user-supplied genre.
The dataset includes the lyrics and genre of over 12,000 songs in a bag-of-words format. Each song belongs to one of 10 genres, and your job is to predict the genre for each song in the test set.
There are several features of this project that make it particularly interesting and fun for you:
Transductive setting. You are given (limited) access to some of the test data ahead of time, allowing you to incorporate the statistics of the test data into your methods if you so desire. For example, you can run PCA on the word frequencies for the entire dataset, not just on the training data.
Additional features. In addition to the lyric features, we are releasing the audio features of each song. Using audio in conjunction with the lyric features may give you an edge over the other groups. A quick description of each audio feature is given below.
Music is awesome.
The format of the project is a competition, with live leaderboards (see below for more details).
The project is broken down into a series of checkpoints. There are four mandatory checkpoints (Nov. 14th, Nov. 16th, Nov. 27th, and Nov. 30th). The final writeup is due Dec. 3rd. The leaderboards will be operating continuously so you can monitor your progress against other teams and towards the score based checkpoints. All mandatory deadlines are midnight. So, the deadline “Nov. 16th” means you can submit anytime before the 16th becomes the 17th.
Your predictions will be evaluated based on their mean reciprocal rank. Your code should produce an Nx10 matrix of ranks. Each row {$\hat{\mathbf{y}}^{(i)} $} is a ranking of the {$ K=10 $} genre labels in decreasing order of confidence for example {$i$}. If the position of the true genre in your ranking vector is given by {$r_i$}, the mean rank loss over your classifier’s predictions is:
{$ \mbox{rank loss} = \frac{1}{N}\sum_{i=1}^N 1-\frac{1}{r_i}$}
The intuition is that for a given true genre, even if you don’t predict it exactly, it’s better for your prediction to give a high rank to the true class. An example of how to generate the ranking matrix is given in the starter kit.
We have partitioned the dataset into three subsets: training, quiz, and test. The starter kit includes the features and labels for the training set, and the features for the quiz set. However, we are not releasing the final test set! For the final submission, you will submit your code, and we will evaluate your classifier on the withheld test data to determine your final overall performance.
For the second and third checkpoints, you must only submit to the leaderboard(s). For the final checkpoint, you must submit ALL of your code via turnin to the correct project folder. Make sure that you submit any code that you used in any way to train and evaluate your method. After the first checkpoint, we will be opening up an autograder that will check the validity of your code to ensure that we’ll be able to evaluate it at the end.
Each code submission should have a file named predict_genre.m
that takes the training and test sets and returns a ranking matrix. Any training should be done prior to submission, and your trained classifier must be saved in a format that can be used directly predict_genre.m
. If you are using an instance-based method like KNN, the quiz set(s) will be passed to the function and can also be used.
Constraints on checkpoint submissions:
.mexa64
compiled code. If your code runs on biglab
, then it will run on our test machine, so test there before submitting to a checkpoint.
Be careful that you include in your submission all files you need for your code to run. If you download the starter kit and work entirely out of the code directory, you should be fine. For example, if your algorithms require libsvm, you must ensure that libsvm exists inside the code directory.
You can submit your code as often as you’d like in order to check its correctness, but you will only be able to submit to the leaderboard once every 5 hours.
You can download the starter kit here: http://alliance.seas.upenn.edu/~cis520/fall12/project_starter_kit.zip
Take a look at run_submission.m
file in the starter code directory to get an idea of how to use code we gave you, and look over how the various components of how the simple baseline method works. You should be able to understand what all the code is doing. We will be discussing the project and the kit during recitation Friday, so please make sure to come to this recitation.
Before you can get results on the leaderboard, you need to submit your team name. Everyone on your team is required to do this. Simply create a text file on eniac
with your team name as follows:
$ echo "My Team Name" >> group.txt $ turnin -c cis520 -p proj_groups group.txt
This group.txt
file should be raw text and contain only a single line. Do not submit PDFs, word documents, rich text, HTML, or anything like that. Just follow the above commands. If you have a SEAS email address, then you will get an email confirmation.
To submit to the leaderboard, you should submit the file submit.txt
which has 10 numbers per line for each example in the test set. An example of how to generate this file is in the starter kit.
Once you have your submit.txt
, you can submit it with the following:
turnin -c cis520 -p leaderboard submit.txt
Your team can submit once every 5 hours, so use your submissions carefully. Your submission will be checked against the reference solutions and you will get your score back via email. This score will also be posted to the leaderboard so everyone can see how awesome you are.
You can view the current leaderboard here: http://www.seas.upenn.edu/~cis520/fall12/leaderboard.html
You must submit your code for the final checkpoint. You can do so with the following:
turnin -c cis520 -p project <list of files including make_final_prediction.m>
You will receive feedback from the autograder, exactly like the homework.
In addition to the bag-of-words lyric features for each song, we will also be providing a set of simple audio features. There is code in the starter kit to get the matrix of audio features for the dataset. While you only have to beat the minimum baseline with audio for the final submission, the best classifiers will use both sets of features to make predictions. There is a lot of information contained in the audio that is not present in the lyrics, and vice versa.
Here is a quick description of the features we are making available:
We’ll post tips here.
You may not use late days for project checkpoints. (Aside from being incompatible with the nature of the competition, it is logistically difficult to apply a fair late day policy to groups with multiple people.)