Categorical

Options for handling categorical variables

One-hot encoding

Works well, especially if there are not too many of them. Note that for text (and sometimes for products), you have the choice of a binary encoding (is each word present or absent from each document) or a count encoding (how many times each word shows up in each document). There are, for example, versions of Naive Bayes which use each of those.

Use an embedding

For text, you can use a pretrained context-oblivious embedding, like word2vec or FastText, or a context-sensitive one like BERT. A good place to start is HuggingFace, and to treat each short document as the average of the embeddings of the words in that document. There are also pretrained embeddings such as sentence2vec or doc2vec

Alternatively, you can run LDA over the documents, and then represent each document as the estimated distribution over the topics it contains.

Target Encoding

A clever trick is to replace each categorical value with the average value of y for the observations with that categorical value. (Of course, this needs to be done in the training set, not the test set.)

Back to Lectures