Processing math: 100%

NaiveBayes

Naive Bayes Model

The Naive Bayes classifier is an example of the generative approach: we will model P(x,y). Consider the toy transportation data below:

x: Inputs/Features/Attributesy: Class
Distance(miles)RainingFlat TireMode
1 milenonobike
2 milesyesnowalk
1 milenoyesbus
1 mileyesnowalk
2 milesyesnobus
1 milenonocar
1 mileyesyesbike
10 milesyesnobike
10 milesnonocar
4 milesnonobike

We will decompose P(x,y) into class prior and class model:

P(x,y)=P(y)classpriorP(xy)classmodel

and estimate them separately as ˆP(y) and ˆP(xy). (Class prior should not be confused with parameter prior. They are very similar concepts, but not the same things.)

We will then use our estimates to output a classifier using Bayes rule:

h(x)=argmaxyˆP(yx)=argmaxyˆP(y)ˆP(xy)yˆP(y)ˆP(xy)=argmaxyˆP(y)ˆP(xy)

To estimate our model using MLE, we can separately estimate the two parts of the model:

logP(D)=ilogP(xi,yi)=ilogP(yi)+logP(xiyi)=logP(DY)+logP(DX|DY)

Estimating P(y)

How do we estimate P(y)? This is very much like the biased coin, except instead of two outcomes, we have 4 (walk, bike, bus, car). We need 4 parameters to represent this multinomial distribution (3 really, since they must sum to 1): (θwalk,θbike,θbus,θcar). The MLE estimate (deriving it is a good exercise) is ˆθy=1ni1(y=yi).

yparameter θyMLE ˆθy
walkθwalk0.2
bikeθbike0.4
busθbus0.2
carθcar0.2

Estimating P(xy)

Estimating class models is much more difficult, since the joint distribution of m dimensions of x can be very complex. Suppose that all the features are binary, like Raining or Flat Tire above. If we have m features, there are K2m possible values of (x,y) and we cannot store or estimate such a distribution explicitly, like we did for P(y). The key (naive) assumption of the model is conditional independence of the features given the class. Recall that Xk is conditionally independent of Xj given Y if:

P(Xj=xjXk=xk,Y=y)=P(Xj=xjY=y),xj,xk,y

or equivalently,

P(Xj=xj,Xk=xkY=y)=P(Xj=xjY=y)P(Xk=xkY=y),xj,xk,y

More generally, the Naive Bayes assumption is that:

ˆP(XY)=jˆP(XjY)

Hence the Naive Bayes classifier is simply:

argmaxyˆP(Y=yX)=argmaxyˆP(Y=y)jˆP(XjY=y)

If the feature Xj is discrete like Raining, then we need to estimate K distributions for it, one for each class, P(Xj|Y=k). We have 4 parameters, (θRwalk,θRbike,θRbus,θRcar), denoting probability of Raining=yes given transportation taken. The MLE estimate (deriving it is also a good exercise) is ˆθRy=i1(R=yes,y=yi)i1(y=yi). For example, P(RY) is

yparameter θRyMLE ˆθRy
walkθRwalk1
bikeθRbike0.5
busθRbus0.5
carθRcar0

For a continuous variable like Distance, there are many possible choices of models, with Gaussian being the simplest. We need to estimate K distributions for each feature, one for each class, P(Xj|Y=k). For example, P(DY) is

yparameters μDy and σDyMLE ˆμDyMLE ˆσDy
walkμDwalk,σDwalk1.50.5
bikeμDbike, σDbike43.7
busμDbus, σDbus1.50.5
carμDcar, σDcar5.54.5

Note that for words, instead of using a binary variable (word present or absent), we could also use the number of occurrences of the word in the document, for example by assuming a binomial distribution for each word.

MLE vs. MAP

Note the danger of using MLE estimates. For example, consider the estimate of conditional distribution of Raining=yes: ˆP(Raining=yesy=car)=0. So if we know it's raining, no matter the distance, the probability of taking the car is 0, which is not a good estimate. This is a general problem due to scarcity of data: we never saw an example with car and raining. Using MAP estimation with Beta priors (with α,β>1), estimates will never be zero, since additional "counts" are added.

Examples

2-dimensional, 2-class example

Suppose our data is from two classes (plus and circle) in two dimensions (x1 and x2) and looks like this:


2D binary classification data

The Naive Bayes classifier will estimate a Gaussian for each class and each dimension. We can visualize the estimated distribution by drawing a contour of the density. The decision boundary, where the probability of each class given the input is equal, is shown in red.


2D binary classification with Naive Bayes. A density contour is drawn for the Gaussian model of each class and the decision boundary is shown in red

Text classification: bag-of-words representation

In classifying text documents, like news articles or emails or web pages, the input is a very complex, structured object. Fortunately, for simple tasks like deciding about spam vs. not spam, politics vs sports, etc., a very simple representation of the input suffices. The standard way to represent a document is to completely disregard the order of the words in it and just consider their counts. So the email below might be represented as:


bag-of-words model for text classification: degree=1, diploma=1, customized=1, deserve=1, fast=1, promptly=1...

The Naive Bayes classifier then learns ˆP(spam), and ˆP(wordspam) and ˆP(wordham) for each word in our dictionary by using MLE/MAP as above. It then predicts prediction 'spam' if:

ˆP(spam)wordemailˆP(wordspam)>ˆP(ham)wordemailˆP(wordham)

Back to Lectures