NaiveBayes
Naive Bayes Model
The Naive Bayes classifier is an example of the generative approach: we will model P(x,y). Consider the toy transportation data below:
x: Inputs/Features/Attributes | y: Class | ||
---|---|---|---|
Distance(miles) | Raining | Flat Tire | Mode |
1 mile | no | no | bike |
2 miles | yes | no | walk |
1 mile | no | yes | bus |
1 mile | yes | no | walk |
2 miles | yes | no | bus |
1 mile | no | no | car |
1 mile | yes | yes | bike |
10 miles | yes | no | bike |
10 miles | no | no | car |
4 miles | no | no | bike |
We will decompose P(x,y) into class prior and class model:
P(x,y)=P(y)⏟classpriorP(x∣y)⏟classmodel
and estimate them separately as ˆP(y) and ˆP(x∣y). (Class prior should not be confused with parameter prior. They are very similar concepts, but not the same things.)
We will then use our estimates to output a classifier using Bayes rule:
h(x)=argmaxyˆP(y∣x)=argmaxyˆP(y)ˆP(x∣y)∑y′ˆP(y′)ˆP(x∣y′)=argmaxyˆP(y)ˆP(x∣y)
To estimate our model using MLE, we can separately estimate the two parts of the model:
logP(D)=∑ilogP(xi,yi)=∑ilogP(yi)+logP(xi∣yi)=logP(DY)+logP(DX|DY)
Estimating P(y)
How do we estimate P(y)? This is very much like the biased coin, except instead of two outcomes, we have 4 (walk, bike, bus, car). We need 4 parameters to represent this multinomial distribution (3 really, since they must sum to 1): (θwalk,θbike,θbus,θcar). The MLE estimate (deriving it is a good exercise) is ˆθy=1n∑i1(y=yi).
y | parameter θy | MLE ˆθy |
---|---|---|
walk | θwalk | 0.2 |
bike | θbike | 0.4 |
bus | θbus | 0.2 |
car | θcar | 0.2 |
Estimating P(x∣y)
Estimating class models is much more difficult, since the joint distribution of m dimensions of x can be very complex. Suppose that all the features are binary, like Raining or Flat Tire above. If we have m features, there are K∗2m possible values of (x,y) and we cannot store or estimate such a distribution explicitly, like we did for P(y). The key (naive) assumption of the model is conditional independence of the features given the class. Recall that Xk is conditionally independent of Xj given Y if:
P(Xj=xj∣Xk=xk,Y=y)=P(Xj=xj∣Y=y),∀xj,xk,y
or equivalently,
P(Xj=xj,Xk=xk∣Y=y)=P(Xj=xj∣Y=y)P(Xk=xk∣Y=y),∀xj,xk,y
More generally, the Naive Bayes assumption is that:
ˆP(X∣Y)=∏jˆP(Xj∣Y)
Hence the Naive Bayes classifier is simply:
argmaxyˆP(Y=y∣X)=argmaxyˆP(Y=y)∏jˆP(Xj∣Y=y)
If the feature Xj is discrete like Raining, then we need to estimate K distributions for it, one for each class, P(Xj|Y=k). We have 4 parameters, (θR∣walk,θR∣bike,θR∣bus,θR∣car), denoting probability of Raining=yes given transportation taken. The MLE estimate (deriving it is also a good exercise) is ˆθR∣y=∑i1(R=yes,y=yi)∑i1(y=yi). For example, P(R∣Y) is
y | parameter θR∣y | MLE ˆθR∣y |
---|---|---|
walk | θR∣walk | 1 |
bike | θR∣bike | 0.5 |
bus | θR∣bus | 0.5 |
car | θR∣car | 0 |
For a continuous variable like Distance, there are many possible choices of models, with Gaussian being the simplest. We need to estimate K distributions for each feature, one for each class, P(Xj|Y=k). For example, P(D∣Y) is
y | parameters μD∣y and σD∣y | MLE ˆμD∣y | MLE ˆσD∣y |
---|---|---|---|
walk | μD∣walk,σD∣walk | 1.5 | 0.5 |
bike | μD∣bike, σD∣bike | 4 | 3.7 |
bus | μD∣bus, σD∣bus | 1.5 | 0.5 |
car | μD∣car, σD∣car | 5.5 | 4.5 |
Note that for words, instead of using a binary variable (word present or absent), we could also use the number of occurrences of the word in the document, for example by assuming a binomial distribution for each word.
MLE vs. MAP
Note the danger of using MLE estimates. For example, consider the estimate of conditional distribution of Raining=yes: ˆP(Raining=yes∣y=car)=0. So if we know it's raining, no matter the distance, the probability of taking the car is 0, which is not a good estimate. This is a general problem due to scarcity of data: we never saw an example with car and raining. Using MAP estimation with Beta priors (with α,β>1), estimates will never be zero, since additional "counts" are added.
Examples
2-dimensional, 2-class example
Suppose our data is from two classes (plus and circle) in two dimensions (x1 and x2) and looks like this:
The Naive Bayes classifier will estimate a Gaussian for each class and each dimension. We can visualize the estimated distribution by drawing a contour of the density. The decision boundary, where the probability of each class given the input is equal, is shown in red.

2D binary classification with Naive Bayes. A density contour is drawn for the Gaussian model of each class and the decision boundary is shown in red
Text classification: bag-of-words representation
In classifying text documents, like news articles or emails or web pages, the input is a very complex, structured object. Fortunately, for simple tasks like deciding about spam vs. not spam, politics vs sports, etc., a very simple representation of the input suffices. The standard way to represent a document is to completely disregard the order of the words in it and just consider their counts. So the email below might be represented as:

bag-of-words model for text classification: degree=1, diploma=1, customized=1, deserve=1, fast=1, promptly=1...
The Naive Bayes classifier then learns ˆP(spam), and ˆP(word∣spam) and ˆP(word∣ham) for each word in our dictionary by using MLE/MAP as above. It then predicts prediction 'spam' if:
ˆP(spam)∏word∈emailˆP(word∣spam)>ˆP(ham)∏word∈emailˆP(word∣ham)