Naive Bayes Classifier

A Naive Bayes Classifier is a supervised Machine Learning algorithm that predicts the class of new samples by utilizing Baye’s Theorem.



Atop a Soapbox

If you have not already read my post on Baye’s Theorem, I warn you that this is a sour subject for me. I have studied/taught Bayes’ Theorem for roughly 8 years now.

Every semester I would literally just say Bayes ONCE. All my students did Bayes’ Formula and understood the problems that were labeled as Bayes’ Theorem, including on the final. Granted this was a Finite course and the students didn’t perse need to know that it was Bayes. The point I’m trying to make is that I don’t even label it as Bayes in my head.

I have avoided Naive Bayes Classifier because my eyes roll every time I see the terminology that is used to describe it. My understanding is that this terminology comes from studying Bayes’ Theorem, but it never came up in any context for me until I encountered Naive Bayes Classifier.

I will attempt to pass on how I understand the Naive Bayes Classifier. I will tell you what these terms are, but I will not embrace them.


The Context

Imagine we have a collection data, points, samples, what have you. In general, all of these individual samples will have the same kinds of features.

An example of a sample that has d features might be as follows.

\(\textbf{x}_{i}=(x_{i,1},x_{i,2},\cdots,x_{i,d})\)

I would call this the “i eth sample”.

We generally will have a collection of n samples and keep them nicely organized into a matrix. I will present the matrix in two ways. One way in terms of the samples and the next in terms of features. Our matrix will have n rows for the n samples and d columns for the d features.

\(\textbf{X}_{(n,d)}=
\begin{bmatrix}
-\text{x}_1- \\
-\text{x}_2- \\
\vdots \\
-\text{x}_n- \\
\end{bmatrix}
\)

\(
\textbf{X}_{(n,d)} =
\begin{bmatrix}
x_{1,1} & x_{1,2} & \cdots & x_{1,d} \\
x_{2,1} & x_{2,2} & \cdots & x_{2,d} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n,1} & x_{n,2} & \cdots & x_{n,d}
\end{bmatrix}
\)

Now, these features can be anything. Namely, they will be either an integer, real number, or text. You can convey almost anything with text. The only thing we might demand of our matrix is that each row represents a single sample and each column represents the ‘same feature’. The same feature might be age (integer), distance (real number), a feeling described in words (anything, text). We just want everything to be neat and organized.

Now all of these samples might correspond to certain categories, called class labels. We would represent the class labels of our above samples as follows. Note that we only consider a discrete and countable number of class labels.

\(\textbf{y}_{(n)}=(y_1,y_2,\cdots,y_n)\)

Where each class label is from any of k different class labels.

\(y_i \in \{C_1,C_2,\cdots,C_k\}\)

Note that \(\textbf{x}_{23}\) would be describing the features of the 23rd sample, and \(y_{23}\) would be describing the class label of the 23rd sample.

In general, we will assume that we always know all of the features of each sample. However, we may not always know the class labels of all our samples.


The Question

So the question is, “Given a sample’s features, can we predict it’s class label?” This can be described as a conditional probability statement.

\(P(C|\textbf{x})=?\)

Conditional formula

\(P(C|\textbf{x})=\frac{P(C \cap \textbf{x})}{P(\textbf{x})}\)

Replacing the intersection with a conditional

\(P(C|\textbf{x})=\frac{P(C) P(\textbf{x}|C)}{P(\textbf{x})}\)


Math has meaning already … Why muddle it with new terms?

\(P(C|\textbf{x})=\frac{P(C) P(\textbf{x}|C)}{P(\textbf{x})}\)

There is already meaning to each of the four parts. I will describe them in the context of our Bayes Classifier.

\(P(C|\textbf{x})\)

The likelihood of a particular class label given a particular sample’s features

\(P(C)\)

The likelihood of a particular class label

\(P(\textbf{x}|C)\)

The likelihood that a sample has particular features given that the sample is of a particular class label

\(P(\textbf{x})\)

The likelihood that a sample has particular features

That’s what they mean. Sure we can expound on them in more detail, but there are not very many ways to describe the above in simpler terms.

For now please focus on how I have just described them, all in terms of a sample’s features, a class label, and their conditionals.

In the end, I will talk about the other ways of describing these and why I think it is unhelpful for the context of a Bayes Classifier


A note on the Denominator

\(P(\textbf{x})\)

Believe it or not, we end up not caring what the likelihood of the particular sample’s features are. Allow me to explain.

What we want to do is select the class label for the sample that maximizes the probability.

So choose the maximum of the following

\(P(C_1|\textbf{x})=\frac{P(C_1)P(\textbf{x}|C_1)}{P(\textbf{x})}\)
\(P(C_2|\textbf{x})=\frac{P(C_2)P(\textbf{x}|C_2)}{P(\textbf{x})}\)
\(\vdots\)
\(P(C_k|\textbf{x})=\frac{P(C_n)P(\textbf{x}|C_n)}{P(\textbf{x})}\)

We will choose the class label by which of the above is maximum, given a particular samples features that stay the same for each. Since \(P(\textbf{x})\) would remain constant for each, it has nothing to do with determining which is more likely. Only the numerator plays a part.

So what we are really investigating is the following

\(P(C|\textbf{x}) \propto P(C)P(\textbf{x}|C)\)


The numerator is where it’s at.

\(P(C)P(\textbf{x}|C)\)

This is the meat and potatoes here. Another way of looking at it is as follows.

\(P(C)P(x_1\cap x_2 \cap \cdots \cap x_d|C)\)

Which is really

\(P(C \cap x_1\cap x_2 \cap \cdots \cap x_d)\)

We are talking about particular features here. So the sample’s age is 15, the length of the right index finger is 9.4 cm, is male, and has 3 siblings.

This is very difficult to consider. We will look for an easier way to think about this through an example.

\(P(C \cap d \cap e \cap f)\)

This can be described with the following tree.

Assume our sample comes from a particular class, the first feature ends up being ‘d’, the second ‘e’ and the last ‘f’. Each after the particular class ends up being a conditional.

This is where the Naive in Naive Bayes Classifier comes in. We are about to assume that all the sample features are independent of each other, regardless if they are or not. Yes, we are going to be naive to the fact that features/data are often NOT independent.

Don’t worry we will be fine.

Now the tree diagram looks as follows. Everything is still occurring under the context of a particular class label, but it is greatly simplified.

So

\(P(C \cap d \cap e \cap f)=P(C)P(d|C)P(e|C)P(f|C)\)

So the numerator will be as follows

\(P(C)P(x_1|C)P(x_2|C) \cdots P(x_d|C)\)


Wrap it all up

So for samples that we don’t know the class label of, we will do the following.

Given a particular sample’s features, evaluate how likely it belongs to each of the class labels. Whichever produced the greatest likelihood, assign that class label to the sample.

\(y={arg\,max}_{C \in\{C_1,C_2,\cdots,C_k\}}P(C|\textbf{x})\)

Which we know is the same as

\(y={arg\,max}_{C \in\{C_1,C_2,\cdots,C_k\}}\frac{P(C)P(\textbf{x}|C)}{P(\textbf{x})}\)

Since the sample’s features will remain constant while testing each class label, this evaluation will be the same as

\(y={arg\,max}_{C \in\{C_1,C_2,\cdots,C_k\}}P(C)P(\textbf{x}|C)\)

And assuming all sample features are independent of each other, we have

\(y={arg\,max}_{C \in\{C_1,C_2,\cdots,C_k\}}P(C)P(x_1|C)P(x_2|C) \cdots P(x_d|C)\)

All we have to do is determine how to calculate individual \(P(C)\) and individual \(P(x|C)\)

So

\(P(C_j)\)
\(P(x_l|C_j)\)

Were

\(C_j \in \{C_1,C_2,\cdots,C_k\}\)
\(x_l \in \{x_1,x_2,\cdots,x_d\}\)

\(P(C_j)\) The likelihood of a particular class label is purely simple probabilities.

Of samples that we know the class label of and want to use to create our model, we can find individual likelihoods by the following.

\(P(C_j)=\frac{n(C_j)}{n(C_1)+n(C_2)+\cdots+n(C_k)}\)

\(P(x_l|C_j)\) The likelihood of the outcome of a particular class feature given that the sample is from a particular class label. This one is a bit of an animal all on its own. There are three major ways that this can be done.

  1. Normal Distrabutions (aka Gaussian)
  2. Binomials
  3. Multinomials

I will “briefly” describe them here. I will make individual posts on each shortly.

Normal Distrabutions

This is used when a feature is a real number. We will simply capture the mean and standard deviation of the feature separately for each of the class labels and then use the probability density function to get probabilities of features.

Binomials

This is used when a feature is boolean, male/female, yes/no, present/absent, whatever. Note that this should not be confused with a classic Binomial Distribution. This is often used to classify text documents, particular words are either present or absent in the document.

Multinomials

Again, like with Binomials, this is not the classic Multinomial Distribution. Here a feature would be counting the number of times an event occurs. This could be the number of children, car accidents, servings of alcohol, again whatever. Just like with Binomials this can be used when the sample is a text document, rather than just determining whether a word is present or not now we are interested in the number of times it is present.


The terms that are used

Again I will state that I am not a statistician. I might have a different insight into the words I’m about to tell you if I were, but I am skeptical.

This is a convention that is used, so I and you better learn them. When I am going to a talk or reading something that has to do with a Bayes Classifier, I take a note card with me with the following.

\[P(C|\textbf{x})=\frac{P(C) P(\textbf{x}|C)}{P(\textbf{x})}\]

\[
\begin{array}{LL}
\text{Posterior} & P(C|\textbf{x}) \\
\text{Prior} & P(C) \\
\text{Likelihood} & P(\textbf{x}|C) \\
\text{Evidence} & P(\textbf{x}) \\
\end{array}
\]

If you have read any of my other posts that involve probabilities then you probably know that I use the work likelihood a lot. If you have taken a Statistics class, read a book on Probability Theory, or have any familiarity with the English language you have encountered the word “Likelihood.” Each of the four above pieces can be described as “Likelihood.”

Without going into the rest of the above descriptors There is one big problem with these labels.

THEY ARE COMPLETELY ARBITRARY. Read on to find out why.


Articles of Arbitration

This is the main reason I have problems with Posterior, Prior, Likelihood, and Evidence.

Class labels are really just sample features. Yes, class labels are generally considered hard to get to. You might not always know when cancer is malignant, E-mail is spam, someone will default on a loan. However, there is nothing mathematically special about them.

Consider samples that just have two features, a class label and another feature. We can get even more specific. Let’s say our class label is Yearly Household income (\(\leq 100k,\leq 200k, >200k\)). Let’s say our sample feature is the number of children present (0,1,2,3+).

So

\(P(C|x)=\frac{P(C)P(x|C)}{P(x)}\)

\(C \in \{C_{\leq 100k}C_{\leq 200k},C_{>200k}\}\)
\(x \in \{0,1,2,3+\}\)

Let’s do some algebraic manipulation here, yes this is a very legal thing to do.

Exchange right and left hand sides.

\(\frac{P(C)P(x|C)}{P(x)}=P(C|x)\)

Multiply both sides by \(P(x)\)

\(P(C)P(x|C)=P(x)P(C|x)\)

Note that both sides are equal to \(P(x \cap C)\).

Divide both sides by \(P(C)\)

\(P(x|C)=\frac{P(x)P(C|x)}{P(C)}\)

This would be as if our sample feature is household income and we are using that knowledge to predict the likelihood of how many children are in the household, the class label.

Viewed like this we, we should swap what we are calling Posterior, Prior, Likelihood, and Evidence. First, swap Posterior and Likelihood. After that, swap Prior and Evidence.

I will continue to carry the postcard with me.