Naive Bayes Classifier: Text Binomial TechTC-100 (Updated)

This is my first attempt at text classification. I wanted to make it be “Binomial”, and it turned out better than I was expecting.



Updated

This is an updated post on my Binomial Naive Bayes Classifier post, Link. A better way to handle Likelihoods of zero occurred to me. I decided to keep the old post as a record of my mistake.


TechTC-100 Text Categorization Datasets

I don’t even remember how I ended up even finding these data sets.

Website: http://techtc.cs.technion.ac.il/techtc100/techtc100.html

Citation:
Evgeniy Gabrilovich and Shaul Markovitch
“Text Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5”
The 21st International Conference on Machine Learning (ICML), pp. 321-328, Banff, Alberta, Canada, July 2004

For simplicity, I used their preprocessed feature vectors and data set I used is as follows (the second one).

id1: 6920 id2: 8366.

A likely post for the future will be the other 99 data sets.


Binomial Woes

If you have read my general post on Naive Bayes Classifier, then please recall that I mentioned that this is not going to be the type of classic Binomial Distribution. We are just using Binomial to mean True/False, On/Off, 0/1, things like that. In the case of text, we use the particular feature (word) is present or absent.

So let’s talk about the Likelihood because that is where all the business is happening.

The probability of a particular feature given a particular class label.

\(P(x_i|C_j)\)

Since we are talking “Binomial” here, we could consider the Likelihood as follows

\(x_i:\text{ word is present}\)
\(\overline{x_i}:\text{ word is absent}\)

So

\(P(x_i|C_j) \text{ vs. }P(\overline{x_i}|C_j)\)

Where each would be the number of samples with (or not with) the word and class over the total samples in that class

\(P(x_i|C_j)=\frac{n(x_i \cap C_j)}{n(C_j)}\)
\(P(\overline{x_i}|C_j)=\frac{n(\overline{x_i} \cap C_j)}{n(C_j)}\)

Note from compliment formulas that we have the following.

\(P(\overline{x_i}|C_j)=1-P(x_i|C_j)\)

Yes I know I just presented \(\overline{x_i}\) like it was something that I was going to use for the rest of the post. Yep, I am tossing it out the window. It will be fine.

We are going to take advantage that our features are boolean (0/1) as follows.

\(x_iP(x_i|C_j)+(1-x_i)[1-P(x_i|C_j)]\)

If the \(i^{th}\) feature is present then \(x_i=1\) and \((1-x_i)=0\), then the expression will be …

\(P(x_i|C_j)\)

If the \(i^{th}\) feature is absent then \(x_i=0\) and \((1-x_i)=1\), then the expression will be …

\([1-P(x_i|C_j)]\)

Handy that

But you might be thinking, “But Cramer, why do this convoluted thing?”

MathS is why, well vectorization is why, oh and Log.

We get to take advantage of matrix multiplication.

Recall

\(
\textbf{X}_{(n,d)} =
\begin{bmatrix}
x_{1,1} & x_{1,2} & \cdots & x_{1,d} \\
x_{2,1} & x_{2,2} & \cdots & x_{2,d} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n,1} & x_{n,2} & \cdots & x_{n,d}
\end{bmatrix}
\)

\(n: \text{ number of samples}\)
\(d: \text{ number of features}\)

And let’s treat everything else as a Matrix as well

\(\textbf{P(x|C)}_{(Cn,d)} \text{ Likelihood}\)

\(\textbf{P(C|x)}_{(n,Cn)} \text{ Posterior}\)

\(\textbf{P(C)}_{(n,Cn)} \text{ * Prior}\)

\(Cn: \text{ number of class labels}\)

\(\text{* }\) Ok the Prior will not really have n rows and Cn columns. The Prior will really be a vector of length Cn but python will broadcast it so that we can Log add it to each other entry that will be in the Posterior.

So

\(\textbf{P(C|x)}_{(n,Cn)}=Log(\textbf{P(C)}_{(n,Cn)})+\textbf{X}_{(n,d)}[Log(\textbf{P(x|C)}_{(Cn,d)})]^t+(1-\textbf{X}_{(n,d)})[Log(1-\textbf{P(x|C)}_{(Cn,d)})]^t\)

Not too bad … but that Likelihood … sigh


Handling Likelihoods of Zero

There is a problem in Naive Bayes and text, dealing with Zero.

Example: “Umami”

Not every text document is likely to have the word “Umami” in it. Further, a data set might not have that each class has the word “Umami”.

Consider a data set that has three class labels. Now it could happen that when you were building the model that two of the class labels had “Umami” but the third did not. Now consider that a new text document that was of this third class happens to have the word “Umami”, this is problematic.

\(P(x_{Umami}|C_3)=0\)

Since calculating the Posterior involves multiplication, it would crash that calculation to 0.

\(P(C)P(x_1|C)P \cdots P(x_{Umami}|C) \cdots P(x_d|C)\)

Even under Log, this is problematic.

\(Log(0) \text{ is undefined}\)

and

\(\lim_{x \to 0+} Log(x) = -\infty\)

So how we said we were going to calculate Likelihoods is going to end up being problematic.

\(P(x_i|C_j)=\frac{n(x_i \cap C_j)}{n(C_j)}\)

So let’s change it. Nothing really says that we have to calculate Likelihoods in particular ways, It just needs to make sense.

Let’s just add 1

But how to illustrate …

A sum vector

\(n(\textbf{x} \cap \textbf{C}_j)_{(1,d)}=[n(x_1 \cap C_j) \;\: n(x_2 \cap C_j) \;\; \cdots]\)

Now let’s add 1 to each element of this sum vector. We will add a wiggle on top to differentiate it

\(\widetilde{n(\textbf{x} \cap \textbf{C}_j)}_{(1,d)}=[n(x_1 \cap C_j)+1 \;\; n(x_2 \cap C_j)+1 \;\; \cdots]\)

Since this has the feel of adding one to the numerator, it does not feel right to use \(n(C_j)\) for the denominator. Let’s just use the sum of this new wiggle vector as the denominator.

Then we could express our Likelihood matrix as follows. Note that this particular data set has two class labels.

\(
\textbf{P(x|C)}_{(Cn,d)}=
\begin{bmatrix}
\widetilde{n(\textbf{x} \cap \textbf{C}_1)}_{(1,d)}\frac{1}{\sum_{i=1}^{n}n(x_i \cap C_1)+1} \\
\widetilde{n(\textbf{x} \cap \textbf{C}_2)}_{(1,d)}\frac{1}{\sum_{i=1}^{n}n(x_i \cap C_2)+1}
\end{bmatrix}
\)

Voila, no more probabilities of zero.


Feature Selection

This section is a remnant from the old post mentioned at the top. With the algorithm how I have laid it out thus far, as you will see, performs quite well with no feature selection.

Four Feacture Selections:

  1. “Raw” Features: No Feature Selection
  2. At Least Two: Take feature if it apears in at least two samples
  3. At Least Five: … apears in at least five samples
  4. At Least Ten: … apears in at least ten samples


Results and Python Files

I was really happy with the results. I ran my code 1500 times for each feature selection, randomizing what I used for the training and tests set each time.

Here are the means of the percent accuracy and standard deviation. I will note that since this was such a small data set, only 140 samples, there is wide variance (standard deviation).

  1. “Raw” Features: \(\mu=92.3\) , \(\sigma=3.9\)
  2. At Least Two: \(\mu=93.4\) , \(\sigma=3.6\)
  3. At Least Five: \(\mu=94.1\) , \(\sigma=3.5\)
  4. At Least Ten: \(\mu=94.3\) , \(\sigma=3.3\)

Ignoring the standard deviation for a second. Except for ‘”Raw” Features”, all of these results were better than all the results found on the website that I found the data on. Granted they did not use Naive Bayes, and I also think they were just showing results that they got for a reference of how “hard” each of the data sets are to classify.

Either way, I am very happy.

I split the work into two files. One python file creates text files of the feature matrices (class label, boolean matrix, and integer matrix). The other file does the Naive Bayes “Binomial”

Note that the updated version goes along with this post

Code

Enjoy