Naive Bayes Classifier: Text Binomial(ish) TechTC-100

This is my first attempt at text classification. I wanted to make it be “Binomial”, but I had to really force it. The results ended up being good, so my forcing can’t be completely bad.



TechTC-100 Text Categorization Datasets

I don’t even remember how I ended up even finding these data sets.

Website: http://techtc.cs.technion.ac.il/techtc100/techtc100.html

Citation:
Evgeniy Gabrilovich and Shaul Markovitch
“Text Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5”
The 21st International Conference on Machine Learning (ICML), pp. 321-328, Banff, Alberta, Canada, July 2004

For simplicity, I used their preprocessed feature vectors and data set I used is as follows (the second one).

id1: 6920 id2: 8366.


Binomial Woes

If you have read my general post on Naive Bayes Classifier, then please recall that I mentioned that this is not going to be the type of classic Binomial Distribution. We are just using Binomial to mean True/False, On/Off, 0/1, things like that. In the case of text, we use the particular feature (word) is present or absent.

So let’s talk about the Likelihood because that is where all the business is happening.

The probability of a particular feature given a particular class label.

\(P(x_i|C_j)\)

Since we are talking “Binomial” here, we could consider the Likelihood as follows

\(x_i:\text{ word is present}\)
\(\overline{x_i}:\text{ word is absent}\)

So

\(P(x_i|C_j) \text{ vs. }P(\overline{x_i}|C_j)\)

Where each would be the number of samples with (or not with) the word and class over the total samples in that class

\(P(x_i|C_j)=\frac{n(x_i \cap C_j)}{n(C_j)}\)
\(P(\overline{x_i}|C_j)=\frac{n(\overline{x_i} \cap C_j)}{n(C_j)}\)

Note however from compliment formulas that we have the following.

\(P(\overline{x_i}|C_j)=1-P(x_i|C_j)\)

Yes I know I just presented \(\overline{x_i}\) like it was something that I was going to use for the rest of the post. Yep, I am tossing it out the window. It will be fine.

We are going to take advantage that our features are boolean (0/1) as follows.

\(x_iP(x_i|C_j)+(1-x_i)[1-P(x_i|C_j)]\)

If the \(i^{th}\) feature is present then \(x_i=1\) and \((1-x_i)=0\), then the expression will be …

\(P(x_i|C_j)\)

If the \(i^{th}\) feature is absent then \(x_i=0\) and \((1-x_i)=1\), then the expression will be …

\([1-P(x_i|C_j)]\)

Handy that

But you might be thinking, “But Cramer, why do this convoluted thing?”

MathS is why, well vectorization is why, oh and Log.

We get to take advantage of matrix multiplication.

Recall

\(
\textbf{X}_{(n,d)} =
\begin{bmatrix}
x_{1,1} & x_{1,2} & \cdots & x_{1,d} \\
x_{2,1} & x_{2,2} & \cdots & x_{2,d} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n,1} & x_{n,2} & \cdots & x_{n,d}
\end{bmatrix}
\)

\(n: \text{ number of samples}\)
\(d: \text{ number of features}\)

And let’s treat everything else as a Matrix as well

\(\textbf{P(x|C)}_{(Cn,d)} \text{ Likelihood}\)

\(\textbf{P(C|x)}_{(n,Cn)} \text{ Posterior}\)

\(\textbf{P(C)}_{(n,Cn)} \text{ * Prior}\)

\(Cn: \text{ number of class labels}\)

\(\text{* }\) Ok the Prior will not really have n rows and Cn columns. The Prior will really be a vector of length Cn but python will broadcast it so that we can Log add it to each other entry that will be in the Posterior.

So

\(\textbf{P(C|x)}_{(n,Cn)}=Log(\textbf{P(C)}_{(n,Cn)})+\textbf{X}_{(n,d)}[Log(\textbf{P(x|C)}_{(Cn,d)})]^t+(1-\textbf{X}_{(n,d)})[Log(1-\textbf{P(x|C)}_{(Cn,d)})]^t\)

Not too bad … but that Likelihood … sigh


How I actually handled Likelihoods

There is a problem in Naive Bayes and text, dealing with Zero.

Example: “Umami”

Not every text document is likely to have the word “Umami” in it. Further, a data set might not have that each class has the word “Umami”.

Consider a data set that has three class labels. Now it could happen that when you were building the model that two of the class labels had “Umami” but the third did not. Now consider that a new text document that was of this third class happens to have the word “Umami”, this is problematic.

\(P(x_{Umami}|C_3)=0\)

Since calculating the Posterior involves multiplication, it would crash that calculation to 0.

\(P(C)P(x_1|C)P \cdots P(x_{Umami}|C) \cdots P(x_d|C)\)

Even under Log, this is problematic.

\(Log(0) \text{ is undefined}\)

and

\(\lim_{x \to 0+} Log(x) = -\infty\)

So how we said we were going to calculate Likelihoods is going to end up being problematic.

\(P(x_i|C_j)=\frac{n(x_i \cap C_j)}{n(C_j)}\)

So let’s change it. Nothing really says that we have to calculate Likelihoods in particular ways, It just needs to make sense.

First, let’s introduce another feature matrix. \(\textbf{X}_{(n,d)}\) was a matrix of ones and zeroes, one if that sample had that feature, zero otherwise. So why not use an integer matrix.

\(
\textbf{X}_{\text{int }(n,d)} =
\begin{bmatrix}
x_{\text{int }1,1} & x_{\text{int }1,2} & \cdots & x_{\text{int }1,d} \\
x_{\text{int }2,1} & x_{\text{int }2,2} & \cdots & x_{\text{int }2,d} \\
\vdots & \vdots & \ddots & \vdots \\
x_{\text{int }n,1} & x_{\text{int }n,2} & \cdots & x_{\text{int }n,d}
\end{bmatrix}
\)

So I would like to change the Likelihood calculation too.

\(P(x_i|C_j)=\frac{\sum_{\{k \text{ | } y_k=C_j\}} x_{\text{int }k,i}}{\sum_{\{k \text{ | } y_k=C_j\}}\sum_{l=0}^{d}x_{\text{int }k,l}}\)

“But Cramer, this does not change anything. We could still get Likelihoods of 0 here.”

Hold your digits. I’m getting to it.

Before someone actually composes a document, the person has a general idea of what they are going to say. At this time, every “word” has at least some nonzero chance making it into the document. So Let’s “force” it to be true.

Just add one to every entry to our integer feature matrix.

\(\textbf{X}_{\text{int }(n,d)} \to 1 + \textbf{X}_{\text{int }(n,d)}\)

Some call it soothing. I just think about it as a way to “simulate” that every document has a chance of every word occurring.

Anyways, this is more similar to how Likelihoods are calculated for Multinomial implementations. I will probably do another post on this where I do the more Multinomial approach. Again this is different from what a Multinomial Distribution is.


Feature Selection

Oh Dear …

I’m not really at the point in my machine learning experience where I am trying to bugger with feature selection/manipulation. That being said, I think that dealing with text requires some sort of selection unless the data set plays very nicely.

Because I’m not ready to really bugger, I decided to do a simple version of feature selection.

Four Feacture Selections:

  1. “Raw” Features: No Feature Selection
  2. At Least Two: Take feature if it apears in at least two samples
  3. At Least Five: … apears in at least five samples
  4. At Least Ten: … apears in at least ten samples


Results and Python Files

I was really happy with the results. I ran my code 500 times for each feature selection, randomizing what I used for the training and tests set each time.

Here are the means of the percent accuracy and standard deviation. I will note that since this was such a small data set, only 140 samples, there is wide variance (standard deviation). Even with running each 500 times, there is still a bit of variation with what these results end up being.

  1. “Raw” Features: \(\mu=69\) , \(\sigma=11\)
  2. At Least Two: \(\mu=81\) , \(\sigma=9\)
  3. At Least Five: \(\mu=91\) , \(\sigma=5\)
  4. At Least Ten: \(\mu=93\) , \(\sigma=4\)

Ignoring the standard deviation for a second. The “At Least Ten” was better than any of the results posted on the website that I found the data on. Granted they did not use Naive Bayes, and I also think they were just showing results that they got for a reference of how “hard” each of the data sets are to classify.

Either way, I am very happy that I at least got in the 90s.

I split the work into two files. One python file creates text files of the feature matrices (class label, boolean matrix, and integer matrix). The other file does the Naive Bayes “Binomial”

Code

Enjoy