Naive Bayes Classifier: Simple Probability and the Mushroom Data Set

I was looking on UC Irvine Machine Learning Repository for a data set to perform either Binomial or Multinomial Bayes on. What I found was a data set that I could just use simple probability on. I wish I would have found this one before I did Gaussian.



Mushrooms!

Before I talk about this data set, allow me to give credit where credit is due. I don’t want yall thinking I went out and collected these features myself.

Citation for the UC Irvine Machine Learning Repository:

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

And a link to the Mushroom Data Set

Anyways …

What initially caught my eye about this data set was that each of the features was categorical, aka discrete and finite. At first, I was thinking it was going to be a Binomial/Multinomial implementation, but then I realized wait this is just Simple Probability.

Ok, apparently there is no “easy” way to tell if a mushroom is going to be poisonous or not. With a total of 8124 samples with 22 features each, that’s right twenty two, samples are classified into either edible or poisonous.

I ended up removing two features. One of the features did not vary from a singular category, removed. Another feature had missing values. I didn’t want to bugger with missing values yet, removed.

I’m sure there are more interesting things to discover about this data set, but that will have to wait for another day/post.


Likelihoods (Feature Given Class Labels) From Simple Probabilities

This is really described quite simply. Let’s look at a particular example.

The likelihood that the cap color is green given that the sample is edible

\(P(x_{cap-color}=green|C=edible)\)

Let’s abreviate this to

\(P(x_{cc,g}|C_{e})\)

Which is

\(P(x_{cc,g}|C_{e})=\frac{P(x_{cc,g} \cap C_{e})}{P(C_{e})}\)

Or, the number of samples that are edible and have green caps over the number of edible samples.

\(P(x_{cc,g}|C_{e})=\frac{n(x_{cc,g} \cap C_{e})}{n(C_{e})}\)

That’s it … Simple


Maybe not really Simple

I will be very eager to share this implementation with other Machine Learning/Data Science people.

While sure the implementation was not difficult, it did not feel elegant. I was hoping to use some sort of vectorization here. I am still almost certain that there is a better way to code this than I did, but it was more important that I got it done rather than making it elegant.


Code and Results

Ok so even though coding this felt very clunky to me, the results were way better than I was expecting. I will note that I have no idea what the results should be, remember that this was just a simple probability implementation of Naive Bayes Classifier.

The probability of correct classification was \(geq\\)0.99

My eyes were popping out of my head.

And here is the code