I was looking on UC Irvine Machine Learning Repository for a data set to perform either Binomial or Multinomial Bayes on. What I found was a data set that I could just use simple probability on. I wish I would have found this one before I did Gaussian.
Mushrooms!
Before I talk about this data set, allow me to give credit where credit is due. I don’t want yall thinking I went out and collected these features myself.
Citation for the UC Irvine Machine Learning Repository:
Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
And a link to the Mushroom Data Set
Anyways …
What initially caught my eye about this data set was that each of the features was categorical, aka discrete and finite. At first, I was thinking it was going to be a Binomial/Multinomial implementation, but then I realized wait this is just Simple Probability.
Ok, apparently there is no “easy” way to tell if a mushroom is going to be poisonous or not. With a total of 8124 samples with 22 features each, that’s right twenty two, samples are classified into either edible or poisonous.
I ended up removing two features. One of the features did not vary from a singular category, removed. Another feature had missing values. I didn’t want to bugger with missing values yet, removed.
I’m sure there are more interesting things to discover about this data set, but that will have to wait for another day/post.
Likelihoods (Feature Given Class Labels) From Simple Probabilities
This is really described quite simply. Let’s look at a particular example.
The likelihood that the cap color is green given that the sample is edible
\(P(x_{cap-color}=green|C=edible)\)
Let’s abreviate this to
\(P(x_{cc,g}|C_{e})\)
Which is
\(P(x_{cc,g}|C_{e})=\frac{P(x_{cc,g} \cap C_{e})}{P(C_{e})}\)
Or, the number of samples that are edible and have green caps over the number of edible samples.
\(P(x_{cc,g}|C_{e})=\frac{n(x_{cc,g} \cap C_{e})}{n(C_{e})}\)
That’s it … Simple
Maybe not really Simple
I will be very eager to share this implementation with other Machine Learning/Data Science people.
While sure the implementation was not difficult, it did not feel elegant. I was hoping to use some sort of vectorization here. I am still almost certain that there is a better way to code this than I did, but it was more important that I got it done rather than making it elegant.
Code and Results
Ok so even though coding this felt very clunky to me, the results were way better than I was expecting. I will note that I have no idea what the results should be, remember that this was just a simple probability implementation of Naive Bayes Classifier.
The probability of correct classification was \(geq\\)0.99
My eyes were popping out of my head.
And here is the code
Recent Comments