Random Forest Classification of Mushrooms

There is a plethora of classification algorithms available to people who have a bit of coding experience and a set of data. A common machine learning method is the random forest, which is a good place to start.

This is a use case in R of the randomForest package used on a data set from UCI’s Machine Learning Data Repository.

Are These Mushrooms Edible?

If someone gave you thousands of rows of data with dozens of columns about mushrooms, could you identify which characteristics make a mushroom edible or poisonous? How much would you trust your model? Would it be enough for you to make a decision on whether or not to eat a mushroom you find? (That’s a bad decision roughly 100% of the time).

The randomForest package does all of the heavy liftingbehind the scenes. While this “magic” is incredibly nice for the end user, it’s important to understand what it is you’re doing. Keep this in mind for absolutely any package you use in R or any other language.

“To know how to run these programs is impressive, but to truly understand how and why they work is what makes you an expert!” -Haley Stoltzman (my wife is a genius)

Here is an article which explains things in layman’s terms – A Gentle Introduction to Random Forests, Ensembles, and Performance Metrics in a Commercial System.

I created a function to grab and clean up the data. This happened to be a very manual process so I borrowed a lot of the code from others. Later on, I found that the data set had already been cleaned up by someone else and presented as a .csv file, but I decided to use my function anyway.

I brought the data in as a dataframe, the first column is “Edible” which could be labeled “Class” as this is what we’re looking for in the classification. We’ll find only two values here, “Edible” and “Poisonous” (keep in mind that more than two values are easily handled by random forest).

I printed the first few rows and the output shows us there are 23 columns (including “Edible”). I am not a mushroom expert but most of this data makes sense to try and utilize.

It’s important to know that R’s random forest package cannot use rows with missing data. Using the summary() function can help to identify issues. This data doesn’t have missing information.

I want to explore the data before fitting a model to get an idea of what to expect. I am plotting a variable on two axes and using colors to see the relationship as to whether or not the mushroom is edible or poisonous.

In these plots, edible is shown as green and poisonous is shown as red. I’m looking for spots where there exists an overwhelming majority of one color.

A comparison of “CapSurface” to “CapShape” shows us:

  • CapShape Bell is more likely to be edible
  • CapShape Convex or Flat have a mix of edible and poisonous and make up the majority of the data
  • CapSurface alone does not tell us a lot of information
  • CapSurface Fibrous + CapShape Bell, Knobbed, or Sunken are likely to be edible
  • These variables will likely increase information gain but may not be incredibly strong

plot

A comparison of “StalkColorBelowRing” to “StalkColorAboveRing” shows us:

  • StalkColorAboveRing Gray is almost always going to be edible
  • StalkColorBelowRing Gray is almost always going to be edible
  • StalkColorBelowRing Buff is almost always going to be poisonous
  • This list could go on…
  • These variables are likely to increase information gain by a fair amount

plot

A comparison of “Odor” to “SporePrintColor” shows us:

  • Odor Foul, Fishy, Pungent, Creosote, and Spicy are highly likely to be poisonous
  • Odor Almond and Anise are highly likely to be edible.
  • Odor None appears to be primarily edible
    • However, if it has SporePrintColor Green it is highly likely to be poisonous!
  • These variables are likely going to lead to a lot of information gain

plot

Due to how strong those variables looked, I decided to plot them strictly as edible or poisonous and found:

  • Odor is an excellent indicator of edible or poisonous
  • Odor None is the only tricky one – there is data where it would be classified as edible or poisonous
  • SporePrintColor is not as strong as odor when it stands alone – there is a lot of overlap between the columns

plot

plot

Before fitting a model it’s important to split data into different parts – train and test data. There’s no perfect way to know exactly how much data you should use to train your model. In this example I split 5% as training and 95% as testing. However, this is not typical, most of what I see is usually around 60%/40% or 70%/30% for test/train split.

If you choose too large of a training set you run the risk of overfitting your model. Overfitting is a classic mistake people make when first entering the field of machine learning. I won’t go into the details but there are classes dedicated to this subject. Wikipedia Article

Initially, I ran this at higher levels of training data and it had perfect prediction with zero false positives or negatives. That’s not as fun to look at as an example so I scaled down the training data which created more bad predictions.

I wanted to know the split of edible to poisonous mushrooms in the data set and compare it to the training and test data. The random sample appears to have created roughly the same ratio of edible to poisonous upon creating train and test data.

Edible % / Poisonous % :

  • Data: 52 / 48
  • Train: 50 / 50
  • Test: 52 / 48

plot

I finally fit the random forest model to the training data. Plotting the model shows us that after about 20 trees, not much changes in terms of error. It fluctuates a bit but not to a large degree.

Printing the model shows the number of variables tried at each split to be 4 and an OOB estimate of error rate 0.25%. The training model fit the training data almost perfectly. There was only one mushroom which was classified incorrectly. The model would have predicted 1 to be poisonous and it would have turned out to be edible. If we consider edible to be “positive” this means we would have had 1 false negative.

It’s always important to look at what is shown in terms of variable importance. This plot indicates what variables had the greatest impact in the classification model.

I limited it to 10 for the plot.

plot

Odor is by far the most important variable in terms of “Mean Decreasing Gini” – a similar term for information gain in this example. The rest of the results are listed below. It’s interesting to notice “Veil Type” created no information gain – so I looked into it in the initial data. The reason is clear – there is only one VeilType, so it doesn’t offer any differentiation and couldn’t possibly impact the results.

I decided to use the model to attempt to predict whether or not a mushroom is edible or poisonous based off of the training data set. It predicted the response variable perfectly – having zero false positives or false negatives.

Now it was time to see how the model did with data it had not seen before – making predictions on the test data.

It did a decent job. It had a 99% accuracy with a very narrow confidence interval. It did have 48 false negatives and 8 false positives (which could be deadly if you were actually choosing to eat mushrooms based off of this model).

Unfortunately, I have no idea how reliable this data is or how it was captured. There is likely some background information and I would never choose whether or not to eat an unknown mushroom based off of this model (and neither should you).

Code used in this post is on my GitHub