bayes_motel – Bayesian classification for Ruby

2010-04-28

Bayesian classification is an algorithm which allows us to categorize documents probabilistically. I recently started playing with Twitter data and realized there was no Ruby gem which would allow me to build a spam detector for tweets. The classifier gem just works on a set of text by figuring out which words appear in a category but a tweet is much more complicated than that. A tweet looks like this:

As you can see, a tweet is just a hash of variables. So which variables are a better indicator of spam? I don’t know and chances are you don’t either. But if we create a corpus of ham tweets and a corpus of spam tweets, we can train a Bayesian classifier with the two datasets and it will figure out which variable values are seen often in spam and which in ham.

Some variables don’t work, statistically speaking:

There are additional things we could do to improve our spam detector:

I wrote bayes_motel based on my research this last weekend. Give it a try and send a pull request if you make changes you’d like to see. The test suite gives more detail about the API and has a few thousand tweets to use as sample data. Happy coding!