bayes_motel – Bayesian classification for Ruby
2010-04-28
Bayesian classification is an algorithm which allows us to categorize documents probabilistically. I recently started playing with Twitter data and realized there was no Ruby gem which would allow me to build a spam detector for tweets. The classifier
gem just works on a set of text by figuring out which words appear in a category but a tweet is much more complicated than that. A tweet looks like this:
As you can see, a tweet is just a hash of variables. So which variables are a better indicator of spam? I don’t know and chances are you don’t either. But if we create a corpus of ham tweets and a corpus of spam tweets, we can train a Bayesian classifier with the two datasets and it will figure out which variable values are seen often in spam and which in ham.
Some variables don’t work, statistically speaking:
- :id, :created_at – these variables are unique for each tweet which means they are useless for classification. BayesMotel will trim any variable values that don’t appear in more than 3% of the corpus.
- :followers_count – this is probably a pretty good spam/ham indicator in general, but not as a simple number. There are millions of possible values (@aplusk has 4.5 million followers) but we are only training on hundreds or thousands of tweets. What would be better is the binary logarithm of the followers_count to create discrete buckets: 32-64 followers = 5, 1024-2048 = 10 and so on. I’d bet any tweet with a value greater than 12 or so (i.e. 4096+ followers) is very likely to be ham.
There are additional things we could do to improve our spam detector:
- We aren’t deep inspecting the value of the tweet text. It might be useful to have variables like “text_link_count” or “text_hashtag_count” to provide basic metrics for the tweet text content.
- We aren’t performing any timeline checks or storing previous tweet state – spammers tend to tweet the same text over and over and their tweets all contain links. This is beyond the scope of a generic Bayesian system.
I wrote bayes_motel based on my research this last weekend. Give it a try and send a pull request if you make changes you’d like to see. The test suite gives more detail about the API and has a few thousand tweets to use as sample data. Happy coding!