Reverse-engineering Gmail’s spam filter
Kevin Day, December 21st, 2007I gave a speech at Toastmasters two weeks ago about how a spam filter works. The audience was non-technical, so the speech was mostly a summary of Paul Graham’s essay, A Plan for Spam.
To personalize the speech, I wanted to add some examples from my email. But in order calculate spam probabilities for emails, I needed to find how many times each word in an email appears in my normal email and how many times it appears in my spam.
Hacking Gmail
Since I use Gmail, I couldn’t easily write a program to search my emails so I had to rely on Gmail’s search capability. I wasn’t able to count the total occurrences of each word, but I was able to count the number of emails in which a given word occurs.
I used this url to go to the last search page for the word “google” in my good email:
https://mail.google.com/mail/#search/google/p999
And this url goes to the last search page for the word “google” in my spam:
https://mail.google.com/mail/#search/in%3Aspam+google/p999
The total number of emails in the search is in the upper right-hand corner. This hack isn’t needed for rare words; there is usually a link for “Oldest” on the search results page. Commonly occurring words, however, don’t have that link.
Only the email body is searched. Headers are ignored.
Calculating spam scores by hand
By modifying the urls above for each word in an email, I was able to count the total number of emails each word appeared in for spam and non-spam. Then I used the simple Bayesian statistics outlined by Graham to calculate the spam scores for three emails:
- A spam, correctly filtered as spam
- A normal email, correctly not filtered as spam
- A spam that slipped through the filter and appeared in my inbox
The correctly identified spam had a spam probability of 100%. The normal email had a spam probability of 0%. And the spam that got through had a spam probability of 44%.
Hmm… it’s interesting that this simple hack almost correctly identified the spam that Google didn’t catch.
Here is the spam:
Your new Diploma!
No examinations!
NO classes!
NO textbooks!
AND
100% Discrete!
Satisfaction guaranteed!
+1 206 30 90 336Regards!
and each word sorted by its spam probability:
satisfaction 73% diploma 68% discrete 61% examinations 53% and 46% your 45% regards 44% no 40% guaranteed 39% new 38% textbooks 36%
I think they key to this getting through was that I was recently a student and the word “textbooks” was in some of my normal emails. Otherwise it might have been tagged as spam.
Getting more data
Based on these three data points, I can conclude that Gmail’s spam filter isn’t any more sophisticated than the one Paul Graham outlined in 2002. Well, maybe more than three emails should be analyzed before making an assertion like that.
Since it’s tedious doing it all by hand, I’d like to see a Greasemonkey script do the work for me. I’m curious how closely the results of this simple filter would compare to Gmail’s. If my curiosity stays high I might try writing the script myself. I haven’t written a Greasemonkey script before though so I don’t know how much work it will be.

You know, Gmail does support POP3 and (as of recently) IMAP connections. You can just install an e-mail client like Thunderbird on your computer and use it to download all of your mail from Gmail. You’ll then have *all* of the messages locally stored in a convenient text format that’s easy to search however you want.
- Russell Stewart, December 21st, 2007, 1:20 pmInteresting approach. As you can access your gmail account using IMAP or POP, working on a local file system is probably the easiest way for you or someone else to continue your investigation.
- Lloyd Budd, December 21st, 2007, 1:40 pm@Russel:
Not sure about IMAP, but Gmail’s POP doesn’t allow you to download spam mail.
- Dmitry Chestnykh, December 21st, 2007, 7:48 pmYeah, the IMAP approach might work better if it downloads spam as well as email. At the time I was just trying to get the most out of the web interface. I’m not sure how far that will take me though. Thanks for the suggestion.
- Kevin Day, December 22nd, 2007, 12:19 amLeave a Reply
Enclose code in <pre></pre> tags