Archive for December, 2007

Obligatory “got a new toy” post

Kevin Day, December 29th, 2007

For Christmas I received a Nokia N800 internet tablet, which was at the top of my wish list. So far it’s been great to use. Couldn’t be happier.

My initial ideas for applications to write for it are (in increasing order of complexity):

  1. Digital picture frame that pulls images from Flickr based on tags supplied by the user.
  2. Fantasy football live stats calculator
  3. Grocery store helper that suggests what food to buy based on the food I have in my apartment, cost, and nutritional value.

I’m just starting with the Flickr-frame application for my desktop, and once it’s finished then I’ll try porting it to the N800. At the moment I’m using PyGTK for the GUI.

Reverse-engineering Gmail’s spam filter

Kevin Day, December 21st, 2007

I gave a speech at Toastmasters two weeks ago about how a spam filter works. The audience was non-technical, so the speech was mostly a summary of Paul Graham’s essay, A Plan for Spam.

To personalize the speech, I wanted to add some examples from my email. But in order calculate spam probabilities for emails, I needed to find how many times each word in an email appears in my normal email and how many times it appears in my spam.

Hacking Gmail

Since I use Gmail, I couldn’t easily write a program to search my emails so I had to rely on Gmail’s search capability. I wasn’t able to count the total occurrences of each word, but I was able to count the number of emails in which a given word occurs.

I used this url to go to the last search page for the word “google” in my good email:

https://mail.google.com/mail/#search/google/p999

And this url goes to the last search page for the word “google” in my spam:

https://mail.google.com/mail/#search/in%3Aspam+google/p999

The total number of emails in the search is in the upper right-hand corner. This hack isn’t needed for rare words; there is usually a link for “Oldest” on the search results page. Commonly occurring words, however, don’t have that link.

Only the email body is searched. Headers are ignored.

Calculating spam scores by hand

By modifying the urls above for each word in an email, I was able to count the total number of emails each word appeared in for spam and non-spam. Then I used the simple Bayesian statistics outlined by Graham to calculate the spam scores for three emails:

  • A spam, correctly filtered as spam
  • A normal email, correctly not filtered as spam
  • A spam that slipped through the filter and appeared in my inbox

The correctly identified spam had a spam probability of 100%. The normal email had a spam probability of 0%. And the spam that got through had a spam probability of 44%.

Hmm… it’s interesting that this simple hack almost correctly identified the spam that Google didn’t catch.

Here is the spam:

Your new Diploma!

No examinations!
NO classes!
NO textbooks!
AND
100% Discrete!
Satisfaction guaranteed!
+1 206 30 90 336

Regards!

and each word sorted by its spam probability:

satisfaction 73%
diploma 68%
discrete 61%
examinations 53%
and 46%
your 45%
regards 44%
no 40%
guaranteed 39%
new 38%
textbooks 36%

I think they key to this getting through was that I was recently a student and the word “textbooks” was in some of my normal emails. Otherwise it might have been tagged as spam.

Getting more data

Based on these three data points, I can conclude that Gmail’s spam filter isn’t any more sophisticated than the one Paul Graham outlined in 2002. Well, maybe more than three emails should be analyzed before making an assertion like that.

Since it’s tedious doing it all by hand, I’d like to see a Greasemonkey script do the work for me. I’m curious how closely the results of this simple filter would compare to Gmail’s. If my curiosity stays high I might try writing the script myself. I haven’t written a Greasemonkey script before though so I don’t know how much work it will be.

Idea for a LinkedIn application

Kevin Day, December 11th, 2007

Since LinkedIn will be allowing 3rd party applications as part of its involvement with OpenSocial, here’s one application I’d like to see:

A status bar showing how I compare to other people with my current job. Take my experience and education and compare those with everyone else with a job similar to mine. I am a design engineer with a master’s degree and 1.5 years of experience. My estimate is that my lack of experience would put me somewhere around the lower 25th percentile of all design engineers.

In addition to my current status, let me know how far I would move up with an extra year of experience, a certification, or another degree. Also, how would I stack up against Senior Design Engineers?

I think now that I’m out of college, I’m grasping for another metric for which to rate myself against others. Regardless, it would help people better set goals in their professional life.

I’m not sure if the Open Social API would allow access to all those other profiles, or if this would require a lot of messy screen scraping. Alternatively, if someone from LinkedIn gets wind of this idea, maybe they could add it themselves (and pay me for the idea too).

Boot dammit!

Kevin Day, December 9th, 2007

The power went out earlier today, and when it came back on I was surprised to find that my server didn’t come back online with the power. There’s no monitor attached to it, so I checked the router and everything else first before I hauled the 20th-century cathode ray tube monitor out of the closet.

It turns out that I just left the install disk in the CD drive when I was installing a package some time ago and Ubuntu was booting from the CD instead of the hard drive. D’oh! Time to change the BIOS settings so that it doesn’t happen again.