Archive for posts tagged ‘google’
Create a robots.txt if you haven’t yet
Kevin Day, February 5th, 2010I have anecdotal evidence that Google is more likely to crawl and index your site if you have a robots.txt (set to allow googlebot) than if you don’t have any robots.txt file.
For Crunch Course, I’ve taken my time setting up a robots.txt file. Partly because I didn’t think it would help much and partly because it takes a couple steps to do it in Django.
As a result, Google’s cache of Crunch Course is about a month old.
I finally got around to adding a robots.txt yesterday, and I’m now getting a bunch of 404 errors from the googlebot for old links that I’ve since changed the structure of. That’s indicating to me that it checks for a robots.txt frequently, but it is much more likely to actually crawl the site if it has the green light to do so. Of course it could also just be a coincidence, but that doesn’t make for a good blog post.
So there you have it, indisputable evidence from one data point that Google is more likely to crawl your site if you have a robots.txt than if you don’t.
Google apps and Microsoft’s file format
Kevin Day, February 21st, 2008Yet another post about a Google app and Joel on Software… anyways…
I don’t know if it’s my imagination, but Google Spreadsheets always seems to be a little bit faster and snappier each time I use it. Right now I’m working from a coffee shop with spotty internet connection, but it’s working as good as ever.
That made me think about how Google must know absolutely every tiny trick about optimizing Javascript in ways that no other company can even fathom. It seems like a good parallel to how Joel says Microsoft was light-years ahead of most other software companies for knowing how to write the fastest code possible on the hardware of the time.
Maybe in 10 years we’ll see Google release their specifications for how they store and update a spreadsheet and complain about how complex it is. However it’ll be that way because it includes hacks to work around IE6’s bugs.
Wanted: Smarter Gmail filtering
Kevin Day, February 20th, 2008A while ago when I first saw Gmail’s option for “Filter messages like these,” I thought that it would use Bayesian statistics to automatically determine which category an email falls into. I was disappointed that it just starts the filter menu with the “To” field pre-populated.
Joel Spolsky mentioned that his company’s software filters email that way and it works effectively.
It just seems like a lot more Google-like way of organizing things rather than filtering based on To/From addresses.
Reverse-engineering Gmail’s spam filter
Kevin Day, December 21st, 2007I gave a speech at Toastmasters two weeks ago about how a spam filter works. The audience was non-technical, so the speech was mostly a summary of Paul Graham’s essay, A Plan for Spam.
To personalize the speech, I wanted to add some examples from my email. But in order calculate spam probabilities for emails, I needed to find how many times each word in an email appears in my normal email and how many times it appears in my spam.
Hacking Gmail
Since I use Gmail, I couldn’t easily write a program to search my emails so I had to rely on Gmail’s search capability. I wasn’t able to count the total occurrences of each word, but I was able to count the number of emails in which a given word occurs.
I used this url to go to the last search page for the word “google” in my good email:
https://mail.google.com/mail/#search/google/p999
And this url goes to the last search page for the word “google” in my spam:
https://mail.google.com/mail/#search/in%3Aspam+google/p999
The total number of emails in the search is in the upper right-hand corner. This hack isn’t needed for rare words; there is usually a link for “Oldest” on the search results page. Commonly occurring words, however, don’t have that link.
Only the email body is searched. Headers are ignored.
Calculating spam scores by hand
By modifying the urls above for each word in an email, I was able to count the total number of emails each word appeared in for spam and non-spam. Then I used the simple Bayesian statistics outlined by Graham to calculate the spam scores for three emails:
- A spam, correctly filtered as spam
- A normal email, correctly not filtered as spam
- A spam that slipped through the filter and appeared in my inbox
The correctly identified spam had a spam probability of 100%. The normal email had a spam probability of 0%. And the spam that got through had a spam probability of 44%.
Hmm… it’s interesting that this simple hack almost correctly identified the spam that Google didn’t catch.
Here is the spam:
Your new Diploma!
No examinations!
NO classes!
NO textbooks!
AND
100% Discrete!
Satisfaction guaranteed!
+1 206 30 90 336Regards!
and each word sorted by its spam probability:
satisfaction 73% diploma 68% discrete 61% examinations 53% and 46% your 45% regards 44% no 40% guaranteed 39% new 38% textbooks 36%
I think they key to this getting through was that I was recently a student and the word “textbooks” was in some of my normal emails. Otherwise it might have been tagged as spam.
Getting more data
Based on these three data points, I can conclude that Gmail’s spam filter isn’t any more sophisticated than the one Paul Graham outlined in 2002. Well, maybe more than three emails should be analyzed before making an assertion like that.
Since it’s tedious doing it all by hand, I’d like to see a Greasemonkey script do the work for me. I’m curious how closely the results of this simple filter would compare to Gmail’s. If my curiosity stays high I might try writing the script myself. I haven’t written a Greasemonkey script before though so I don’t know how much work it will be.
