Prevent Google From Caching Your Site (Meta tags: googlebot and robots)

| No TrackBacks
Thumbnail image for google-dominant.jpgA few weeks ago, I found myself sitting waist deep in blog regret.  You know, that feeling you get when you post something you probably shouldn't have, and then you realize a month later that it was a bad idea.  Long story short, as an HP employee, I posted some (neutral?) comments on HP's recent pay cuts.  Some folks didn't think it was very neutral, and said some pretty nasty things about me on another blog.  I found out about this (the hard way), and decided to remove the post from kolich.com all together.

The removal process went fine (thank you Movable Type), but after analyzing my server logs a bit, I noticed that some clever folks were using Google to look up a cached copy of the post in question.  Unfortunately, there was nothing I could do about cached copies of the post floating out there on the web.  The "damage" (if you would call it that) was already done.

However, to prevent future debacles I decided I will 1) never again post anything about my job, career, or employer to my blog, and 2) will never again allow Google to keep cached copies of my content.  Which brings me to the heart of this post.  If you want to prevent Google from caching your content, you have a few options:

  • You can add a "Noarchive: /" record to your robots.txt file.  Ideally, bots who care about robots.txt (most well established bots do), will see this and won't archive/cache your content.  Of course, Noarchive: lets you be as specific or generic as you want.  If you only want to block a specific page or section of your site, you can say "Noarchive: /block/this/relative/from/root/".

  • If you don't like the Noarchive: robots.txt option, you can add a "noarchive" <meta> tag to each page you'd like to exclude from caching.  This is the approach I took for this blog (kolich.com):

    <meta name="googlebot" content="noarchive" />
    <meta name="robots" content="noarchive" />

In Movable Type, I added these <meta> tags to my HTML Head template.  Any new post that I publish (like this one) will automatically contain these tags.  This works nicely.  With these meta tags in place, Google will no longer keep cached copies of my content.

On another note, I question the usefulness of search engines, like Google, keeping cached copies of web-sites and other content found on the web.  In general, I suspect most bloggers and webmasters don't want cached copies of their content floating around the net.  With cached copies, you lose control over who views your content and in what context.  In my opinion, a better approach is to let bloggers and webmasters "opt-in" to site caching and archiving.  Instead of assuming everyone wants their content archived on Google, it would be nice if Google assumed the opposite.  Instead, assume we don't want our stuff cached, but if someone does, then they can opt-in individually by adding a <meta content="archive"> tag, or similar directive to their robots.txt file.

Oh well, you live you learn.

Did You Find this Helpful?

Did you find this post helpful, or at least, interesting?

  

Send Mark a Direct Message

If you'd like to send me a direct message, please do so below. However, I do not publicly post comments or messages submitted directly to me. So, if you're going to try to SPAM me, or my blog, you're pretty much wasting your time.

400 characters remaining

Error

About Mark

A Silicon Valley native, Mark Kolich is a full-time Software Engineer, a casual entrepreneur, and a consultant for hire. A web technologies expert, his current focus is on building powerful and robust cloud-driven web-applications using Java, PHP, Perl, AJAX, DHTML, CSS, and JavaScript. His favorite programming languages are PHP, Java and JavaScript. He uses Linux, enjoys biking to work, loves building great software, and always writes elegant, readable, and maintainable code.

No TrackBacks

No trackbacks attached to this entry.

Twitter (@markkolich)

Translate

About this Entry

This page contains a single entry by Mark Kolich published on April 11, 2009 2:48 PM.

Configuring Ant To Use javac's -Xlint:unchecked Option was the previous entry in this blog.

TweetmemeBot's Invalid User-Agent String is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.