November 2009 Archives

delicious-vs-kolich-domain-hack-google.pngYou're reading this blog, so you may have noticed that I'm operating under mark.koli.ch; a domain hack of my name using the .ch Swiss ccTLD.  The previously retired kolich.com sits idle, serving up HTTP 410 Gone's when appropriate.  Interestingly enough, however, since making the switch to koli.ch I've noticed a significant drop-off of in traffic to my blog.  I suspect that the switch to a non-traditional ccTLD, like .ch, is to blame.  Since the switch to a dot-ch domain incoming traffic to my blog, directly from Google, has fallen off significantly.  It seems that Google's ranking algorithms consider dot-com domains "more valuable" or relevant, than sites hosted under various ccTLD's.  Equally annoying is that Google, and even Yahoo, fail to properly understand my koli.ch domain hack at all.  In other words, when someone issues a search for "kolich blog", Google and Yahoo fail to properly interpret "koli.ch" as "kolich".  Therefore, search results with "kolich" as a whole-word in the URL appear higher in the search results than others with "koli.ch".  Google must understand some domain-hacks though, given that they seem to recgonize "del.icio.us" as "delicious" in their search results.  I'm still wondering how to tell Google that "koli.ch" should be interpreted as "kolich", a whole word.

It's funny though because I've often criticized and poked fun at folks touting various SEO (search engine optimization) techniques.  My opinion has been that as long as you build a great site, with decent information on it, "they" (visitors) will come.  I still live by that mantra but it's clear to me now that Google, and other search engines, really do pay careful attention to your domain.  Even so, I have no plans to switch back to kolich.com, but if you are considering a switch to a domain hack like mark.koli.ch you should expect or at least be aware of the changes this move might bring to your page rank in various search results.
curl-reject-trace-method.pngIt's long been rumored that exposing the HTTP TRACE and TRACK methods on your web-server can open the door to a number of miscellaneous vulnerabilities, including cookie thefts and other cross-site tracing attacks.  Many resources out there claim you should configure you web-server to flat-out reject TRACE and TRACK requests, and I agree with them.  Generally speaking, there's really no good need (that I've found) that would require or make use of TRACE or TRACK.  With that said, if you're running Apache, it's fairly easy to reject TRACE and TRACK using mod_rewrite:

RewriteCond %{REQUEST_METHOD} ^TRACE [NC,OR]
RewriteCond %{REQUEST_METHOD} ^TRACK [NC]
RewriteRule ^/(.*)$ - [F,L]

You can prove to yourself that this works, by using a tool like curl to issue an HTTP TRACE and TRACK to your newly secured web-server.  Use the -X option with curl to specify the HTTP request type:

#/> curl -v -A "Curl" -X TRACE mark.koli.ch
* About to connect() to mark.koli.ch port 80 (#0)
* Trying 24.130.215.240... connected
* Connected to mark.koli.ch (24.130.215.240) port 80 (#0)
> TRACE / HTTP/1.1
> User-Agent: Curl
> Host: mark.koli.ch
> Accept: */*
>
< HTTP/1.1 403 Forbidden
< Date: Sat, 14 Nov 2009 18:53:06 GMT
< Server: Apache
< Content-Length: 202
< Connection: close
< Content-Type: text/html; charset=iso-8859-1
<
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access /
on this server.</p>
</body></html>
* Closing connection #0

Yep, works nicely.  One thing that slightly annoys me though is that the HTTP OPTIONS method still reports that my server supports TRACE, even though I clearly don't anymore.  A quick Google search reports that many other folks have had the same concern, with no clear resolution.
This afternoon, I was using the HttpFox Firefox extension to analyze some web-traffic for a work related project.  With HttpFox still running in the background (I forgot I left it running), I opened another tab and navigated over to my Twitter page to check out a few things.  I clicked a few links, replied to a few folks, etc.  Switching back to my work project, I closed Twitter and re-opened HttpFox.  Well well well, what do we have here.  I discovered that Twitter silently rolled out some JavaScript that actively tracks every link I click on.  Any link, in any Tweet, that you click on is silently reported back to Twitter behind the scenes.  Looking at the output of HttpFox pretty much proves it:

twitter-abacus-tracking-clicked-links.png

Looks like twitter.com/abacus is some type of web-service used by Twitter to log what links we're all clicking on.  I'm curious why Twitter cares what links we're clicking on.

BTW, Twitter, if you're reading this, the HTTP Content-Type in your responses from /abacus are incorrect.  You're phoning home by creating a new Image() in your core JavaScript like so:

(new Image()).src="/abacus?"+$.param(A);

But your Content-Type from this request is text/html, which could cause problems in a few browsers.  If you're going to use an Image(), the returned Content-Type from your /abacus web-service should be that of an image: image/jpeg, image/png, image/gif, etc.

twitter-abacus-tracking-clicked-links-wrong-content-type.png

Cheers.

Rediscovering HTTP 410 Gone

| 1 TrackBack
http-410-gone.pngThis evening, I accomplished another important milestone in the kolich.com migration process.  Since acquiring koli.ch, I've been slowly migrating my blog and all of its resources to my new online home under mark.koli.ch.  This migration began back in September 2009, as I aptly described here and here.  Only problem is, I still see a ton of bots and other users requesting resources under kolich.com, even though as far as I'm concerned, it's shut down and there's nothing to see there.

For a month or so, I've been gracefully redirecting traffic with an HTTP 301 Moved Permanently.  Even so, it appears that Google and other crawlers are still hitting kolich.com looking for stuff that simply doesn't exist there anymore, even though I've been telling them for a solid month to "go look somewhere else."  Time to pull out the big guns.  A quick flip through my handy copy of RFC 2616, that's the HTTP 1.1 spec, lead me to rediscover HTTP 410 Gone.  If you haven't met HTTP 410, it's the forgotten step child of HTTP 404 Not Found.  As described here, "Error 410 means 'Resource gone', as in, a resource used to exist at this location, but now it's gone. Not only is it gone, but I don't know (or I don't want to tell you) where it went. If I knew where it went, and I wanted to tell you, I would use error 301 ('Permanent redirect') and any smart client would simply redirect to the new address. But 410 means 'Resource gone, no forwarding address'. Train gone sorry."

Looks great, that's exactly what I need.  Time to serve up some 410's.  I configured my local mark.kolich.com Apache 2.2.3 virtual host with mod_rewrite to return an HTTP 410 Gone for most resources:

RewriteCond %{REQUEST_URI} !\.(html?)$ [NC]
RewriteCond %{REQUEST_URI} !^/$ [NC]
RewriteRule ^/(.*)$ - [G,L]

Note the [G,L] on the RewriteRule directive.  G, meaning Gone, and L meaning the last rule in the chain to apply to this request.  In this case, any request for a resource that doesn't end in .html (or .htm) and isn't aimed at the server root, I immediately respond with an HTTP 410 Gone.  Here's a nice example.  I'm handling HTML pages a little differently.  Requests for an actual blog entry itself (a resource that ends in .html), are caught an handled a little more gracefully as shown here.  I haven't yet decided when to phase out this graceful catch.

In any event, let's see if an HTTP 410 gets the attention of those pesky crawlers and RSS feed readers.  To be continued ...
Many sites are moving towards dynamic XML sitemaps.  These sitemaps let you tell Google, Yahoo, Bing, and Ask.com which pages on your site they should index, how often, and when they were last modified.  You can even assign a priority to each page in the sitemap, which serves as an indication of how important a specific page is in relation to others.

The sitemap protocol is well defined here at sitemaps.org.

Yesterday, I configured Movable Type, my blog publishing platform, to automatically generate my own sitemap.xml when I publish a new page or blog entry.  I added a custom Movable Type Index Template that would automatically generate a complete sitemap.xml for me, and place it under the root of my blog at http://mark.koli.ch/sitemap.xml.


1- My Sitemap XML Index Template

My custom sitemap XML Index Template is relatively straightforward.  In my Movable Type control panel, I clicked "Create index template" on the Blog Templates screen.  I named my template "XML Sitemap" and used the following configuration:

<?xml version="1.0" encoding="<$mt:PublishCharset$>"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

<!-- blog root -->
<url>
<loc><$mt:BlogURL encode_xml="1"$></loc>
<lastmod>
<mt:Entries lastn="1">
<$mt:EntryModifiedDate utc="1" format="%Y-%m-%dT%H:%M:%S+00:00"$>
</mt:Entries>
</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>

<!-- pages -->
<mt:Pages lastn="0">
<url>
<loc><$mt:PagePermalink$></loc>
<lastmod><mt:PageModifiedDate utc="1" format="%Y-%m-%dT%H:%M:%S+00:00"$></lastmod>
<priority>0.8</priority>
</url>
</mt:Pages>

<!-- entries -->
<mt:Entries lastn="0">
<url>
<loc><$mt:EntryPermalink encode_xml="1"$></loc>
<lastmod><$mt:EntryDate utc="1" format="%Y-%m-%dT%H:%M:%S+00:00"$></lastmod>
<changefreq>never</changefreq>
<priority>0.6</priority>
</url>
</mt:Entries>

<!-- archives -->
<mt:ArchiveList archive_type="Monthly">
<url>
<loc><mt:ArchiveLink></loc>
<priority>0.4</priority>
<changefreq>never</changefreq>
</url>
</mt:ArchiveList>

</urlset>

Using the sitemap XML protocol defined at sitemaps.org, I configured this template to include my blog root, all pages, all entries, and all archives in the sitemap.  I assigned a higher priority to my blog root and individual pages versus the entries and archives.  Note that I also omitted the <changefreq> tag under each "page", because I have no idea how often those pages will actually change.  Also, I intentionally omitted the <lastmod> tag under each archive page, since again, there's not point in defining the last modified date on an archive.

edit-template-xml-sitemap.png

Of course, you're free to change this template as you see fit as long as it adheres to the sitemap standard.


2- Submit Your Sitemap XML (Submission URL's)

Once you publish your sitemap with Movable Type, you'll probably want to alert Google, Bing, Yahoo and Ask.com that you've got a new sitemap.xml available for your blog.  As described here in the sitemap protocol, you can "ping" these web crawlers to alert them of the change.  To do so, copy and paste these URL's into a web-browser, and replace <sitemap URL> with the full URL to your new sitemap:

http://www.google.com/ping?sitemap=<sitemap URL>
http://search.yahooapis.com/SiteExplorerService/V1/ping?sitemap=<sitemap URL>
http://submissions.ask.com/ping?sitemap=<sitemap URL>
http://www.bing.com/webmaster/ping.aspx?siteMap=<sitemap URL>

Example:

http://www.google.com/ping?sitemap=http://mark.koli.ch/sitemap.xml

On each submission, you should see some type of successful (HTTP 200 OK) response indicating that your submission was successful.  Here's what Google's looked like:

google-added-sitemap.png

Enjoy!

Onyx (beta)

| No TrackBacks
onyx-logo.jpgIf you follow me on Twitter, you may have noticed I silently launched Onyx a few weeks ago.  In a nutshell, as described on the Onyx homepage ...

"Onyx is a social file management tool I built to help me keep track of, organize, and share my digital archive. While browsing the web, I tend to accumulate a lot of junk; if I like something, I save it. If I see a cool application of some sort, I'll take a screen shot. If I find a cool song, I'll snag it for later. Or, if I have an important document I need to archive, I'll store it. All of this digital content was sitting around in a relatively unorganized and unsearchable set of files and directories on a local file system. Onyx is my solution to this digital content clutter problem. Files and bookmarks uploaded into Onyx can be protected, searched, organized and shared much easier than a set of files and directories on my local disk."

Yep.

Onyx was a chance for me to "cross off" an important task on my digital TODO list that's been hanging over my head for a while: organize and archive all of my digital crap.  It also gave me a chance to play with some new technologies I've been wanting to integrate into a real project for quite a while, like jQuery UI's draggable and droppable.  I also learned how to base-36 encode numbers for a tiny URL, and solved a very annoying problem using HTTPS with Internet Explorer.

In the last 24-hours, I finished uploading all of my personal, and public, digital content into Onyx which you can browse here from my Onyx home directory.  Of course, like any good file management solution, my personal/private files are protected.  What you'll see in my home directory are files and folders I've allowed the public to view.

For the curious software engineer, Onyx is written entirely in PHP running on Apache 2.2.3.  I'm also using a clever little mod_rewrite hack in Apache to drop the .php on each Onyx URL.  Dropping the .php makes my URL's look a little cleaner; hipster Django and RoR can suck on that one.  You may also ask why I named this project "Onyx".  As described here on Wikipedia, Onyx is a type of colorful layered quartz which contains bands of almost every color.  This colorful layering reminded me of the layered structure of a file system: files, folders, bookmarks, etc. all mashed together.  Hence, Onyx.

If you'd like to read a little more about my Onyx project, you might find this post interesting.  Thoughts and feedback are always welcome.

Rock on.
google-cookie-block.pngToday, the spoiled elitists at Google announced the release of Google Dashboard.  It's a way for you to see, in reasonable detail, what Google knows about you based on the Google services you use. Their blog claims, "in an effort to provide you with greater transparency and control over their own data, we've built the Google Dashboard." Clever.

What's bothersome to me is that I need a Google Account to see what Google supposedly knows about me.  Well, what about those cute little .google.com cookies they shove into my browser when I use their search engine?  IMHO, Google Dashboard is missing one key feature: the ability to clearly show me what Google knows about me and my web-search history, anonymously, based on the already unique ID tracking numbers in those cookies.  Google, why do I need an account to see what you've learned about me based on my "anonymous" web-history?

There's probably only a few realistic explanations for why Google wouldn't let you see this information:

  1. Their cookies aren't actually used for tracking of web-searches and user habits.  I suppose this is a possibility.
  2. Or, more likely, analyzing your web-search traffic is where the real bacon is.  And, not surprisingly, Google doesn't want to show us the real underlying data their advertising engine uses to show us ads, which is their primary revenue stream.  I guess I don't blame them.  After all, they are just another public corporation with shareholder responsibilities.

I'm awfully tired of the world bending over and blindly accepting everything Google throws at us as the greatest thing since sliced bread.  If you really understand how Google makes their money, you should also try to understand what Google is not showing us, or not telling us, and why.


Blocking Google Cookies in Firefox

For the most part, I've given up on Google.  Their web-search is fine, but I don't particularly enjoy the fact that my web-search and browsing history is "anonymously" tracked behind my back.  If you'd like to permanently, or temporarily, block Google from inserting their nosy tracking cookies into your browser you can easily do so by setting a "cookie exception" in Firefox (assuming you use Firefox):

  1. Click the Tools menu, and select "Options...".
  2. Click the Privacy tab.
  3. Click the "Exceptions..." button.
  4. In the "Address of web site:" box, enter ".google.com" no quotes and click Block to add the google.com domain to your blocked list.

google-block-cookie-howto.png

A few blog readers astutely pointed out that if you block cookies from .google.com, you won't be able to login to any Google services.  Yes, I know that.  And for the record, I don't use Gmail or any other Google Account that would require me to login on a regular basis.  When I need to login to my Google Code account, I temporary unblock .google.com, and login.

Twitter (@markkolich)

Translate

About this Archive

This page is an archive of entries from November 2009 listed from newest to oldest.

October 2009 is the previous archive.

December 2009 is the next archive.

Find recent content on the main index or look in the archives to find all content.