December 2010 Archives

Let me start by saying that most cloud storage solutions are relatively cheap to begin with, so compressing entities or streams stored in a database table or an elastic cloud store may not save you all that much.  Purely in terms of cloud storage costs, you might save pennies or at most dollars if you compress optimally.  In this case, optimally meaning you know your payloads will GZIP compress nicely and it makes sense to do so.  In fact, if you forcefully compress entities unnecessarily you may actually increase their size!

Let's say you have a "bucket" in which you plan to store hundreds or even thousands of cached HTML documents.  Common sense might tell you that HTML compresses well.  Your tiger like Computer Science instincts were right: HTML, generally speaking, does compress well and is a great candidate for compression.  Obviously, less bytes stored in the cloud usually means slightly reduced storage costs.

In Java, most applications represent HTML as a String literal which is really just a sequence of characters.  Or to look at it another way, HTML can be thought of an array of bytes with a known character encoding.  Thinking in terms of bytes, it's quite easy to GZIP compress and uncompress byte[] arrays on the fly in Java.  Meet GZIPInputStream and GZIPOutputStream.

GZIP compress an InputStream and return the result as a new byte[] array:

public static final byte[] compress(final InputStream is)
throws IOException {
GZIPOutputStream gzos = null;
try {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
gzos = new GZIPOutputStream(baos);
copy(is, gzos);
gzos.finish(); // Important!
return baos.toByteArray();
} finally {
closeQuietly(gzos);
}
}

GZIP uncompress an InputStream and return the result as a new byte[] array:

public static final byte[] uncompress(final InputStream is)
throws IOException {
GZIPInputStream gzis = null;
try {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
gzis = new GZIPInputStream(is);
copy(gzis, baos);
return baos.toByteArray();
} finally {
closeQuietly(gzis);
}
}

First, note that I'm using ByteArrayOutputStream's to store the resulting compressed or uncompressed byte[] array in memory.  Naturally, this means that this solution may not be ideal for you depending on your application.  If you're planning to compress gigs of data in memory, that could be a bad idea unless you really know what you're doing.  Proper usage of this really depends on your application and your intentions.

Second, the copy() and closeQuietly() pseudo methods above are implemented for you here in my GzipCompressor utility class.

Looping back to our cached HTML example, let's compress a tiny HTML document represented here as a String:

// Note that this HTML is tiny, and probably won't compress well at all.
// In fact, the "compressed" result may actually be larger in size than
// the uncompressed original String. This is just an example however,
// to show you how to compress a String literal.
final String html = "<html><body><h1>Horrible HTML</h1></body></html>";

// Get the UTF-8 encoded bytes from the input String. I'm assuming
// that my HTML document is UTF-8 encoded.
final byte[] uncompressed = html.getBytes(UTF_8);
System.out.println("Uncompressed: " + uncompressed.length + "-bytes.");

// Compress and report the result.
final byte[] compressed = compress(uncompressed);
System.out.println("Compressed: " + compressed.length + "-bytes.");

Now that you have a compressed byte[] array, it should be trivial to store it in the cloud using your favorite cloud storage engine or database.  Ideally, you'll want to tweak your entities so that PUT's and GET's automatically compress and uncompress these entities on the fly for you.

My full GzipCompressor utility class can be found here.

Enjoy.
In many web-service infrastructures, it's often desirable to disable the caching of redirects.  Specifically, you might want to set the Expires or Cache-Control headers so that your 301 or 302 redirects from Apache's mod_rewrite are never cached upstream.  Off the top of my head, I can think of a number of reasons why you might want to prevent the caching of a redirect:

  • Your redirect may change from one request, to the next.  Disable caching so the client (the browser) isn't redirected to the same destination every time.

  • Your web-application is behind a reverse caching proxy, and you don't want the caching proxy to cache the redirect.

  • In development, you're sitting behind a corporate web-proxy that is notorious for caching content when it really shouldn't.  Disable caching on the redirects so you can verify that your web-application is working as expected during testing (assuming the web-proxy obeys your Cache-Control and Expires headers).

  • Your web-application counts how many times someone is redirected.  Disable caching so your click-through statistics are a bit more accurate.

Surprisingly, this seemingly common need isn't well documented in the official Apache docs.  So, here's how to do it.

In this example, I'm redirecting based on the Host.  If the incoming request does not match the Host I require, mod_rewrite triggers a 301 redirect to the correct Host.  Of course, your RewriteCond's might be different.

RewriteCond %{HTTP_HOST} !^mark\.koli\.ch [NC]
RewriteRule ^/(.*)$ http://mark.koli.ch/$1 [R=301,L,E=nocache:1]

## Set the response header if the "nocache" environment variable is set
## in the RewriteRule above.
Header always set Cache-Control "no-store, no-cache, must-revalidate" env=nocache

## Set Expires too ...
Header always set Expires "Thu, 01 Jan 1970 00:00:00 GMT" env=nocache

In this example, when the RewriteRule is fired the "nocache" environment variable is set.  Note the E=nocache:1 rewrite flag in the RewriteRule.  Subsequently, mod_headers will set the Cache-Control and Expires headers only if this "nocache" environment variable is set.  In other words, "nocache" is only set on a 301 redirect from the RewriteRule.

This works nicely.

GET /wombat HTTP/1.1
Host: koli.ch

HTTP/1.1 301 Moved Permanently
Date: Sat, 11 Dec 2010 19:36:09 GMT
Location: http://mark.koli.ch/wombat
Server: Apache
Cache-Control: no-store, no-cache, must-revalidate
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-Length: 230
Content-Type: text/html; charset=iso-8859-1
Connection: close

Yay for HTTP.
For months, I've had a spare 20" HP LCD-2065 display sitting under my desk at the office.  With a few extra cycles on my hands, I decided to take half-a-day and setup a truly bad ass developers workstation: three, 20-inch monitors, Xinerama'ed to produce a single 4800x1200 pixel desktop (each display driving 1600x1200 @ 60 Hz).  And, best of all, the HP Z600 Workstation powering this monster is running 64-bit 10.04 Ubuntu Linux.

ubuntu-hp-z600-nvidia-fx-1800.jpg

Not bad, eh?

Here's how I did it ...

Twitter (@markkolich)

Translate

About this Archive

This page is an archive of entries from December 2010 listed from newest to oldest.

October 2010 is the previous archive.

January 2011 is the next archive.

Find recent content on the main index or look in the archives to find all content.