Content tagged: aws

A Quick Bash Script to Backup Files on AWS S3

bf9ffc5cee15672bfcc95f2db9828a9c15ec36d6

Sun Jan 11 13:08:26 2015 -0800

Last week, I lost a disk in a 4TB software RAID5 array mounted in my home Linux server. This host has been online for almost 4-years without any major interruptions so the clock was ticking — it was really only a matter of time until I would be forced to replace a disk. Fortunately, replacing the disk in the array was a complete breeze. No data was lost, and rebuilding the array with a new disk only took a short 160-minutes while the host remained online. Hats off to the folks maintaining Linux RAID — the entire disk replacement process end-to-end, was flawless.

Before I could replace the failing disk, the array was limping along in a degraded state. This got me thinking: I already regularly backup the data I cannot live without to an external USB pocket drive and store it “offsite” — what if I could sync the most important stuff to the “cloud” too?

So, I sat down and wrote a quick bash script that recursively crawls a root directory of my choosing and uses curl to upload each discovered file to AWS S3. Note that the structure of the backup on S3 will exactly match the file/directory structure on disk:

#!/bin/bash

S3_KEY="[YOUR AWS KEY HERE]"
S3_SECRET="[YOUR AWS SECRET HERE]"

BUCKET="[YOUR BUCKET NAME HERE]"

CONTENT_TYPE="application/octet-stream"

find $@ -type f -print0 | while IFS= read -r -d '' i; do
  FILE="$(perl -MURI::Escape -e 'print uri_escape($ARGV[0],"^A-Za-z0-9\-\._~\/");' "$i")"
  RESOURCE="/${BUCKET}/${FILE}"
  DATE_VALUE=`date -R`
  STRING_TO_SIGN="HEAD\n\n\n${DATE_VALUE}\n${RESOURCE}"
  SIGNATURE=`echo -en ${STRING_TO_SIGN} | openssl sha1 -hmac ${S3_SECRET} -binary | base64`
  EXISTS=`curl -s -I -w "%{http_code}" \
    -o /dev/null \
    -H "Host: ${BUCKET}.s3.amazonaws.com" \
    -H "Date: ${DATE_VALUE}" \
    -H "Authorization: AWS ${S3_KEY}:${SIGNATURE}" \
    https://${BUCKET}.s3.amazonaws.com/${FILE}`
  if [ $EXISTS -eq "200" ];
  then
    echo "File \"$i\" exists."
  else
    echo $i
    MD5=`openssl dgst -md5 -binary "$i" | base64`
    STRING_TO_SIGN="PUT\n${MD5}\n${CONTENT_TYPE}\n${DATE_VALUE}\n${RESOURCE}"
    SIGNATURE=`echo -en ${STRING_TO_SIGN} | openssl sha1 -hmac ${S3_SECRET} -binary | base64`
    curl -# -X PUT -T "${i}" \
      --limit-rate 300k \
      --connect-timeout 120 \
      -H "Host: ${BUCKET}.s3.amazonaws.com" \
      -H "Date: ${DATE_VALUE}" \
      -H "Content-Type: ${CONTENT_TYPE}" \
      -H "Content-MD5: ${MD5}" \
      -H "Authorization: AWS ${S3_KEY}:${SIGNATURE}" \
      https://${BUCKET}.s3.amazonaws.com/${FILE} > /dev/null
  fi
done

If you’d rather not copy+paste, download the script here.

A few notes:

  • You should replace S3_KEY, S3_SECRET, and BUCKET in the script with your AWS key, AWS secret, and backup bucket name respectively.
  • I’m using the --limit-rate 300k argument to limit the upload speed to 300 KB/sec. Otherwise, I’d completely saturate my upload bandwidth at home. You should, of course, adjust this limit to suit your needs depending on where you’re uploading from.
  • I’m using the --connect-timeout 120 argument to work around spurious connection failures that might occur during a handshake with S3 while starting an upload.
  • Documentation on the request signing mechanism used in the script can be found at http://docs.aws.amazon.com/AmazonS3/latest/dev/RESTAuthentication.html.

Usage

Assuming you have a directory named foobar which contains a nested structure of the content you want to upload:

chmod +x s3-backup.sh

./s3-backup.sh foobar

Or maybe you only want to upload foobar/baz/*:

./s3-backup.sh foobar/baz

Happy uploading.

Recursively Deleting Large Amazon S3 Buckets

d7b268e6f71843a5735e21bb9765548b58f9d430

Fri Sep 17 09:54:00 2010 -0700

My first experience using Amazon Web Services for a production quality project was quite fun, and deeply interesting. I’ve played with AWS a bit on my own time, but I recently had a chance to really sink my teeth into it and implement production level code that uses AWS as a real platform for an upcoming web, and mobile application.

Perhaps the most interesting, and frustrating, part of this project involved storing hundreds of thousands of objects in an AWS S3 bucket. If you’re not familiar with S3, it’s the AWS equivalent to an online storage web-service. The concept is simple: you create an S3 “bucket” then shove “objects” into the bucket, creating folders where necessary. Of course, you can also update and delete objects. If it helps, think of S3 as a pseudo online file-system that’s theoretically capable of storing an unlimited amount of data. Yes, I’m talking Exabytes of data … theoretically … if you’re willing to pay Amazon for that much storage.

In any event, I created a new S3 bucket and eventually placed hundreds of thousands of objects into it. S3 handled this with ease. The problem, however, was when it came time to delete this bucket and all objects inside of it. Turns out, there is no native S3 API call that recursively deletes an S3 bucket, or renames it for that matter. I guess Amazon leaves it up to the developer to implement such functionality?

That said, if you need to recursively delete a very large S3 bucket, you really have 2 options: use a tool like s3funnel or write your own tool that efficiently deletes multiple objects concurrently. Note that I say concurrently, otherwise you’ll waste a lot of time sitting around waiting for a single-threaded delete to remove objects one at a time, which is horribly inefficient. Well this sounds like a perfect problem for a thread pool and wouldn’t you guess it, even a CountDownLatch!

The idea here is you’ll want to spawn multiple threads from a controlled thread pool where each thread is responsible for deleting a single object. This way, you can delete 20, 30, 100 objects at a time. Yay for threads!

Here’s the pseudo code. Note that I say pseudo code because it’s not a complete implementation. This examples assumes you have an AWS S3 implementation (a library) that’s able to list objects in a bucket, delete buckets, and delete objects.

package com.kolich.aws.s3.util;

import java.io.UnsupportedEncodingException;
import java.net.URLEncoder;
import java.util.List;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

import com.amazonaws.services.s3.model.S3ObjectSummary;

public class RecursiveS3BucketDelete {

  private static final String AWS_ACCESS_KEY_PROPERTY = "aws.key";
  private static final String AWS_SECRET_PROPERTY = "aws.secret";

  /**
   * The -Daws.key and -Daws.secret system properties should
   * be set like this:
   * -Daws.key=AK7895IH1234X2GW12IQ
   * -Daws.secret=1234567890123456789012345678901234456789
   */

  // Set up a new thread pool to delete 20 objects at a time.
  private static final ExecutorService pool__ =
        Executors.newFixedThreadPool(20);

  public static void main(String[] args) {

    final String accessKey = System.getProperty(AWS_ACCESS_KEY_PROPERTY);
    final String secret = System.getProperty(AWS_SECRET_PROPERTY);
    if(accessKey == null || secret == null) {
      throw new IllegalArgumentException("You're missing the " +
          "-Daws.key and -Daws.secret required VM properties.");
    }

    final String bucketName;
    if(args.length < 1) {
      throw new IllegalArgumentException("Missing required " +
          "program argument: bucket name.");
    }
    bucketName = args[0];

    // ... setup your S3 client here.

    List<S3ObjectSummary> objects = null;
    do {
      objects = s3.listObjects(bucketName).getObjectSummaries();
      // Create a new CountDownLatch with a size of how many objects
      // we fetched.  Each worker thread will decrement the latch on
      // completion; the parent waits until all workers are finished
      // before starting a new batch of delete worker threads.
      final CountDownLatch latch = new CountDownLatch(objects.size());
      for(final S3ObjectSummary object : objects) {
        pool__.execute(new Runnable() {
          @Override
          public void run() {
            try {
              s3.deleteObject(bucketName,
                URLEncoder.encode(object.getKey(), "UTF-8"));
            } catch (Exception e) {
              System.err.println(">>>> FAILED to delete object: (" +
                bucketName + ", " + object.getKey()+ ")");
            } finally {
              latch.countDown();
            }
          }
        });
      }
      // Wait here until the current set of threads
      // are done processing.  This prevents us from shoving too
      // many threads into the thread pool; it's a little more
      // controlled this way.
      try {
        System.out.println("Waiting for threads to finish ...");
        // This blocks the parent until all spawned children
        // have finished.
        latch.await();
      } catch (InterruptedException e) { }
    } while(objects != null && !objects.isEmpty());

    pool__.shutdown();

    // Finally, delete the bucket itself.
    try {
      s3.deleteBucket(bucketName);
    } catch (Exception e) {
      System.err.println("Failed to ultimately delete bucket: " +
          bucketName);
    }

  }

}

Additional notes, and warnings:

  • If you’re not familiar with using a CountDownLatch, you can find my [detailed blog post on it here](understanding-javas-countdownlatch.html}.
  • If you’re going to delete multiple objects at a time, you should confirm the S3 library you’re using is thread safe. Many S3 libraries I’ve seen rely on the popular Apache Commons HttpClient to handle the underlying HTTP communication work with S3. However, you should note that HttpClient isn’t thread safe by default, unless you’ve explicitly set it up to use a ThreadSafeClientConnManager.