Personal Backups with S3

Data availability is one of those things that keeps me up at night. Or at least, it would if I didn't like sleep so much. So when I caught the backup meme floating around the ætherweb, I wanted to follow up with a description of my own backup solution. Only, my backup solution is rather ad-hoc, and not really complete. I'm still searching for a full solution, which in my mind must involve not just being able to put data in multiple places (that part's easy) but being able to locate backups and identify their relationship with the originals. I'm working on it. A new component just appeared on my radar, though, which may be a key piece in my solution.

At the moment, my favorite solution for simple backup is live network backup with unison^[1]. Network backups mean that your backup data is tied to a computing platform with which you can directly interact and process that data. If I unison my important, working data over to my laptop, then I can use the data on either my laptop or my server, and if either fails, I have the other on which to fall back. With physical backups, I have to do something to interact with my data. I have to locate and plug in that hard drive with the backup (and I haven't had a lot of luck with external enclosures) or I have find the right DVD and mount it somewhere, etc. The data isn't as liquid as I would like.

My broader approach to data availability is more complex. Like Norm and Sean, I work with version control systems whenever possible. I plant a new Subversion^[2] repository on my local system whenever I start up a new project. But then, what about data in working copies? In any case, little of my backup strategy could really be considered "off site", or "stable".

Enter Amazon's S3 web service. (Which I discovered via a New York Times article. How weird is it to be getting tech news from the Gray Lady?) I don't know a lot about the major remote backup services, but from what I understand they all require some kind of specialized software. No, thanks; proprietary makes me queasy. The Simple Storage Service is just what it says on the tin; very, very simple. It basically provides space to just dump data, for whatever purpose you like. And it uses HTTP requests to do it. It doesn't get much more transparent than that. Naturally, the next question is cost; at $0.15 per Gig per month for storage and $0.20 per Gig per month for data transfer, it seems to me like a steal. With just one GB I can pretty much store all the personal data I can imagine, which brings me to something like a couple bucks a year with plenty of room (that is, seemingly unlimited room) for growth. This is clearly not for everything; for instance, I'm not going to be backing up my music this way.

The final question is security. Based upon the S3 docs and my poking at the source code, it seems like they do a reasonable job of authenticating all of your messages, and they claim to support an access control policy on your data that defaults to for-your-eyes-only. Cool, but that doesn't protect the data on the wire, and it doesn't really protect the data in their wher3houses. I want to back up encrypted blobs, so I have to take some matters into my own hands. Naturally, the other challenge with backing up encrypted blobs is making sure you have access to the decrypting key; having a backup doesn't make much difference if you can't decrypt that backup after that catastrophic hardware failure because your decryption key was lost along with everything else. I recommend doing a physical backup of your decryption key to a physically protected location, like a safe deposit box.

So here's the formula (at least on a Unix-like machine). Go get an Amazon web services developer account. Sign up for S3. Download s3-curl or one of the toolkits if you want to build your own interface. Encrypt all your data to yourself (for example with GPG^[3]), and then push it over to Amazon. I'm going to describe the s3-curl and GPG solution here. In this example, I'm going to create a bucket named infinitesque (yes, that's now my bucket) and then push an encrypted archive of my personal email directory into that bucket (yes, this is actually one of the first actions I performed).

First, the encrypted archive:

$ tar -cjp Mail | gpg -ser jlc6@po.cwru.edu > Mail.tar.bz2.gpg

Yeah, that took a little while. Next, extract s3-curl if you haven't done so already. To get s3-curl up and running, you may need to install a Perl module or two. With administrative privileges, this looks like:

# perl -MCPAN -e "install Digest::HMAC_SHA1"

Which will get you support for the HMAC_SHA1 hash algorithm from Perl, which is useful for all sorts of network authentication algorithms including the one used by S3. Once you have all the prerequisites for s3-curl installed, locate the Access Key ID and Secret Access Key that Amazon gave you with your developer account. I'll call these "ID" and "Key", respectively. A command like this will create a new bucket:

$ ./s3curl.pl --id=myID --key=myKEY --put=/dev/null -- \ 
  http://s3.amazonaws.com/infinitesque

Naturally, you'll have to pick a different bucket name. A command like the following will upload stuff to your new bucket; whatever you place at the end of the URL will be the name of that object.

$ ./s3curl.pl --id=myID --key=myKEY --put=~/Mail.tar.bz2.gpg -- \ 
  http://s3.amazonaws.com/infinitesque/home/john/Mail.tar.bz2.gpg

A couple hours later (for me) my object is uploaded. No, I don't know if Amazon charges for bandwidth if something happens before the upload finishes. But anyway, isn't it great how all these objects have natural URLs? (But GAH, most other resources on Amazon have horrendous URLs.) I am looking forward to being able to incorporate them into RDF statements for tracking purposes. Also along the lines of Web Architecture, we can get a nice XML list of what we currently have in a given bucket by simply "getting" that bucket by name:

$ ./s3curl.pl --id=myID --key=myKEY -- http://s3.amazonaws.com/infinitesque

Which, for me, looks like the following after running it through xmllint --format.

<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>infinitesque</Name>
  <Prefix/>
  <Marker/>
  <MaxKeys>1000</MaxKeys>
  <IsTruncated>false</IsTruncated>
  <Contents>
    <Key>home/john/Mail.tar.bz2.gpg</Key>
    <LastModified>2006-04-07T20:33:34.000Z</LastModified>
    <ETag>"7cc3094cb3d8b2ce8e2df62250dbcea4"</ETag>
    <Size>282846884</Size>
    <Owner>
      <ID>e317c40e5d99887800fe253c1bccd07ec86d3d50aa6209805ffa54834a9093a3</ID>
      <DisplayName>jlc613</DisplayName>
    </Owner>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
  <Contents>
    <Key>home/john/backup.tar.bz2.gpg</Key>
    <LastModified>2006-04-07T20:15:12.000Z</LastModified>
    <ETag>"4cc8686514f92c91e5730fdc0bf3c3a9"</ETag>
    <Size>906780</Size>
    <Owner>
      <ID>e317c40e5d99887800fe253c1bccd07ec86d3d50aa6209805ffa54834a9093a3</ID>
      <DisplayName>jlc613</DisplayName>
    </Owner>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
</ListBucketResult>

The ETag turns out to be the MD5 checksum of the object you uploaded, so you can do a quick check to make sure the bytes they got are the bytes you got. I don't know if MaxKeys means that I can only upload 1000 items per bucket, but I'm not worried about it yet. To retrieve your data, you simply get the appropriate URL using the same syntax we used to get the bucket listing above.

Simplicity. I like that about this service, but don't expect to be doing any processing on that data remotely; anything you want to do with that data will require you to copy it back to some machine that you control. This will inflate your operating costs somewhat, but the simplicity of the storage service is what allows all this to be possible in the first place.

For a while, I had been thinking that it might be fun to run a cheap, simple, and transparent data backup service oriented towards end-users. I think that Amazon just beat me to it, although it would still be nice to have a large backup-oriented shell account somewhere that I could use for unison. This might count as abuse of the spirit of Amazon's service. I don't really know, but I can use it as part of my backup solution until someone calls me on it.

[1]	Shout out #1 goes to unison, which is truly an amazing, useful tool. And it's written in OCaml. Hooray for elegant apps written in functional programming languages. On my TODO list: learn OCaml simply so that I can bask in the unison source.
[2]	Shout out #2 goes to Subversion. Where creating and accessing repositories is as easy as breathing. (Now, do I link to Subversion's Wikipedia page, like Norm, or Subversion's home page? Conundrums! We need XLink support! For great justice! And multilinks!)
[3]	Shout-out #3 goes to GPG. Do I even need to say anything else?

Personal Backups with S3

Abstract