Phusion white papers Phusion overview

Phusion Blog

Duplicity + S3: easy, cheap, encrypted, automated full-disk backups for your servers

By Hongli Lai on November 11th, 2013

Backup

Backups are one of those things that are important, but that a lot of people don’t do. The thought of setting up backups always raised a mental barrier for me for a number of reasons:

  • I have to think about where to backup to.
  • I have to remember to run the backup on a periodic basis.
  • I worry about the bandwidth and/or storage costs.

I still remember the days when a 2.5 GB harddisk was considered large, and when I had to spent a few hours splitting MP3 files and putting them on 20 floppy disks to transfer them between computers. Backing up my entire harddisk would have costed me hundreds of dollars and hours of time. Because of this, I tend to worry about the efficiency of my backups. I only want to backup things that need backing up.

I tended to tweak my backup software and rules to be as efficient as possible. However, this made setting up backups a total pain, and makes it very easy to procrastinate backups… until it is too late.

I learned to embrace Moore’s Law

Times have changed. Storage is cheap, very cheap. Time Machine — Apple’s backup software — taught me to stop worrying about efficiency. Backing up everything not only makes backing up a mindless and trivial task, it also makes me feel safe. I don’t have to worry about losing my data anymore. I don’t have to worry that my backup rules missed an important file.

Backing up desktops and laptops is easy and cheap enough. A 2 TB harddisk costs only $100.

What about servers?

  • Most people can’t go to the data center and attach a hard disk. Buying or renting another harddisk from the hosting provider can be expensive. Furthermore, if your backup device resides on the same location where the data center is, then destruction of the data center (e.g. a fire) will destroy your backup as well.
  • Backup services provided by the hosting provider can be expensive.
  • Until a few years ago, bandwidth was relatively expensive, making backing up the entire harddisk to a remote storage service an unviable option for those with a tight budget.
  • And finally, do you trust that the storage provider will not read or tamper with your data?

Enter Duplicity and S3

Duplicity is a tool for creating incremental, encrypted backups. “Incremental” means that each backup only stores data that has changed since the last backup run. This is achieved by using the rsync algorithm.

What is rsync? It is a tool for synchronizing files between machines. The cool thing about rsync is that it only transfers changes. If you have a directory with 10 GB of files, and your remote machine has an older version of that directory, then rsync only transfers new files or changed files. Of the changed files, rsync is smart enough to only transfer the parts of the files that have changed!

At some point, Ben Escoto authored the tool rdiff-backup, an incremental backup tool which uses an rsync-like algorithm to create filesystem backups. Rdiff-backup also saves metadata such as permissions, owner and group IDs, ACLs, etc. Rdiff-backup stores past versions as well and allows easy rollback to a point in time. It even compresses backups. However, rdiff-backup has one drawback: you have to install it on the remote server as well. This makes it impossible to use rdiff-backup to backup to storage services that don’t allow running arbitrary software.

Ben later created Duplicity, which is like rdiff-backup but encrypts everything. Duplicity works without needing special software on the remote machine and supports many storage methods, for example FTP, SSH, and even S3.

On the storage side, Amazon has consistently lowered the prices of S3 over the past few years. The current price for the US-west-2 region is only $0.09 per GB per month.

Bandwidth costs have also lowered tremendously. Many hosting providers these days allow more than 1 TB of traffic per month per server.

This makes Duplicity and S3 the perfect combination for backing up my servers. Using encryption means that I don’t have to trust my service provider. Storing 200 GB only costs $18 per month.

Setting up Duplicity and S3 using Duply

Duplicity in itself is still a relative pain to use. It has many options — too many if you’re just starting out. Luckily there is a tool which simplifies Duplicity even further: Duply. It keeps your settings in a profile, and supports pre- and post-execution scripts.

Let’s install Duplicity and Duply. If you’re on Ubuntu, you should add the Duplicity PPA so that you get the latest version. If not, you can just install an older version of Duplicity from the distribution’s repositories.

# Replace 'precise' with your Ubuntu version's codename.
echo deb http://ppa.launchpad.net/duplicity-team/ppa/ubuntu precise main | \
sudo tee /etc/apt/sources.list.d/duplicity.list
sudo apt-get update

Then:

# python-boto adds S3 support
sudo apt-get install duplicity duply python-boto

Create a profile. Let’s name this profile “test”.

duply test create

This will create a configuration file in $HOME/.duply/test/conf. Open it in your editor. You will be presented with a lot of configuration options, but only a few are really important. One of them is GPG_KEY and GPG_PW. Duplicity supports asymmetric public-key encryption, or symmetric password-only encryption. For the purposes of this tutorial we’re going to use symmetric password-only encryption because it’s the easiest.

Let’s generate a random, secure password:

openssl rand -base64 20

Comment out GPG_KEY and set a password in GPG_PW:

#GPG_KEY='_KEY_ID_'
GPG_PW='<the password you just got from openssl>'

Scroll down and set the TARGET options:

TARGET='s3://s3-<region endpoint name>.amazonaws.com/<bucket name>/<folder name>'
TARGET_USER='<your AWS access key ID>'
TARGET_PASS='<your AWS secret key>'

Substitute “region endpoint name” with the host name of the region in which you want to store your S3 bucket. You can find a list of host names at the AWS website. For example, for US-west-2 (Oregon):

TARGET='s3://s3-us-west-2.amazonaws.com/myserver.com-backup/main'

Set the base directory of the backup. We want to backup the entire filesystem:

SOURCE='/'

It is also possible to set a maximum time for keeping old backups. In this tutorial, let’s set it to 6 months:

MAX_AGE=6M

Save and close the configuration file.

There are also some things that we never want to backup, such as /tmp, /dev and log files. So we create an exclusion file $HOME/.duply/test/exclude with the following contents:

- /dev
- /home/*/.cache
- /home/*/.ccache
- /lost+found
- /media
- /mnt
- /proc
- /root/.cache
- /root/.ccache
- /run
- /selinux
- /sys
- /tmp
- /u/apps/*/current/log/*
- /u/apps/*/releases/*/log/*
- /var/cache/*/*
- /var/log
- /var/run
- /var/tmp

This file follows the Duplicity file list syntax. The - sign here means “exclude this directory”. For more information, please refer to the Duplicity man page.

Notice that this file excludes Capistrano-deployed Ruby web apps’ log files. If you’re running Node.js apps on your server then it’s easy to exclude your Node.js log files in a similar manner.

Finally, go to the Amazon S3 control panel, and create a bucket in the chosen region:

Create a bucket on S3

Enter the bucket name

Initiating the backup

We’re now ready to initiate the backup. This can take a while, so let’s open a screen session so that we can terminate the SSH session and check back later.

sudo apt-get install screen
screen

Initiate the backup:

sudo duply test backup

Press Esc-D to detach the screen session.

Check back a few hours later. Login to your server and reattach your screen session:

screen -x

You should see something like this, which means that the backup succeeded. Congratulations!

--------------[ Backup Statistics ]--------------
...
Errors 0
-------------------------------------------------

--- Finished state OK at 16:48:16.192 - Runtime 01:17:08.540 ---

--- Start running command POST at 16:48:16.213 ---
Skipping n/a script '/home/admin/.duply/main/post'.
--- Finished state OK at 16:48:16.244 - Runtime 00:00:00.031 ---

Setting up periodic incremental backups with cron

We can use cron, the system’s periodic task scheduler, to setup periodic incremental backups. Edit root’s crontab:

sudo crontab -e

Insert the following:

0 2 * * 7 env HOME=/home/admin duply main backup

This line runs the duply main backup command every Sunday at 2:00 AM. Note that we set the HOME environment variable here to /home/admin. Duply is run as root because the cronjob belongs to root. However the Duply profiles are stored in /home/admin/.duply, which is why we need to set the HOME environment variable here.

If you want to setup daily backups, replace “0 2 * * 7” with “0 2 * * *”.

Making cron jobs less noisy

Cron has a nice feature: it emails you with the output of every job it has run. If you find that this gets annoying after a while, then you can make it only email you if something went wrong. For this, we’ll need the silence-unless-failed tool, part of phusion-server-tools. This tool runs the given command and swallows its output, unless the command fails.

Install phusion-server-tools and edit root’s crontab again:

sudo git clone https://github.com/phusion/phusion-server-tools.git /tools
sudo crontab -e

Replace:

env HOME=/home/admin duply main backup

with:

/tools/silence-unless-failed env HOME=/home/admin duply main backup

Restoring a backup

Simple restores

You can restore the latest backup with the Duply restore command. It is important to use sudo because this allows Duplicity to restore the original filesystem metadata.

The following will restore the latest backup to a specific directory. The target directory does not need to exist, Duplicity will automatically create it. After restoration, you can move its contents to the root filesystem using mv.

sudo duply main restore /restored_files

You can’t just do sudo duply main restore / here because your system files (e.g. bash, libc, etc) are in use.

Moving the files from /restored_files to / using mv might still not work for you. In that case, consider booting your server from a rescue system and restoring from there.

Restoring a specific file or directory

Use the fetch command to restore a specific file. This restores the /etc/password file in the backup and saves it to /home/admin/password. Notice the lack of leading slash in the etc/password argument.

sudo duply main fetch etc/password /home/admin/password

The fetch command also works on directories:

sudo duply main fetch etc /home/admin/etc

Restoring from a specific date

Every restoration command accepts a date, allowing you to restore from that specific date.

First, use the status command to get an overview of backup dates:

$ duply main status
...
Number of contained backup sets: 2
Total number of contained volumes: 2
 Type of backup set:                            Time:      Num volumes:
                Full         Sat Nov  8 07:38:30 2013                 1
         Incremental         Sat Nov  9 07:43:17 2013                 1
...

In this example, we restore the November 8 backup. Unfortunately we can’t just copy and paste the time string. Instead, we have to write the time in the w3 format. See also the Time Formats section in the Duplicity man page.

sudo duply test restore /restored_files '2013-11-08T07:38:30'

Safely store your keys or passwords!

Whether you used asymmetric public-key encryption or symmetric password-only encryption, you must store them safely! If you ever lose them, you will lose your data. There is no way to recover encrypted data for which the key or password is lost.

My preferred way of storing secrets is to store them inside 1Password and to replicate the data to my phone and tablet so that I have redundant encrypted copies. Alternatives to 1Password include LastPass or KeePass although I have no experience with them.

Conclusion

With Duplicity, Duply and S3, you can setup cheap and secure automated backups in a matter of minutes. For many servers this combo is the silver bullet.

One thing that this tutorial hasn’t dealt with, is database backups. While we’re backing up the database’s raw files, doing so isn’t a good idea. If the database files were being written to at the time the backup was made, then the backup will contain potentially irrecoverably corrupted database files. Even the database’s journaling file or write-ahead log won’t help, because these technologies are designed only to protect against power failures, not against concurrent file-level backup processes. Luckily Duply supports the concept of pre-scripts. In the next part of this article, we’ll cover pre-scripts and database backups.

I hope you’ve enjoyed this article. If you have any comments, please don’t hesitate to post them below. We regularly publish news and interesting articles. If you’re interested, please follow us on Twitter, or subscribe to our newsletter.

Discuss on Hacker News.