AWS EBS Volumes and ZFS Snapshots

Recently I wanted to throw up a tiny little irc server for a small group of friends. I used this opportunity to finally create a sub-account in my personal AWS organization and host it entirely isolated. One of the things I wanted to do was a bog simple backup scheme for this system, but because this is completely isolated from all of my other normal infrastructure I can't use my standard restic backup setup.

I decided to take advantage of a few things. One, all of the stateful data on this system is on a distinct EBS volume — everything on the root volume is managed by Packer[1] and can be re-created at any time.

Two, this local volume has a ZFS zpool and several ZFS datasets on it. With ZFS, I have a filesystem which can do easy snapshots and has (something) always consistent on disk. I use zfsnap to take hourly backups with a defined retention policy. So the local disk and the filesystems on it have some notion of local backups.

Three, there's a AWS EBS Data Lifecycle Management policy which takes a daily snapshot of the volume and shoves it into S3, with appropriate retention policies. This protects against things like me accidentially deleting the volume, or catastrophic failure of the availability zone the EC2 instance is in. The appropriate Terraform[1] code for that is:

resource "aws_ebs_volume" "irc-local-volume" {
  availability_zone = data.aws_subnet.irc-subnet.availability_zone
  size              = 24
  tags = {
    Snapshot = "true"
    SnapshotPolicy = "Daily-2Weeks"
  }
}

resource "aws_volume_attachment" "irc-local-volume-attachment" {
  device_name = "/dev/xvdf"
  volume_id   = aws_ebs_volume.irc-local-volume.id
  instance_id = aws_instance.irc.id
}

resource "aws_iam_role" "dlm_lifecycle_role" {
  name = "dlm-lifecycle-role"

  assume_role_policy = <<-EOF
  {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Action": "sts:AssumeRole",
        "Principal": {
          "Service": "dlm.amazonaws.com"
        },
        "Effect": "Allow",
        "Sid": ""
      }
    ]
  } 
  EOF
}

resource "aws_iam_role_policy" "dlm_lifecycle" {
  name = "dlm-lifecycle-policy"
  role = aws_iam_role.dlm_lifecycle_role.id

  policy = <<-EOF
  {
     "Version": "2012-10-17",
     "Statement": [
        {
           "Effect": "Allow",
           "Action": [
              "ec2:CreateSnapshot",
              "ec2:DeleteSnapshot",
              "ec2:DescribeVolumes",
              "ec2:DescribeSnapshots"
           ],
           "Resource": "*"
        },
        {
           "Effect": "Allow",
           "Action": [
              "ec2:CreateTags"
           ],
           "Resource": "arn:aws:ec2:*::snapshot/*"
        }
     ]
  }
  EOF
}


resource "aws_dlm_lifecycle_policy" "two-week-policy-daily" {
  description = "DLM policy to take daily snapshots and keep for 2 weeks"
  execution_role_arn = aws_iam_role.dlm_lifecycle_role.arn
  state              = "ENABLED"

  policy_details {
    resource_types = ["VOLUME"]

    schedule {
      name = "2 weeks of daily snapshots"

      create_rule {
        interval = 24
        interval_unit = "HOURS"
        times = ["23:45"]
      }

      retain_rule {
        count = 14
      }

      tags_to_add = {
        SnapshotCreator = "DLM"
      }

      copy_tags = true
    }

    target_tags = {
      Snapshot = "true"
      SnapshotPolicy = "Daily-2Weeks"
    }
  }
}
    

Tying this all together: every hour, zfsnap makes a snapshot of the ZFS datasets on the local disk. So I have an on-disk consistent backup of this data. As a disaster recovery mechanism, every day AWS takes a snapshot of the EBS volume it's on. To test, and to recover in the event something happens, I launched a new instance, made a volume from the snapshot, attached it to the new instance, scanned to attach zpools, and then restored the ZFS datasets on it to the latest hourly snapshot.

Overkill for something this simple, but a useful technique to have in your toolbox.

[1] Standard Disclaimer