Sat, 28 Mar 2020

AWS EBS Volumes and ZFS Snapshots

Recently I wanted to throw up a tiny little irc server for a small group of friends. I used this opportunity to finally create a sub-account in my personal AWS organization and host it entirely isolated. One of the things I wanted to do was a bog simple backup scheme for this system, but because this is completely isolated from all of my other normal infrastructure I can't use my standard restic backup setup.

I decided to take advantage of a few things. One, all of the stateful data on this system is on a distinct EBS volume — everything on the root volume is managed by Packer[1] and can be re-created at any time.

Two, this local volume has a ZFS zpool and several ZFS datasets on it. With ZFS, I have a filesystem which can do easy snapshots and has (something) always consistent on disk. I use zfsnap to take hourly backups with a defined retention policy. So the local disk and the filesystems on it have some notion of local backups.

Three, there's a AWS EBS Data Lifecycle Management policy which takes a daily snapshot of the volume and shoves it into S3, with appropriate retention policies. This protects against things like me accidentially deleting the volume, or catastrophic failure of the availability zone the EC2 instance is in. The appropriate Terraform[1] code for that is:

resource "aws_ebs_volume" "irc-local-volume" {
  availability_zone = data.aws_subnet.irc-subnet.availability_zone
  size              = 24
  tags = {
    Snapshot = "true"
    SnapshotPolicy = "Daily-2Weeks"
  }
}

resource "aws_volume_attachment" "irc-local-volume-attachment" {
  device_name = "/dev/xvdf"
  volume_id   = aws_ebs_volume.irc-local-volume.id
  instance_id = aws_instance.irc.id
}

resource "aws_iam_role" "dlm_lifecycle_role" {
  name = "dlm-lifecycle-role"

  assume_role_policy = <<-EOF
  {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Action": "sts:AssumeRole",
        "Principal": {
          "Service": "dlm.amazonaws.com"
        },
        "Effect": "Allow",
        "Sid": ""
      }
    ]
  } 
  EOF
}

resource "aws_iam_role_policy" "dlm_lifecycle" {
  name = "dlm-lifecycle-policy"
  role = aws_iam_role.dlm_lifecycle_role.id

  policy = <<-EOF
  {
     "Version": "2012-10-17",
     "Statement": [
        {
           "Effect": "Allow",
           "Action": [
              "ec2:CreateSnapshot",
              "ec2:DeleteSnapshot",
              "ec2:DescribeVolumes",
              "ec2:DescribeSnapshots"
           ],
           "Resource": "*"
        },
        {
           "Effect": "Allow",
           "Action": [
              "ec2:CreateTags"
           ],
           "Resource": "arn:aws:ec2:*::snapshot/*"
        }
     ]
  }
  EOF
}


resource "aws_dlm_lifecycle_policy" "two-week-policy-daily" {
  description = "DLM policy to take daily snapshots and keep for 2 weeks"
  execution_role_arn = aws_iam_role.dlm_lifecycle_role.arn
  state              = "ENABLED"

  policy_details {
    resource_types = ["VOLUME"]

    schedule {
      name = "2 weeks of daily snapshots"

      create_rule {
        interval = 24
        interval_unit = "HOURS"
        times = ["23:45"]
      }

      retain_rule {
        count = 14
      }

      tags_to_add = {
        SnapshotCreator = "DLM"
      }

      copy_tags = true
    }

    target_tags = {
      Snapshot = "true"
      SnapshotPolicy = "Daily-2Weeks"
    }
  }
}
    

Tying this all together: every hour, zfsnap makes a snapshot of the ZFS datasets on the local disk. So I have an on-disk consistent backup of this data. As a disaster recovery mechanism, every day AWS takes a snapshot of the EBS volume it's on. To test, and to recover in the event something happens, I launched a new instance, made a volume from the snapshot, attached it to the new instance, scanned to attach zpools, and then restored the ZFS datasets on it to the latest hourly snapshot.

Overkill for something this simple, but a useful technique to have in your toolbox.

[1] Standard Disclaimer

Posted at: 10:14 | category: /computers/aws | Link

Mon, 16 Jun 2014

S3, boto and IAM

As part of my process to replace the power-hungry eight-year old server I have at home with a tiny Intel NUC, I'm slowly moving any real services off of it onto my colocated machine. The last real service I'm running at home is my backup machine, which handles both my AFS cell backups and my rsync-based machine backup scripts.

Moving the backup server to colocation isn't difficult, but I need to find a place to stash the second, disaster-recovery copy of all of my backups. The obvious and most cost-effective solution is shoving all the data into Amazon's S3 service, particularly if I have it go into Glacier storage. For a project at the day job I've been using Duplicity for backups, which will happily handle S3 as a backend.

In a sane setup, there's a bucket dedicated to backups, say, tproa-backups, with each machine having a prefix that its backups are sent to. Each machine would get an IAM identity that would have the appropriate rights to create objects in S3 with that prefix, so machine's couldn't trip over each other's backups.

The documentation for S3 and Duplicity is rather sparse, and none of it talks about using IAM identities for access control. After getting Connection reset by peer errors from Duplicity, I tried getting the lastest versions of both Duplicity and Boto, the Python library around AWS. That failed, so next I tried using the Boto s3put script to try to shove something into S3, which also failed.

After digging around, I found the correct incantation to set in your bucket policy to allow an IAM identity to do Duplicity backups. In the following example, you'll see the following identifiers:

arn:aws:iam::854026359331:user/backup-gozer
The machine identity for gozer.tproa.net
arn:aws:s3:::tproa-backups/gozer.tproa.net
The S3 bucket/prefix for backups

{
    "Version": "2008-10-17",
    "Id": "Policy1402707051767",
    "Statement": [
	{
	    "Sid": "Stmt1402707005319",
	    "Effect": "Allow",
	    "Principal": {
		"AWS": "arn:aws:iam::854026359331:user/backup-gozer"
	    },
	    "Action": "s3:*",
	    "Resource": [
		"arn:aws:s3:::tproa-backups/gozer.tproa.net*",
		"arn:aws:s3:::tproa-backups/gozer.tproa.net"
	    ]
	},
	{
	    "Sid": "Stmt1402707048357",
	    "Effect": "Allow",
	    "Principal": {
		"AWS": "arn:aws:iam::854026359331:user/backup-gozer"
	    },
	    "Action": [
		"s3:ListBucket",
		"s3:GetBucketLocation"
	    ],
	    "Resource": "arn:aws:s3:::tproa-backups"
	}
    ]
}

In the first statement, I'm giving the machine IAM identity the rights to do anything under tproa-backups/gozer.tproa.net. Note, in particular, that you do not put a '/' at the end of the prefix. In the second statement, I'm giving the machine IAM identity the rights to both list the bucket and find the bucket location of the bucket tproa-backups Again, the same note about not ending the bucket name with a slash. The s3:GetBucketLocation right is crucial, without it the Boto library can't find which location the bucket is in, so it can't connect to the proper S3 frontend, which causes it to bomb out without any useful error message.

Posted at: 21:52 | category: /computers/aws | Link