Sat, 28 Mar 2020

AWS EBS Volumes and ZFS Snapshots

Recently I wanted to throw up a tiny little irc server for a small group of friends. I used this opportunity to finally create a sub-account in my personal AWS organization and host it entirely isolated. One of the things I wanted to do was a bog simple backup scheme for this system, but because this is completely isolated from all of my other normal infrastructure I can't use my standard restic backup setup.

I decided to take advantage of a few things. One, all of the stateful data on this system is on a distinct EBS volume — everything on the root volume is managed by Packer[1] and can be re-created at any time.

Two, this local volume has a ZFS zpool and several ZFS datasets on it. With ZFS, I have a filesystem which can do easy snapshots and has (something) always consistent on disk. I use zfsnap to take hourly backups with a defined retention policy. So the local disk and the filesystems on it have some notion of local backups.

Three, there's a AWS EBS Data Lifecycle Management policy which takes a daily snapshot of the volume and shoves it into S3, with appropriate retention policies. This protects against things like me accidentially deleting the volume, or catastrophic failure of the availability zone the EC2 instance is in. The appropriate Terraform[1] code for that is:

resource "aws_ebs_volume" "irc-local-volume" {
  availability_zone = data.aws_subnet.irc-subnet.availability_zone
  size              = 24
  tags = {
    Snapshot = "true"
    SnapshotPolicy = "Daily-2Weeks"
  }
}

resource "aws_volume_attachment" "irc-local-volume-attachment" {
  device_name = "/dev/xvdf"
  volume_id   = aws_ebs_volume.irc-local-volume.id
  instance_id = aws_instance.irc.id
}

resource "aws_iam_role" "dlm_lifecycle_role" {
  name = "dlm-lifecycle-role"

  assume_role_policy = <<-EOF
  {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Action": "sts:AssumeRole",
        "Principal": {
          "Service": "dlm.amazonaws.com"
        },
        "Effect": "Allow",
        "Sid": ""
      }
    ]
  } 
  EOF
}

resource "aws_iam_role_policy" "dlm_lifecycle" {
  name = "dlm-lifecycle-policy"
  role = aws_iam_role.dlm_lifecycle_role.id

  policy = <<-EOF
  {
     "Version": "2012-10-17",
     "Statement": [
        {
           "Effect": "Allow",
           "Action": [
              "ec2:CreateSnapshot",
              "ec2:DeleteSnapshot",
              "ec2:DescribeVolumes",
              "ec2:DescribeSnapshots"
           ],
           "Resource": "*"
        },
        {
           "Effect": "Allow",
           "Action": [
              "ec2:CreateTags"
           ],
           "Resource": "arn:aws:ec2:*::snapshot/*"
        }
     ]
  }
  EOF
}


resource "aws_dlm_lifecycle_policy" "two-week-policy-daily" {
  description = "DLM policy to take daily snapshots and keep for 2 weeks"
  execution_role_arn = aws_iam_role.dlm_lifecycle_role.arn
  state              = "ENABLED"

  policy_details {
    resource_types = ["VOLUME"]

    schedule {
      name = "2 weeks of daily snapshots"

      create_rule {
        interval = 24
        interval_unit = "HOURS"
        times = ["23:45"]
      }

      retain_rule {
        count = 14
      }

      tags_to_add = {
        SnapshotCreator = "DLM"
      }

      copy_tags = true
    }

    target_tags = {
      Snapshot = "true"
      SnapshotPolicy = "Daily-2Weeks"
    }
  }
}
    

Tying this all together: every hour, zfsnap makes a snapshot of the ZFS datasets on the local disk. So I have an on-disk consistent backup of this data. As a disaster recovery mechanism, every day AWS takes a snapshot of the EBS volume it's on. To test, and to recover in the event something happens, I launched a new instance, made a volume from the snapshot, attached it to the new instance, scanned to attach zpools, and then restored the ZFS datasets on it to the latest hourly snapshot.

Overkill for something this simple, but a useful technique to have in your toolbox.

[1] Standard Disclaimer

Posted at: 10:14 | category: /computers/aws | Link

Standard Disclaimer: HashiCorp

From time to time I may post things which discuss technologies or products of HashiCorp, Inc. If you're reading that and are getting a link to this, that means:

At the time of authorship of that post, I am an employee of HashiCorp. Said post is explicitly a personal project and is not an official HashiCorp product. The views expressed in that posting are entirely personal and are not statements made on the behalf of HashiCorp, Inc.

Posted at: 09:41 | category: /standard-disclaimer | Link

Sun, 08 Mar 2020

Randomness on a PCEngines APU2

This weekend I've been noodling around with my perennial project of building a ersatz HSM (what are you using to protect your home CA root?) A fresh install of Debian 10 on a PCEngines APU2 later, I started some basic setup. One of the first things I started playing with was a source of randomness for the system. In "production" there won't physically be any network connections, and as an isolated box where presumably you'd boot it up, do one or two operations, and shut it back down, there's not a lot of chance to collect some entropy. Those operations will also tend to be things like "generate cryptographic keys", which consume a lot of entropy. "Low entropy generation" and "high entropy consumption" isn't a happy recipe.

Ages ago, I picked up a ChaosKey, which is a USB-based hardware random number generator with support in the Linux kernel. I've only got one, currently, and it is another thing I have to plug in and have stick out of the APU2, but then I remembered something else.

When I put this system together, however, I also bought the PCEngines TPM module, based on the Infineon TPM SLB 9665 line, which plugs into the LPC port on the board and provides a TPM 2.0 module. At the time, the idea was to play with the APU2 vboot and measured boot process. In the meantime, however, the TPM module also provides a hardware random number generator, which also apparently is supported by the Linux kernel.

The first question that came to mind is "okay, how to I see that this is working?" A little looking lead me to the rng-tools, which includes a daemon, rngd which will read from the hardware random number generator (HWRNG), do some entropy quality checks on it, and then fold that entropy into the kernel pool. So I started that up, but I still didn't know if it was working. Checking the status of the rng-tools service in systemctl I saw this:

Mar 08 16:43:09 hsm-test0 rngd[666]: block failed FIPS test: 0x1f
Mar 08 16:43:09 hsm-test0 rngd[666]: block failed FIPS test: 0x1f
Mar 08 16:43:09 hsm-test0 rngd[666]: Too many consecutive bad blocks of data, check entropy source!
Mar 08 16:43:09 hsm-test0 rngd[666]: Throttling down entropy source read speed...

Okay, I thought, did I miss something in setting up the TPM? Do you have to go through the process of claiming the TPM and setting up ownership keys before the HWRNG starts working properly? It certainly isn't returning anything useful:

$ sudo hexdump -n 512 -C /dev/hwrng
00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
00000200

Digging, I discovered that the APU2 is based on the AMD GX-412TC SOC, which includes AMD's "Cryptographic Coprocessor" and "Platform Security Processor", which appears to be an embedded ARM processor designed to work as a secure enclave of sorts. Anyways, I can see that it's there, and currently what the system is using as it's HWRNG:

$ cat /sys/class/misc/hw_random/rng_available 
ccp-1-rng tpm-rng-0
$ cat /sys/class/misc/hw_random/rng_current 
ccp-1-rng

You can switch the current HWRNG by writing to /sys/class/misc/hw_random/rng_current:

$ /bin/echo -n "tpm-rng-0" | sudo tee /sys/class/misc/hw_random/rng_current

Which certainly seems to work better now:

$ sudo hexdump -n 512 -C /dev/hwrng
00000000  14 e4 bd d5 aa 82 c1 93  bb 06 05 2c 7c fc bd 26  |...........,|..&|
00000010  c6 41 b9 45 d9 c9 49 ef  bc 34 54 16 1a 3c 81 68  |.A.E..I..4T..<.h|
00000020  bf f3 b1 b3 a7 eb 0c 89  c9 4e f9 77 6b be e3 41  |.........N.wk..A|
00000030  bf e0 16 9b 4f 04 a1 0c  e1 fc 78 7c f8 d4 b9 c6  |....O.....x|....|
[...]

You can also use rngtest to run the same tests rngd uses to check the quality of the entropy from the HWRNG:

$ sudo dd if=/dev/hwrng status=none | rngtest -c 100
rngtest 2-unofficial-mt.14
Copyright (c) 2004 by Henrique de Moraes Holschuh
This is free software; see the source for copying conditions.  
There is NO warranty; not even for MERCHANTABILITY or FITNESS 
FOR A PARTICULAR PURPOSE.

rngtest: starting FIPS tests...
rngtest: bits received from input: 2000032
rngtest: FIPS 140-2 successes: 100
rngtest: FIPS 140-2 failures: 0
rngtest: FIPS 140-2(2001-10-10) Monobit: 0
rngtest: FIPS 140-2(2001-10-10) Poker: 0
rngtest: FIPS 140-2(2001-10-10) Runs: 0
rngtest: FIPS 140-2(2001-10-10) Long run: 0
rngtest: FIPS 140-2(2001-10-10) Continuous run: 0
rngtest: input channel speed: (min=53.131; avg=55.470; max=68.755)Kibits/s
rngtest: FIPS tests speed: (min=10.474; avg=11.068; max=11.401)Mibits/s
rngtest: Program run time: 35478549 microseconds

I'm not sure why the AMD CCP HWRNG is returning bogus data, I suspect that the BIOS in the APU2 isn't properly initializing it, but don't know for sure. There are two options: manually select which HWRNG I want to use as shown above, or simply not load the CCP kernel module. Given that the HWRNG on the CCP is obviously not working, I have no idea if anything else on it is also working, so it's probably best not to trust it. You can disable it by creating /etc/modprobe.d/blacklist.conf with the following contents:

blacklist kvm_amd
blacklist ccp

(kvm_amd causes ccp to be loaded, even if I blacklist it — I'm not using KVM on this box anyways, so I blacklist it as well). Run sudo update-initramfs -u and reboot, and it no longer shows up and rngd is happy.

For comparison, here's the output of rngtest after switching to use the ChaosKey:

$ sudo dd if=/dev/hwrng status=none | rngtest -c 100
rngtest 2-unofficial-mt.14
Copyright (c) 2004 by Henrique de Moraes Holschuh
This is free software; see the source for copying conditions.  
There is NO warranty; not even for MERCHANTABILITY or 
FITNESS FOR A PARTICULAR PURPOSE.

rngtest: starting FIPS tests...
rngtest: bits received from input: 2000032
rngtest: FIPS 140-2 successes: 100
rngtest: FIPS 140-2 failures: 0
rngtest: FIPS 140-2(2001-10-10) Monobit: 0
rngtest: FIPS 140-2(2001-10-10) Poker: 0
rngtest: FIPS 140-2(2001-10-10) Runs: 0
rngtest: FIPS 140-2(2001-10-10) Long run: 0
rngtest: FIPS 140-2(2001-10-10) Continuous run: 0
rngtest: input channel speed: (min=2.778; avg=5.023; max=7.798)Mibits/s
rngtest: FIPS tests speed: (min=10.162; avg=12.822; max=14.004)Mibits/s
rngtest: Program run time: 664867 microseconds

It generates entropy much much faster than the TPM, but I think the TPM will be fine with keeping up. Plus, it's already in the box.

Posted at: 13:15 | category: /computers/random | Link