Sat, 28 Mar 2020
AWS EBS Volumes and ZFS Snapshots
Recently I wanted to throw up a tiny little irc server for a small group of friends. I used this opportunity to finally create a sub-account in my personal AWS organization and host it entirely isolated. One of the things I wanted to do was a bog simple backup scheme for this system, but because this is completely isolated from all of my other normal infrastructure I can't use my standard restic backup setup.
I decided to take advantage of a few things. One, all of the stateful data on this system is on a distinct EBS volume — everything on the root volume is managed by Packer[1] and can be re-created at any time.
Two, this local volume has a ZFS zpool and several ZFS datasets on it. With ZFS, I have a filesystem which can do easy snapshots and has (something) always consistent on disk. I use zfsnap to take hourly backups with a defined retention policy. So the local disk and the filesystems on it have some notion of local backups.
Three, there's a AWS EBS Data Lifecycle Management policy which takes a daily snapshot of the volume and shoves it into S3, with appropriate retention policies. This protects against things like me accidentially deleting the volume, or catastrophic failure of the availability zone the EC2 instance is in. The appropriate Terraform[1] code for that is:
resource "aws_ebs_volume" "irc-local-volume" { availability_zone = data.aws_subnet.irc-subnet.availability_zone size = 24 tags = { Snapshot = "true" SnapshotPolicy = "Daily-2Weeks" } } resource "aws_volume_attachment" "irc-local-volume-attachment" { device_name = "/dev/xvdf" volume_id = aws_ebs_volume.irc-local-volume.id instance_id = aws_instance.irc.id } resource "aws_iam_role" "dlm_lifecycle_role" { name = "dlm-lifecycle-role" assume_role_policy = <<-EOF { "Version": "2012-10-17", "Statement": [ { "Action": "sts:AssumeRole", "Principal": { "Service": "dlm.amazonaws.com" }, "Effect": "Allow", "Sid": "" } ] } EOF } resource "aws_iam_role_policy" "dlm_lifecycle" { name = "dlm-lifecycle-policy" role = aws_iam_role.dlm_lifecycle_role.id policy = <<-EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ec2:CreateSnapshot", "ec2:DeleteSnapshot", "ec2:DescribeVolumes", "ec2:DescribeSnapshots" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "ec2:CreateTags" ], "Resource": "arn:aws:ec2:*::snapshot/*" } ] } EOF } resource "aws_dlm_lifecycle_policy" "two-week-policy-daily" { description = "DLM policy to take daily snapshots and keep for 2 weeks" execution_role_arn = aws_iam_role.dlm_lifecycle_role.arn state = "ENABLED" policy_details { resource_types = ["VOLUME"] schedule { name = "2 weeks of daily snapshots" create_rule { interval = 24 interval_unit = "HOURS" times = ["23:45"] } retain_rule { count = 14 } tags_to_add = { SnapshotCreator = "DLM" } copy_tags = true } target_tags = { Snapshot = "true" SnapshotPolicy = "Daily-2Weeks" } } }
Tying this all together: every hour, zfsnap makes a snapshot of the ZFS datasets on the local disk. So I have an on-disk consistent backup of this data. As a disaster recovery mechanism, every day AWS takes a snapshot of the EBS volume it's on. To test, and to recover in the event something happens, I launched a new instance, made a volume from the snapshot, attached it to the new instance, scanned to attach zpools, and then restored the ZFS datasets on it to the latest hourly snapshot.
Overkill for something this simple, but a useful technique to have in your toolbox.
Posted at: 10:14 | category: /computers/aws | Link
Standard Disclaimer: HashiCorp
From time to time I may post things which discuss technologies or products of HashiCorp, Inc. If you're reading that and are getting a link to this, that means:
At the time of authorship of that post, I am an employee of HashiCorp. Said post is explicitly a personal project and is not an official HashiCorp product. The views expressed in that posting are entirely personal and are not statements made on the behalf of HashiCorp, Inc.
Posted at: 09:41 | category: /standard-disclaimer | Link
Sun, 08 Mar 2020
Randomness on a PCEngines APU2
This weekend I've been noodling around with my perennial project of building a ersatz HSM (what are you using to protect your home CA root?) A fresh install of Debian 10 on a PCEngines APU2 later, I started some basic setup. One of the first things I started playing with was a source of randomness for the system. In "production" there won't physically be any network connections, and as an isolated box where presumably you'd boot it up, do one or two operations, and shut it back down, there's not a lot of chance to collect some entropy. Those operations will also tend to be things like "generate cryptographic keys", which consume a lot of entropy. "Low entropy generation" and "high entropy consumption" isn't a happy recipe.
Ages ago, I picked up a ChaosKey, which is a USB-based hardware random number generator with support in the Linux kernel. I've only got one, currently, and it is another thing I have to plug in and have stick out of the APU2, but then I remembered something else.
When I put this system together, however, I also bought the PCEngines TPM module, based on the Infineon TPM SLB 9665 line, which plugs into the LPC port on the board and provides a TPM 2.0 module. At the time, the idea was to play with the APU2 vboot and measured boot process. In the meantime, however, the TPM module also provides a hardware random number generator, which also apparently is supported by the Linux kernel.
The first question that came to mind is "okay, how to I see that
this is working?" A little looking lead me to the rng-tools
,
which includes a daemon, rngd
which will read from the
hardware random number generator (HWRNG), do some entropy quality
checks on it, and then fold that entropy into the kernel pool. So I
started that up, but I still didn't know if it was working. Checking
the status of the rng-tools
service in systemctl
I saw this:
Mar 08 16:43:09 hsm-test0 rngd[666]: block failed FIPS test: 0x1f Mar 08 16:43:09 hsm-test0 rngd[666]: block failed FIPS test: 0x1f Mar 08 16:43:09 hsm-test0 rngd[666]: Too many consecutive bad blocks of data, check entropy source! Mar 08 16:43:09 hsm-test0 rngd[666]: Throttling down entropy source read speed...
Okay, I thought, did I miss something in setting up the TPM? Do you have to go through the process of claiming the TPM and setting up ownership keys before the HWRNG starts working properly? It certainly isn't returning anything useful:
$ sudo hexdump -n 512 -C /dev/hwrng 00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| * 00000200
Digging, I discovered that the APU2 is based on the AMD GX-412TC SOC, which includes AMD's "Cryptographic Coprocessor" and "Platform Security Processor", which appears to be an embedded ARM processor designed to work as a secure enclave of sorts. Anyways, I can see that it's there, and currently what the system is using as it's HWRNG:
$ cat /sys/class/misc/hw_random/rng_available ccp-1-rng tpm-rng-0 $ cat /sys/class/misc/hw_random/rng_current ccp-1-rng
You can switch the current HWRNG by writing to /sys/class/misc/hw_random/rng_current
:
$ /bin/echo -n "tpm-rng-0" | sudo tee /sys/class/misc/hw_random/rng_current
Which certainly seems to work better now:
$ sudo hexdump -n 512 -C /dev/hwrng 00000000 14 e4 bd d5 aa 82 c1 93 bb 06 05 2c 7c fc bd 26 |...........,|..&| 00000010 c6 41 b9 45 d9 c9 49 ef bc 34 54 16 1a 3c 81 68 |.A.E..I..4T..<.h| 00000020 bf f3 b1 b3 a7 eb 0c 89 c9 4e f9 77 6b be e3 41 |.........N.wk..A| 00000030 bf e0 16 9b 4f 04 a1 0c e1 fc 78 7c f8 d4 b9 c6 |....O.....x|....| [...]
You can also use rngtest
to run the same tests rngd
uses to check the quality of the entropy from the HWRNG:
$ sudo dd if=/dev/hwrng status=none | rngtest -c 100 rngtest 2-unofficial-mt.14 Copyright (c) 2004 by Henrique de Moraes Holschuh This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. rngtest: starting FIPS tests... rngtest: bits received from input: 2000032 rngtest: FIPS 140-2 successes: 100 rngtest: FIPS 140-2 failures: 0 rngtest: FIPS 140-2(2001-10-10) Monobit: 0 rngtest: FIPS 140-2(2001-10-10) Poker: 0 rngtest: FIPS 140-2(2001-10-10) Runs: 0 rngtest: FIPS 140-2(2001-10-10) Long run: 0 rngtest: FIPS 140-2(2001-10-10) Continuous run: 0 rngtest: input channel speed: (min=53.131; avg=55.470; max=68.755)Kibits/s rngtest: FIPS tests speed: (min=10.474; avg=11.068; max=11.401)Mibits/s rngtest: Program run time: 35478549 microseconds
I'm not sure why the AMD CCP HWRNG is returning bogus data, I suspect
that the BIOS in the APU2 isn't properly initializing it, but don't
know for sure. There are two options: manually select which HWRNG I want
to use as shown above, or simply not load the CCP kernel module. Given that
the HWRNG on the CCP is obviously not working, I have no idea if anything
else on it is also working, so it's probably best not to trust it. You
can disable it by creating /etc/modprobe.d/blacklist.conf
with the following contents:
blacklist kvm_amd blacklist ccp
(kvm_amd
causes ccp
to be loaded, even if
I blacklist it — I'm not using KVM on this box anyways, so I
blacklist it as well). Run sudo update-initramfs -u
and
reboot, and it no longer shows up and rngd
is happy.
For comparison, here's the output of rngtest
after switching
to use the ChaosKey:
$ sudo dd if=/dev/hwrng status=none | rngtest -c 100 rngtest 2-unofficial-mt.14 Copyright (c) 2004 by Henrique de Moraes Holschuh This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. rngtest: starting FIPS tests... rngtest: bits received from input: 2000032 rngtest: FIPS 140-2 successes: 100 rngtest: FIPS 140-2 failures: 0 rngtest: FIPS 140-2(2001-10-10) Monobit: 0 rngtest: FIPS 140-2(2001-10-10) Poker: 0 rngtest: FIPS 140-2(2001-10-10) Runs: 0 rngtest: FIPS 140-2(2001-10-10) Long run: 0 rngtest: FIPS 140-2(2001-10-10) Continuous run: 0 rngtest: input channel speed: (min=2.778; avg=5.023; max=7.798)Mibits/s rngtest: FIPS tests speed: (min=10.162; avg=12.822; max=14.004)Mibits/s rngtest: Program run time: 664867 microseconds
It generates entropy much much faster than the TPM, but I think the TPM will be fine with keeping up. Plus, it's already in the box.
Posted at: 13:15 | category: /computers/random | Link