This month there have been a couple of interesting discussions about on-call rotations in the tech industry. The first was started by Charity Majors, who sparked a thread on Twitter:
All this heated talk about on call is certainly revealing the particular pathologies of where those engineers work. Listen:
— Charity Majors (@mipsytipsy) February 10, 2018
1) engineering is about building *and maintaining* services
2) on call should not be life-impacting
3) services are *better* when feedback loops are short
A couple days later John Barton followed up with an article that I really enjoyed, and pretty much whole-heartedly endorse. I had a few thoughts from both of these, and wanted to talk about them here.
"But that's just an incentive for engineers to weasel extra pay by building broken systems": I think this falls apart in several ways. First, that extra pay doesn't just appear with no additional consequences — the engineer on-call still has to actually fix the problem, wake up at odd hours, be bothered when they'd much rather be bowling or watching a movie or reading a book or sleeping, etc. Second, if this actually works at your company, your management is broken. Period. That's the whole point of it, to put an explicit material cost to this additional duty. If your management tolerates abuse of this pay, they either explictly consider this part of the cost of doing business, or they're not paying close enough attention, and both of those cases are entirely on them.
To everyone that argues that an engineer's pay covers this, I'd counter by asking "Okay, how much of that pay represents the on-call expectation?" I'm guessing many places wouldn't be able to do that. And unlike many things an employer pays for that are fuzzy, hard-to-define criteria, this one is easy, all it takes is a stop-watch and a calculator to count up how many minutes are spent responding to incidents. Is what you're being paid for it worth it? As John points out, many other industries with highly trained professionals pay on-call differentials, and tech shouldn't be any different.
I'd also add a guideline to John's list: if someone gets a page, the next day someone covers for them for 24 hours. While this isn't official policy where I work, it's my own unofficial policy to offer to cover for my co-workers when they have a particularly bad on-call day. Someone who is woken at 3am, even if they can go back to sleep ten minutes later, doesn't get as good as rest and isn't as effective the next day. Having that followed by another interrupted sleep the next night both makes the problem worse and also makes it so that the the most critical person on your team, the one responding to an emergency, is in less than peak condition. Don't let people shrug this off with an "I'm fine" — there's a large body of sleep research that disagrees with them.
Like many things in tech that I think are bad, it's only going to change if expectations start changing, and expectations aren't going to change unless we start prodding them in the right direction. I think these kinds of questions, asking for the kinds of policies John advocates, needs to be something more standard industry-wide. If my situation warrants, I plan on making this part of the questions I ask any potential employer, and if your situation warrants, I'd ask you to do the same.