Wed, 29 Feb 2012

Graphite - Scalable Realtime Graphing

Recently at work we've been trying to get into the measuring and monitoring business. In looking around the measuring side I came across graphite, which describes itself as "scalable realtime graphing".

I won't go into many details, you can read them on the graphite site. Basically, it's a fairly simple and robust-looking system for gathering tuples. What kind of tuples, you say?

some.label value timestamp

Which doesn't sound like much, until you think in terms of:

system.edu_columbia_cul_verdi.cpu.load_average.1min 0.79 1330568950

Which might represent the 1 minute load average on a particular system at a point in time. Graphite itself doesn't have any tools that measure anything — it's just a data collection and visualization setup — but there are plenty of scripts and systems that allow you to collect all kinds of common data. After a couple hours getting graphite itself up and running, and some slight modification to an existing script, I was feeding various system metrics from a few machines into graphite. A bit of noodling with the frontend later, I had some nifty looking graphs.

The ease of this all, and, I suspect, why there's a good number of scripts and setups that feed data into graphite, stems from how simple it is to get data into graphite. Just open up a tcp socket and start sending tuples, separated by linebreaks. If you want to be a bit efficient, throw a Python pickle of an array of tuples at another socket. If you want to be easy, throw a UDP packet at a third socket. That's it. There's a handful of libraries that applications can use to make this even more simple — something which excites me in the "If it moves, graph it; if it doesn't move, graph it in case it starts" philosophy.

Even better, graphite doesn't care about the label at all, which means as soon as you decide to measure something you just need to pick out a label, what the label means, and start slinging it at graphite. Graphite does use rule based on regular expressions to determine how granular and how long to store data, e.g. we might configure graphite to store systems.* labels at, say, 1 minute increments for a day, 5 minute increments for a month, and 15 minute increments for 3 years, and website.* metrics with a different policy. A default catches everything for which there is no rule, so if you want to measure something, just start saying it and fix the details later.

The other half that makes graphite such a nifty tool is the ease at which you can combine different metrics: with the web front end you just drag graphs on top of each other to start building up correlations. "Does the load on the system correlate to something an application is doing?"

Every so often I come across a tool that fits a particular niche very well, graphite is one of them. I'm actually excited about moving from playing with this to deploying it everywhere at work, which I think is about 30% of its utlity right there.

Future plans include tying graphite into a monitoring tool (probably xymon — having used Xymon's predecessor Hobbit in the past, I know it's pretty easy to cause events in it. Since graphite is already getting a lot of data already, you can use that data to trigger Xymon, which can do it's normal notification. I'm thinking of a ZeroMQ publisher that just publishes all of the graphite updates as they come in, and various subscribers can read the data and send Xymon events based on them.

Posted at: 22:12 | category: /computers/sysadmin | Link

Mon, 04 Oct 2010

Systems Thinking

A while back, some folks I know were talking about the difference between someone who is a computer administrator and someone who is a systems administrator. Basically concentrating on the differences in thinking necessary when you move from dealing only with a single, isolated machine to dealing with multiple systems, with parts that can interact in subtle and non-trivial ways.

I think I've found a concrete example that demonstrates well this difference, and I found it this morning in the restroom just down the hall from my office. A while back, watersaving faucets were installed in this restroom, which is a generally a good thing. The hot water source for this restroom, however, appears to be somewhere in the vicinity of Toledo, and there does not appear to be any pipe insulation. So, you either sit there for a long period of time, waiting for something resembling warm water to appear (which, of course, totally negates the water-saving part of the faucet), or you wash your hands with clean, cool water. I usually opt for the later, and to add insult to injury, every time I do this I read the sign on the mirror telling me that in order to help prevent the spread of The Oink, I should wash my hands with plenty of soap and warm water.

A concrete example, I think, of how individual actions, which alone make sense, fail in a systems context.

Posted at: 10:09 | category: /computers/sysadmin | Link

Mon, 23 Mar 2009

Essential Necromancy for System Administrators

This is probably also the result of reading a large amount of Pratchett in the fever of my recent illness, but at work today I was talking to co-worker C and the phrase "Two-chicken problem" came up, in the context of "It will take a sacrifice of two chickens to solve this problem" (IBM tape library and lin_tape control paths, if you are curious). Then the thought occured to me: there needs to be a practical guide to the occult arts for system administrators.

I mean, everyone knows it takes goat's blood to solve SCSI problems. But what about the details? Does it have to be fresh goat's blood? Do you have to sacrifice a live goat, then and there, or can you take it from a more aged specimen? Or can you just order goat's blood from a laboratory supply house? Does it even have to be liquid? Can you use powder, directly, or perhaps reconstitute it? It's the old witchcraft problem: it's all well and good to use eye of newt, but which eye? Which newt?

I'm thinking there's an O'Reiley book here.

Posted at: 20:43 | category: /computers/sysadmin | Link