Recently at work we've been trying to get into the measuring and monitoring business. In looking around the measuring side I came across graphite, which describes itself as "scalable realtime graphing".
I won't go into many details, you can read them on the graphite site. Basically, it's a fairly simple and robust-looking system for gathering tuples. What kind of tuples, you say?
some.label value timestamp
Which doesn't sound like much, until you think in terms of:
system.edu_columbia_cul_verdi.cpu.load_average.1min 0.79 1330568950
Which might represent the 1 minute load average on a particular system at a point in time. Graphite itself doesn't have any tools that measure anything — it's just a data collection and visualization setup — but there are plenty of scripts and systems that allow you to collect all kinds of common data. After a couple hours getting graphite itself up and running, and some slight modification to an existing script, I was feeding various system metrics from a few machines into graphite. A bit of noodling with the frontend later, I had some nifty looking graphs.
The ease of this all, and, I suspect, why there's a good number of scripts and setups that feed data into graphite, stems from how simple it is to get data into graphite. Just open up a tcp socket and start sending tuples, separated by linebreaks. If you want to be a bit efficient, throw a Python pickle of an array of tuples at another socket. If you want to be easy, throw a UDP packet at a third socket. That's it. There's a handful of libraries that applications can use to make this even more simple — something which excites me in the "If it moves, graph it; if it doesn't move, graph it in case it starts" philosophy.
Even better, graphite doesn't care about the label at all, which means as
soon as you decide to measure something you just need to pick out a label,
what the label means, and start slinging it at graphite. Graphite does use
rule based on regular expressions to determine how granular and how long
to store data, e.g. we might configure graphite to store systems.*
labels at, say, 1 minute increments for a day, 5 minute increments for a
month, and 15 minute increments for 3 years, and website.*
metrics with a different policy. A default catches everything for which
there is no rule, so if you want to measure something, just start saying
it and fix the details later.
The other half that makes graphite such a nifty tool is the ease at which you can combine different metrics: with the web front end you just drag graphs on top of each other to start building up correlations. "Does the load on the system correlate to something an application is doing?"
Every so often I come across a tool that fits a particular niche very well, graphite is one of them. I'm actually excited about moving from playing with this to deploying it everywhere at work, which I think is about 30% of its utlity right there.
Future plans include tying graphite into a monitoring tool (probably xymon — having used Xymon's predecessor Hobbit in the past, I know it's pretty easy to cause events in it. Since graphite is already getting a lot of data already, you can use that data to trigger Xymon, which can do it's normal notification. I'm thinking of a ZeroMQ publisher that just publishes all of the graphite updates as they come in, and various subscribers can read the data and send Xymon events based on them.