Your startup is logging hundreds of metrics. If it isn’t, you should fix that right now.
Your users are awesome – most of the metrics update a few times a second, the slowest only a few times a minute.
So how often do you parse those logs and look for what they’re trying to say? Draw a graph or twenty?
Once a week? Every month? Every two or three months? You do know you’re supposed to sample a signal twice as often as you expect it to happen right?
If you’re anything like me, the answer is practically never. Logs are boring. They’re a bitch to parse and learning anything from logs is a pain in the arse. Really the only thing they’re good at is pinpointing a bug once you know what to look for.
Okay … you and I just admitted to only checking up on vital metrics every ~1,200,000 data points.
Well, there’s your problem!, as Adam Savage would say.
Stats, not logs
About a year ago Zemanta was operating perfectly. Everything was running smoothly, users were happy, features kept rolling out. Everything was just perfect.
Except for a tiny detail – it took them six months to discover that under the right conditions something stopped working. Just flat out didn’t work right. I don’t know what it was, they didn’t tell me, I just know it was there.
That’s when they started building this:
A dashboard of everything!
Stepping into Zemanta’s Ljubljana office the first thing you notice are those four screens showing everything imaginable about the state of the system. Everything from a live map of API call locations, to the length of numerous queues and how many connections are currently open.
Anyone who knows what’s going on can instantly see if something is wrong. Is a part of the system down? Someone trying to DOS us? Users getting errors? People complaining on Twitter?
When I was there to take a look, everything suddenly went red. DANGER! DANGER!
No idea what was wrong, but at that moment I knew this is exactly what everything I have ever built was missing – a cool dashboard to tell me when I’m being an idiot.
The perfect setup
Making a dashboard of all your metrics is surprisingly simple.
- Log everything you can think of
- Set up a statsd server
- Tell your logging module to also push to statsd
- Set up a Graphite server
- Open Graphite frontend in a browser
- Click a few metrics
You can probably do a lot more if you want to, but that’s basically it.
Statsd is a “A network daemon that runs on the Node.js platform and listens for statistics, like counters and timers, sent over UDP and sends aggregates to one or more pluggable backend services (e.g., Graphite).”
And Graphite is something like Mixpanel for developers – it can take all sorts of time series data and draw graphs for you. Anything you can think of, Graphite can make you a graph. Even better, it can make you a whole dashboard with just a few clicks.
More importantly, neither service needs any configuration. Nobody cares if you randomly think of a new metric – just start collecting and moments later you can add a new graph to your Graphite dashboard.
Your dashboard might not look as fancy as Tony Stark’s, but it works.
You should follow me on twitter here.