Category Archives: Monitoring

Heroku log drains into Logstash

The first and obvious option for shipping logs from a heroku app to Logstash is the heroku input plugin. However, this requires installing the Heroku gem and deploying the login + password of a Heroku user to your Logstash server(s). At this time it seems that any user given permissions to an app on Heroku has full control. Not good when you just want to fetch logs. Heroku has added more granular permissions via OAuth but the Heroku gem does not support OAuth tokens yet.

Fortunately there’s another option using Heroku’s log drain. Setting up a log drain from Heroku to Logstash is almost as simple as the Heroku input plugin but has the major advantage of not requiring any new users or passwords to be deployed on the Logstash serer.

Hooking up your Heroku-deployed apps to your Logstash/Kibana/Elasticsearch infrastructure is straightforward using the syslog drains provided by Heroku. Here is a recipe for putting the pieces together:

Install Heroku toolbelt

In order to configure a log drain on Heroku you need to install the Heroku toolbelt (or using the API directly). At this time I don't think there's a way to configure log drains from the web UI.

The heroku toolbelt can also be used to tail an apps' logs on the command line which is great for debugging since the logs output by this command are identical to the logs that should be sent to the Logstash if everything is configured correctly:

List apps:

$ heroku apps
myapp1  email@dom.tld
myapp2  email@dom.tld

Tail app's logs:

$ heroku logs --app myapp1 -t
2014-01-31T14:26:11.801629+00:00 app[web.1]: 2014-01-31T14:26:11.801Z - debug: FETCHING TICKET: 15846
2014-01-31T14:26:26.753977+00:00 app[web.1]: 2014-01-31T14:26:26.752Z - debug: FETCHING TICKET: 15851
2014-01-31T14:26:27.457415+00:00 heroku[router]: at=info method=GET path=/ping? host=myapp1 request_id=6cb9b9eb-388c-4364-9278-a81179067f21 fwd="" dyno=web.2 connect=7ms service=6ms status=200 bytes=2

Configure log drain

Use the Heroku toolbelt to configure a log drain:

$ heroku drains:add --app myapp1 syslog://logstash.dom.tld:1514

Heroku's Logplex system will now send all logs generated from myapp1 to logstash.dom.tld:1514 via TCP in syslog RFC-5424 format.

Configure Logstash

Next, configure a Logstash input to receive these logs:

input {
    tcp {
        port => "1514"
        tags => ["input_heroku_syslog"]

Heroku uses the syslog format as defined in RFC 5424. Logstash ships with a grok rules that parse out most syslog formats including RFC 5424 but I found that they were not quite perfect for Heroku logs.

For more details on Heroku log drains:

Here are examples of raw log lines sent by Heroku. These are exactly what Logstash will receive and suitable for testing with the grok debugger.

The first is an example of a log message sent from a heroku component, in this case the Heroku router:

231 <158>1 2014-01-08T01:05:27.967180+00:00 d.9bc44987-ff40-40ac-a248-ff4ec4d71d7c heroku router - - at=info method=POST path=
/ fwd="" dyno=web.1 connect=2ms service=81ms status=200 bytes=2

Next, is an example of a log line generated from the application's stdout:

158 <13>1 2014-01-08T17:49:14.822585+00:00 d.9bc44987-ff40-40ac-a248-ff4ec4d71d7c app web.1 - - 2014-01-08T17:49:14.822Z -
minfo: FETCHING TICKET: 15103

Here is the grok pattern we use to parse Heroku's syslog RFC 5424-(ish) log messages:

filter {
  if "input_heroku_syslog" in [tags] {
    grok {
      match => ["message", "%{SYSLOG5424PRI}%{NONNEGINT:syslog5424_ver} +(?:%{TIMESTAMP_ISO8601:timestamp}|-) +(?:%{HOSTNAME:heroku
_drain_id}|-) +(?:%{WORD:heroku_source}|-) +(?:%{DATA:heroku_dyno}|-) +(?:%{WORD:syslog5424_msgid}|-) +(?:%{SYSLOG5424SD:syslog5424
_sd}|-|) +%{GREEDYDATA:heroku_message}"]
    mutate { rename => ["heroku_message", "message"] }
    kv { source => "message" }
    syslog_pri { syslog_pri_field_name => "syslog5424_pri" }

A few notes about this filter config:

  • The log message will be stored in the message field.
  • Key/value pairs matching key=value in the message will be parsed and added to the Logstash event. Many of the internal Heroku components's logs include useful key/vals.
  • Special fields heroku_drain_id, heroku_source, and heroku_dyno will be extracted from each event.

Sensu and Graphite, Part 2

In a previous post I described two methods for routing metrics generated by Sensu clients to Graphite:

  • use a pipe handler to send metrics via TCP to graphite
  • use Graphite's AMQP (rabbitmq) support.

Method #1 was simply described for completeness. It is not scalable and shouldn't be used except for very small workloads. Pipe handlers involve a fork() by sensu-server for every metric received.

At the time I recommended method #2 which was more efficient – Sensu would simply copy the metric from its own results queue to another queue that Graphite would be listening on, since both Sensu and Graphite can talk to RabbitMQ.

However, Graphite's AMQP support is fairly lacking, in my opinion. It does not seem to be getting much attention on the regular Graphite support forums and the code around AMQP has not changed much. The docs section describing its configuration remains an empty TODO.

The main reason I don't like the AMQP approach anymore is that it does not work well with Graphite clusters. I prefer to build a Graphite cluster where each node is identically configured. Each node would connect to a an AMQP queue, pop a metric off of the queue in a load-balanced fashion, then let carbon-relay's routing rules figure out where to send the metric. It does not work this way. Instead each graphite node would pull each metric posted to the queue, duplicating effort on each node in the cluster. This is wasteful and limits the capacity of the cluster needlessly.

Newer, better ways

My new preferred method for sending metrics to Graphite is to use TCP with a load-balancer in front of Graphite's carbon-relay instances in the case of a multi-node cluster.

This was not really possible when the initial blog post was written, but since that time Sensu has added support for extensions handlers in addition to the original pipe handlers. Extensions are Ruby code that is loaded and run inside the sensu-server process. They are much more efficient than fork()'ing to handle each event.

There are two extension handlers available for sending metrics to Graphite:

  • Sensu-server TCP handler: Ships with sensu-server. Very simple, takes the event[&#39;output&#39;] string and sends it untouched over a TCP socket to a destination.
  • @grepory's WizardVan: More features, supports OpenTSDB and Graphite, buffering support, re-connect, backoff, etc.

Here is a quick example of configuring each of these extension handlers.

Sensu-server TCP Handler

Configuring the TCP handler that ships with Sensu is easy and is documented in the handlers section of the Sensu docs.

The TCP handler is very basic and will simply copy the output of the check directly over the socket. This works out fine for most Sensu metric checks since the defacto standard for most is to output graphite's line-oriented format.

Example tcp handler:

  "handlers": {
    "graphite_line_tcp": {
      "type": "tcp",
      "socket": {
        "host": "metrics.dom.tld",
        "port": 2003

Add the graphite_line_tcp handler to your metric checks:

  "checks": {
    "vmstat_metrics": {
      "type": "metric",
      "handlers": ["graphite_line_tcp"],
      "command": "/etc/sensu/plugins/vmstat-metrics.rb --scheme stats.:::name:::",
      "interval": 60,
      "subscribers": [ "webservers" ]

WizardVan (aka, sensu-metrics-relay) Extension

A more advanced TCP extension handler is available from @grepory and goes by the code-name WizardVan or sensu-metrics-relay (same thing, but I was confused for a moment).

WizardVan does not come shipped with Sensu but installation instructions are available on its Github page. In the future it may be easier to install by shipping as a rubygem.

WizardVan also takes advantage of another newer Sensu feature known as mutators which provide the ability for WizardVan to send metrics to either Graphite or OpenTSDB or both.

By default, WizardVan assumes that metrics are in Graphite format and so configuring it for use with Graphite is straight-forward:

Here is a general example for configuring WizardVan. See the docs for more options.

  "handlers": {
    "relay": {
        "graphite": {
            "host": "graphite.dom.tld",
            "port": 2003
        "opentsdb": {
            "host": "tsdb.dom.tld",
            "port": 4424

For further information on configuring WizardVan see the README.

NOTE: Unless you have a very high (hundreds/sec) rate of metrics you may need to lower WizardVan's MAX_QUEUE_SUZE to something less than 16KB (try 128). Hopefully soon this will be configurable instead of hardcoded.

Sensu Presentation from CentOS Dojo Phoenix

I was invited to speak at CentOS Dojo in Phoenix, AZ recently (May 2013) about the Sensu monitoring framework. I wanted to do something a little bit different than past presentations and try to show some use cases that fit what Sensu can do rather than just do a basic introduction to Sensu.

Check out the presentation below. The first half is an overview of Sensu (most of the audience at CentOS Dojo had not heard of Sensu yet) and the second half introduces (whet’s the appetite?!) some cool uses of Sensu such as automated cleanup and decommissioning of EC2 nodes, routing checks to different teams in Pagerduty, embedding “playbook” documentation in checks to help speed up MTTR.

Many thanks to @jeremy_carroll for his feedback and assistance.

Solving monitoring state storage problems using Redis

Redis is an in-memory key-value data store that provides a small number of primitives suitable to the task of building monitoring systems. As a lot of us are hacking in this space I thought I’d write a blog post summarizing where I’ve been using it in a little Sensu like monitoring system I have been working on on and off.

There’s some monitoring related events coming up like MonitoringLove in Antwerp and Monitorama in Boston – I will be attending both and I hope a few members in the community will create similar background posts on various interesting areas before these events.

I’ve only recently started looking at Redis but really like it. It’s a very light weight daemon written in C with fantastic documentation detailing things like each commands performance characteristics and most documantation pages are live in that they have a REPL right on the page like the SET page – note you can type into the code sample and see your changes in real time. It is sponsored by VMWare and released under the 3 clause BSD license.

Redis Data Types

Redis provides a few common data structures:

  • Normal key-value storage where every key has just one string value
  • Hashes where every key contains a hash of key-values strings
  • Lists of strings – basically just plain old Arrays sorted in insertion order that allows duplicate values
  • Sets are a bit like Lists but with the addition that a given value can only appear in a list once
  • Sorted Sets are sets that in addition to the value also have a weight associated with it, the set is indexed by weight

All the keys support things like expiry based on time and TTL calculation. Additionally it also supports PubSub.

At first it can be hard to imagine how you’d use a data store with only these few data types and capable of only storing strings for monitoring but with a bit of creativity it can be really very useful.

The full reference about all the types can be found in the Redis Docs: Data Types

Monitoring Needs

Monitoring systems generally need a number of different types of storage. These are configuration, event archiving and status and alert tracking. There are more but these are the big ticket items, of the 3 I am only going to focus on the last one – Status and Alert Tracking here.

Status tracking is essentially transient data. If you loose your status view it’s not really a big deal it will be recreated quite quickly as new check results come in. Worst case you’ll get some alerts again that you recently got. This fits well with Redis that doesn’t always commit data soon as it receives it – it flushes roughly every second from memory to disk.

Redis does not provide much by way of SSL or strong authentication so I tend to consider it a single node IPC system rather than say a generic PubSub system. I feed data into a node using system like ActiveMQ and then for comms and state tracking on a single node I’ll use Redis.

I’ll show how it can be used to solve the following monitoring related storage/messaging problems:

  • Check Status – a check like load on every node
  • Staleness Tracking – you need to know when a node is not receiving check results so you can do alive checks
  • Event Notification – your core monitoring system will likely feed into alerters like Opsgenie and metric storage like Graphite
  • Alert Tracking – you need to know when you last sent an alert and when you can alert again based on an interval like every 2 hours

Check Status

The check is generally the main item of monitoring systems. Something configures a check like load and then every node gets check results for this item, the monitoring system has to track the status of the checks on a per node basis.

In my example a check result looks more or less like this:

{"lastcheck"        => "1357490521", 
 "count"            => "1143", 
 "exitcode"         => "0", 
 "output"           => "OK - load average: 0.23, 0.10, 0.02", 
 "last_state_change"=> "1357412507",
 "perfdata"         => '{"load15":0.02,"load5":0.1,"load1":0.23}',
 "check"            => "load",
 "host"             => ""}

This is standard stuff and the most boring part – you might guess this goes into a Hash and you’ll be right. Note the count item there Redis has special handling for counters and I’ll show that in a minute.

By convention Redis keys are name spaced by a : so I’d store the check status for a specific node + check combination in a key like

Updating or creating a new hash is real easy – just write to it:

def save_check(check)
  key = "status:%s:%s" % [, check.check]
  check.last_state_change = @redis.hget(key, "last_state_change")
  check.previous_exitcode = @redis.hget(key, "exitcode")
  @redis.multi do
    @redis.hset(key, "host",
    @redis.hset(key, "check", check.check)
    @redis.hset(key, "exitcode", check.exitcode)
    @redis.hset(key, "lastcheck", check.last_check)
    @redis.hset(key, "last_state_change", check.last_state_change)
    @redis.hset(key, "output", check.output)
    @redis.hset(key, "perfdata", check.perfdata)
    unless check.changed_state?
      @redis.hincrby(key, "count", 1)
      @redis.hset(key, "count", 1)
  check.count = @redis.hget(key, "count")

Here I assume we have a object that represents a check result called check and we’re more or less just fetching/updating data in it. I first retrieve the previously saved state of exitcode and last state change time and save those into the object. The object will do some internal state management to determine if the current check result represents a changed state – OK to WARNING etc – based on this information.

The @redis.multi starts a transaction, everything inside the block will be written in an atomic way by the Redis server thus ensuring we do not have any half-baked state while other parts of the system might be reading the status of this check.

As I said the check determines if the current result is a state change when I set the previous exitcode on line 5 this means lines 16-20 will either set the count to 1 if it’s a change or just increment the count if not. We use the internal Redis counter handling on line 17 to avoid having to first fetch the count and then update it and saving it, this saves a round trip to the database.

You can now just retrieve the whole hash with the HGETALL command, even on the command line:

% redis-cli hgetall
 1) "check"
 2) "load"
 3) "host"
 4) ""
 5) "output"
 6) "OK - load average: 0.00, 0.00, 0.00"
 7) "lastcheck"
 8) "1357494721"
 9) "exitcode"
10) "0"
11) "perfdata"
12) "{\"load15\":0.0,\"load5\":0.0,\"load1\":0.0}"
13) "last_state_change"
14) "1357412507"
15) "count"
16) "1178"

References: Redis Hashes, MULTI, HSET, HINCRBY, HGET, HGETALL

Staleness Tracking

Staleness Tracking here means we want to know when last we saw any data about a node, if the node is not providing information we need to go and see what happened to it. Maybe it’s up but the data sender died or maybe it’s crashed.

This is where we really start using some of the Redis features to save us time. We need to track when last we saw a specific node and then we have to be able to quickly find all nodes not seen within certain amount of time like 120 seconds.

We could retrieve all the check results and check their last updated time and so figure it out but that’s not optimal.

This is what Sorted Lists are for. Remember Sorted Lists have a weight and orders the list by the weight, if we use the timestamp that we last received data at for a host as the weight it means we can very quickly fetch a list of stale hosts.

def update_host_last_seen(host, time)
  @redis.zadd("host:last_seen", time, host)

When we call this code like update_host_last_seen(“”, the host will either be added to or updated in the Sorted List based on the current UTC time. We do this every time we save a new result set with the code in the previous section.

To get a list of hosts that we have not seen in the last 120 seconds is really easy now:

def get_stale_hosts(age)
  @redis.zrangebyscore("host:last_seen", 0, ( - age))

If we call this with an age like 120 we’ll get an array of nodes that have not had any data within the last 120 seconds.

You can do the same check on the CLI, this shows all the machines not seen in the last 60 seconds:

% redis-cli zrangebyscore host:last_seen 0 $(expr $(date +%s) - 60)
 1) ""

Reference: Sorted Sets, ZADD, ZRANGEBYSCORE

Event Notification

When a check result enters the system thats either a state change, a problem or have metrics associated it we’d want to send those on to other pieces of code.

We don’t know or care who those interested parties are we only care that there might be some interested parties – it might be something writing to Graphite or OpenTSDB or both at the same time or something alerting to Opsgenie or Pager Duty. This is a classic use case for PubSub and Redis has a good PubSub subsystem that we’ll use for this.

I am only going to show the metrics publishing – problem and state changes are very similar:

def publish_metrics(check)
  if check.has_perfdata?
    msg = {"metrics" => check.perfdata, "type" => "metrics", "time" => check.last_check, "host" =>, "check" => check.check}.to_json
    publish(["metrics",, check.check], msg)
def publish(type, message)
  target = ["overwatch", Array(type).join(":")].join(":")
  @redis.publish(target, message)

This is pretty simple stuff, we’re just publishing some JSON to a named destination like We can now write small standalone single function tools that consume this stream of metrics and send it wherever we like – like Graphite or OpenTSDB.

We publish similar events for any incoming check result that is not OK and also for any state transition like CRITICAL to OK, these would be consumed by alerter handlers that might feed pagers or SMS.

We’re publishing these alerts to to destinations that include the host and specific check – this way we can very easily create individual host views of activity by doing pattern based subscribes.

Reference: PubSub, PUBLISH

Alert Tracking

Alert Tracking means keeping track of which alerts we’ve already sent and when we’ll need to send them again like only after 2 hours of the same problem and not on every check result which might come in every minute.

Leading on from the previous section we’d just consume the problem and state change PubSub channels and react on messages from those:

A possible consumer of this might look like this:

@redis.psubscribe("overwatch:state_change:*", "overwatch:issues:*") do |on|
  on.pmessage do |channel, message|
    event = JSON.parse(message)
    case event["type"]
      when "issue"
        sender.notify_issue(event["issue"]["exitcode"], event["host"], event["check"], event["issue"]["output"])
      when "state_change"
        if event["state_change"]["exitcode"] == 0
          sender.notify_recovery(event["host"], event["check"], event["state_change"]["output"])

This subscribes to the 2 channels and pass the incoming events to a notifier. Note we’re using the patterns here to catch all alerts and changes for all hosts.

The problem here is that without any special handling this is going to fire off alerts every minute assuming we check the load every minute. This is where Redis expiry of keys come in.

We’ll need to track which messages we have sent when and on any state change clear the tracking thus restarting the counters.

So we’ll just add keys called “″ to indicate an UNKNOWN state alert for load on

def record_alert(host, check, status, expire=7200)
  key = "alert:%s:%s:%d" % [host, check, status]
  @redis.set(key, 1)
  @redis.expire(key, expire)

This takes an expire time which defaults to 2 hours and tells redis to just remove the key when its time is up.

With this we need a way to figure out if we can send again:

def alert_ttl(host, check, status)
  key = "alert:%s:%s:%d" % [host, check, status]

This will return the amount of seconds till next alert and -1 if we are ready to send again

And finally on every state change we need to just purge all the tracking for a given node + check combo. The reason for this is if we notified on CRITICAL a minute ago then the service recovers to OK but soon goes to CRITICAL again this most recent CRITICAL alert will be suppressed as part of the previous cycle of alerts.

def clear_alert_ttls(host, check)

So now I can show the two methods that will actually publish the alerts:

The first notifies of issues but only every @interval seconds and it uses the alert_ttl helper above to determine if it should or shouldn’t send:

def notify_issue(exitcode, host, check, output)
  if (ttl = @storage.alert_ttl(host, check, exitcode)) == -1
    subject = "%s %s#%s" % [status_for_code(exitcode), host, check]
    message = "%s: %s" % [subject, output]
    send(message, subject, @recipients)
    @redis.record_alert(host, check, exitcode, @alert_interval)
  else"Not alerting %s#%s due to interval restrictions, next alert in %d seconds" % [host, check, ttl])

The second will publish recovery notices and we’d always want those and they will not repeat, here we clear all the previous alert tracking to avoid incorrect alert surpressions:

def notify_recovery(host, check, output)
  subject = "RECOVERY %s#%s" % [host, check]
  message = "%s: %s" % [subject, output]
  send_alert(message, subject, @recipients)
  @redis.clear_alert_ttls(host, check)



This covered a few Redis basics but it’s a very rich system that can be used in many areas so if you are interested spend some quality time with its docs.

Using its facilities saved me a ton of effort while working on a small monitoring system. It is fast and light weight and enable cross language collaboration that I’d have found hard to replicate in a performant manner without it.

Collecting Metrics from Ruby Processes with Zabbix Trappers

A trapper and his brother, Noah, preparing a beaver trapJustin and I have recently started using Zabbix for monitoring, in place of Nagios. We’ve also taken the opportunity to start collecting even more metrics than before.

One nice thing about Zabbix is that it can use pre-existing Nagios monitoring plugins out of the box. But what if you also want to collect metrics from say, a Ruby process? You’re in luck! Zabbix can collect various forms of information (from numerical metrics to arbitrary strings, to log data) via the Zabbix sender protocol. Let’s set this up.

Server Side Setup

There are two things we’ll have to do from the Zabbix web interface: create a host, and create an item.

Create a Host

First we’ll create a host. This could either be a real host for which we’ll track and monitor other attributes, or simply a dummy host acting as a container for the metrics we’d like to collect from one or more sources.

Under Configuration -> Hosts, click on Create to start creating a new host. We will need to enter a Host name (e.g. “myappserver”), add our host to at least one group (e.g. “mygroup” or “Linux servers”), and enter its I.P. address (or you can leave it as if this is a just dummy host to hold items).

Zabbix Manual Quickstart – New Host

Create an Item

Next we’ll create an “item” on that host, which will receive and store our data. In the list of hosts, you should now see the host we created (“myappserver”). In that row, click the link for its items (“Items (0)”), then click “Create Item”.

The new item page gives us a lot of configurable parameters, but we’ll start with the essentials. First, we’ll want to specify that the “Type” of this item should be “Zabbix trapper”. Then, let’s fill in a “Name” (used in lists and a few other places around the web ui), a “Key” (the unique name and identifier for the data we’ll be collecting). Finally, we’ll also need to select the “Type of information” we want to collect (in our case, let’s use “Numeric (unsigned)” and use the default “Decimal” data type).

Click “Save”, and we’ll be ready to start receiving data.

Zabbix Manual Quickstart – New Item

Client Side (Implementing the Protocol)

Looking at the protocol definition page on the Zabbix wiki, we see that it is “…. Zabbix header… JSON”. JSON is easy, but what’s a “Zabbix header”? Looking at the snippet of Java source provided, it appears that the “Zabbix header” is the string of characters, ZBXD, followed by 0×01 and a bit-packed/padded integer representing the length of the JSON blob which follows. Not so bad. we can achieve this in a few lines of Ruby:

require 'json'
msg = {
  "request" => "sender data",
  "data" => [
      "host" => "myappserver",
      "key" => "mydata",
      "value" => "1",
body = JSON.generate msg
data_length = body.bytesize
data_header = "ZBXD\1".encode("ascii") + \
              [data_length].pack("i") + \
data_to_send = data_header + body

Then we can simply send this off to the Zabbix server (listening on port 10051) like so:

s ="zabbixserver", 10051)
s.write message.to_s
response_header = s.recv(5)
if not response_header == "ZBXD\1"
  puts "response: #{response_header}"
  raise 'Got invalid response'
response_data_header = s.recv(8)
response_length = response_data_header[0,4].unpack("i")[0]
response_raw = s.recv(response_length)
response = JSON.load(response_raw)

In a future post, we’ll look at what we can do with the data we’ve collected.

The post Collecting Metrics from Ruby Processes with Zabbix Trappers appeared first on Atomic Spin.

Monitoring health of Dell/LSI RAID arrays with Ganglia

I have couple hundred Dell systems with LSI RAID arrays however we lacked hardware monitoring which would occasionally result in situations where there would be multiple disk failures that would not get caught. Some time ago I read this post about using MegaCli to monitor Dell's RAID controller

One thing I did not like about this approach is that it may generate too many e-mails as it will send e-mails every hour until the disk has been fixed. Instead I used Ganglia with Nagios to provide me with similar type of functionality.

Create a file called analysis.awk with following content

    /Device Id/ { counter += 1; device[counter] = $3 }
    /Firmware state/ { state_drive[counter] = $3 }
    /Inquiry/ { name_drive[counter] = $3 " " $4 " " $5 " " $6 }
    END {
    for (i=1; i<=counter; i+=1) printf ( "Device %02d (%s) status is: %s\n", device[i], name_drive[i], state_drive[i]); 

Get MegaCli utilities from e.g. RPMFind. Copy following BASH script into a file and run into from a cron at frequency you need e.g. every 30 minutes, hour etc.


GMETRIC_BIN="/usr/bin/gmetric -d 7200 " 



BAD_DISKS=`$MEGACLI_PATH -PDList -aALL | awk -f ${MEGACLI_DIR}/analysis.awk | grep -Ev "*: Online" | wc -l` 

if [ $BAD_DISKS -eq 0 ]; then
    STATUS="All RAID Arrays healthy"
    STATUS=`$MEGACLI_PATH -PDList -aALL | awk -f ${MEGACLI_DIR}/analysis.awk | grep -Ev "*: Online"` 

$GMETRIC_BIN -t uint16 -n failed_unconfigured_disks -v $BAD_DISKS -u disks
$GMETRIC_BIN -t string -n raid_array_status -T "Raid Array Status" -v "$STATUS"


This will create two different Ganglia metrics. One is number of failed or unconfigured disks and another one is just a string value that gives you details on the failure e.g. Disk 4 failed. Besides being able to alert on this metric it also gives me a valuable data point I can use to correlate node behavior ie. when disk failed load on the machine went up.

If you are using Ganglia with Nagios integration you have two different options on how you want to alert.

1. Create a separate check for every host you want monitored e.g.

define command{
        command_name    check_ganglia_metric
        command_line    /bin/sh /var/www/html/ganglia/ host=$HOSTADDRESS$ metric_name=$ARG1$ operator=$ARG2$ critical_value=$ARG3$

define service{
        use                   generic-service
        host_name             server1
        service_description   Failed/unconfigured RAID disk(s)
        check_command         check_ganglia_metric!failed_unconfigured_disks!more!0.1


2. Create a single check that gives you a single alert if at least one machine has a bad disk (how I do it :-)). For this purpose I'm utilizing the check_host_regex which allows me to specify a regular expression of matching hosts. In my case I check every single host. If a host doesn't have the failed_disks metric I assume it doesn't have it and I "ignore" matches. My config is similar to this

define command{
        command_name    check_host_regex_ignore_unknowns
        command_line    /bin/sh /etc/icinga/objects/ hreg=$ARG1$ checks=$ARG2$ ignore_unknowns=1

define service{
        use                             generic-service
        host_name                       server2
        service_description             Failed disk - RAID array
        check_command                   check_host_regex_ignore_unknowns!'.*'!failed_disks,more,0.5

Which will give you something like this

# Services OK = 236, CRIT/UNK = 2 : CRITICAL failed_disks = 1 disks, CRITICAL failed_disks = 1 disks


My monitoring setup

On Twitter Grig Gheorghiu posed a number of questions about monitoring tools and what he wants in a monitoring tool. This is my attempt at describing what my setup looks like or has looked like in the past.

1. Metrics acquisition / performance trending

I use Ganglia to collect all my metrics including string metrics. Base installation for Ganglia Gmond will give you over 100 metrics. There are a number of Python modules that are disabled by default like mySQL, Redis that you can easily enable and get more. If you need even more you can check out these to Github repositories.

Don't worry about sending too many metrics. I have hosts that send in excess of 1100 metrics per host. Ganglia can handle it so don't be shy :-). Also when I say all metrics go into Ganglia I mean EVERYTHING. If I want to alert on it it will be in Ganglia so I have things like these

  • NTP time offset
  • What version is particular key piece of software on e.g. deploy ID 123af58
  • Memory utilization/CPU utilization for key daemon processes
  • Number of failed disks in a RAID array
  • Application uptime
  • Etc.

2. Alerting

I use Nagios or Icinga for alerting. I don't really use any Nagios plugins as all the checks are driven by data coming out of Ganglia. I have written a post in the past about why you should use your trending data for alerting which you can read for some background. About a year ago Ganglia/Nagios integration has been added to Ganglia Web which makes a number of things much easier so for example I have

  • A single check that checks all hosts in the cluster for failed disk in a RAID array
  • A single check that checks whether time is within certain offset on all hosts ie. to make sure NTP is actually running
  • A single check that makes sure version of deployed code is the same everywhere
  • A single check for all file systems on a local system with each file system having their own thresholds
  • A single check for elevated rates of TCP errors - useful to get a quick idea if things are globally slow, affecting a certain set of hosts in a geographic area or an individual host

Beauty in having all of the metric data in Ganglia is that you can also get creative by writing custom checks that have your own user specified logic e.g.

  • Alert me only if 20% of my data nodes are down e.g. in architectures where you can withstand few nodes failing

In addition I recommend adding as much context to alerts as possible so that you get as much as information in alert as possible.

I also heavily utilize something I wrote called the alerting controller

Which allows you easily enable/disable/schedule downtime for services in Nagios e.g. disable process alive check for configuration servlet on all hosts while we are doing an upgrade etc. In addition I have a tab with Naglite opened up most of the time to check on any outstanding alerts.

3. Notifications

Beyond just alerting I am always interested to see what is happening with the infrastructure so we can act proactively. For that purpose I use IRC and a modified version of NagiosBot which sends echoes following things to the channel

  • Nagios alerts (same script that adds context to alerts above) - helpful for quick team coordination
  • Dynect DNS zone changes - those may be "invisible" so good idea to track
  • Zendesk tickets - anyone can handle support requests
  • Twitter mentions of a particular keyword
  • Application configuration changes e.g. new version of the code deployed or in progress
  • Severe Application errors - short summary of an error and which node it occured on


This is by no means an exhaustive list and this may not be the best way to do things but it does work for me.

Our #monitoringsucks rpm is repository available

Not only our Rubygems Builds have changed, but also my internal #monitoringsucks repository.

You might have noticed a variety of vagrant- projects on my github acount,
Being the #monitoringsucks part of them. All of those Vagrant projects are basically my test setups to play with those new tools.

They contain a bunch of puppet modules that install and configure these tools. (Note that they mostly consist of
of git submodules to other puppet module repositories.

Given the fact that I also like to have my software cleanly installed from a package, that means that some of these tools had to be packaged, or I had to create a personal / internal repository which had packages from upstream that were hiding on the internet available.

I've forked of this repository off the internal Inuits epository so you all can also benefit from these efforts.
(You gotta love pulp :))

That means you can now install all of the above mentionned #monitoringsucks tool from our public repo on

  1. yumrepo { 'monitoringsucks':
  2. baseurl => '',
  3. descr => 'MonitoringSuck at Inuits',
  4. gpgcheck => '0',
  5. }

Patches to both the Vagrant projects and the puppet modules are welcome ...

Devops in Munich

Devopsdays Mountainview sold out in a short 3 hours .. but there's other events that will breath devops this summer.
DrupalCon in Munich will be one of them ..

Some of you might have noticed that I`m cochairing the devops track for DrupalCon Munich,
The CFP is open till the 11th of this month and we are still actively looking for speakers.

We're trying to bridge the gap between drupal developers and the people that put their code to production, at scale.
But also enhancing the knowledge of infrastructure components Drupal developers depend on.

We're looking for talks both on culture (both success stories and failure) , automation,
specifically looking for people talking about drupal deployments , eg using tools like Capistrano, Chef, Puppet,
We want to hear where Continuous Integration fits in your deployment , do you do Continuous Delivery of a drupal environment.
And how do you test ... yes we like to hear a lot about testing , performance tests, security tests, application tests and so on.
... Or have you solved the content vs code vs config deployment problem yet ?

How are you measuring and monitoring these deployments and adding metrics to them so you can get good visibility on both
system and user actions of your platform. Have you build fancy dashboards showing your whole organisation the current state of your deployment ?

We're also looking for people talking about introducing different data backends, nosql, scaling different search backends , building your own cdn using smart filesystem setups.
Or making smart use of existing backends, such as tuning and scaling MySQL, memcached and others.

So lets make it clear to the community that drupal people do care about their code after they committed it in source control !

Please submit your talks here

Adding context to your alerts

I am a big believer in adding context to alerts. This allows the recipient of an alert to make a better decision on how to deal with an alert. It's often hard to classify alerts so providing as much context to the alert is extremely helpful. For instance if I am alerting on a value of a metric I like to attach an image of that metric for the past hour. This way if I am on my mobile phone and out and about I have the alerting metric graph right there without needing to open up another window or having to start up my laptop.

In more recent versions of Ganglia there is an option to add overlay events to hosts which show up as vertical lines on the graph. I figured that would be great context to add to alerts. Since I'm using Nagios I decided to extend a mail handler I used before to query Ganglia events database and include any events that were connected to the matching host in 24 hours. This helps in a number of  scenarios to keep team on the same page and well informed e.g.

  • There was a code push/config change however host/service was not scheduled for maintenance
  • Recent code push is causing issues ie. web servers are crashing

This is an example e-mail you get

As an added bonus mail handler sends all alerts to a Nagios Bot :-). Now all you need to make sure is to record events for any major changes. You could do a lot of these things automatically by e.g.

  • Adding hooks to your startup scripts so that when you purposely restart services it is logged
  • Watching logs then inserting proper events in the timeline. App stoppe
  • Querying external services e.g. Dynect provides an API to query zone changes

You can download the mail handler from here