Monitoring SSL Certificate Expiry in GCP and Kubernetes

SSL cert monitoring diagram

Problem

At my current job, we use Google Cloud Platform. Each team has a set of GCP Projects; each project can have multiple clusters. The majority of services that our teams write expose some kind of HTTP API or web interface - so what does this mean? All HTTP endpoints we expose are encrypted with SSL[1], so we have a lot of SSL certificates in a lot of different places.

Each of our GCP projects is built using our CI/CD tooling. All GCP resources and all of our Kubernetes application manifests are defined in git. We have a standard set of stacks that we deploy to each cluster using our templating. One of the stacks is Prometheus, Influxdb, and Grafana. In this article, I’ll explain how we leverage (part of) this stack to automatically monitor SSL certificates in use by our load balancers across all of our GCP projects.

Certificate Renewal

To enable teams to expose services with minimal effort, we rely on deploying a Kubernetes LetsEncrypt controller to each of our clusters. The LetsEncrypt controller automatically provisions certificates for Kubernetes resources that require them, as indicated by annotations on the resources, e.g:

apiVersion: v1
kind: Service
metadata:
  name: app0
  labels:
    app: app0
  annotations:
    acme/certificate: app0.prod.gcp0.example.com
    acme/secretName: app0-certificate
spec:
  type: ClusterIP
  ports:
    - port: 3000
      targetPort: 3000
  selector:
    app: app0

This certificate can now be consumed by an NGiNX ingress controller, like so:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: app0
  annotations:
    kubernetes.io/ingress.class: "nginx"
spec:
  tls:
    - secretName: app0-certificate
      hosts:
        - app0.prod.gcp0.example.com

  rules:
    - host: app0.prod.gcp0.example.com
      http:
        paths:
          - path: /
            backend:
              serviceName: app0
              servicePort: 3000

Switching the ingress.class annotation to have the value of gce will mean Google Compute Engine will handle this configuration. A copy of the secret (the SSL certificate) will be made in GCP as a Compute SSL Certificate resource, which the GCP load balancer can then use to serve HTTPS.

Of course, this isn’t the only method for deploying SSL certificates for services in GCP and/or Kubernetes. In our case, we also have many legacy certificates that are manually renewed by humans, stored encrypted in our repositories, and deployed as secrets to Kubernetes or SSL Certificate resources to Google Compute Engine.

The GCE ingress controller makes a copy of the secret as a Compute SSL Certificate. This means that certificates used in the default Kubernetes load balancers are stored in two separate locations: the Kubernetes cluster, as a secret, and in GCE, as a Certificate resource.

Regardless of how the certificates end up in either GCE or Kubernetes, we can monitor them with Prometheus.

Whether manually renewed or managed by LetsEncrypt, our certificates end up in up-to two places:

  • The Kubernetes Secret store
  • As a GCP compute SSL Certificate

Note that the NGiNX ingress controller works by mounting the Kubernetes Secret into the controller as a file.

The following commands will show certificates for each respective location:

  • Kubernetes Secrets (kubectl get secret)
  • GCP compute ssl-certificates (gcloud compute ssl-certificates)

Exposing Certificate Expiry

In order to ensure that our certificates are being renewed properly, we want to check the certificates that are being served up by the load balancers. To check the certificates we need to do the following:

  1. Fetch a list of FQDNs to check from the appropriate API (GCP or GKE/Kubernetes)
  2. Connect to each FQDN and retrieve the certificate
  3. Check the Valid To field for the certificate to ensure it isn’t in the past

To do the first two parts of this process we’ll use a couple of programs that I’ve written that scrape the GCP and K8S APIs and expose the expiry times for every certificate in each:

Kubernetes manifest for prometheus-gke-letsencrypt-certs:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: prometheus-gke-letsencrypt-certs
  namespace: system-monitoring
  labels:
    k8s-app: prometheus-gke-letsencrypt-certs
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: prometheus-gke-letsencrypt-certs
  template:
    metadata:
      labels:
        k8s-app: prometheus-gke-letsencrypt-certs
      annotations:
        prometheus_io_port: '9292'
        prometheus_io_scrape_metricz: 'true'
    spec:
      containers:
      - name: prometheus-gke-letsencrypt-certs
        image: roobert/prometheus-gke-letsencrypt-certs:v0.0.4
        ports:
          - containerPort: 9292

Kubernetes manifest for prometheus-gcp-ssl-certs:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: prometheus-gcp-ssl-certs
  namespace: system-monitoring
  labels:
    k8s-app: prometheus-gcp-ssl-certs
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: prometheus-gcp-ssl-certs
  template:
    metadata:
      labels:
        k8s-app: prometheus-gcp-ssl-certs
      annotations:
        prometheus_io_port: '9292'
        prometheus_io_scrape_metricz: 'true'
    spec:
      containers:
      - name: prometheus-gcp-ssl-certs
        image: roobert/prometheus-gcp-ssl-certs:v0.0.4
        ports:
          - containerPort: 9292

These exporters each connect to a different API and then expose a list of CNs with their Valid To value in seconds. Using these values we can calculate how long left until the certificate expires (time() - $valid_to).

Once these exporters have been deployed, and if, like ours, Prometheus has been configured to look for the prometheus_io_* annotations, then Prometheus should start scraping these exporters and the metrics should be visible in the Prometheus UI. Search for gke_letsencrypt_cert_expiration or gcp_ssl_cert_expiration, here’s one example:

Prometheus Query - SSL

Visibility

Now that certificate metrics are being updated, the first useful thing we can do is make them visible.

Each of our projects has a Grafana instance automatically deployed to it and preloaded with some useful dashboards, one of which queries Prometheus for data about the SSL certs. When a certificate has less than seven days until it runs out, it turns orange; when it’s expired it will turn red.

Grafana SSL cert expiry dashboard

The JSON for the above dashboard can be found in this gist: gist:roobert/e114b4420f2be3988d61876f47cc35ae

Alerting

Next, let’s setup some Alert Manager alerts so we can surface issues rather than having to check for them ourselves:

ALERT GKELetsEncryptCertExpiry
  IF gke_letsencrypt_cert_expiry - time() < 86400 AND gke_letsencrypt_cert_expiry - time() > 0
  LABELS {
    severity="warning"
  }
  ANNOTATIONS {
    SUMMARY = ": SSL cert expiry",
    DESCRIPTION = ": GKE LetsEncrypt cert expires in less than 1 day"
  }

ALERT GKELetsEncryptCertExpired
  IF gke_letsencrypt_cert_expiry - time() =< 0
  LABELS {
    severity="critical"
  }
  ANNOTATIONS {
    SUMMARY = ": SSL cert expired",
    DESCRIPTION = ": GKE LetsEncrypt cert has expired"
  }

ALERT GCPSSLCertExpiry
  IF gcp_ssl_cert_expiry - time() < 86400 AND gcp_ssl_cert_expiry - time() > 0
  LABELS {
    severity="warning"
  }
  ANNOTATIONS {
    SUMMARY = ": SSL cert expiry",
    DESCRIPTION = ": GCP SSL cert expires in less than 1 day"
  }

ALERT GCPSSLCertExpired
  IF gcp_ssl_cert_expiry - time() =< 0
  LABELS {
    severity="critical"
  }
  ANNOTATIONS {
    SUMMARY = ": SSL cert expired",
    DESCRIPTION = ": GCP SSL cert has expired"
  }

Caution: Due to the nature of LetsEncrypt certificate renewals only happening on the last day that they are valid, the window of opportunity for receiving an alert is extremely slim.

Conclusion

In this article, I’ve outlined our basic SSL monitoring strategy and included the code for two Prometheus exporters which can expose the metrics necessary to configure your own graphs and alerts. I hope this has been helpful.




[1] Technically TLS but commonly referred to as SSL

Kubernetes Manifest Templating with ERB and Hiera

Problem

At my current job each team has a dev(n)-stage(n)-production(n) type deployment workflow. Application deployments are kept in git repositories and deployed by our continuous delivery tooling.

It is unusual for there to be major differences between applications deployed to each of these different contexts. Usually it is just a matter of tuning resource limits or when testing, deploying a different version of the deployment.

The project matrix looks like this:

project matrix



GCP projects must have globally unique names, so ours are prefixed with “bw-

The directory structure is composed of Names, Deployments, and Components:

  • Name is the GCP Project name
  • A Deployment is a logical collection of software
  • A Component is a logical collection of Kubernetes manifests

For example, a monitoring deployment composed of influxdb, grafana, and prometheus might look like:

monitoring/prometheus/<manifests>
monitoring/influxdb/<manifests>
monitoring/grafana/<manifests>

The monitoring stack can be deployed to each context by simply copying the monitoring deployment to the relevant location in our directory tree:

bw-dev-teamA0/monitoring/
bw-stage-teamA0/monitoring/
bw-prod-teamA0/monitoring/
bw-dev-teamB0/monitoring/
bw-stage-teamB0/monitoring/
bw-prod-teamB0/monitoring/

In order to apply resource limits for the stage and prod environments where teamB processes more events than teamA:

bw-dev-teamA0/monitoring/prometheus/    #
bw-dev-teamA0/monitoring/influxdb/      # unchanged
bw-dev-teamA0/monitoring/grafana/       #

bw-stage-teamA0/monitoring/prometheus/  # cpu: 1, mem: 256Mi
bw-stage-teamA0/monitoring/influxdb/    # cpu: 1, mem: 256Mi
bw-stage-teamA0/monitoring/grafana/     # cpu: 1, mem: 256Mi
bw-prod-teamA0/monitoring/prometheus/   # cpu: 1, mem: 256Mi
bw-prod-teamA0/monitoring/influxdb/     # cpu: 1, mem: 256Mi
bw-prod-teamA0/monitoring/grafana/      # cpu: 1, mem: 256Mi

bw-dev-teamB0/monitoring/prometheus/    #
bw-dev-teamB0/monitoring/influxdb/      # unchanged
bw-dev-teamB0/monitoring/grafana/       #

bw-stage-teamB0/monitoring/prometheus/  # cpu: 1, mem: 256Mi
bw-stage-teamB0/monitoring/influxdb/    # cpu: 1, mem: 256Mi
bw-stage-teamB0/monitoring/grafana/     # cpu: 1, mem: 256Mi

bw-prod-teamB0/monitoring/prometheus/   # cpu: 2, mem: 512Mi
bw-prod-teamB0/monitoring/influxdb/     # cpu: 2, mem: 512Mi
bw-prod-teamB0/monitoring/grafana/      # cpu: 2, mem: 512Mi

To also test a newer version of influxdb in teamA’s dev environment:

bw-dev-teamA0/monitoring/prometheus/    #
bw-dev-teamA0/monitoring/influxdb/      # version: 1.4
bw-dev-teamA0/monitoring/grafana/       #

bw-stage-teamA0/monitoring/prometheus/  # cpu: 1, mem: 256Mi
bw-stage-teamA0/monitoring/influxdb/    # cpu: 1, mem: 256Mi
bw-stage-teamA0/monitoring/grafana/     # cpu: 1, mem: 256Mi
bw-prod-teamA0/monitoring/prometheus/   # cpu: 1, mem: 256Mi
bw-prod-teamA0/monitoring/influxdb/     # cpu: 1, mem: 256Mi
bw-prod-teamA0/monitoring/grafana/      # cpu: 1, mem: 256Mi

bw-dev-teamB0/monitoring/prometheus/    #
bw-dev-teamB0/monitoring/influxdb/      # unchanged
bw-dev-teamB0/monitoring/grafana/       #

bw-stage-teamB0/monitoring/prometheus/  # cpu: 1, mem: 256Mi
bw-stage-teamB0/monitoring/influxdb/    # cpu: 1, mem: 256Mi
bw-stage-teamB0/monitoring/grafana/     # cpu: 1, mem: 256Mi

bw-prod-teamB0/monitoring/prometheus/   # cpu: 2, mem: 512Mi
bw-prod-teamB0/monitoring/influxdb/     # cpu: 2, mem: 512Mi
bw-prod-teamB0/monitoring/grafana/      # cpu: 2, mem: 512Mi

The point of this example is to show how quickly maintenance can become a problem when dealing with many deployments across multiple teams/environments.

For instance, this example shows that five unique sets of manifests would need to be maintained for this single deployment.

Solution

Requirements

  • Deploy different versions of a deployment to different contexts (versioning)
  • Tune deployments using logic and variables based on deployment context (templating)

Versioning

Let’s say we want to have the following:

bw-dev-teamA0/monitoring/prometheus/    #
bw-dev-teamA0/monitoring/influxdb/      # version: 1.4
bw-dev-teamA0/monitoring/grafana/       #

bw-stage-teamA0/monitoring/prometheus/  #
bw-stage-teamA0/monitoring/influxdb/    # version: 1.3
bw-stage-teamA0/monitoring/grafana/     #

bw-prod-teamA0/monitoring/prometheus/   #
bw-prod-teamA0/monitoring/influxdb/     # version: 1.3
bw-prod-teamA0/monitoring/grafana/      #

bw-dev-teamB0/monitoring/prometheus/    #
bw-dev-teamB0/monitoring/influxdb/      # version: 1.3
bw-dev-teamB0/monitoring/grafana/       #

bw-stage-teamB0/monitoring/prometheus/  #
bw-stage-teamB0/monitoring/influxdb/    # version: 1.3
bw-stage-teamB0/monitoring/grafana/     #

bw-prod-teamB0/monitoring/prometheus/   #
bw-prod-teamB0/monitoring/influxdb/     # version: 1.3
bw-prod-teamB0/monitoring/grafana/      #

This can be achieved by creating directories for each version of the deployment:

/manifests/monitoring/0.1.0/           # contains influxdb version 1.3
/manifests/monitoring/0.2.0/           # contains influxdb version 1.4
/manifests/monitoring/latest -> 0.2.0  # symlink to latest version (used by dev environments)

And then by quite simply symlinking the deployment to the version to deploy:

bw-dev-teamA0/monitoring/   -> /manifests/monitoring/latest  # deployment version 0.2.0
bw-stage-teamA0/monitoring/ -> /manifests/monitoring/0.1.0
bw-prod-teamA0/monitoring/  -> /manifests/monitoring/0.1.0

bw-dev-teamB0/monitoring/   -> /manifests/monitoring/0.1.0
bw-stage-teamB0/monitoring/ -> /manifests/monitoring/0.1.0
bw-prod-teamB0/monitoring/  -> /manifests/monitoring/0.1.0

Although this solves the versioning problem, this doesn’t help with customizing the deployments, which is where templating comes in.

ERB and Hiera

erb-hiera

Understanding ERB and Hiera is beyond the scope of this article but this diagram should give some clue as to how they work.

Templating

erb-hiera is a generic templating tool, here’s an example of what a config to deploy various versions of a deployment to different contexts looks like:

- scope:
    environment: dev
    project: bw-dev-teamA0
  dir:
    input: /manifests/monitoring/latest/manifest
    output: /output/bw-dev-teamA0/cluster0/monitoring/

- scope:
    environment: stage
    project: bw-stage-teamA0
  dir:
    input: /manifests/monitoring/0.1.0/manifest
    output: /output/bw-stage-teamA0/cluster0/monitoring/

- scope:
    environment: prod
    project: bw-prod-teamA0
  dir:
    input: /manifests/monitoring/0.1.0/manifest
    output: /output/bw-prod-teamA0/cluster0/monitoring/

- scope:
    environment: dev
    project: bw-dev-teamB0
  dir:
    input: /manifests/monitoring/0.1.0/manifest
    output: /output/bw-dev-teamB0/cluster0/monitoring/

- scope:
    environment: stage
    project: bw-stage-teamB0
  dir:
    input: /manifests/monitoring/0.1.0/manifest
    output: /output/bw-stage-teamB0/cluster0/monitoring/

- scope:
    environment: prod
    project: bw-prod-teamB0
  dir:
    input: /manifests/monitoring/0.1.0/manifest
    output: /output/bw-prod-teamB0/cluster0/monitoring/

Note that instead of having a complex and difficult to manage directory structure of symlinks the input directory is defined in each block - in this example the input directories are versioned, as discussed in the Versioning section

Example hiera config:

:backends:
  - yaml
:yaml:
  :datadir: "hiera"
:hierarchy:
  - "project/%{project}/deployment/%{deployment}"
  - "deployment/%{deployment}/environment/%{environment}"
  - "common"

Now it is possible to configure some default resource limits for each environment. It is assumed stage and prod require roughly the same amount of resources by default:

deployment/monitoring/environment/stage.yaml:

limits::cpu: 1
limits::mem: 256Mi

deployment/monitoring/environment/prod.yaml:

limits::cpu: 1
limits::mem: 256Mi

Then override team B’s production environment to increase the resource limits, since it needs more resources than the other environments: project/%{project}/deployment/monitoring.yaml:

limits::cpu: 2
limits::mem: 512Mi

One more change is required in order for this configuration to work. It is necessary to wrap the limits config in a condition so that no limits are applied to the dev environment:

<%- if hiera("environment") =~ /stage|production/ -%>
apiVersion: v1
kind: LimitRange
metadata:
  name: limits
spec:
  limits:
  - default:
      cpu: <%= hiera("limits::cpu") %>
      memory: <%= hiera("limits::mem") %>
...
<% else %>
# no limits set for this environment
<% end %>

The result is that with a simple erb-hiera config, hiera config, hiera lookup tree, and versioned manifests, the desired configuration is reached. There is less code duplication, and more flexibility in manifest creation.

Why Not Helm?

Helm can be used in various different ways, it can do as much or as little as required. It can act in a similar way to erb-hiera by being used simply to generate manifests from templates, or act as a fully fledged release manager where it deploys a pod into a kubernetes cluster which can track release state for the deployed helm charts.

So why erb-hiera? Because it is simple, and our teams are used to the combination of ERB templating language and Hiera due to their familiarity with Puppet. We can use the same tool across multiple code bases which manage our infrastructure and applications.

If you like Hiera but prefer Go templates, perhaps developing a Hiera plugin for Helm would be a good option?

erb-hiera can be used to manage all Kubernetes manifests but it is also entirely possible to use helm in parallel. At the moment we have a combination of native kubernetes manifests, helm charts, and template generated documents from erb-hiera.

Conclusion

erb-hiera is a simple tool which does just one thing: document generation from templates. This article has shown one possible use case where using a templating tool can be combined with versioning to provide powerful and flexible Kubernetes manifest management.

References

LetsEncrypt NGiNX Quick Start

NGiNX support for the Lets Encrypt letsencrypt-auto tool is not yet stable, here are some instrucions on how to get up and running with LetsEncrypt when using NGiNX.

NGiNX Static Content Server

Start a web server with a config like:

server {
    listen      80;
    server_name www.dust.cx dust.cx;
    location / { root /var/www/dust.cx; autoindex on; }
}

Certificate Request

Request certificate:

git clone https://github.com/letsencrypt/letsencrypt ~/git/letsencrypt 
~/git/letsencrypt/letsencrypt-auto certonly --webroot -w /var/www/dust.cx -d dust.cx -d www.dust.cx

NGiNX Config

Update NGiNX config to redirect all HTTP traffic to HTTPS, and specify cert file paths:

server {
    listen      80;
    server_name www.dust.cx dust.cx;
    rewrite     ^https://$server_name$request_uri? permanent;
}

server {
    listen 443;
    server_name www.dust.cx dust.cx;

    ssl on;
    ssl_certificate /etc/letsencrypt/live/dust.cx/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/dust.cx/privkey.pem;

    ssl_stapling on;
    ssl_stapling_verify on;
    add_header Strict-Transport-Security "max-age=31536000; includeSubdomains";

    location / { root /var/www/dust.cx; autoindex on; }
}

Reload NGiNX:

service nginx reload

Test

$ echo -n | openssl s_client -connect dust.cx:443
CONNECTED(00000003)
depth=2 O = Digital Signature Trust Co., CN = DST Root CA X3
verify return:1
depth=1 C = US, O = Let's Encrypt, CN = Let's Encrypt Authority X1
verify return:1
depth=0 CN = dust.cx
verify return:1
---
Certificate chain
 0 s:/CN=dust.cx
   i:/C=US/O=Let's Encrypt/CN=Let's Encrypt Authority X1
 1 s:/C=US/O=Let's Encrypt/CN=Let's Encrypt Authority X1
   i:/O=Digital Signature Trust Co./CN=DST Root CA X3
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIE/zCCA+egAwIBAgISAdtFUuyTk5UZoVzFVVnVT25zMA0GCSqGSIb3DQEBCwUA
MEoxCzAJBgNVBAYTAlVTMRYwFAYDVQQKEw1MZXQncyBFbmNyeXB0MSMwIQYDVQQD
ExpMZXQncyBFbmNyeXB0IEF1dGhvcml0eSBYMTAeFw0xNTEyMDYxMTQ0MDBaFw0x
NjAzMDUxMTQ0MDBaMBIxEDAOBgNVBAMTB2R1c3QuY3gwggEiMA0GCSqGSIb3DQEB
AQUAA4IBDwAwggEKAoIBAQDG+5xpMLdKinooEM4+ocZgtYAa+GaKc/RhbhuZLAh6
xYHy1/vLutqBlifuv6qXAtAYrM/xk3+zW7KrCXv3iz7ZYKh5mMKPV5hn+M8fIfqo
NHg9t75BlgeP6M/EG4td+hXWS9jYFJ7o82SIDX8zhDlEs/g3bQIE/+DuYWSC5WYu
PbJ1kUkOfGs7HQPwTPt7d2QafiEoy0sszgfPPPsYiEuOddgtsrKE+F9LuDdbT+Ze
V3TVK6nzdw7Km+i68xBTFk7m9+3guYBAf1yB4yROxNNOReBahqh3aFMjo4zZ3cYj
/U+MpOExTbT7ECO/mXkhCBzjK2I/k2bGqOhWcBOATvNlAgMBAAGjggIVMIICETAO
BgNVHQ8BAf8EBAMCBaAwHQYDVR0lBBYwFAYIKwYBBQUHAwEGCCsGAQUFBwMCMAwG
A1UdEwEB/wQCMAAwHQYDVR0OBBYEFG7T/aH/BX76cSQou6icQD/fs5ZxMB8GA1Ud
IwQYMBaAFKhKamMEfd265tE5t6ZFZe/zqOyhMHAGCCsGAQUFBwEBBGQwYjAvBggr
BgEFBQcwAYYjaHR0cDovL29jc3AuaW50LXgxLmxldHNlbmNyeXB0Lm9yZy8wLwYI
KwYBBQUHMAKGI2h0dHA6Ly9jZXJ0LmludC14MS5sZXRzZW5jcnlwdC5vcmcvMB8G
A1UdEQQYMBaCB2R1c3QuY3iCC3d3dy5kdXN0LmN4MIH+BgNVHSAEgfYwgfMwCAYG
Z4EMAQIBMIHmBgsrBgEEAYLfEwEBATCB1jAmBggrBgEFBQcCARYaaHR0cDovL2Nw
cy5sZXRzZW5jcnlwdC5vcmcwgasGCCsGAQUFBwICMIGeDIGbVGhpcyBDZXJ0aWZp
Y2F0ZSBtYXkgb25seSBiZSByZWxpZWQgdXBvbiBieSBSZWx5aW5nIFBhcnRpZXMg
YW5kIG9ubHkgaW4gYWNjb3JkYW5jZSB3aXRoIHRoZSBDZXJ0aWZpY2F0ZSBQb2xp
Y3kgZm91bmQgYXQgaHR0cHM6Ly9sZXRzZW5jcnlwdC5vcmcvcmVwb3NpdG9yeS8w
DQYJKoZIhvcNAQELBQADggEBADRzDUqJGXwVCAZTch9C3pLVbahmJ3vu3Iz1niXo
eMWceM3hEMUXtDAWIJbnmbDG9X37MI58+L9mHmD593cE7b7y1u0PtRta0X3QMYzd
CemUZD5RkII3KZuz1CYbccbdE/oL8xkAXwNxlNS6qHkdoS0xPRm3COX5DDJgIR0t
OjOthLu/XXPkdm7sA3mtxdhGGvAbNKvBNZiHOBdYR2IkxaIl6ONl5vpa/0pPAJ0p
u0I86Fpu3EwVH5dsK+jk3EXn/Zhv15EDc6mwJ0GSRGYtn83+SM3kAILmkcLxhflx
XZYHrONeYLkPhDUJGnCxObPHbSVauVvdUgW1HnfAdph1+dE=
-----END CERTIFICATE-----
subject=/CN=dust.cx
issuer=/C=US/O=Let's Encrypt/CN=Let's Encrypt Authority X1
---
No client certificate CA names sent
Peer signing digest: SHA512
Server Temp Key: ECDH, P-256, 256 bits
---
SSL handshake has read 3157 bytes and written 441 bytes
---
New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-GCM-SHA384
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : ECDHE-RSA-AES256-GCM-SHA384
    Session-ID: CAB0B56296FF95BA74ADC40876E78EBAA4B3949FDFC145B0DFCDAB3A5C69B588
    Session-ID-ctx: 
    Master-Key: D04421C7E3BDE901845C4F418601B8118A7F7CAACA1C18B1CC8E0F02687DDFB5AF39A7ED213294C833BBC9BFE850C1A8
    Key-Arg   : None
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    TLS session ticket lifetime hint: 300 (seconds)
    TLS session ticket:
    0000 - 9e d2 78 c0 fd e2 03 e9-c6 ec 39 ad 55 3a 14 df   ..x.......9.U:..
    0010 - 2c 93 0a c4 13 30 af 73-9c 64 04 9d 18 e8 c1 21   ,....0.s.d.....!
    0020 - de 48 31 c9 02 53 17 38-2a a5 b4 04 4f 68 38 e9   .H1..S.8*...Oh8.
    0030 - 08 45 ec b4 ec 45 38 a5-7b 5d d9 d8 e8 40 02 f2   .E...E8.{]...@..
    0040 - 1b 39 92 b5 08 bc e0 f0-2a 81 a6 85 66 76 20 86   .9......*...fv .
    0050 - 80 52 5c 58 90 21 da 3f-e9 9c d0 81 d1 f6 ba dc   .RX.!.?........
    0060 - 8e 4f 11 b3 d2 51 ed 0f-ff 6d f6 06 00 d6 ec 6e   .O...Q...m.....n
    0070 - 00 b5 9d ec b9 7d b0 5f-1c 3c b2 fa 6c 1d 89 c5   .....}._.<..l...
    0080 - 84 3d 69 98 28 de df c1-24 23 cf c3 fd c4 81 90   .=i.(...$#......
    0090 - c7 16 b2 ed 8d f7 49 32-37 32 04 9b 42 e1 08 3f   ......I272..B..?
    00a0 - e5 43 f8 4d 55 23 e2 19-b4 ad f2 80 c4 9d 12 b9   .C.MU#..........

    Start Time: 1449413126
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)
---
DONE

Sensu – What I’ve Learnt

What!

I first tried Sensu at my last job roughly 2 years ago and loved it. After years of Nagios (yes, every distribution of) and the occasional flirtation with Zabbix (I want something different but oh god no), Sensu came along and offered a fresh perspective.

After a couple of weeks running Sensu in parallel with Nagios I was convinced. It wasn't until starting my current job that I really got my hands dirty with Sensu, this blog post is about what I've learnt over the last year.

Why Sensu?

I've been inspired by some great talks by people who are far better at explaining things than me, here are some of them:

Andy Sykes' talk called Stop using Nagios (so it can die peacefully) is a wonderful opinionated talk on what the future of monitoring could look like. Although this talk was only given 18 months ago, Sensu has evolved a lot since then and several of the problems that were brought up in the talk have been solved since then.

Kyle Andersons is a great contributor to the Sensu community and his talk on Sensu entitled Sensu @ Yelp (part 1, part 2) was the first Sensu talk I saw. It discusses how Sensu has been deployed at Yelp. The talk is a good starting point and explains Sensu and it's dependencies from the ground up all the way through to customizing it to fit your companies specific needs, it really opened my eyes to the flexibility of Sensu and what is possible.

Recently I was asked to give a talk to the team leaders at my job about the work that I've been doing on our monitoring platform. The talk doesn't focus solely on Sensu but is more generally about how we've improved the monitoring platform and with minimal effort from teams they could not only help Ops but improve Dev visibility or problems in production for themselves.

The talk slides (press 's' for speakernotes) are available here.

How?

This section contains some resources to help gain a better understanding of Sensu.

The Sensu docs are constantly being updated, they are strong in places but also not so great in others:

  • https://sensuapp.org/docs/latest

Kyle Anderson has done a cool free introductory course to Sensu in the form of video lectures:

  • https://www.udemy.com/sensu-introduction/learn/

The renowned Sensu diagram from older versions of the docs and the current Sensu infrastructure gif are both relatively confusing in my opinion. I made the following diagram which I find helps to describe the Sensu event pipeline:

sensu event pipeline

Note: this describes Sensu configured with Standalone checks only. If Subscription checks are used then the Sensu client reads from topics on RabbitMQ, too.

Sensu in Anger

Standalone Vs. Subscription Checks

One of the main points of confusion for a lot of people seems to be whether to choose subscription or standalone checks.

The difference between the two is that subscription checks are defined on the sensu-servers and then clients simply have a 'subscriptions' parameter with a list of subscriptions to subscribe to, where-as standalone checks are defined directly on each client.

For simplicity, and to follow what seems to be best practice, I tend to primarily use standalone checks. The Puppet and Chef modules by default assume that checks are defined as standalone checks, with subscription checks being the exception. I like that the configuration of standalone checks is on the client which makes things like debugging individual machines simpler.

Note: there is a safe_mode parameter which can be set on clients when using subscription mode. The safe mode parameter is a security measure to prevent the Sensu client from executing a check scheduled by a Sensu server if a corresponding check definition doesn't exist locally.

As far as I know standalone checks have two limitations:

Standlone checks can't be used to create aggregate checks - i.e: when you want to check if a certain percentage of machines are in a certain state. The reason for this is that the way the data for aggregate checks is bucketed is by 'issued' time stamp. Subscription based checks all have the same timestamp since it's read off the transport and is part of the event data in the response after the check has been run by the client. For standalone checks the issued timestamp is generated when the check is run and since each client schedules its own checks, the issued timestamps between clients wont match up.

Something to note when deciding to use subscription checks to create aggregate checks: since the bucket name for the aggregate check data is comprised of the issued timestamp and the check name, it isn't possible to create aggregate checks with checks that have different check names. I did actually write a patch to solve this problem but because it involved a scheduler rewrite it was considered that the added complexity wasn't worth supporting what was considered to be an edge case. As Sean Porter points out in the PR, the aggregate functionality may become more flexible in the future.

The second limitation is that round-robin checks can only be configured when using subscription checks, again this is because the Sensu server is used to schedule the checks rather than the clients, which have no common knowledge between them.

JIT Clients

I originally wrote about JIT clients in a previous post entitled Sensu - Host Masquerading, they have now been implemented and are a great way to monitor things like switches or any devices which can't run a sensu-client natively.

RabbitMQ Issues

A lot of people initially have trouble configuring RabbitMQ.

  1. check the erlang version to make sure it is at least version R16B01 otherwise SSL won't work
  2. get a Sensu client connecting to the RabbitMQ transport without SSL
  3. configure SSL

Enable RabbitMQ web UI:

rabbitmq-plugins enable rabbitmq_management

Browse to http://server:15672

Also from the CLI:

# list clients connected to rabbitmq
rabbitmqctl list_connections -p /sensu

TTL and Timeouts

Configure your checks with timeouts and TTLs otherwise when a check config is removed and a client restarted, you won't receive an alert.

Timeouts should be configured to kill long running check scripts and help avoid problems with check scripts running multiple times due to long execution times.

Debugging

Common pattern for debugging sensu-{client,server,api}:

# disable puppetruns; enable debugging, pipe logs through JQ
puppet agent --disable 'sensu debugging'
sed -i 's/warn/debug/' /etc/defaults/sensu
/etc/init.d/sensu-client restart
tail -f /var/log/sensu/sensu-client | jq .

# do some debugging..

# re-enable puppet; run puppet to reset client config state
puppet agent --enable
puppet agent -t

Deploying with Ansible (symlinks)

At my current job we manage everything up to application level with Puppet, and then use Ansible to deploy the applications. This is mainly because Ansible is much friendlier for developers to use and means we can delegate writing application deployment out to teams. Our applications are deployed under a single unprivileged user account with write access to a subdirectory under /etc/sensu: /etc/sensu/conf.d/checks/app. I added a patch to sensu to allow Sensu to read configuration files from symlinked directories, in this way application checks can be deployed as follows:

$ ls -l /etc/sensu/conf.d/checks/app
total 8
drwxrwxr-x 2 sensu   sensu     4096 Sep 14 17:08 .
dr-xr-xr-x 3 sensu   sensu     4096 Sep 14 12:54 ..
lrwxrwxrwx 1 company company   32 Sep 14 17:04 app_a-1 -> /home/company/opt/app_a-1/checks
lrwxrwxrwx 1 company company   41 Sep 14 17:05 app_b-1 -> /home/company/opt/app_b-1/checks
lrwxrwxrwx 1 company company   45 Sep  5 18:37 app_b-2 -> /home/company/opt/app_b-2/checks

Now when applications get removed from servers all that is left is a dangling symlink which puppet can then clean-up later.

Running Checks on a System

A common task tends to be logging in to a server to debug a check or to manually run a check to see if a problem has been fixed, to that end I wrote a prototype/primitive shell script that uses jq to extract the command from a check script and run it. At some point I'll work more on this and add bash/zsh completion.

One of the nice reasons to have such a tool would be to allow developers to deploy Sensu check configurations into their development environments and still get an overview of the output from each check command without needing a running Sensu agent or Sensu cluster.

Multiple Slack Channels

I modified the original slack handler and added the ability to send alerts to multiple slack channels, the handler and further information can be found here

Embedding Interesting Data

Graphs / Graphite / Grafana

I wrote a blog post on embedding graphite graphs into Sensu using Ansible, since then I've switched to using much prettier (and interactive) grafana graphs deployed by Puppet, the same technique can be used as described in the original post.

Event History / Logging

Sensu maintains the last 21 states of each currently active check to use for things like flap detection. Sensu doesn't have a full event history but in keeping with the unix philosophy, there is a logstash handler which allows you to write event history to logstash. Kibana can then be used to view event history.

I wrote a patch that is now part of Uchiwa which allows embedding iFrames into Sensu metadata.

Here is another prototype app I wrote that acts as a proxy between Uchiwa and Elasticsearch containing Logstash data. The proxy returns an Iframe that can be embedded in Uchiwa.

I would like to prettify the log output at some point.

Sensu kibana iframe test

Moving from Nagios

HA

Inevitably after deciding to use Sensu in production, you'll want to look at running Sensu in a HA configuration, here's a diagram describing my configuration:

Sensu HA Platform

How Many Checks?

Some bash to calculate how many checks are running on your infrastructure..

mbp0 /home/rw > cat tmp/sensu_overview.sh
#!/usr/bin/env bash
#
# Script to output some statistics about a Sensu deployment
#
# Notes:
#
# * requires jq (https://stedolan.github.io/jq/)
# * slow
#

SERVER=$1

function number_of_checks () {
  checks=0

  for client in $(sensu-cli client list -f json | jq -r '.[].name'); do
    client_checks=$(curl -s ${SERVER}:4567/clients/${client}/history | jq '. | length')
    checks=$((${checks}+${client_checks}))
  done

  echo $checks
}

function number_of_clients () {
  curl -s ${SERVER}:4567/clients | jq '. | length'
}

echo "number of clients: $(number_of_clients)"
echo "number of checks:  $(number_of_checks)"
mbp0 /home/rw > ./tmp/sensu_overview.sh sensu.xxx.net
number of clients: 429
number of checks:  11481

Other Contributions..

I wrote the initial implementation of the result data storage which essentially allows green-light-esque dashboards, i.e: the ability to see metadata and output value from checks with status-0. This was the groundwork which allowed for TTL feature to be implemented.

Conclusion

Sensu is great. It's a really flexible, easily customizable platform that can be integrated into just about anything. I've had fun contributing back to the community and look forward to seeing the new and interesting ways people come up with using Sensu.

Columned Graphite Data in InfluxDB

For a long time now graphite has been the defacto standard for use as a time-series database, recently I decided to try InfluxDB, this blog post is about what I’ve found.

Installation and configuration of InfluxDB is as about as simple as it can get:

mbp0 /home/rw 2> dpkg -c tmp/influxdb_0.9.4.2_amd64.deb
drwx------ 0/0               0 2015-09-29 18:52 ./
drwxrwxr-x 0/0               0 2015-09-29 18:52 ./usr/
drwxrwxr-x 0/0               0 2015-09-29 18:52 ./usr/share/
drwxrwxr-x 0/0               0 2015-09-29 18:52 ./usr/share/doc/
drwxrwxr-x 0/0               0 2015-09-29 18:52 ./usr/share/doc/influxdb/
-rw-r--r-- 0/0             142 2015-09-29 18:52 ./usr/share/doc/influxdb/changelog.Debian.gz
drwxrwxr-x 0/0               0 2015-09-29 18:52 ./opt/
drwxrwxr-x 0/0               0 2015-09-29 18:52 ./opt/influxdb/
drwxrwxr-x 0/0               0 2015-09-29 18:52 ./opt/influxdb/versions/
drwxrwxr-x 0/0               0 2015-09-29 18:52 ./opt/influxdb/versions/0.9.4.2/
drwxrwxr-x 0/0               0 2015-09-29 18:52 ./opt/influxdb/versions/0.9.4.2/scripts/
-rw-rw-r-- 0/0             483 2015-09-29 18:51 ./opt/influxdb/versions/0.9.4.2/scripts/influxdb.service
-rwxrwxr-x 0/0            5759 2015-09-29 18:51 ./opt/influxdb/versions/0.9.4.2/scripts/init.sh
-rwxr-xr-x 0/0        11796648 2015-09-29 18:51 ./opt/influxdb/versions/0.9.4.2/influx
-rwxr-xr-x 0/0        17886048 2015-09-29 18:51 ./opt/influxdb/versions/0.9.4.2/influxd
drwxrwxr-x 0/0               0 2015-09-29 18:52 ./etc/
drwxrwxr-x 0/0               0 2015-09-29 18:52 ./etc/opt/
drwxrwxr-x 0/0               0 2015-09-29 18:52 ./etc/opt/influxdb/
-rw-rw-r-- 0/0            8414 2015-09-29 18:51 ./etc/opt/influxdb/influxdb.conf
drwxrwxr-x 0/0               0 2015-09-29 18:52 ./etc/logrotate.d/
-rw-rw-r-- 0/0             113 2015-09-29 18:51 ./etc/logrotate.d/influxd

Two binaries (client and daemon), init configuration, a configuration file, and a changelog. Great!

The out of the box configuration is good enough to get going with, however, InfluxDB has various listeners that enable the use of more primitive metric protocols such as graphite and collectd. I enabled the graphite listener:

[[graphite]]
  enabled = true
  bind-address = ":2003"
  protocol = "tcp"

Then configured carbon-relay-ng to relay metrics to InfluxDB:

[routes.influxdb]
patt = ""
addr = "influxdb:2003"
spool = true
pickle = false

With metrics now being relayed into InfluxDB it’s time to create some queries:

mbp0 /opt/influxdb > /opt/infuxdb/bin/influx -database graphite
InfluxDB shell 0.9.4.2
> show measurements
name: measurements
------------------
name
metrics.net.server0.eth0.rx_bytes
metrics.net.server0.eth0.rx_dropped
metrics.net.server0.eth0.rx_errors
metrics.net.server0.eth0.rx_packets
metrics.net.server0.eth0.tx_bytes
metrics.net.server0.eth0.tx_dropped
metrics.net.server0.eth0.tx_errors
metrics.net.server0.eth0.tx_packets
...

> select * from "metrics.net.server0.eth0.rx_bytes"
name: metrics.net.server0.eth0.rx_bytes
-----------------------------------------
time                   value
2015-10-10T16:22:00Z   2.6120917495e+10
2015-10-10T16:24:20Z   2.6121235774e+10
2015-10-10T16:24:46Z   2.6121281251e+10
2015-10-10T16:24:50Z   2.6121288143e+10
2015-10-10T16:26:04Z   2.612146782e+10

But wait a minute, isn’t this supposed to be a columnar database?

Reading more of the docs shows I need to add a ‘template’ to graphite so that the graphite data can be converted into tagged data, my graphite config now looks like this:

[[graphite]]
  enabled = true
  bind-address = ":2003"
  protocol = "tcp"
  templates = [ "metrics.net.* .measurement.host.interface.measurement" ]

This time I manually run the check from my local machine to generate some data:

mbp0 /home/rw/git/sensu-plugins master ✓ > ./metrics-net.rb --scheme metrics.$(hostname) | grep \.eth0
metrics.net.mbp0.eth0.tx_packets 12412227 1444494023
metrics.net.mbp0.eth0.rx_packets 20782213 1444494023
metrics.net.mbp0.eth0.tx_bytes 1928577400 1444494023
metrics.net.mbp0.eth0.rx_bytes 26120684821 1444494023
metrics.net.mbp0.eth0.tx_errors 0 1444494023
metrics.net.mbp0.eth0.rx_errors 60 1444494023
metrics.net.mbp0.eth0.tx_dropped 0 1444494023
metrics.net.mbp0.eth0.rx_dropped 0 1444494023
mbp0 /home/rw/git/sensu-plugins master ✓ > ./metrics-net.rb --scheme test_metrics.net.mbp0 | grep --color=never \.eth0 | nc influxdb 2003
mbp0 /home/rw/git/sensu-plugins master ✓ >

Check the data in InfluxDB:

mbp0 /opt/influxdb > ./influx -database graphite
Connected to http://localhost:8086 version 
InfluxDB shell 0.9.4.2
> show measurements
name: measurements
------------------
name
net.rx_bytes
net.rx_dropped
net.rx_errors
net.rx_packets
net.tx_bytes
net.tx_dropped
net.tx_errors
net.tx_packets
> select * from "net.rx_bytes" limit 1
name: net.rx_bytes
------------------
time                   host   interface   value
2015-10-10T16:32:14Z   mbp0   eth0        2.612265085e+10

This looks better but the query shows that each metric is being written to the database as its own measurement with a single column called value. The host and interface columns here are infact tags, rather than fields.

Lets enable the udp listener and write some data to the database using InfluxDBs native line protocol.

influxdb.conf:

[[udp]]
  enabled = true
  bind-address = ":8087"
  database = "udp"
mbp0 /opt/influxdb > ./influx -execute 'create database udp'
mbp0 /opt/influxdb > echo 'test_measurement,host=localhost field1=1,field2=2,field3=3' | nc -u localhost 8087
^C
mbp0 /opt/influxdb > ./influx -database udp
Connected to http://localhost:8086 version 
InfluxDB shell 0.9.4.2
> show measurements
name: measurements
------------------
name
test_measurement

> select * from test_measurement
name: test_measurement
----------------------
time                             field1   field2   field3   host
2015-10-10T16:10:12.102611995Z   1        2        3        localhost

This is what we want, data stored in named fields.

It turns out that with the original storage engine BZ1 it’s not only inefficient to do lookups on multiple field data, it’s also not possible to add fields to a metric once it’s been written to.

Fortunately I had some luck as the InfluxDB team were about to release their new storage engine entitled TSM1. The new storage engine allows fields to be added to existing metrics in a database.

My patch to enable a special keyword field has been merged into master and will be part of the 0.9.5 release. For now it’s possible to use a nightly build.

From a clean system, to get columnar graphite data in InfluxDB, do the following:

Install nightly (or >= 0.9.5) InfluxDB build:

mbp0 /home/rw > wget https://s3.amazonaws.com/influxdb/influxdb_nightly_amd64.deb
--2015-10-10 17:56:54--  https://s3.amazonaws.com/influxdb/influxdb_nightly_amd64.deb
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.80.203
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.80.203|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14714390 (14M) [application/x-debian-package]
Saving to: ‘influxdb_nightly_amd64.deb’

influxdb_nightly_amd64.deb           100%[=====================================================================>]  14.03M  4.68MB/s   in 3.0s   

2015-10-10 17:56:57 (4.68 MB/s) - ‘influxdb_nightly_amd64.deb’ saved [14714390/14714390]

mbp0 /home/rw > sudo dpkg -i influxdb_nightly_amd64.deb
Selecting previously unselected package influxdb.
(Reading database ... 270013 files and directories currently installed.)
Preparing to unpack influxdb_nightly_amd64.deb ...
Unpacking influxdb (0.9.5-nightly-f1e0c59) ...
Setting up influxdb (0.9.5-nightly-f1e0c59) ...
mbp0 /home/rw >

Enable graphite listener with templates for each of your metrics, including special field keyword:

[[graphite]]
  enabled = true
  database = "graphite"
  bind-address = ":2003"
  protocol = "tcp"
  templates = [
    "metrics.net.* .measurement.host.interface.field"
  ]

Change storage engine to TSM1:

[data]
  engine = "tsm1"

Write some test data to InfluxDB:

mbp0 /home/rw/git/sensu-plugins master ✓ > ./metrics-net.rb --scheme metrics.net.mbp0 | grep --color=never \.eth0 | nc localhost 2003
mbp0 /home/rw/git/sensu-plugins master ✓ >

Validate configuration:

mbp0 /opt/influxdb > ./influx -database graphite
Connected to http://localhost:8086 version 0.9.5-nightly-f1e0c59
InfluxDB shell 0.9.5-nightly-f1e0c59
> show measurements
name: measurements
------------------
name
net

> select * from net
name: net
---------
time                host  interface rx_bytes          rx_dropped  rx_errors  rx_packets     tx_bytes         tx_dropped  tx_errors  tx_packets
1444497867000000000 mbp0  eth0      2.6229759902e+10  0           60         2.0903486e+07  1.947292214e+09  0           0          1.2501794e+07

Cool!

Sensu – Host Masquerading

A key part of monitoring infrastructure involves having the ability to monitor things that we can’t necessarily install a monitoring client on: switches and other network devices, external services and websites, etc..

In Nagios it’s pretty common to group active checks under virtual hosts that don’t really exist to create logical sets of checks. Sensu doesn’t yet have this ability.

There has been some discussion about the possibility of adding a masquerade feature and changing event data to drop the client info requirement in order to be able to craft event data with a custom source address. In the latter issue Kyle Anderson proposes a solution which was at one point implemented but then later reverted.

I applied Kyles patch to my Sensu server.rb and configured a set of checks with the :source attribute. My check data then contained a modified source and my handlers sent messages with the modified event data. Great! Unfortunately though, the new event data wasn’t accessible through the API. I emailed Kyle for advice and he kindly created this issue.

In order for clients to be visible in the Uchiwa frontend we need to fix sensu-api. After looking at the API code and trying a few things I eventually decided to try simply duplicating the original client data in Redis.

Duplicating the client data works well since it will get updated each time an event is processed. Each event includes a timestamp property that sensu-server uses to calculate the keepalive for each server. What this means is that our masqueraded host behaves exactly like a real host and all functionality in the sensu-api (and as a result, the Uchiwa frontend) behaves as expected.

Sensu – Events and Graphite Graphs

Graphite Events

I learnt from jdixons obfuscurity blog that Graphite has a little known feature called Events that can, unsurprisingly, be used to store events in Graphite.

Since Sensu/Uchiwa don’t have any way to see event history, I thought it would be nice to be able to see check events on related graphs, for example: CPU {WARN,CRIT,OK} on the CPU usage graph.

In order to pipe Sensu events into Graphite, I wrote a simple handler plugin that POSTs all Sensu events to the Graphite Events URI.

The following is a short write-up of how to get going with Sensu events and Graphite.

Writing Events

First, test to see if it’s possible to write to the Graphite Events URI. Unlike writing data to carbon, the Events URI expects json:

curl --insecure 
  -X POST 
  https://graphite.brandwatch.com:443/events/ 
  -d '{"what": "test", "tags" : "test"}'

Reading Events

The event should appear in the Graphite event list:

graphite event test

Next, test to see if the event is retrievable:

curl "https://graphite.brandwatch.com/render 
  ?from=-12hours 
  &until=now 
  &width=500 
  &height=200 
  &target=drawAsInfinite(events('test'))"

Note: Since the event has no Y value, drawAsInfinite() is used to extend the X value (time) vertically so that the event is displayed as a vertical bar on the graph:

graphite event test

Sensu

Now to get Sensu check events into Graphite.

Handler

Install the handler (update: now available as part of the sensu-community-plugins: handler and config) on your Sensu server, adjusting the graphite_event.json config if necessary:

git clone 
  https://github.com/roobert/sensu_handler_graphite_event.git

cp sensu_handler_graphite_event/graphite_event.json 
  /etc/sensu/conf.d/

cp sensu_handler_graphite_event/graphite_event.rb 
  /etc/sensu/handlers/

sudo service sensu-server restart

Events

In my last post, I talked about how to embed Graphite graphs in the Uchiwa UI and used a CPU Graphite query as an example. This is the same query except that I’ve added the events targets:

curl "https://graphite.brandwatch.com/render 
  ?from=-12hours 
  &until=now 
  &width=500 
  &height=200 
  &target=collectd.<hostname>.aggregation-cpu-average.cpu-system.value 
  &target=drawAsInfinite(events('sugar', 'check-cpu', 'ok')) 
  &target=drawAsInfinite(events('sugar', 'check-cpu', 'warning')) 
  &target=drawAsInfinite(events('sugar', 'check-cpu', 'critical'))"

Here’s the result of the above query, displaying two events at about 6pm. Note that the graph time period is such that the CRITICAL and OK events are practically overlapping:

sensu events

Here are the same two events displayed on a graph with a much lower query window (1 hour):

sensu events

Uchiwa

Finally, update the Sensu client.json with the new query:

{
   "client": {
      "name": "{{ sensu_client_hostname }}",
      "address": "{{ sensu_client_address }}",
      "subscriptions": subscriptions,
      "graphite_cpu": "https://graphite.brandwatch.com/render?from=-12hours&until=now&width=500&height=200&target=collectd.{{ ansible_hostname }}.aggregation-cpu-average.cpu-system.value&target=drawAsInfinite(events(%27{{ ansible_hostname }}%27,%27check-cpu%27,%27ok%27))&target=drawAsInfinite(events(%27{{ ansible_hostname }}%27,%27check-cpu%27,%27warning%27))&target=drawAsInfinite(events(%27{{ ansible_hostname }}%27,%27check-cpu%27,%27critical%27))&uchiwa_force_image=.jpg"
   }
}

Result:

graphite with events in uchiwa

Going Further..

It’s useful having CPU/Mem graphs visible in the client view of Uchiwa, but it’s equally possible to include graphs in a check definition so they are visible in the check view.

Creating an events() target with ‘keepalive’ as one of the tags will allow you to see changes in the overall client availability.

Next..

Next up: embedding Logstash/Kibana data in Uchiwa.

Sensu – Embedded Graphite Graphs

Earlier this year I saw a great talk entitled Please stop using Nagios (so it can die peacefully). After I’d finished laughing and picked myself up off the floor, I deployed Sensu and immediately loved it.

Months later and I’m now experimenting with replacing the existing nagios monitoring system we use at my new job with Sensu.

Uchiwa

One of the things I thought would be useful would be to have graphs embedded in the wonderful Uchiwa dashboard. It turns out I’m not alone because the author of Uchiwa (Simon Palourde) has plans to add support for embedding graphite graphs into Uchiwa natively. Until then, it’s still possible to get some lovely graph action going on by taking advantage of the fact Uchiwa will:

  1. display any extra properties you add to the client config JSON or check config JSON in the UI
  2. render images

Uchiwa decides what to display as an image depending on file extension type. Adding a fake argument to our graphite query tricks Uchiwa into displaying the image returned by the query inline, instead of as a link to the graph:

&uchiwa_force_display_as_image=.jpg

Graphite

I want to be able to see CPU and Memory usage for each machine when I click on the machine view. My graphite queries look like:

https://graphite.brandwatch.com/render 
  ?from=-12hours 
  &until=now 
  &width=500 
  &height=200 
  &target=collectd.<hostname>.aggregation-cpu-average.cpu-system.value

https://graphite.brandwatch.com/render 
  ?from=-12hours 
  &until=now 
  &width=500 
  &height=200 
  &target=collectd.<hostname>.memory.memory-used.value 
  &target=collectd.<hostname>.memory.memory-cached.value 
  &target=collectd.<hostname>.memory.memory-free.value 
  &target=collectd.<hostname>.memory.memory-buffered.value

Putting it Together..

Add the queries to the client config. It’s necessary to encode the single quotes (%27) and since I’m using Ansible to distribute the Sensu configuration, I’ve used {{ ansible_hostname }} in place of the hostname in each metric key.

{
   "client": {
      "name": "{{ sensu_client_hostname }}",
      "address": "{{ sensu_client_address }}",
      "subscriptions": subscriptions,
      "graphite_cpu": "https://graphite.brandwatch.com/render?from=-12hours&until=now&width=500&height=200&target=collectd.{{ ansible_hostname }}.aggregation-cpu-average.cpu-system.value&uchiwa_force_image=.jpg",
      "graphite_mem": "https://graphite.brandwatch.com/render?from=-12hours&until=now&width=500&height=200&target=collectd.{{ ansible_hostname }}.memory.memory-used.value&target=collectd.{{ ansible_hostname }}.memory.memory-cached.value&target=collectd.{{ ansible_hostname }}.memory.memory-free.value&target=collectd.{{ ansible_hostname }}.memory.memory-buffered.value&uchiwa_force_image=.jpg"
   }
}

The Result

sensu_embedded_graph0

Going Further..

Checks can also have arbitrary properties so it’s also possible to add queries to the check definitions and have them appear in the check view of Uchiwa.

Next up: adding events to graphite graphs with Sensu.