pre-commit hooks and terraform- a safety net for your repositories

I’m the only infrastructure person on a number of my projects and it’s sometimes difficult to find someone to review pull requests. So, in self-defence, I’ve adopted git precommit hooks as a way to ensure I don’t make certain tedious mistakes before burning through peoples time and goodwill. In this post we’ll look at how pre-commit and terraform can be combined.

pre-commit is “A framework for managing and maintaining multi-language pre-commit hooks” that has a comprehensive selection of community written extensions. The extension at the core of this post will be pre-commit-terraform, which provides all the basic functionality you’ll need.

Before we start you’ll need to install pre-commit itself. You can do this via your package manager of choice. I like to run all my python code inside a virtualenv to help keep the versions isolated.

$ pip install pre-commit --upgrade
Successfully installed pre-commit-1.10.4

To keep the examples realistic I’m going to add the precommit hook to my Terraform SNS topic module. Mostly because I need it on a new project; and I want to resolve the issue raised against it.

# repo cloning preamble
git clone git@github.com:deanwilson/tf_sns_email.git
cd tf_sns_email/
git co -b add_precommi

With all the preamble done we’ll start with the simplest thing possible and build from there. First we add the basic .pre-commit-config.yaml file to the root of our repository and enable the terraform fmt hook. This hook will ensure all our terraform code matches what would be produced by running terraform fmt over your codebase.

cat <<EOF > .pre-commit-config.yaml
- repo: git://github.com/antonbabenko/pre-commit-terraform
  rev: v1.7.3
  hooks:
    - id: terraform_fmt
EOF

We then install the pre-commit within this repo so it can start to provide our safety net.

$ pre-commit install
pre-commit installed at /tmp/tf_sns_email/.git/hooks/pre-commit

Let the pain commence! We can now run pre-commit over the repository and see what’s wrong.

$ pre-commit run --all-files
[INFO] Initializing environment for git://github.com/antonbabenko/pre-commit-terraform.
Terraform fmt............................................................Failed
hookid: terraform_fmt

Files were modified by this hook. Additional output:

main.tf
outputs.tf
variables.tf

So, what’s wrong? Only everything. A quick git diff shows that it’s not actually terrible. My indentation doesn’t match that expected by terraform fmt so we accept the changes and commit them in. It’s also worth adding .pre-commit-config.yaml in too to ensure anyone else working on this branch gets the same precommit checks. Once the config file is commited you should never again be able to commit incorrectly formatted code as the precommit will prevent it from getting that far.

A second run of the hook and we’re back in a good state.

$ pre-commit run --all-files
Terraform fmt..............Passed

The first base is covered, so let’s get a little more daring and ensure our terraform is valid as well as nicely formatted. This functionality is only a single line of code away as the pre-commit extension does all of the work for us:

cat <<EOF >> .pre-commit-config.yaml
    - id: terraform_validate_with_variables
EOF

This line of config enables another of the hooks. This one ensures all terraform files are valid and that all variables are set. If you have more of a module than a project and are not supplying all the possible variables you can change terraform_validate_with_variables to terraform_validate_no_variables and it will be much more lenient.

New config in place we rerun the hooks and prepare to be disappointed.

> pre-commit run --all-files
Terraform fmt..................................Passed
Terraform validate with variables..............Failed
hookid: terraform_validate_with_variables


Error: 2 error(s) occurred:

* provider.template: no suitable version installed
  version requirements: "(any version)"
  versions installed: none
* provider.aws: no suitable version installed
  version requirements: "(any version)"
  versions installed: none

And that shows how long it’s been since I’ve used this module; it predates the provider extraction work. Fixing these issues requires adding the providers, a new variable (aws_region) to allow specification of the AWS region, and adding some defaults. Once we fix those issues the precommit hook will fail due to the providers being absent, but that’s an easy one to resolve.

...
* provider.template: no suitable version installed
  version requirements: "1.0.0"
  versions installed: none
...

> terraform init

Initializing provider plugins...
- Checking for available provider plugins on https://releases.hashicorp.com...
- Downloading plugin for provider "template" (1.0.0)...
- Downloading plugin for provider "aws" (1.30.0)...

One more precommit run and we’re in a solid starting state.

Terraform fmt.............................Passed
Terraform validate without variables......Passed

With all the basics covered we can go a little further and mixin the magic of terraform-docs too. By adding another line to the pre-commit config -

cat <<EOF >> .pre-commit-config.yaml
    - id: terraform_docs
EOF

And adding a placeholder anywhere in the README.md -

+### Module inputs and outputs
+
+<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
+<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
+
+

terraform-docs will be invoked and add generated documentation to the README for all of the variables and outputs. If they ever change you’ll need to review and commit the differences but the hooks will stop you from ever going out of sync. Now we have this happening automatically I can remove the manually added, and error prone, documentation for variables and outputs. And be shamed into adding some useful descriptions.

pre-commit hooks will never replace a competent pull request reviewer but they help ensure basics mistakes are never made and allow your peers to focus on the important parts of the code, like structure and intent, rather than formatting and documentation consistencies. All of the code changes made in this post can be seen in the Add precommit pull request

Managing AWS Default VPC Security Groups with Terraform

When it comes to Amazon Web Services support Terraform has coverage that’s second to none. It includes most of Amazons current services, rapidly adds newly released ones, and even helps granularise existing resources by adding terraform specific extensions for things like individual rules with aws_security_group_rule. This awesome coverage makes it even more jarring when you encounter one of the rare edge cases, such as VPC default security groups.

It’s worth taking a step back and thinking about how Terraform normally works. When you write code to manage resources terraform expects to fully own the created resources life cycle. It will create it, ensure that changes made are correctly reflected (and remove those made manually), and when resources code is removed from the .tf files it will destroy it. While this is fine for 99% of the supported Amazon resources the VPC default security group is a little different.

Each Amazon Virtual Private Cloud (VPC) created will have a default security group provided. This is created by Amazon itself and is often undeletable. Rather than leaving it unmanaged, which happens all too often, we can instead add it to terraforms control with the special aws_default_security_group resource. When used this resource works a little differently than most others. Terraform doesn’t attempt to create the group, instead it’s adopted under its management umbrella. This allows you to control what rules are placed in this default group and stops the security group already exists errors that will happen if you try to manage it as a normal group.

The terraform code to add the default VPC security group looks surprisingly normal:

resource "aws_vpc" "myvpc" {
  cidr_block = "10.2.0.0/16"
}

resource "aws_default_security_group" "default" {
  vpc_id = "${aws_vpc.myvpc.id}"

  # ... snip ...
  # security group rules can go here
}

One nice little tweak I’ve found useful is to customise the default security group to only allow inbound access on port 22 from my current (very static) IP address.

# use the swiss army knife http data source to get your IP
data "http" "my_local_ip" {
    url = "https://ipv4.icanhazip.com"
}

resource "aws_security_group_rule" "ssh_from_me" {
  type            = "ingress"
  from_port       = 22
  to_port         = 22
  protocol        = "tcp"
  cidr_blocks     = ["${chomp(data.http.my_local_ip.body)}/32"]

  security_group_id = "${aws_default_security_group.default.id}"
}

Automatic Terraform documentation with terraform-docs

Terraform code reuse leads to modules. Modules lead to variables and outputs. Variables and outputs lead to massive amount of boilerplate documentation. terraform-docs lets you shortcut some of these steps and jump straight to consistent, easy to use, automatically generated documentation instead.

Terraform-docs, a self-contained binary implemented in Go, and released by Segment, provides an efficient way to add documentation to your terraform code without requiring large changes to your workflow or massive amounts of additional boilerplate. In its simplest invocation it reads the descriptions provided in your variables and outputs and displays them on the command line:

    /**
    *
    * A sample terraform file with a variable and output
    *
    */

variable "greeting" {
  type        = "string"
  description = "The string used as a greeting"
  default     = "hello"
}

output "introduction" {
  description = "The full, polite, introduction"
  value       = "${var.greeting} from terraform"
}

Running terraform-docs against this code produces:

A sample terraform file with a variable and output

  var.greeting (hello)
  The string used as a greeting

  output.introduction
  The full, polite, introduction

This basic usage makes it simpler to use existing code by presenting the official interface without over-burdening you with implementation details. Once you’ve added descriptions to your variables and outputs, something you should really already be doing, you can start to expose the documentation in other ways. By adding the markdown option -

terraform-docs markdown .

you can generate the docs in a GitHub friendly way that provides an easy, web based, introduction to what your code accepts and returns. We used this quite heavily in the GOV.UK AWS repo and it’s been invaluable. The ability to browse an overview of the terraform code makes it simpler to determine if a specific module does what you actually need without requiring you to read all of the implementation.

A collection of terraform variables and their defaults

When we first adopted terraform-docs we hit issues with the code being updated without the documentation changing to match it. We soon settled on using git precommit hooks, such as this terraform-docs githook script by Laura Martin or the heavy handed GOV.UK update-docs script. Once we had these in place the little discrepancies stopped slipping through and the reference documentation became a lot more trusted.

As an aside if you plan on using terraform-docs as part of your automated continuous integration pipeline you’ll probably want to create a terraform-docs package. I personally use FPM Cookery for this and it’d been an easy win so far.

I’ve become a big fan of terraform-docs and it’s great to see such self-contained tools making such a positive impact on the terraform ecosystem. If you’re writing tf code for consumption by more than just yourself (and even then) it’s well worth a second look.

Automatic datasource configuration with Grafana 5

When I first started my Prometheus experiments with docker-compose one of the most awkward parts of the process, especially to document, were the manual steps required to click around the Grafana dashboard in order to add the Prometheus datasource. Thanks to the wonderful people behind Grafana there has been a push in the newest major version, 5 at time of writing, to make Grafana easier to automate. And it really does pay off.

Instead of forcing you to load the UI and play clicky clicky games with vague instructions to go here, and then the tab on the left, no, the other left, down a bit… you can now configure the data source with a YAML file that’s loaded on startup.

# from datasource.yaml
apiVersion: 1

datasources:
- name: Prometheus
  type: prometheus
  access: proxy
  isDefault: true
  url: http://prometheus:9090
  # don't set this to true in production
  editable: true

Because I’m using this code base in a tinkering lab I set editable to true. This allows me to make adhoc changes. In production you’d want to make this false so people can’t accidentally break your backing store.

It only takes a little code to link everything together, add the config file and expose it to the container. You can see all the changes required in the Upgrade grafana and configure datasource via a YAML file pull request. Getting the exact YAML syntax, and confusing myself over access proxy vs direct was the hardest part. It’s only a single step along the way to a more automation friendly Grafana but it is an important one and a positive example that they are heading in the right direction.

Aqua Security microscanner – a first look

I’m a big fan of baking testing into build and delivery pipelines so when a new tool pops up in that space I like to take a look at what features it brings to the table and how much effort it’s going to take to roll out. The Aqua Security microscanner, from a company you’ve probably seen at least one excellent tech talk from in the last year, is a quite a new release that surfaces vulnerable operating systems packages in your container builds.

To experiment with microscanner I’m going to add it to my simple Gemstash Dockerfile.

FROM ubuntu:16.04
MAINTAINER dean.wilson@gmail.com

RUN apt-get update && 
    apt-get -y upgrade && 
    apt-get install -y 
      build-essential 
      ruby 
      ruby-dev 
      libsqlite3-dev 
      curl 
    && gem install --no-ri --no-rdoc gemstash

EXPOSE 9292

HEALTHCHECK --interval=15s --timeout=3s 
  CMD curl -f http://localhost:9292/ || exit 1

CMD ["gemstash", "start", "--no-daemonize"]

This is a conceptually simple Dockerfile. We update the Ubuntu package list, upgrade packages where needed, add dependencies required to build our rubygems and then install gemstash. From this very boilerplate base we only need to make a few changes for microscanner to run.

> git diff Dockerfile
diff --git a/gemstash/Dockerfile b/gemstash/Dockerfile
index 741838f..bab819a 100644
--- a/gemstash/Dockerfile
+++ b/gemstash/Dockerfile
@@ -2,7 +2,6 @@ FROM ubuntu:16.04
 MAINTAINER dean.wilson@gmail.com

 RUN apt-get update && 
-    apt-get -y upgrade && 
     apt-get install -y 
       build-essential 
       ruby 
@@ -11,6 +10,14 @@ RUN apt-get update && 
       curl 
     && gem install --no-ri --no-rdoc gemstash

+ARG token
+RUN apt-get update && apt-get -y install ca-certificates wget && 
+    wget -O /microscanner https://get.aquasec.com/microscanner && 
+    chmod +x /microscanner && 
+    /microscanner ${token} && 
+    rm -rf /microscanner
+

Firstly we remove the package upgrade step, as we want to ensure vulnerabilities are present in our container. We then use the newer ARG directive to tell Docker we will be passing a value named token in at build time. Lastly we attempt to add microscanner and its dependencies, in a single image layer. As we’re using the wget and ca- certificates packages it does have a small impact on container size but microscanner itself is added, used and removed without a trace.

You’ll notice we specify a token when running the scanner. This grants access to the Aqua scanning servers, and is rate limited. How do you get a token? You request it by calling out to the Aqua Security container with your email address:

docker run --rm -it aquasec/microscanner --register foo@mailinator.com
# ... snip ...
Aqua Security MicroScanner, version 2.6.4
Community Edition

Accept and proceed? Y/N:
y
Please check your email for the token.

Once you have the token (mine came through in seconds) you can build the container:

docker build --build-arg=token=A1A1Aaa1AaAaAAA1 --no-cache .

For this experiment I’ve taken the big hammer of --no-cache to ensure all the packages are tested on each build. This will have a build time performance aspect and should be considered along with the other best practices. If your container has vulnerable package versions you’ll get a massive dump of JSON in your build output. Individual packages will show their vulnerabilities:

{
      "resource": {
        "format": "deb",
        "name": "systemd",
        "version": "229-4ubuntu21.1",
        "arch": "amd64",
        "cpe": "pkg:/ubuntu:16.04:systemd:229-4ubuntu21.1",
        "name_hash": "2245b39c177e93fc015ba051be4e8574"
      },
      "scanned": true,
      "vulnerabilities": [
        {
          "name": "CVE-2018-6954",
          "description": "systemd-tmpfiles in systemd through 237 mishandles symlinks present in non-terminal path components, which allows local users to obtain ownership of arbitrary files via vectors involving creation of a directory and a file under that directory, and later replacing that directory with a symlink. This occurs even if the fs.protected_symlinks sysctl is turned on.",
          "nvd_score": 7.2,
          "nvd_score_version": "CVSS v2",
          "nvd_vectors": "AV:L/AC:L/Au:N/C:C/I:C/A:C",
          "nvd_severity": "high",
          "nvd_url": "https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-6954",
          "vendor_score": 5,
          "vendor_score_version": "Aqua",
          "vendor_severity": "medium",
          "vendor_url": "https://people.canonical.com/~ubuntu-security/cve/2018/CVE-2018-6954.html",
          "publish_date": "2018-02-13",
          "modification_date": "2018-03-16",
          "fix_version": "any in ubuntu 17.04",
          "solution": "Upgrade operating system to ubuntu version 17.04 (includes fixed versions of systemd)"
        }
}

You’ll also see some summary information, total number of issues, run time and container operating system values for example.

  "vulnerability_summary": {
    "total": 147,
    "medium": 77,
    "low": 70,
    "negligible": 6,
    "score_average": 4.047619,
    "max_score": 5,
    "max_fixable_score": 5,
    "max_fixable_severity": "medium"
  },

If any of the vulnerabilities are considering to be high in severity then the build should fail, preventing you from going live with known issues.

It’s very early days for microscanner and there’s a certain amount of inflexibility that will shake out over use, such as being able to fail builds on medium or even low severity issues, or only show packages with vulnerabilities, but it’s a very easy way to add this kind of safety net to your containers and worth keeping an eye on.

Validate AWS CIS security benchmarks with prowler

Despite the number of Amazon Web Services that have the word simple in their titles, keeping on top of a large cloud deployment isn’t an easy ask. There are a lot of important, complex, aspects to consider so it’s advisable to pay attention to the best practices, reference architectures, and benchmarks published by AWS and their partners. In this post we’ll take a look at the CIS security benchmark and a tool that will save you a lot of manual verifying.

CIS, the “Center For Internet Security”, publish best practice, security configuration guides, that present a number of recommendations that you should be aware of if you’re running production workloads in AWS. You don’t have to change your environment to suit every recommendation, or even agree with them, but you should read through it once and note where you’re consciously different to their advice. The guide itself, which you can find on the CIS AWS Benchmark page, or as an AWS static whitepaper link that doesn’t require an email address to read, is quite low level but well worth a read. Being aware of all the potential issues will help shape your cloud environments for the better. But, as good, lazy, admins we won’t go and check each of the recommendations by hand. Instead we’ll use a python application called Prowler.

The recommendations are terse but mostly clear. As the screenshot shows they aid in verification and remediation by presenting instructions for how to reach the given values in the web console or via the CLI.

AWS CIS example policy

Prowler however provides us a third way. It has checks for most of the recommendations, and even some bonus extras, and will iterate through them and assign us a pass or fail for each. Let’s install it and run some experiments.

Installing prowler

Prowler is a python program so we’ll install it, and the required dependencies, into a virtualenv to keep the versions isolated.

# create a new virtual env
virtualenv prowler-sweep
cd prowler-sweep
source bin/activate

# Get prowler from github
git clone https://github.com/toniblyx/prowler
cd prowler

# install the dependencies
pip install ansi2html awscli

You now have all the code required for prowler to run a sweep of your security settings.

Running prowler

I uses different profiles, configured in .aws/credential for most of my experiments so for now I’ll run prowler as me, but with read only access. If you want to run this as a dedicated user or under EC2 the installation guide has lists the required IAM permissions.

./prowler -p full-readonly

  _ __  _ __ _____      _| | ___ _ __
 | '_ | '__/ _   / / / |/ _  '__|
 | |_) | | | (_)  V  V /| |  __/ |
 | .__/|_|  ___/ _/_/ |_|___|_|v2.0-beta2
 |_| the handy cloud security tool

 Date: Wed 30 May 18:47:25 BST 2018

In its most basic mode prowler will run from the command line and show its results in glorious, colourful, ANSI.

Prowler output in glorious ANSI colour

In additional to text with control characters it can also provide basic HTML reports or even JSON and CSV for further processing and integration into your existing tools. Once you’ve finished a full sweep in your format of choice you can start to prioritise the findings and often add remediation to your Terraform or CloudFormation code bases.

Above and beyond

In addition to the CIS recommendations Prowler adds some of its own checks, for example some services didn’t exist when the last benchmark was published, and for common operational practises that are worth following. You can even extend it yourself if you have local rules or compliance requirements. There’s a list of additional prowler checks and description on the GitHub repository.

AWS security is a big, sprawling, topic with many moving parts, and while no third party resource will ever cover all your use cases documents like the CIS benchmark and tools like prowler can help quickly provide a baseline and safety net to ensure if you do get breached it won’t be because of a simple oversight.

The simple vims – code comments

After finding a bug in my custom written, bulk code comment / uncomment, vim function I decided to invest a little time to find a mature replacement that would remove my maintenance burden. In addition to removing my custom code I wanted a packaged solution, to make it easier to include across all of my vim installs.

After a little googling I found the ideal solution, the vim-commentary plugin. It ticks all my check boxes:

  • mature enough all the obvious bugs should have been found
  • receives attention when it needs it
  • has a narrow, well defined, focus
  • as a user it works the way I’d have approached it
  • And while it’s not a selection criteria, Tim Pope writing it is a big plus

I use the Vundle package manager for vim so installing commentary was quick and painless. I already have the vundle boilerplate in my .vimrc config file:

" set the runtime path to include Vundle and initialise
set rtp+=~/.vim/bundle/Vundle.vim
call vundle#begin()

" let Vundle manage Vundle, required
Plugin 'VundleVim/Vundle.vim'
" ... snip ... Lots of other plugins

call vundle#end()            " required

So all I had to do was add the new Plugin directive

" ... snip ...
Plugin 'VundleVim/Vundle.vim'
Plugin 'tpope/vim-commentary'
" ... snip ...

and then re-source the configuration and install the new plugin

:source %
:PluginInstall

Once it’s installed using it is as easy as selecting the text you want to comment out and typing gc. You can also use gcc (which can take a count) to comment out the current line. To uncomment code repeat the operation. Predictable enough that your muscle memory will learn it quickly. If you want to change the comment style, for example puppet code defaults to the horrible /* file { '/tmp/foo': */ format, you can override the default by adding an autocmd line to your .vimrc

    autocmd FileType puppet setlocal commentstring=# %s

I replaced my own custom code with commentary a few weeks ago and it’s quickly become a great, intuitive, replacement. If you use vim for writing code and want a simple way to comment and uncomment blocks it’s an excellent choice.

Viewing AlertManager Email Alerts via MailHog

After adding AlertManager to my Prometheus test stack in a previous post I spent some time triggering different failiure cases and generating test messages. While it’s slightly satisfying seeing rows change from green to red I soon wanted to actually send real alerts, with all their values somewhere I could easily view. My criteria were:

  • must be easy to integrate with AlertManager
  • must not require external network access
  • must be easy to use from docker-compose
  • should have as few moving parts as possible

A few short web searches later I stumbled back onto a small server I’ve used for this in the past - MailHog. MailHog is an awesome little server that listens for SMTP traffic and then displays it using an internal HTTP server. It has sensible defaults so no configuration was required, comes as a single binary and even has a working dockerhub image. My solution was found!

The amount of work to include it was even less than I’d hoped. A new docker-compose.yaml file for mailhog itself, a very basic AlertManager configuration file and a few lines of docker config to put the right configs in each of the containers later and we have a working email alert view:

MailHog screen shot of Alertmanager emails

Adding AlertManager to docker-compose Prometheus

What’s the use of monitoring if you can’t raise alerts? It’s half a solution at best and now I have basic monitoring working, as discussed in Prometheus experiments with docker-compose, it felt like it was time to add AlertManager, Prometheus often used partner in crime, so I can investigate raising, handling and resolving alerts. Unfortunately this turned out to be a lot harder than ‘just’ adding a basic exporter.

Before we delve into the issues and how I worked around them in my implementation let’s see the result of all the work, adding a redis alert and forcing it to trigger. Ignoring all the implementation details for now we need to do four things to add AlertManager to our experiments:

  • add the AlertManager container
  • tell Prometheus how to contact AlertManager
  • tell Prometheus where the alert rules files are located
  • add an alerting rule to confirm everything is connected

Assuming we’re in the root of docker-compose-prometheus we’ll run our docker-compose command to create all the instances we need for testing:

docker-compose 
  -f prometheus-server/docker-compose.yaml   
  -f alertmanager-server/docker-compose.yaml 
  -f redis-server/docker-compose.yaml        
up -d

You can confirm all the containers are available by running:

docker-compose 
  -f prometheus-server/docker-compose.yaml   
  -f alertmanager-server/docker-compose.yaml 
  -f redis-server/docker-compose.yaml        
ps

Screen shot of Prometheus alerting rule

In this screenshot you can see the Prometheus alerting page, with our RedisDown alert against a green background as everything is working correctly. We also show the RedisDown AlertManager rule configuration. This rule checks the redis_up value returned by the redis exporter. If redis is down it will be 0, and if it doesn’t recover in the next minute it will trigger an alert. It’s worth noting here that you can confirm your rules files are valid using this, less scary than it looks, promtool command:

# the left hand argument to `-v` is the local file from this repo.
docker run 
  -v `pwd`/redis-server/redis.rules:/fileof.rules 
  -it --entrypoint=promtool prom/prometheus:v2.1.0 check rules /fileof.rules

Checking /fileof.rules
  SUCCESS: 1 rules found

Everything seems to be configured correctly, so lets break it and confirm alerting is working. First we will kill the redis container. This will cause the exporter to change the value of redis_up.

# kill the container
docker kill prometheusserver_redis-server_1

# check it has exited
docker ps -a | grep prometheusserver_redis-server_1

# simplified output
library/redis:4.0.8    Exited (137) 2 minutes ago    prometheusserver_redis-server_1

The alert will then change to “State PENDING” on the prometheus alerts page. Once the minute it up it will change to “State FIRING” and, if everything is working, appear in AlertManager too.

Screen shot of a triggered Prometheus alerting rule

In addition to using the web UI you can directly query alertmanager via the command line using the docker container

docker exec -ti prometheusserver_alert-manager_1 amtool 
  --alertmanager.url http://127.0.0.1:9093 alert

Alertname  Starts At                Summary
RedisDown  2018-03-09 18:33:58 UTC  Redis Availability alert.

At this point we have a basic but working AlertManager running alongside our local prometheus. It’s far from a complete or comprehensive configuration, and the alerts don’t yet go anywhere, but it’s a solid base to start your own experiments from. You can see all the code to make this work in the add_alert_manager branch

Now we’ve covered how AlertManager fits into our tests and how to confirm it’s working we will delve into how it’s configured, something that was much more work than I expected. Prometheus, by design, runs with a single configuration file. While this is fine for a number of use cases, my design goal of combining any combination of docker-compose files to create a test environment doesn’t play well with it. This became clear to me when I needed to add the alertmanager configuration to the main config file, but only when alertmanager is included. The config to enable AlertManager and its alerting rules is concise:

rule_files:
  - "/etc/prometheus/*.rules"

alerting:
  alertmanagers:
    - static_configs:
      - targets: ['alert-manager:9093']

The first part, rule_files:, accepts wild card selection of alert rule files. Each of these files contain one of more alert rules, such as our RedisDown example above. This globbing makes it easy to add rules to prometheus from each included component. The second part tells prometheus where it can find the alertmanager instance it should raise alerts with.

In order to use these configs I had to add another step to running prometheus; collecting all the configuration snippets and combining them into a single file before starting the process. My first thought was to create my own Prometheus container and preprocess the configuration before starting the daemon. I quickly decided against this as I don’t want to be responsible for maintaining my own fork of the Dockerfile. I was also worried about timing issues and start up race conditions from all the other containers adding their configs. Instead I decided to add another container.

This tiny busybox based container, which I named promconf-concat, runs a short shell script in a loop. This code concatenates all the configuration fragments, starting with the base config, together. If the complete config file has changed it replaces the existing, volume mounted, file which prometheus then detects as changed and reloads.

I have a strong suspicion I’ll be revisiting this part of the project again and splitting the fragments more. Adding ordering will probably be required as some of the exporters (such as MySQL) can’t be configured as targets via the file_sd_configs mechanism. However for now it’s allowed me to test the basic alerting functionality and continue to delver more deeply into Prometheus.

Green system percentage vs user visible issues

How much of your system does your internal monitoring need to consider down before something is user visible? While there will always be the perfect chain of three or four things that can cripple a chunk of you customer visible infrastructure there are often a lot of low importance checks that will flare up and consume time and attention. But what’s the ratio?

As a small thought experiment on one project I’ve recently started to leave a new, very simple four panel, Grafana dashboard open on a Raspberry PI driven monitor that shows the percentage of the internal monitoring checks that are currently in a successful state next to the number of user visible issues and incidents. I’ve found watching the percentage of the system that’s working rise and fall without anyone outside the company, and often the team, noticing to be strangely hypnotic. I’ve also added a couple of panels to show the number of events of each of those types over the last hour.

Fugly Dashboard showing 4 panels described in the page

I was hoping the numbers would provide some inspiration towards questions like, “Are we monitoring at the right level?”, “Do we need to be running all of these at this frequency?” and similar questions but so far I’ve mostly found it to be reassuring that it can withstand small internal failures while also worrying about the amount of state churn it seems to detect. While it’s not been as helpful as alert summary roll ups it has been a great source of visual white noise while thinking about other alerting issues.