Aqua Security microscanner – a first look

I’m a big fan of baking testing into build and delivery pipelines so when a new tool pops up in that space I like to take a look at what features it brings to the table and how much effort it’s going to take to roll out. The Aqua Security microscanner, from a company you’ve probably seen at least one excellent tech talk from in the last year, is a quite a new release that surfaces vulnerable operating systems packages in your container builds.

To experiment with microscanner I’m going to add it to my simple Gemstash Dockerfile.

FROM ubuntu:16.04
MAINTAINER dean.wilson@gmail.com

RUN apt-get update && 
    apt-get -y upgrade && 
    apt-get install -y 
      build-essential 
      ruby 
      ruby-dev 
      libsqlite3-dev 
      curl 
    && gem install --no-ri --no-rdoc gemstash

EXPOSE 9292

HEALTHCHECK --interval=15s --timeout=3s 
  CMD curl -f http://localhost:9292/ || exit 1

CMD ["gemstash", "start", "--no-daemonize"]

This is a conceptually simple Dockerfile. We update the Ubuntu package list, upgrade packages where needed, add dependencies required to build our rubygems and then install gemstash. From this very boilerplate base we only need to make a few changes for microscanner to run.

> git diff Dockerfile
diff --git a/gemstash/Dockerfile b/gemstash/Dockerfile
index 741838f..bab819a 100644
--- a/gemstash/Dockerfile
+++ b/gemstash/Dockerfile
@@ -2,7 +2,6 @@ FROM ubuntu:16.04
 MAINTAINER dean.wilson@gmail.com

 RUN apt-get update && 
-    apt-get -y upgrade && 
     apt-get install -y 
       build-essential 
       ruby 
@@ -11,6 +10,14 @@ RUN apt-get update && 
       curl 
     && gem install --no-ri --no-rdoc gemstash

+ARG token
+RUN apt-get update && apt-get -y install ca-certificates wget && 
+    wget -O /microscanner https://get.aquasec.com/microscanner && 
+    chmod +x /microscanner && 
+    /microscanner ${token} && 
+    rm -rf /microscanner
+

Firstly we remove the package upgrade step, as we want to ensure vulnerabilities are present in our container. We then use the newer ARG directive to tell Docker we will be passing a value named token in at build time. Lastly we attempt to add microscanner and its dependencies, in a single image layer. As we’re using the wget and ca- certificates packages it does have a small impact on container size but microscanner itself is added, used and removed without a trace.

You’ll notice we specify a token when running the scanner. This grants access to the Aqua scanning servers, and is rate limited. How do you get a token? You request it by calling out to the Aqua Security container with your email address:

docker run --rm -it aquasec/microscanner --register foo@mailinator.com
# ... snip ...
Aqua Security MicroScanner, version 2.6.4
Community Edition

Accept and proceed? Y/N:
y
Please check your email for the token.

Once you have the token (mine came through in seconds) you can build the container:

docker build --build-arg=token=A1A1Aaa1AaAaAAA1 --no-cache .

For this experiment I’ve taken the big hammer of --no-cache to ensure all the packages are tested on each build. This will have a build time performance aspect and should be considered along with the other best practices. If your container has vulnerable package versions you’ll get a massive dump of JSON in your build output. Individual packages will show their vulnerabilities:

{
      "resource": {
        "format": "deb",
        "name": "systemd",
        "version": "229-4ubuntu21.1",
        "arch": "amd64",
        "cpe": "pkg:/ubuntu:16.04:systemd:229-4ubuntu21.1",
        "name_hash": "2245b39c177e93fc015ba051be4e8574"
      },
      "scanned": true,
      "vulnerabilities": [
        {
          "name": "CVE-2018-6954",
          "description": "systemd-tmpfiles in systemd through 237 mishandles symlinks present in non-terminal path components, which allows local users to obtain ownership of arbitrary files via vectors involving creation of a directory and a file under that directory, and later replacing that directory with a symlink. This occurs even if the fs.protected_symlinks sysctl is turned on.",
          "nvd_score": 7.2,
          "nvd_score_version": "CVSS v2",
          "nvd_vectors": "AV:L/AC:L/Au:N/C:C/I:C/A:C",
          "nvd_severity": "high",
          "nvd_url": "https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-6954",
          "vendor_score": 5,
          "vendor_score_version": "Aqua",
          "vendor_severity": "medium",
          "vendor_url": "https://people.canonical.com/~ubuntu-security/cve/2018/CVE-2018-6954.html",
          "publish_date": "2018-02-13",
          "modification_date": "2018-03-16",
          "fix_version": "any in ubuntu 17.04",
          "solution": "Upgrade operating system to ubuntu version 17.04 (includes fixed versions of systemd)"
        }
}

You’ll also see some summary information, total number of issues, run time and container operating system values for example.

  "vulnerability_summary": {
    "total": 147,
    "medium": 77,
    "low": 70,
    "negligible": 6,
    "score_average": 4.047619,
    "max_score": 5,
    "max_fixable_score": 5,
    "max_fixable_severity": "medium"
  },

If any of the vulnerabilities are considering to be high in severity then the build should fail, preventing you from going live with known issues.

It’s very early days for microscanner and there’s a certain amount of inflexibility that will shake out over use, such as being able to fail builds on medium or even low severity issues, or only show packages with vulnerabilities, but it’s a very easy way to add this kind of safety net to your containers and worth keeping an eye on.

Validate AWS CIS security benchmarks with prowler

Despite the number of Amazon Web Services that have the word simple in their titles, keeping on top of a large cloud deployment isn’t an easy ask. There are a lot of important, complex, aspects to consider so it’s advisable to pay attention to the best practices, reference architectures, and benchmarks published by AWS and their partners. In this post we’ll take a look at the CIS security benchmark and a tool that will save you a lot of manual verifying.

CIS, the “Center For Internet Security”, publish best practice, security configuration guides, that present a number of recommendations that you should be aware of if you’re running production workloads in AWS. You don’t have to change your environment to suit every recommendation, or even agree with them, but you should read through it once and note where you’re consciously different to their advice. The guide itself, which you can find on the CIS AWS Benchmark page, or as an AWS static whitepaper link that doesn’t require an email address to read, is quite low level but well worth a read. Being aware of all the potential issues will help shape your cloud environments for the better. But, as good, lazy, admins we won’t go and check each of the recommendations by hand. Instead we’ll use a python application called Prowler.

The recommendations are terse but mostly clear. As the screenshot shows they aid in verification and remediation by presenting instructions for how to reach the given values in the web console or via the CLI.

AWS CIS example policy

Prowler however provides us a third way. It has checks for most of the recommendations, and even some bonus extras, and will iterate through them and assign us a pass or fail for each. Let’s install it and run some experiments.

Installing prowler

Prowler is a python program so we’ll install it, and the required dependencies, into a virtualenv to keep the versions isolated.

# create a new virtual env
virtualenv prowler-sweep
cd prowler-sweep
source bin/activate

# Get prowler from github
git clone https://github.com/toniblyx/prowler
cd prowler

# install the dependencies
pip install ansi2html awscli

You now have all the code required for prowler to run a sweep of your security settings.

Running prowler

I uses different profiles, configured in .aws/credential for most of my experiments so for now I’ll run prowler as me, but with read only access. If you want to run this as a dedicated user or under EC2 the installation guide has lists the required IAM permissions.

./prowler -p full-readonly

  _ __  _ __ _____      _| | ___ _ __
 | '_ | '__/ _   / / / |/ _  '__|
 | |_) | | | (_)  V  V /| |  __/ |
 | .__/|_|  ___/ _/_/ |_|___|_|v2.0-beta2
 |_| the handy cloud security tool

 Date: Wed 30 May 18:47:25 BST 2018

In its most basic mode prowler will run from the command line and show its results in glorious, colourful, ANSI.

Prowler output in glorious ANSI colour

In additional to text with control characters it can also provide basic HTML reports or even JSON and CSV for further processing and integration into your existing tools. Once you’ve finished a full sweep in your format of choice you can start to prioritise the findings and often add remediation to your Terraform or CloudFormation code bases.

Above and beyond

In addition to the CIS recommendations Prowler adds some of its own checks, for example some services didn’t exist when the last benchmark was published, and for common operational practises that are worth following. You can even extend it yourself if you have local rules or compliance requirements. There’s a list of additional prowler checks and description on the GitHub repository.

AWS security is a big, sprawling, topic with many moving parts, and while no third party resource will ever cover all your use cases documents like the CIS benchmark and tools like prowler can help quickly provide a baseline and safety net to ensure if you do get breached it won’t be because of a simple oversight.

The simple vims – code comments

After finding a bug in my custom written, bulk code comment / uncomment, vim function I decided to invest a little time to find a mature replacement that would remove my maintenance burden. In addition to removing my custom code I wanted a packaged solution, to make it easier to include across all of my vim installs.

After a little googling I found the ideal solution, the vim-commentary plugin. It ticks all my check boxes:

  • mature enough all the obvious bugs should have been found
  • receives attention when it needs it
  • has a narrow, well defined, focus
  • as a user it works the way I’d have approached it
  • And while it’s not a selection criteria, Tim Pope writing it is a big plus

I use the Vundle package manager for vim so installing commentary was quick and painless. I already have the vundle boilerplate in my .vimrc config file:

" set the runtime path to include Vundle and initialise
set rtp+=~/.vim/bundle/Vundle.vim
call vundle#begin()

" let Vundle manage Vundle, required
Plugin 'VundleVim/Vundle.vim'
" ... snip ... Lots of other plugins

call vundle#end()            " required

So all I had to do was add the new Plugin directive

" ... snip ...
Plugin 'VundleVim/Vundle.vim'
Plugin 'tpope/vim-commentary'
" ... snip ...

and then re-source the configuration and install the new plugin

:source %
:PluginInstall

Once it’s installed using it is as easy as selecting the text you want to comment out and typing gc. You can also use gcc (which can take a count) to comment out the current line. To uncomment code repeat the operation. Predictable enough that your muscle memory will learn it quickly. If you want to change the comment style, for example puppet code defaults to the horrible /* file { '/tmp/foo': */ format, you can override the default by adding an autocmd line to your .vimrc

    autocmd FileType puppet setlocal commentstring=# %s

I replaced my own custom code with commentary a few weeks ago and it’s quickly become a great, intuitive, replacement. If you use vim for writing code and want a simple way to comment and uncomment blocks it’s an excellent choice.

Viewing AlertManager Email Alerts via MailHog

After adding AlertManager to my Prometheus test stack in a previous post I spent some time triggering different failiure cases and generating test messages. While it’s slightly satisfying seeing rows change from green to red I soon wanted to actually send real alerts, with all their values somewhere I could easily view. My criteria were:

  • must be easy to integrate with AlertManager
  • must not require external network access
  • must be easy to use from docker-compose
  • should have as few moving parts as possible

A few short web searches later I stumbled back onto a small server I’ve used for this in the past - MailHog. MailHog is an awesome little server that listens for SMTP traffic and then displays it using an internal HTTP server. It has sensible defaults so no configuration was required, comes as a single binary and even has a working dockerhub image. My solution was found!

The amount of work to include it was even less than I’d hoped. A new docker-compose.yaml file for mailhog itself, a very basic AlertManager configuration file and a few lines of docker config to put the right configs in each of the containers later and we have a working email alert view:

MailHog screen shot of Alertmanager emails

Adding AlertManager to docker-compose Prometheus

What’s the use of monitoring if you can’t raise alerts? It’s half a solution at best and now I have basic monitoring working, as discussed in Prometheus experiments with docker-compose, it felt like it was time to add AlertManager, Prometheus often used partner in crime, so I can investigate raising, handling and resolving alerts. Unfortunately this turned out to be a lot harder than ‘just’ adding a basic exporter.

Before we delve into the issues and how I worked around them in my implementation let’s see the result of all the work, adding a redis alert and forcing it to trigger. Ignoring all the implementation details for now we need to do four things to add AlertManager to our experiments:

  • add the AlertManager container
  • tell Prometheus how to contact AlertManager
  • tell Prometheus where the alert rules files are located
  • add an alerting rule to confirm everything is connected

Assuming we’re in the root of docker-compose-prometheus we’ll run our docker-compose command to create all the instances we need for testing:

docker-compose 
  -f prometheus-server/docker-compose.yaml   
  -f alertmanager-server/docker-compose.yaml 
  -f redis-server/docker-compose.yaml        
up -d

You can confirm all the containers are available by running:

docker-compose 
  -f prometheus-server/docker-compose.yaml   
  -f alertmanager-server/docker-compose.yaml 
  -f redis-server/docker-compose.yaml        
ps

Screen shot of Prometheus alerting rule

In this screenshot you can see the Prometheus alerting page, with our RedisDown alert against a green background as everything is working correctly. We also show the RedisDown AlertManager rule configuration. This rule checks the redis_up value returned by the redis exporter. If redis is down it will be 0, and if it doesn’t recover in the next minute it will trigger an alert. It’s worth noting here that you can confirm your rules files are valid using this, less scary than it looks, promtool command:

# the left hand argument to `-v` is the local file from this repo.
docker run 
  -v `pwd`/redis-server/redis.rules:/fileof.rules 
  -it --entrypoint=promtool prom/prometheus:v2.1.0 check rules /fileof.rules

Checking /fileof.rules
  SUCCESS: 1 rules found

Everything seems to be configured correctly, so lets break it and confirm alerting is working. First we will kill the redis container. This will cause the exporter to change the value of redis_up.

# kill the container
docker kill prometheusserver_redis-server_1

# check it has exited
docker ps -a | grep prometheusserver_redis-server_1

# simplified output
library/redis:4.0.8    Exited (137) 2 minutes ago    prometheusserver_redis-server_1

The alert will then change to “State PENDING” on the prometheus alerts page. Once the minute it up it will change to “State FIRING” and, if everything is working, appear in AlertManager too.

Screen shot of a triggered Prometheus alerting rule

In addition to using the web UI you can directly query alertmanager via the command line using the docker container

docker exec -ti prometheusserver_alert-manager_1 amtool 
  --alertmanager.url http://127.0.0.1:9093 alert

Alertname  Starts At                Summary
RedisDown  2018-03-09 18:33:58 UTC  Redis Availability alert.

At this point we have a basic but working AlertManager running alongside our local prometheus. It’s far from a complete or comprehensive configuration, and the alerts don’t yet go anywhere, but it’s a solid base to start your own experiments from. You can see all the code to make this work in the add_alert_manager branch

Now we’ve covered how AlertManager fits into our tests and how to confirm it’s working we will delve into how it’s configured, something that was much more work than I expected. Prometheus, by design, runs with a single configuration file. While this is fine for a number of use cases, my design goal of combining any combination of docker-compose files to create a test environment doesn’t play well with it. This became clear to me when I needed to add the alertmanager configuration to the main config file, but only when alertmanager is included. The config to enable AlertManager and its alerting rules is concise:

rule_files:
  - "/etc/prometheus/*.rules"

alerting:
  alertmanagers:
    - static_configs:
      - targets: ['alert-manager:9093']

The first part, rule_files:, accepts wild card selection of alert rule files. Each of these files contain one of more alert rules, such as our RedisDown example above. This globbing makes it easy to add rules to prometheus from each included component. The second part tells prometheus where it can find the alertmanager instance it should raise alerts with.

In order to use these configs I had to add another step to running prometheus; collecting all the configuration snippets and combining them into a single file before starting the process. My first thought was to create my own Prometheus container and preprocess the configuration before starting the daemon. I quickly decided against this as I don’t want to be responsible for maintaining my own fork of the Dockerfile. I was also worried about timing issues and start up race conditions from all the other containers adding their configs. Instead I decided to add another container.

This tiny busybox based container, which I named promconf-concat, runs a short shell script in a loop. This code concatenates all the configuration fragments, starting with the base config, together. If the complete config file has changed it replaces the existing, volume mounted, file which prometheus then detects as changed and reloads.

I have a strong suspicion I’ll be revisiting this part of the project again and splitting the fragments more. Adding ordering will probably be required as some of the exporters (such as MySQL) can’t be configured as targets via the file_sd_configs mechanism. However for now it’s allowed me to test the basic alerting functionality and continue to delver more deeply into Prometheus.

Green system percentage vs user visible issues

How much of your system does your internal monitoring need to consider down before something is user visible? While there will always be the perfect chain of three or four things that can cripple a chunk of you customer visible infrastructure there are often a lot of low importance checks that will flare up and consume time and attention. But what’s the ratio?

As a small thought experiment on one project I’ve recently started to leave a new, very simple four panel, Grafana dashboard open on a Raspberry PI driven monitor that shows the percentage of the internal monitoring checks that are currently in a successful state next to the number of user visible issues and incidents. I’ve found watching the percentage of the system that’s working rise and fall without anyone outside the company, and often the team, noticing to be strangely hypnotic. I’ve also added a couple of panels to show the number of events of each of those types over the last hour.

Fugly Dashboard showing 4 panels described in the page

I was hoping the numbers would provide some inspiration towards questions like, “Are we monitoring at the right level?”, “Do we need to be running all of these at this frequency?” and similar questions but so far I’ve mostly found it to be reassuring that it can withstand small internal failures while also worrying about the amount of state churn it seems to detect. While it’s not been as helpful as alert summary roll ups it has been a great source of visual white noise while thinking about other alerting issues.

Prometheus experiments with docker-compose

As 2018 rolls along the time has come to rebuild parts of my homelab again. This time I’m looking at my monitoring and metrics setup, which is based on sensu and graphite, and planning some experiments and evaluations using Prometheus. In this post I’ll show how I’m setting up my tests and provide the Prometheus experiments with docker-compose source code in case it makes your own experiments a little easier to run.

My starting requirements were fairly standard. I want to use containers where possible. I want to test lots of different backends and I want to be able to pick and choose which combinations of technologies I run for any particular tests. As an example I have a few little applications that make use of redis and some that use memcached, but I don’t want to be committed to running all of the backing services for each smaller experiment. In terms of technology I settled on docker-compose to help keep the container sprawl in check while also enabling me to specify all the relationships. While looking into compose I found Understanding multiple Compose files and my basic structure began to emerge.

Starting with prometheus and grafana themselves I created the prometheus-server directory and added a basic prometheus config file to configure the service. I then added configuration for each of the things it was to collect from; prometheus and grafana in this case. Once these were in place I added the prometheus and grafana docker-compose.yaml file and created the stack.

docker-compose -f prometheus-server/docker-compose.yaml up -d

docker-compose -f prometheus-server/docker-compose.yaml ps

> docker-compose -f prometheus-server/docker-compose.yaml ps
        Name                   Command       State   Ports
-----------------------------------------------------------------------
prometheusserver_grafana_1     /run.sh       Up  0.0.0.0:3000->3000/tcp
prometheusserver_prometheus_1  /bin/prom ... Up  0.0.0.0:9090->9090/tcp

After manually configuring the prometheus data source in Grafana, all of which is covered in the README you have a working prometheus scraping itself and grafana and a grafana that allows you to experiment with presenting the data.

While this is a good first step I need visibility into more than the monitoring system itself, so it’s time to add another service. Keeping our goal of being modular in mind I decided to break everything out into separate directories and isolate the configuration. Adding a new service is as simple as adding a redis-server directory and writing a docker-compose file to run redis and the prometheus exporter we use to get metrics from it. This part is simple as most of the work is done for us. We use third party docker containers and everything is up and running. But how do we add the redis exporter to the prometheus targets? That’s where docker-composes merging behaviour shines.

In our base docker-compose.yaml file we define the prometheus service and the volumes assigned to it:

services:
  prometheus:
    image: prom/prometheus:v2.1.0
    ports:
      - 9090:9090
    networks:
      - public
    volumes:
      - prometheus_data:/prometheus
      - ${PWD}/prometheus-server/config/prometheus.yml:/etc/prometheus/prometheus.yml
      - ${PWD}/prometheus-server/config/targets/prometheus.json:/etc/prometheus/targets/prometheus.json
      - ${PWD}/prometheus-server/config/targets/grafana.json:/etc/prometheus/targets/grafana.json
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

You can see we’re mounting individual target files in to prometheus for it to probe. Now in our docker-compose-prometheus/redis-server/docker-compose.yaml file we’ll reference back to the existing prometheus service and add to the volumes array.

  prometheus:
    volumes:
      - ${PWD}/redis-server/redis.json:/etc/prometheus/targets/redis.json

Rather than overriding the array this incomplete service configuration adds another element to it. Allowing us to build up our config over multiple docker-compose files. In order for this to work we have to run the compose commands with each config specified every time. Resulting in the slightly hideous -

docker-compose 
  -f prometheus-server/docker-compose.yaml 
  -f redis-server/docker-compose.yaml 
  up -d

Once you’re running a stack with 3 or 4 components you’ll probably reach for aliases and add a base docker-compose replacement

alias dc='docker-compose -f prometheus-server/docker-compose.yaml -f redis-server/docker-compose.yaml'

and then call that with actual commands like dc up -d and dc logs. Adding your own application to the testing stack is as easy as adding a backing resource. Create a directory and the two config files and everything should be hooked in.

It’s early in the process and I’m sure to find issues with this naive approach but it’s enabled me to create arbitrarily complicated prometheus test environments and start evaluating its ecosystem of plugins and exporters. I’ll add more to it and refine where possible, the manual steps should hopefully be reduced by Grafana 5 for example, but hopefully it’ll remain a viable way for myself and others to run quick, adhoc tests.

A short 2017 review

It’s time for a little 2017 navel gazing. Prepare for a little self-congratulation and a touch of gushing. You’ve been warned. In general my 2017 was a decent one in terms of tech. I was fortunate to be presented a number of opportunities to get involved in projects and chat to people that I’m immensely thankful for and I’m going to mention some of them here to remind myself how lucky you can be.

Let’s start with conferences, I was fortunate enough to attend a handful of them in 2017. Scale Summit was, as always, a great place to chat about our industry. In addition to the usual band of rascals I met Sarah Wells in person for the first time and was blown away by the breadth and depth of her knowledge. She gave a number of excellent talks over 2017 and they’re well worth watching. The inaugural Jeffcon filled in for a lack of Serverless London (fingers crossed for 2018) and was inspiring throughout, from the astounding keynote by Simon Wardley keynote all the way to the after conference chats.

I attended two DevopsDays, London, more about which later, and Stockholm. It was the first in Sweden and the organisers did the community proud. In a moment of annual leave burning I also attended Google Cloud and AWS Summits at the Excel centre. It’s nice to see tech events so close to where I’m from. I finished the year off with the GDS tech away day, DockerCon Europe and Velocity EU.

DevopsDays holds a special place in my heart as the conference and community that introduced me to so many of my peers that I heartily respect. The biggest, lasting contribution, of Patricks for me is building those bridges. When the last “definition of Devops” post is made I’ll still cherish the people I met from that group of very talented folk. That’s one of the reasons I was happy to be involved in the organisation of my second London DevOps. You’d be amazed at the time, energy and passion the organisers, speakers and audience invest in to a DevopsDays event. But it really does show on the day(s).

I was also honoured to be included in the Velocity Europe Program Committee. Velocity has always been one of the important events of industry and to go from budgeting most of a year in advance to attend to being asked to help select from the submitted papers, and even more than that, be a session chair, was something I’m immensely proud of and thankful to James Turnbull for even thinking of me. The speakers, some of who were old hands at large events and some giving their first conference talk (in their second language no less!), were a pleasure to work with and made a nerve wracking day so much better than I could have hoped. It was also a stark reminder of how much I hate speaking in front of a room full of people.

Moving away from gushing over conferences, I published a book. It was a small experiment and it’s been very educational. It’s sold a few copies, made enough to pay for the domain for a few years and led to some interesting conversations with readers. I also wrote a few Alexa skills. While they’re not the more complicated or interesting bits of code from last year they have a bit of a special significance to me. I’m from a very non-technical background so it’s nice for my family to actually see, or in this case hear, something I’ve built.

Other things that helped keep me sane were tech reviewing a couple of books, hopefully soon to be published, and reviewing talk submissions. Some for conferences I was heavily involved in and some for events I wasn’t able to attend. It’s a significant investment of time but nearly every one of them taught me something. Even about technology I consider myself competent in.

I still maintain a small quarterly Pragmatic Investment Plan (PiP), which I started a few years ago, and while it’s more motion than progress these days it does keep me honest and ensure I do at least a little bit of non-work technology each month. Apart from Q1 2017 I surprisingly managed to read a tech book each month, post a handful of articles on my blog, and attend a few user groups here and there. I’ve kept the basics of the PiP for 2018 and I’m hoping it keeps me moving.

My general reading for the year was the worst it’s been for five years. I managed to read, from start to finish, 51 books. Totalling under 15,000 pages. I did have quite a few false starts and unfinished books at the end which didn’t help.

Oddly, my most popular blog post of the year was Non-intuitive downtime and possibly not lost sales. It was mentioned in a lot of weekly newsletters and resulted in quite a bit of traffic. SRE weekly also included it, which was a lovely change of pace from my employer being mentioned in the “Outages” section.

All in all 2017 was a good year for me personally and contained at least one career highlight. In closing I’d like to thank you for reading UnixDaemon, especially if you made it this far down, and let’s hope we both have an awesome 2018.

Terraform testing thoughts

As your terraform code grows in both size and complexity you should invest in tests and other ways to ensure everything is doing exactly what you intended. Although there are existing ways to exercise parts of your code I think Terraform is currently missing an important part of testing functionality, and I hope by the end of this post you’ll agree.

I want puppet catalog compile testing in terraform

Our current terraform testing process looks a lot like this:

  • precommit hooks to ensure the code is formatted and valid before it’s checked in
  • run terraform plan and apply to ensure the code actually works
  • execute a sparse collection of AWSSpec / InSpec tests against the created resources
  • Visually check the AWS Console to ensure everything “looks correct”

We ensure the code is all syntactically validate (and pretty) before it’s checked in. We then run a plan, which often finds issues with module paths, names and such, and then the slow, all encompassing, and cost increasing apply happens. And then you spot an unexpanded variable. Or that something didn’t get included correctly with a count.

I think there is a missed opportunity to add a separate phase, between plan and apply above, to expose the compiled plan in a easy to integrate format such as JSON or YAML. This would allow existing testing tools, and things like custom rspec matchers and cucumber test cases, to verify your code before progressing to the often slow, and cash consuming, apply phase. There are a number of things you could usefully test in a serialised plan output. Are your “fake if” counts doing what you expect? Are those nested data structures translating to all the tags you expect? How about the stringified splats and local composite variables? And what are the actual values hidden behind those computed properties? All of this would be visible at this stage. Having these tests would allow you to catch a lot of more subtle logic issues before you invoke the big hammer of actually creating resources.

I’m far from the first person to request this and upstream have been fair and considerate but it’s not something that’s on the short term road map. Work arounds do exist but they all have expensive limitations. The current plan file is in a binary format that isn’t guaranteed to be backwards compatible to external clients. Writing a plan output parser is possible but “a tool like this is very likely to be broken by future Terraform releases, since we don’t consider the human-oriented plan output to be a compatibility constraint” and hooking the plan generation code, an approach taken by palantir/tfjson will be a constant re-investment as terraforms core rapidly changes.

Adding a way to publish the plan in an easy to process way would allow many other testing tools and approaches to bloom and I hope I’ve managed to convince you that it’d be a great addition to terraform.

Show server side response timings in chrome developer tools

While trying to add additional performance annotations to one of my side projects I recently stumbled over the exceptionally promising Server-Timing HTTP header and specification. It’s a simple way to add semi-structured values describing aspects of the response generation and how long they each took. These can then be processed and displayed in your normal web development tools.

In this post I’ll show a simplified example, using Flask, to add timings to a single page response and display them using Google Chrome developer tools. The sample python flask application below returns a web page consisting of a single string and some fake information detailing all the actions assembling the page could have required.

# cat hello.py

from flask import Flask, make_response
app = Flask(__name__)


@app.route("/")
def hello():

    # Collect all the timings you want to expose
    # each string is how long it took in microseconds
    # and the human readable name to display
    sub_requests = [
        'redis=0.1; "Redis"',
        'mysql=2.1; "MySQL"',
        'elasticsearch=1.2; "ElasticSearch"'
    ]

    # Convert timings to a single string
    timings = ', '.join(sub_requests)

    resp.headers.set('Server-Timing', timings)

    return resp

Once you’ve started the application, with FLASK_APP=hello.py flask run, you can request this page via curl to confirm the header and values are present.

    $ curl -sI http://127.0.0.1:5000/ | grep Timing
    ...
    Server-Timing: redis=0.1; "Redis", mysql=2.1; "MySQL", elasticsearch=1.2; "ElasticSearch"
    ...

Now we’ve added the header, and some sample data, to our tiny Flask application let’s view it in Chrome devtools. Open the developer tools with Ctrl-Shift-I and then click on the network tab. If you hover the mouse pointer over the coloured section in “Waterfall” you should see an overlay like this:

Chrome devtools response performance graph

The values provided by our header are at the bottom under “Server Timing”.

Support for displaying the values provided with this header isn’t yet wide spread. The example, and screenshot, presented here are from Chrome 62.0.3202.75 (Official Build) (64-bit) and may require changes as the spec progresses from its current draft status. The full potential of the Server-Timing header won’t be obvious for a while but even with only a few supporting tools it’s still a great way to add some extra visibility to your projects.