Category Archives: devops

Puppet 4 data lookup strategies

I recently wrote about the new Data in Modules support in Puppet 4, there’s another new feature that goes hand in hand with this to finally rid us of functions like hiera_hash() and such.

Up to now we’ve had to do something ugly like this to handle merged class parameters:

class users($local = hiera_hash("users::local", {}) {
 ...
}

This is functional but quite ugly and ties your module to having hiera. While these days it’s a reasonably safe assumption but with the ability to specify different environment data sources this will not always be the case. For example there’s a new kid on the block called Jerakia that lives in this world so having Hiera specific calls in modules is going to be a limiting strategy.

A much safer abstraction is to be able to rely on the automatic parameter lookup feature – but it had no way to know about the fact that this item should be a hash merge and so the functions were used as above.

Worse things like merge strategies were set globally, a module could not say a certain key should be deep merged and others just shallow merged etc, and if a module required a specific way it had no control over this.

A solution for this problem landed in recent Puppet 4 via a special merged hash called lookup_options. This is documented quite lightly in the official docs so I thought I’ll put up a example here.

lookup() function


To understand how this work you first have to understand the lookup() function, it’s documented here. But this is basically the replacement for the various hiera() functions and have a matching puppet lookup CLI tool.

If you wanted to do a hiera_hash() lookup that is doing the old deeper hash merge you’d do something like:

$local = lookup("users::local", Hash, {"strategy" => "deep", "merge_hash_arrays" => true})

This would merge just this key rather than say setting the merge strategy to deeper globally in hiera and it’s something the module author can control. The Hash above describes the data type the result should match and support all the various complex composite type definitions so you can really in detail describe the desired result data – almost like a Schema.

lookup_options hiera key


We saw above how to instruct the lookup() function to do a hiera_hash() but wouldn’t it be great if we could somehow tell Puppet that a specific key should always be merged in this way? That way a simple lookup(“users::local”) would do the merge and crucially so would the automatic parameter lookups – even across backends and data providers.

We just want:

class users(Hash $local = {}) {
 ...
}

For this to make sense the users module must be able to indicate this in the data layer. And since we now have data in modules there’s a obvious place to put this.

If you set up the users module here to use the hiera data service for data in modules as per my previous blog post you can now specify the merge strategy in your data:

# users/data/common.yaml
lookup_options:
  users::local:
    strategy: deep
    merge_hash_arrays: true

Note how this match exactly the following lookup():

$local = lookup("users::local", Hash, {"strategy" => "deep", "merge_hash_arrays" => true})

The data type validation is done on the class parameters where it will also validate specifically specified data and the strategies for processing the data is in the module data level.

The way this works is that puppet will do a lookup_options lookup from the data source that is merged together – so you could set this at site level as well – but there is a check to ensure a module can only set keys for itself so it can not change behaviours of other modules.

This is a huge win and really shows what can be done with the Data in Modules features and something that’s been impossible before. This really brings the automatic parameter lookup feature a huge way forward and combines for me to be one of the most compelling features of Puppet 4.

I am not sure who proposed this behaviour, the history is a bit muddled but if someone can tweet me links to mailing list threads or something I’ll link them here for those who want to discover the background and reasoning that went into it.

Wishlist

The lookup function and the options are a great move forward however I find the UX of the various lookup options and merge strategies etc quite bad. It’s really hard for me to go from reading the documentation to knowing what a certain option will do with my data – in fact I still have no idea what some of these do the only way to discover it seems to be just spending time playing with it which I haven’t had, it would be great for new users to get some more clarity there.

Some doc updates that provide a translation from old Hiera terms to new strategies would be great and maybe some examples of what these actually do.

Native Puppet 4 Data in Modules

Back in August 2012 I requested an enhancement to the general data landscape of Puppet and a natural progression on the design of Hiera to enable it to be used in modules that are shared outside of your own environments. I called this Data in Modules. There was lots of community interest in this but not much movement, eventually I made a working POC that I released in December 2013.

The basic idea around the feature is that we want to be able to use Hiera to model internal data found in modules as well as site specific data and that these 2 sets of data coexist and compliment each other. Full details of this can be found in my post titled Better Puppet Modules Using Hiera Data and some more background can be found in The problem with params.pp. These posts are a bit old now and some things have moved on but they’re good background reading.

It’s taken a while but as part of the Puppet 4 rework effort the data ingesting mechanisms have also been rewritten in finally in Puppet 4.3.0 native data in modules have arrived. The original Jira for this is 4474. It’s really pretty close to what I had in mind in my proposals and my POC and I am really happy with this. Along the way a new function called lookup() have been introduced to replace the old collection of hiera(), hiera_array() and hiera_hash().

The official docs for this feature can be found at the Puppet Labs Docs site. Here I’ll more or less just take my previous NTP example and show how you could use the new Data in Modules to simplify it as per the above mentioned posts.

This is the very basic Puppet class we’ll be working with here:

class ntp (
  String $config,
  String $keys_file
) {
 ...
}

In the past these variables would have needed to interact with the params.pp file like $config = $ntp::params::config, but now it’s just a simple class. At this point it’ll not yet use any data in the module, to do that you have to activate it in the metadata.json:

# ntp/metadata.json
{
  ...
  "data_provider": "hiera"
}

At this point Puppet knows you want to use the hiera data in the module. But key to the feature and really the whole reason it exists is because a module needs to be able to specify it’s own hierarchy. Imagine you want to set $keys_file here, you’ll have to be sure the hierarchy in question includes the OS Family and you must have control over that data. In the past with the hierarchy being controlled completely by the site hiera.yaml this was not possible at all and the outcome was that if you wanted to share a module outside of your environment you have to go the params.pp route as that was the only portable solution.

So now your modules can have their own hiera.yaml. It’s slightly different from the past but should be familiar to past hiera users, it goes in your module so this would be ntp/hiera.yaml:

---
version: 4
datadir: data
hierarchy:
  - name: "OS family"
    backend: yaml
    path: "os/%{facts.os.family}
 
  - name: "common"
    backend: yaml

This is the new format for the hiera configuration, it’s more flexible and a future version of hiera will have some changing semantics that’s quite nice over the original design I came up with so you have to use that new format here.

Here you can see the module has it’s own OS Family tier as well as a common tier. Lets see the ntp/data/common.yaml:

---
ntp::config: "/etc/ntp.conf"
ntp::keys_file: "/etc/ntp.keys"

These are sane defaults to use for any non specifically supported operating systems.

Below are examples for AIX and Debian:

# data/os/AIX.yaml
---
ntp::config: "/etc/ntpd.conf"
# data/os/Debian.yaml
---
ntp::keys_file: "/etc/ntp/keys"

At this point the need for params.pp is gone – at least in this simplistic example – and this data along with the environment specific or site specific data cohabit really nicely. If you specified any of these data items in your site Hiera data your site data will override the module. The advantages of this might not be immediately obvious. I have a very long list of advantages over params.pp in my Better Puppet Modules Using Hiera Data post, be sure to read that for background.

There’s an alternative approach where you write a Puppet function that returns a hash of data and the data system will fetch the keys from there. This is really powerful and might end up being a interesting solution to something along the lines of a module specific custom hiera backend – but a lighter weight version of that. I might write that up later, this post is already a bit long.

The remaining problem is to do with data that needs to be merged as traditionally Hiera and Puppet has no idea you want this to happen when you do a basic lookup – hence these annoying hiera_hash() functions etc – , there’s a solution for this and I’ll post a blog post about that next week once the next Puppet 4 release is out and a bug I found that makes it unusable is fixed in that version.

This feature is a great addition to Puppet and I am really glad to finally see this land. My hacky modules in data code was used quite extensively with 72 000 downloads from the forge but I was never really happy with it and was desperate to see this land natively. This is a big step forward and I hope it sees wide adoption in the community.

Iterating in Puppet

Iteration in Puppet has been a long standing pain point, Puppet 4 address this by adding blocks, loops etc. Here I capture the various approaches to working with some complex data in Puppet before and after Puppet 4

To demonstrate this I’ll take some data from a previous blog post and see how to deal with it, here’s the data that will be in $domains in the examples blow:

{
    "x.net": {
      "nexthop": "70.x.x.x",
      "spamdestination": "rip@devco.net",
      "spamthreshold": 1500,
      "enable_antispam": 1
    },
    "x.co.uk": {
      "nexthop": "70.x.x.x",
      "spamdestination": "rip@devco.net",
      "spamthreshold": 1500,
      "enable_antispam": 1
    },
}

First we’re going to need some defined type that can create an individual domain, we’ll call that mail::domain but I won’t show the code here, as that’s not really important.

Puppet 3 + stdlib

The first approach I’ll show your basic Puppet 3 approach. The basic idea here is to get a list of domains and use the array iteration Puppet has always had on name.

The trick here is to get the domain names using the keys() function and then pass all the data into every instance of the define – the instance fetch it’s data from the data passed into the define.

$domain_names = keys($domains)
 
mail::domains{$domain_names:
  domains => $domains
}
 
define mail::domains($domains) {
  $domain = $domains[$name]
 
  mail::domain{$name:
    nexthop => $domain["nexthop"]
    .
    .
  }
}

Puppet 3 + create_resources

A hacky riff on eval() was added to Puppet during 3 to make it a bit easier to deal with data from Hiera or similar, it takes some data in a standard format and create instances of a defined type:

create_resources("mail::domain", $domains, {"spamthreshold" => 1500, "enable_antispam" => 1})

This replaces all the code above plus adds some default handling in the case that the data is not uniform. Some people love it, some hate it, I think it’s a bit too magical so prefer to avoid it.

Puppet 4 – each loop

This is the approach you’d probably want to use in Puppet 4 it uses a simple each loop over the data:

$domains.each |$name, $domain| { 
  mail::domain{$name:
    nexthop => $domain["nexthop"]
    .
    .
  }
}

It’s quite readable and obvious what’s happening here, it’s more typing than the create_resources example but I think this is the preferred way due to clarity etc

Below this we get into the more academic solutions to the problem, mainly showing off some Puppet 4 features.

Puppet 4 – wildcard shortcut

If listing every key is tedious like above and if you know your hashes map 1:1 to the defined type parameters you can short circuit things a bit, this is quite close to the create_resources convenience:

each($domains) |$name, $domain| { 
  mail::domain{$name:
    * => $domain
  }
}

The splat operator takes all the data in the hash and maps it right onto properties of the define type, quite handy

Puppet 4 – wildcard and defaults

Your data might not all be complete so you’d want to get some defaults merged in, this is something create resources also supports so this is how you’d do it without create_resources:

$defaults = {
  "spamthreshold" => 1500,
  "enable_antispam" => 1
}
 
$domains.each |$name, $domain| { 
  mail::domain{$name:
    * => $defaults + $domain  # + now merge hashes 
  }
}

Puppet 4 – wildcard and resource defaults

An alternative to the above that’s a bit more verbose but might be more readable can be seen below:

$defaults = {
  "spamthreshold" => 1500,
  "enable_antispam" => 1
}
 
$domains.each |$name, $domain| { 
  mail::domain{
    default:
      * => $defaults;
 
    $name:
      * => $domain
  }
}

Conclusion

That’s about it, there are many more iteration tricks in Puppet 4 but this shows you how to achieve what you did with create_resources in the past and a couple of possible approaches to solving that problem.

Not sure which I’d recommend, but I suspect the choice comes down to personal style and situation.

Translating Webhooks with AWS API Gateway and Lambda

Webhooks are great, so many services now support them but I found actually doing anything with them a pain as there are no standards for what goes in them and any 3rd party service you wish to integrate with has to support the particular hooks you are producing.

For instance I want to use SignalFX for my metrics and events but they have very few integrations. A translator could take an incoming hook and turn it into a SignalFX event and pass it onward.

For a long time I’ve wanted to build a translator but never got around to doing it because I did not feel like self hosting it and write a whole bunch of supporting infrastructure. With the release of AWS API Gateway this has become quite easy and really convenient as there are no infrastructure or instances to manage.

I’ll show a bit of a walk through on how I built a translator that sends events to Signal FX. Note I do not do any kind of queueing or retrying on the gateway at present so it’s lossy and best efforts.

AWS Lambda runs stateless functions on demand. At launch it only supported ingesting their own Events but the recently launched API Gateway lets you front it using a REST API of your own design and this made it a lot easier.

For the rest of this post I assume you’re over the basic hurdles of signing up for AWS and are already familiar with the basics, so some stuff will be skipped but it’s not really that complex to get going.

The Code


To get going you need some JS code to handle the translation, here’s a naive method to convert a GitHub push notification into a SignalFX event:

This will be the meat of the of the processing and it includes a bit of code to create a request using the https module which includes the SignalFX authentication header.

Note this creates dimensions to the event that is being sent, I guess you can think of them like some kind of key=val tags for the event. In the Signal FX UI I can select events like this:

And any other added dimension can be used too, events shows up as little diamonds on graphs, so if I am graphing a service using these dimensions I can pick out events that relate to the branches and repositories that influence the data.

This is called as below:

There’s some stuff not shown here for brevity, it’s all in GitHub.

Setting up the Lambda functions

We have to create a Lambda function, for now I’ll use the console but you can use terraform for this it helps quite a lot.

As this repo is made up of a few files your only option is to zip it up. You’ll have to clone it and make your own config.js based on the sample prior to creating the zip file.

Once you have it just create a Lambda function which I’ll call gitHubToSFX and choose your zip file as source. While setting it up you have to supply a handler. This is how Lambda finds your function to call.

In my case I specify index.handleGitHubPushNotifications – uses the handleGitHubPushNotifications function found in index.js.

It ends up looking like this:

Once created you can test it right there if you have a sample GitHub commit message.

The REST End Point

Now we need to create somewhere for GitHub to send the POST request to. Gateway works with resources and methods. A resource is something like /github-hook and a method is POST.

I’ve created the resource and method, and told it to call the Lambda function here:

You have to deploy your API – just hit the big Deploy API button and follow the steps, you can create stages like development, staging, production and deploy API’s through such a life cycle. I just went straight to prod.

Once deployed it gives you a URL like https://12344xnb.execute-api.eu-west-1.amazonaws.com/prod and your GitHub hook would be configured to hit https://12344xnb.execute-api.eu-west-1.amazonaws.com/prod/github-hook .

Conclusion


That’s about it, once you’ve configured GitHub you’ll start seeing events flow through.

Both Lambda and API Gateway can write logs to Cloud Watch and from the JS side you can see do something like console.log(“hello”) and this will show up in the Cloud Watch logs to help with debugging.

I hope to start gathering a lot of translations like these and am still learning Node, so not really sure yet how to make packages or classes but so far this seems really easy to use.

Cost wise it’s really cheap. You’d pay $3.50 per million API calls received on the Gateway and $0.09/GB for the transfer costs, but given the nature of these events this will be negligible. Lambda is free for the first 1 million requests and you’ll pay some tiny amount for the time used. They are both eligible for the free tier too in case you’re new to AWS.

The costs and lack of infrastructure makes this a very attractive option for building this kind of service.

The power of packaging software, package all the things

Software delivery is hard, plenty of people all over this planet are struggling with delivering software in their own controlled environment. They have invented great patterns that will build an artifact, then do some magic and the application is up and running.

When talking about continuous delivery, people invariably discus their delivery pipeline and the different components that need to be in that pipeline.
Often, the focus on getting the application deployed or upgraded from that pipeline is so strong that teams
forget how to deploy their environment from scratch.

After running a number of tests on the code , compiling it where needed, people want to move forward quickly and deploy their release artifact on an actual platform.
This deployment is typically via a file upload or a checkout from a source-control tool from the dedicated computer on which the application resides.
Sometimes, dedicated tools are integrated to simulate what a developer would do manually on a computer to get the application running. Copy three files left, one right, and make sure you restart the service. Although this is obviously already a large improvement over people manually pasting commands from a 42 page run book, it doesn’t solve all problems.

Like the guy who quickly makes a change on the production server, never to commit the change, (say goodbye to git pull for your upgrade process)
If you package your software there are a couple of things you get for free from your packaging system.
Questions like, has this file been modified since I deployed it, where did this file come from, when was it deployed,
what version of software X do I have running on all my servers, are easily answered by the same
tools we use already for every other package on the system. Not only can you use existing tools you are also using tools that are well known by your ops team and that they
already use for every other piece of software on your system.

If your build process creates a package and uploads it to a package repository which is available for the hosts in the environment you want to deploy to, there is no need anymore for
a script that copies the artifact from a 3rd party location , and even less for that 42 page text document which never gets updated and still tells you to download yaja.3.1.9.war from a location where you can only find
3.2 and 3.1.8 and the developer that knows if you can use 3.2 or why 3.1.9 got removed just left for the long weekend.

Another, and maybe even more important thing, is the current sadly growing practice of having yet another tool in place that translates that 42 page text document to a bunch of shell scripts created from a drag and drop interface, typically that "deploy tool" is even triggered from within the pipeline. Apart from the fact that it usually stimulates a pattern of non reusable code, distributing even more ssh keys , or adding yet another agent on all systems. it doesn’t take into account that you want to think of your servers as cattle and be able to deploy new instances of your application fast.
Do you really want to deploy your five new nodes on AWS with a full Apache stack ready for production, then reconfigure your load balancers only to figure out that someone needs to go click in your continuous integration tool or deployment to deploy the application to the new hosts? That one manual action someone forgets?
Imvho Deployment tools are a phase in the maturity process of a product team.. yes it's a step up from manually deploying software but it creates more and other problems , once your team grows in maturity refactoring out that tool is trivial.

The obvious and trivial approach to this problem, and it comes with even more benefits. is called packaging. When you package your artifacts as operating system (e.g., .deb or .rpm) packages,
you can include that package in the list of packages to be deployed at installation time (via Kickstart or debootstrap). Similarly, when your configuration management tool
(e.g., Puppet or Chef) provisions the computer, you can specify which version of the application you want to have deployed by default.

So, when you’re designing how you want to deploy your application, think about deploying new instances or deploying to existing setups (or rather, upgrading your application).
Doing so will make life so much easier when you want to deploy a new batch of servers.

What done REALLY looks like in devops

Steve Ropa blogged about What done looks like in devops , I must say I respecfullly , but fully disagree with Steve here.

For those of you that remember I gave an Ignite about my views on the use of the Definition of Done back ad #deovpsdays 2013 in Amsterdam.

In the early days we talked about the #devops movement partly being a reaction against the late friday night deployments where the ops people got a tarball with some minimalistic notes and were supposed to put stuff in production. The work of the development team was Done, but the operations team work just started.

Things have improved .. like Steve mentions for a lot of teams done now means that that their software is deployable, that we have metrics from them, that we can monitor the application.

But lets face it .. even if all of that is in place there is still going to be maintenance, security fixes, major stack upgrades, minor application changes, we all still need to keep the delivery pipelines running.

A security patch on an appliction stack means that both the ops and the developers need to figure out the required changes together.

Building and delivering value to your end users is something that never ends, we are never actually done.

So let me repeat ,

"Done is when your last enduser is in his grave"
In other words, when the application is decomissioned.

And that is the shared responsability mindset devops really brings, everybody is caring about the value they are bringing to their customers, both developers and operations people. Thinking about keeping the application running. And not assuming that because a list of requirements have been validated at the end of a sprint we are done. Because we never are...

BTW. Here's my original slides for that #devopsdays Amsterdam talk.


Some thoughts on operating containers

I recently blogged about my workflow improvements realised by using docker for some services. Like everyone else the full story about running containers in production is a bit of an unknown. I am running 7 or 8 things in containers at the moment but I have a lot of outstanding questions.

On one hand I can go the route of a private PaaS where you push an image or Dockerfile into it and forget about it. And hope you never have to debug anything or dive deep into finding out why something is not performant as those tend to be very much closed systems. Some like deis are just Docker underneath but some others like the recently hyped lattice.cf unpacks the Docker container and transforms it into something else entirely that is much harder to interact with from a debug perspective. As a bit of an old school sysadmin this fire-and-hope-for-the-best approach leaves me a bit uninterested. I’m going to want to not loose the ability to carefully observe my running containers using traditional tools if I have to. It’s great to strive for never having to do that, never having to touch a running app using any thing but your monitoring SaaS or that you can just always scale out horizontally but personally I feel I need a bit more closer to the bits interaction at times. Aim for that goal and get a much better overall system, but while you’ve not yet reached this narvana like state you’re going to want to get at your running apps using strace if it has to.

So having ruled out just running one of the existing crop of private PaaS offerings locally I started thinking about what a container is really. I consider them to be analogous a package so we need to first explore what Packages are. In it’s simplest form a package is just a bunch of files packaged up. So what makes it better than a tarball?

  • Metadata like name, version, build time, build host, dependencies, descriptions, licence, signature and urls
  • Built in logic like pre/post install scripts but also companion scripts like init system scripts, monitoring logic etc
  • An API to interact with this – the rpm or apt/deb commands – but like in the case of Yum also libraries for interacting with these

All of the above combines to bring the biggest and ultimate benefit from a package: Strong set of companion tools to build, host, deploy, validate, update and inspect those packages. You cannot have the main benefit from packages without the mature implementations of the preceding points.

To really put it in perspective, the Puppet or Chef package resources only works because of the combination of the above 3 points. Without them it will fail which is why the daily attempts by people on #puppet for example to reinvent packaging with a exec running wget and make ends up failing and yield the predictable answer of packaging up your software instead.

When I look at the current state of a docker container and the published approaches for building them I am left a bit wanting when I compare them to a mature package manager wrt to the 3 points above. Which means I am going to end up with a unsatisfactory set of tools and interactions with my running apps.

So to address this I started looking at standardising my builds and creating a framework for building containers the way I like to and what kind of information I would be able to make available to create the tooling I think is needed. I do this using a base image that has a script called container in it that can introspect metadata about the image. Any image downstream from this base image can just add more metadata and hook into the life cycle my container management script provides. It’s not OS dependent so I wouldn’t be forcing any group into a OS choice and can still gain a lot of the advantages Docker brings wrt to making heterogeneous environments less painful. My build system embeds the metadata into any container it builds as JSON files.

Metadata


There are lots going on in this space, Kubernetes has labels and Docker is getting metadata but these are more tools to enable metadata, still up to people to decide what to do with it.

The reason you want to be able to really interact with and introspect packages come down to things like auditing them. Where do you have outdated SSL versions and the like. Likewise I want to know things about my containers and images:

  • Where and when was it built and why
  • What was it’s ancestor images
  • How do I start, validate, monitor and update it
  • What git repo is being built, what hash of that git repo was built
  • What are all the tags this specific container is known as at time of build
  • What’s the project name this belongs to
  • Have the ability to have arbitrary user supplied rich metadata

All that should be visible to the inside and outside of the container and kept for every ancestor of the container. Given this I can create rich generic management tools: I can create tools that do not require configuration to start, update and validate the functionality as well as monitor and extract metrics of any container without any hard coded logic.

Here’s an example:

% docker exec -ti rbldnsd container --metadata|json_reformat
{
  "validate_method": "/srv/support/bin/validate.sh",
  "start_method": "/srv/support/bin/start.sh",
  "update_method": "/srv/support/bin/update.sh"
  "validate": true,
  "build_cause": "TIMERTRIGGER",
  "build_tag": "jenkins-docker rbldnsd-55",
  "ci": true,
  "image_tag_names": [
    "hub.my.net/ripienaar/rbldnsd"
  ],
  "project": "rbldnsd",
  "build_time": "2015-03-30 06:02:10",
  "build_time_stamp": 1427691730,
  "image_name": "ripienaar/rbldnsd",
  "gitref": "e1b0a445744fec5e584919711cafd8f4cebdee0e",
}

Missing from this is monitoring and metrics related bits as those are still a work in progress. But you can see here metadata for a lot of the stuff I mentioned. Images I build embeds this into the image, this means when I FROM one of my images I get a history, that I can examine:

% docker exec -ti rbldnsd container --examine
Container first started at 2015-03-30 05:02:37 +0000 (1427691757)
 
Container management methods:
 
   Container supports START method using command /srv/support/bin/start.sh
   Container supports UPDATE method using command /srv/support/bin/update.sh
   Container supports VALIDATE method using command /srv/support/bin/validate.sh
 
Metadata for image centos_base
 
  Names:
            Project Name: centos_base
              Image Name: ripienaar/centos_base
         Image Tag Names: hub.my.net/ripienaar/centos_base
 
  Build Info:
                  CI Run: true
                Git Hash: fcb5f3c664b293c7a196c9809a33714427804d40
             Build Cause: TIMERTRIGGER
              Build Time: 2015-03-24 03:25:01 (1427167501)
               Build Tag: jenkins-docker centos_base-20
 
  Actions:
                   START: not set
                  UPDATE: not set
                VALIDATE: not set
 
Metadata for image rbldnsd
 
  Names:
            Project Name: rbldnsd
              Image Name: ripienaar/rbldnsd
         Image Tag Names: hub.my.net/ripienaar/rbldnsd
 
  Build Info:
                  CI Run: true
                Git Hash: e1b0a445744fec5e584919711cafd8f4cebdee0e
             Build Cause: TIMERTRIGGER
              Build Time: 2015-03-30 06:02:10 (1427691730)
               Build Tag: jenkins-docker rbldnsd-55
 
  Actions:
                   START: /srv/support/bin/start.sh
                  UPDATE: /srv/support/bin/update.sh
                VALIDATE: /srv/support/bin/validate.sh

This is the same information as above but also showing the ancestor of this rbldnsd image – the centos_base image. I can see when they were built, why, what hashes of the repositories and I can see how I can interact with these containers. From here I can audit or manage their life cycle quite easily.

I’d like to add to this a bunch of run-time information like when was it deployed, why, to what node etc and will leverage the docker metadata when that becomes available or hack something up with ENV variables.

Solving this problem has been key to getting to grips of the operational concerns I had with Docker and feeling I can get back to the level of maturity I had with packages.

Management


You can see from above that the metadata supports specifying START, UPDATE and VALIDATE actions. Future ones might be MONITOR and METRICS.

UPDATE requires some explaining. Of course the trend is toward immutable infrastructure where every change is a rebuild and this is a pretty good approach. I host things like a DNS based RBL and these tend to update all the time, I’d like to do so quicker and with less resource usage than a full rebuild and redeploy – but without ending up in a place where a rebuild looses my changes. This is a good middle ground somewhere between immutability and rapid change. I rebuild and redeploy all my containers every night so this covers the few hours in between.

So the typical pattern I do this with is to make the data directories for these images be git checkouts using deploy keys on my git server. So the build process will always take latest git and the update process will fetch latest git and reload the running config. Here’s my DNS server:

% sudo docker exec bind container --update
>> Fetching latest git checkout
From https://git.devco.net/ripienaar/docker_bind
 * branch            master     -> FETCH_HEAD
Already up-to-date.
 
>> Validating configuration
>> Checking named.conf syntax in master mode
>> Checking named.conf syntax in slave mode
>> Checking zones..
 
>> Reloading name server
server reload successful

There were no updates but you can see it would fetch the latest, validate it passes inspection and then reload the server if everything is ok. And here is the main part of the script implementing this action:

echo ">> Fetching latest git checkout"
git pull origin master
 
echo ">> Validating configuration"
container --validate
 
echo ">> Reloading name server"
rndc reload

This way I just need to orchestrate these standard container –update execs – webhooks does this in my case.

VALIDATE is interesting too, in this case validate uses the usual named-checkconf and named-checkzone commands to check the incoming config files but my more recent containers use serverspec and infrataster to validate the full end to end functionality of a running container.

% sudo docker exec -ti rbldnsd container --validate
.............................
 
Finished in 6.86 seconds (files took 0.39762 seconds to load)
29 examples, 0 failures

My dev process revolves around this like TDD would, my build process will run these steps end of every build in a running instance of the container, my deploy process runs this post deploy of anything it deploys. And personally if anything is not working right my first port of call is just this command, it often gets me right down to the part that went wrong – if I have good tests that is :) I mentioned I rebuild and redeploy the entire infrastructure daily – it’s exactly the investment in these tests that means I can do so while getting a good nights sleep.

Monitoring will likewise be extended around standardised introspectible commands so that a single method can be made to extra status and metric information out of any container built on this method.

Outcome


I’m pretty happy with where this got me, I found it much easier to build some tooling around containers given rich metadata and standardised interaction methods, I kind of hoped this was what I would get from Docker itself but it’s either too early or what it provides is too low level – understandable as from it’s perspective it would want to avoid being too prescriptive or have limited sets of data it supports on limited operating systems. I think though as a team who want to build and deploy a solid infrastructure on Docker you need to invest in something along these lines.

My containers are really containers now not just for their files and dependencies but more and more their operational life cycle is contained in the container too. Containers can be asked for their health, they can update themselves and eventually emit detailed reusable metrics and stusses. The API to do all of this is standardised and I can run this anywhere with confidence gained from having these introspective abilities and metadata anywhere. Like the huge benefit I got from an improved workflow I find this embedded operational life cycle is equally large and something that I found hard to achieve in my old traditional CM based approach.

I think Paas systems need to get a bit more of this kind of thing in their pipelines, I’d like to be able to ask my PaaS to just run my validate steps regularly or on demand. Or have standardised monitoring status and metrics output so that the likes of Datadog etc can deliver agents that provide in depth application monitoring without configuration by just sitting in a container next to a set of these containers. Today the state of the art for PaaS health checks seem to be to just hit the exposed port, but real life management of services is much more intricate than that. If they had that I could adopt one of those and spare myself a lot of pain.

For now though this is what my systems will do and hopefully some of the ideas become generally accepted.

Collecting links to free services for developers

A while ago Brandon Burton tweeted the following:


And I said someone should make a list. I looked around and could not find one so I decided I want to make such a list.

I am gathering links in a flat Markdown file at the moment but once I get an idea for the categories and kinds of links I’ll look at setting up a site with better UX than this big readme. If you have design chops for a site like this and want to help, get in touch.

Already I gathered quite a few and had some good links sent as PRs, if you have any links or if you work for a company who think they might fit the bill please send me links.

I am looking for links to services that provide free services especially to Open Source developers. Past that if they provide a developer account with some free resources – like a monitoring service that allows 5 free devices etc. I’d probably favour links good for infrastructure coders rather than say mobile app developers as there’s a huge list of those around.

I am not after services in Private Beta or Free During Beta or things like this, the only ones of those I’d accept are ones who specifically state that beta accounts will become free dev accounts or similar in future.

Moving a service from Puppet to Docker

I’ve moved a number of my more complex infrastructure components from being Puppet managed to being Docker managed. There are many reasons for this the main one being my Puppet code is ancient and faced with a rewrite to be Puppet 4 like or just rethinking things, I’m leaning towards rethinking. I don’t think CM is solving the right problem for me for certain aspects of my infrastructure and new approaches can bring more value for my use case.

There’s a lot of posts around talking about Docker and concentrating on the image building side of it or just the running of a container side – which I find quite uninteresting and in fact pretty terrible. The real benefit for me comes in workflow, the API, the events out of the daemon and the container stats. People look at the image and container aspects in isolation and go on about how this is not new technology, but that’s missing the point.

Mainly a workflow problem

I’ll look at an example moving rbldnsd from Puppet to Docker managed and what I gain from that. Along the way I’ll also throw in some examples of a more complex migration I did for my bind servers. In case you don’t know rbldnsd is a daemon that maintains a DNS based RBLs using config files that look something like this:

$DATASET dnset senderhost
.digitalmarketeer.com   :127.0.0.2:Connection rejected after user complaints.

You can then query it using the usual ways your MTA support and decide policy based on that.

The life cycle of this service is typical of the ones I am targeting:

  • A custom RPM had to be built and maintained and served from yet another piece of infrastructure.
  • The basic package, config, service triplet. So vanilla it’s almost not worth looking at the code, it looks like all other package, config, service code.
  • Requires ongoing data management – I add/remove hosts from the blacklists constantly. But this overlaps with the config part above.
  • Requires the ability to test DNS queries work in development before even committing the change
  • Requires rapid updating of configuration data

The last 3 points here deserve some explanation. When I am editing these configuration files I want to be able to test them right there in my shell without even committing them to git. This means starting up a rbldnsd instance and querying it with dig. This is pretty annoying to do with the puppet work flow which I won’t go into here as it’s a huge subject on it’s own. Suffice to say it doesn’t work for me and end up not being production like at all.

When I am updating this config files onto the running service there’s a daemon that will load them into its running memory. I need to be pretty sure that daemon I am testing on is identical to what’s in production now. Ideally bit for bit identical. Again this is pretty hard as many/most dev environments tend to be a few steps ahead of production. I need a way to say give me the bits running production and throw this config at them and then do an end to end test with no hassles and in 5 seconds.

I need a way to orchestrate that config data update to happen when I need it to happen – and not when Puppet runs again – and ideally it has to be quick, not at the pace that Puppet manages my 600 resources. Services should let me introspect them to figure out how to update their data and a generic updater should be able to update all my services that match this flow.

I’ve never really solved the last 3 points with my Puppet workflows for anything I work on, it’s a fiendishly complex problem to solve correctly. Everyone does it with Vagrant instances or ever more complex environments. Or they do their change, commit it and make sure there are test coverage and only get feedback later when something like Beaker ran. This is way too slow for me in this scenario. I just want to block 1 annoying host. Vagrant especially does not work for me as I refuse to run things on my desktop or laptop, I develop on VMs that are remote, so Vagrant isn’t an option. Additionally Vagrant environments become so complex, basically a whole new environment. Yet built in annoyingly different ways so that keeping match with Production can be a challenge – or just prohibitively slow if you’re building them out with Puppet. So you end up again not testing in a environment that’s remotely production like.

These are pretty major things that I’ve never been able to solve to my liking with Puppet. I’ve first moved a bunch of my web sites then bind and now rbldnsd to Docker and think I’ve managed to come up with a workflow and toolchain that solves this for me.

Desired outcome

So maybe to demonstrate what I am after I should show what I want the outcome to look like. Here’s a rbldnsd dev session. I want to block *.mailingliststart.com, specifically I saw sh8.mailingliststart.com in my logs. I want to test the hosts are going to be blocked correctly before pushing to prod or even committing to git – it’s so embarrassing to make fix commits for obvious dumb things :P

So I add to the zones/bl file:

.mailingliststart.com :127.0.0.2:Excessive spam from this host
$ vi zones/bl
$ rake test:host
Host name to test: sh8.mailingliststart.com
Testing sh8.mailingliststart.com
 
Starting the rbldnsd container...
>>> Testing black list
docker exec rbldnsd dig -p 5301 +noall +answer any sh8.mailingliststart.com.senderhost.bl.rbl @localhost
sh8.mailingliststart.com.senderhost.bl.rbl. 2100 IN A 127.0.0.2
sh8.mailingliststart.com.senderhost.bl.rbl. 2100 IN TXT "Excessive spam from this host"
 
>>> Testing white list
.
.
.
 
Removing the rbldnsd container...
$ git commit zones -m 'block mailingliststart.com'
$ git push origin master

Here I added the bits to the config file and want to be sure the hostname I saw in my logs/headers will actually be blocked.:

  • It prepares the latest container by default and mounts my working directory into the container with -v ${PWD}:/service.
  • Container starts up just like it would in production using the same bits that’s running production – but reads the new uncommitted config
  • It uses dig to query the running rbldnsd and run any in-built validation steps the service has (this container has none yet)
  • Cleans up everything

The whole thing takes about 4 seconds on a virtual machine running on virtualbox on circa 2009 Mac. I saw the host was blacklisted and not somehow also whitelisted, looks good, commit and push.

Once pushed a webhook triggers my update orchestration and the running containers get the new config files only. The whole edit, test and deploy process takes less than a minute. The data though is in git which means tonight when my containers get rebuilt from fresh they will get this change baked in and rolled out as new instances.

For comparison here’s what my bind container does – where I want to be sure my configs are correct against the version of Bind in production and my zones are free of error:

$ time rake test
docker run -ti --rm -v /home/rip/work/docker_bind:/srv/named -e TEST=1 ripienaar/bind
>> Checking named.conf syntax in master mode
>> Checking named.conf syntax in slave mode
>> Checking zones..
rake test  0.18s user 0.33s system 7% cpu 3.858 total

Same basic story – I know I have no config errors and zones will load correctly against the exact same version running production without even committing my code. And it’s very quick feedback always less than 5 seconds.

Implementation Details

I won’t go into all the Dockerfile details it’s just normal stuff. The image building and running of containers is not exciting. The layout of the services are something like this:

/service/bin/start.sh
/service/bin/update.sh
/service/bin/validate.sh
/service/zones/{bl,gl,wl}
/opt/rbldnsd-0.997a/rbldnsd

What is exciting is that I can introspect a running container. The Dockerfile has lines like this:

ENV UPDATE_METHOD /service/bin/update.sh
ENV VALIDATE_METHOD /service/bin/validate.sh

And an external tool can find out how this container likes to be updated or validated – and later monitored:

$ docker inspect rbldnsd
.
.
        "Env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "UPDATE_METHOD=/service/bin/update.sh",
            "VALIDATE_METHOD=/service/bin/validate.sh",
            "GIT_REF=fa9dd19d93e6d6cb7d5b2ebdc57f99cd2906df6f"
        ],

My update webhook basically just does this:

mco rpc docker runtime_update container=rbldnsd -S container("rbldnsd").present=1 --batch 1

So I use mcollective to target an update operation on all machines that runs the rbldnsd container – 1 at a time. The mcollective agent uses docker inspect to introspect the container. Once it knows how the container wants to be updated it calls that command using docker exec.

Outcome Summary

For me this turned out to be a huge win. I had to do a lot of work on the image building side of things, the orchestration, deployment etc – things I had to do with Puppet too anyway. But this basically ticks all the boxes for me that I had in the beginning of this post and quite a few more:

  • A reasonable facsimile of the package, config, service triplet that yields idempotent builds
  • A comfortable way to develop and test my changes locally with instant feedback like I would with unit tests for normal code but for integration tests of infrastructure components using the same bits as in production.
  • An approach where my services are standalone and they all have to think about their run, update and validation cadences. With those being introspectable and callable from the outside.
  • My services are standalone artefacts and versioned as a whole. Not spread around the place on machines, in package repos, in data and in CM code that attempts to tie it all together. It’s one thing, from one git repo, stored in one place with a version.
  • With validation built into the container and the container being a runnable artefact I get to do this during CI before rolling anything out just like I do on my CLI. And always the actual bits in use or proposed to be used in Production are used.
  • Overall I have a lot more confidence in my production changes now than I had with the Puppet workflow.
  • Changes can be rolled out to running containers very rapidly – less than 10 seconds and not at the slow Puppet run pace.
  • My dev environment is hugely simplified yet much more flexible as I can run current, past and future versions of anything. With less complexity.
  • Have a very nice middle ground between immutable server and the need for updating content. Containers are still rebuilt and redeployed every night on schedule and they are still disposable but not at the cost of day to day updates.

I’ve built this process into a number of containers now some like this that are services and even some web ones like my wiki where I edit markdown files and they get rolled out to the running containers immediately on push.

I still have some way to go with monitoring and these services are standalone and not complex multi-component ones but I don’t foresee huge issues with those.

I couldn’t solve this with all these outcomes without a rapid way to stand up and destroy production environments that are isolated from my machine I am developing on. Especially if the final service is some loosely coupled combination of parts from many different sources. I’d love to talk to people who think they have something approaching this without using Docker or similar and be proven wrong but for now, this is a huge step forward for me.

So Puppet and Chef are irrelevant now?

Getting back to the Puppet part of this post. I could come up with some way to mix Puppet in here too. There are though other interesting aspects about the Docker life cycle that I might blog about later which I think makes it a bit of a square peg in a round hole to combine these two tools. Especially I think people today who think they should use Puppet to build containers or configure containers are a bit miss guided and missing out, I hope they keep working on that though and get somewhere interesting because omfg Dockerfiles but I don’t think the current attempts are interesting.

It kind of gets back to the old thing where it turns out Puppet is not a good choice to manage deployments of Applications but its ok for Infrastructure. I am reconsidering what is infrastructure and what are applications.

So I chose to rethink things from the ground up – how would a nameserver service looked if I considered it Application and not Infrastructure and how should a Application development life cycle around that service look?

This is not a new realisation for me, I’ve often wished and expressed the desire that Puppet Labs should focus a lot more on the workflow and the development cycle and work on providing tools and hooks for that and think about how to make that better, I don’t think that’s really happened. So the conclusion for me was that for this Application or Service development and deployment life cycle Puppet was the wrong tool. I also realise I don’t even remotely resemble their paying target audience.

I am also not saying Puppet or by extension Chef or other CM tools are irrelevant due to Docker that’s just madness. I think there’s a place where the 2 worlds meet and for me I am starting to notice that a lot of what I thought was Infrastructure are actually Applications and these have different development and deployment needs which CM and Puppet especially do not address.

Soon there will not be a single mention of DNS related infrastructure in my Puppet code. The container and related files are about equal in complexity and lines of code to what was in Puppet, the final outcome is about the same and it’s as configurable to my environments. The workflow though is massively improved because now I have the advantages that Application developers had for this piece of Infrastructure. Finally a much larger part of the Infrastructure As Code puzzle is falling together and it actually feels like I am working on code with the same feedback cycles and single verifiable artefact outcomes. And that’s pretty huge. Infrastructure are still being CM managed – I just hope to have a radically reduced Infrastructure footprint.

The big take away here isn’t that Docker is some technological magical bullet killing off vast parts of the existing landscape or destroying a subset of tools like CM completely. It brings workflow and UX improvements that are pretty unique and well worth exploring. And this is especially a part where the CM folk have basically just not focussed on. The single biggest win is probably the single artefact aspect as this enables everything I mentioned here.

It also brings a lot of other things from the daemon side – the API, the events, the stats etc that I didn’t talk about here and those are very big deals too wrt what future work they enable. But that’s for future posts.

Technically I think I have a lot of bad things to say about almost every aspect of Docker but those are out weighed by this rapid feedback and increased overall confidence in making change at the pace I would like to.

Jenkins, Puppet, Graphite, Logstash and YOU

This is a repost of an article I wrote for the Acquia Blog some time ago.

As mentioned before, devops can be summarized by talking about culture, automation, monitoring metrics and sharing. Although devops is not about tooling, there are a number of open source tools out there that will be able to help you achieve your goals. Some of those tools will also enable better communication between your development and operations teams.

When we talk about Continuous Integration and Continuous Deployment we need a number of tools to help us there. We need to be able to build reproducible artifacts which we can test. And we need a reproducible infrastructure which we can manage in a fast and sane way. To do that we need a Continuous Integration framework like Jenkins.

Formerly known as Hudson, Jenkins has been around for a while. The open source project was initially very popular in the Java community but has now gained popularity in different environments. Jenkins allows you to create reproducible Build and Test scenarios and perform reporting on those. It will provide you with a uniform and managed way to , Build, Test, Release and Trigger the deployment of new Artifacts, both traditional software and infrastructure as code-based projects. Jenkins has a vibrant community that builds new plugins for the tool in different kinds of languages. People use it to build their deployment pipelines, automatically check out new versions of the source code, syntax test it and style test it. If needed, users can compile the software, triggering unit tests, uploading a tested artifact into a repository so it is ready to be deployed on a new platform level.

Jenkins then can trigger an automated way to deploy the tested software on its new target platform. Whether that be development, testing, user acceptance or production is just a parameter. Deployment should not be something we try first in production, it should be done the same on all platforms. The deltas between these platforms should be managed using a configuration management tool such as Puppet, Chef or friends.

In a way this means that Infrastructure as code is a testing dependency, as you also want to be able to deploy a platform to exactly the same state as it was before you ran your tests, so that you can compare the test results of your test runs and make sure they are correct. This means you need to be able to control the starting point of your test and tools like Puppet and Chef can help you here. Which tool you use is the least important part of the discussion, as the important part is that you adopt one of the tools and start treating your infrastructure the same way as you treat your code base: as a tested, stable, reproducible piece of software that you can deploy over and over in a predictable fashion.

Configuration management tools such as Puppet, Chef, CFengine are just a part of the ecosystem and integration with Orchestration and monitoring tools is needed as you want feedback on how your platform is behaving after the changes have been introduced. Lots of people measure the impact of a new deploy, and then we obviously move to the M part of CAMS.

There, Graphite is one of the most popular tools to store metrics. Plenty of other tools in the same area tried to go where Graphite is going , but both on flexibility, scalability and ease of use, not many tools allow developers and operations people to build dashboards for any metric they can think of in a matter of seconds.

Just sending a keyword, a timestamp and a value to the Graphite platform provides you with a large choice of actions that can be done with that metric. You can graph it, transform it, or even set an alert on it. Graphite takes out the complexity of similar tools together with an easy to use API for developers so they can integrate their own self service metrics into dashboards to be used by everyone.

One last tool that deserves our attention is Logstash. Initially just a tool to aggregate, index and search the log files of our platform, it is sometimes a huge missed source of relevant information about how our applications behave.. Logstash and it's Kibana+ElasticSearch ecosystem are now quickly evolving into a real time analytics platform. Implementing the Collect, Ship+Transform, Store and Display pattern we see emerge a lot in the #monitoringlove community. Logstash now allows us to turn boring old logfiles that people only started searching upon failure into valuable information that is being used by product owners and business manager to learn from on the behavior of their users.

Together with the Graphite-based dashboards we mentioned above, these tools help people start sharing their information and communicate better. When thinking about these tools, think about what you are doing, what goals you are trying to reach and where you need to improve. Because after all, devops is not solving a technical problem, it's trying to solve a business problem and bringing better value to the end user at a more sustainable pace. And in that way the biggest tool we need to use is YOU, as the person who enables communication.