Containers will not fix your broken culture

From DevOps, Docker, and Empathy by Jérôme Petazzoni (title shamelessly stolen from a talk by Bridget Kromhout):

The DevOps movement is more about a culture shift than embracing a new set of tools. One of the tenets of DevOps is to get people to talk together.

Implementing containers won’t give us DevOps.

You can’t buy DevOps by the pound, and it doesn’t come in a box, or even in intermodal containers.

It’s not just about merging “Dev” and “Ops,” but also getting these two to sit at the same table and talk to each other.

Docker doesn’t enforce these things (I pity the fool who preaches or believes it) but it gives us a table to sit at, and a common language to facilitate the conversation. It’s a tool, just a tool indeed, but it helps people share context and thus understanding.

The Choria Emulator

In my previous posts I discussed what goes into load testing a Choria network, what connections are made, subscriptions are made etc.

From this it’s obvious the things we should be able to emulate are:

  • Connections to NATS
  • Subscriptions – which implies number of agents and sub collectives
  • Message payload sizes

To make it realistically affordable to emulate many more machines that I have I made an emulator that can start numbers of Choria daemons on a single node.

I’ve been slowly rewriting MCollective daemon side in Go which means I already had all the networking and connectors available there, so a daemon was written:

usage: choria-emulator --instances=INSTANCES [<flags>]
 
Emulator for Choria Networks
 
Flags:
      --help                 Show context-sensitive help (also try --help-long and --help-man).
      --version              Show application version.
      --name=""              Instance name prefix
  -i, --instances=INSTANCES  Number of instances to start
  -a, --agents=1             Number of emulated agents to start
      --collectives=1        Number of emulated subcollectives to create
  -c, --config=CONFIG        Choria configuration file
      --tls                  Enable TLS on the NATS connections
      --verify               Enable TLS certificate verifications on the NATS connections
      --server=SERVER ...    NATS Server pool, specify multiple times (eg one:4222)
  -p, --http-port=8080       Port to listen for /debug/vars

You can see here it takes a number of instances, agents and collectives. The instances will all respond with ${name}-${instance} on any mco ping or RPC commands. It can be discovered using the normal mc discovery – though only supports agent and identity filters.

Every instance will be a Choria daemon with the exact same network connection and NATS subscriptions as real ones. Thus 50 000 emulated Choria will put the exact same load of work on your NATS brokers as would normal ones, performance wise even with high concurrency the emulator performs quite well – it’s many orders of magnitude faster than the ruby Choria client anyway so it’s real enough.

The agents they start are all copies of this one:

emulated0
=========
 
Choria Agent emulated by choria-emulator
 
      Author: R.I.Pienaar <rip@devco.net>
     Version: 0.0.1
     License: Apache-2.0
     Timeout: 120
   Home Page: http://choria.io
 
   Requires MCollective 2.9.0 or newer
 
ACTIONS:
========
   generate
 
   generate action:
   ----------------
       Generates random data of a given size
 
       INPUT:
           size:
              Description: Amount of text to generate
                   Prompt: Size
                     Type: integer
                 Optional: true
            Default Value: 20
 
 
       OUTPUT:
           message:
              Description: Generated Message
               Display As: Message

You can this has a basic data generator action – you give it a desired size and it makes you a message that size. It will run as many of these as you wish all called like emulated0 etc.

It has an mcollective agent that go with it, the idea is you create a pool of machines all with your normal mcollective on it and this agent. Using that agent then you build up a different new mcollective network comprising the emulators, federation and NATS.

Here’s some example of commands – you’ll see these later again when we talk about scenarios:

We download the dependencies onto all our nodes:

$ mco playbook run setup-prereqs.yaml --emulator_url=https://example.net/rip/choria-emulator-0.0.1 --gnatsd_url=https://example.net/rip/gnatsd --choria_url=https://example.net/rip/choria

We start NATS on our first node:

$ mco playbook run start-nats.yaml --monitor 8300 --port 4300 -I test1.example.net

We start the emulator with 1500 instances per node all pointing to our above NATS:

$ mco playbook run start-emulator.yaml --agents 10 --collectives 10 --instances 750 --monitor 8080 --servers 192.168.1.1:4300

You’ll then setup a client config for the built network and can interact with it using normal mco stuff and the test suite I’ll show later. Simularly there are playbooks to stop all the various parts etc. The playbooks just interact with the mcollective agent so you could use mco rpc directly too.

I found I can easily run 700 to 1000 instances on basic VMs – needs like 1.5GB RAM – so it’s fairly light. Using 400 nodes I managed to build a 300 000 node Choria network and could easily interact with it etc.

Finally I made a ec2 environment where you can stand up a Puppet Master, Choria, the emulator and everything you need and do load tests on your own dime. I was able to do many runs with 50 000 emulated nodes on EC2 and the whole lot cost me less than $20.

The code for this emulator is very much a work in progress as is the Go code for the Choria protocol and networking but the emulator is here if you want to take a peek.

Choria Update

Recently at Config Management Camp I’ve had many discussions about Orchestration, Playbooks and Choria, I thought it’s time for another update on it’s status.

I am nearing version 1.0.0, there are a few things to deal with but it’s getting close. Foremost I wanted to get the project it’s own space on all the various locations like GitHub, Forge, etc.

Inevitably this means getting a logo, it’s been a bit of a slog but after working through loads of feedback on Twitter and offers for assistance from various companies I decided to go to a private designer called Isaac Durazo and the outcome can be seen below:


 

The process of getting the logo was quite interesting and I am really pleased with the outcome, I’ll blog about that separately.

Other than the logo the project now has it’s own GitHub organisation at https://github.com/choria-io and I have moved all the forge modules to it’s own space as well https://forge.puppet.com/choria.

There are various other places the logo show up like in the Slack notifications and so forth.

On the project front there’s a few improvements:

  • There is now a registration plugin that records a bunch of internal stats on disk, the aim is for them to be read by Collectd and Sensu
  • A new Auditing plugin that emits JSON structured data
  • Several new Data Stores for Playbooks – files, environment.
  • Bug fixes on Windows
  • All the modules, plugins etc have moved to the Choria Forge and GitHub
  • Quite extensive documentation site updates including branding with the logo and logo colors.

There is now very few things left to do to get 1.0.0 out but I guess another release or two will be done before then.

So from now to update to coming versions you need to use the choria/mcollective_choria module which will pull in all it’s dependencies from the Choria project rather than my own Forge.

Still no progress on moving the actual MCollective project forward but I’ve discussed a way to deal with forking the various projects in a way that seems to work for what I want to achieve. In reality I’ll only have time to do that in a couple of months so hopefully something positive will happen in the mean time.

Head over to Choria.io to take a look.

Choria Playbooks

Today I am very pleased to release something I’ve been thinking about for years and actively working on since August.

After many POCs and thrown away attempts at this over the years I am finally releasing a Playbook system that lets you run work flows on your MCollective network – it can integrate with a near endless set of remote services in addition to your MCollective to create a multi service playbook system.

This is a early release with only a few integrations but I think it’s already useful and I’m looking for feedback and integrations to build this into something really powerful for the Puppet eco system.

The full docs can be found on the Choria Website, but below you can get some details.

Overview


Today playbooks are basic YAML files. They do not have a pseudo programming language in them though I am not against the idea. Eventually I envision a Service to execute playbooks on your behalf, but today you just run them in your shell. I do not anticipate YAML to be the end format of playbooks but it’s good enough for today.

Playbooks have a basic flow that is more or less like this:

  1. Discover named Node Sets
  2. Validate the named Node Sets meet expectations such as reachability and versions of software available on them
  3. Run a pre_book task list that lets you do prep work
  4. Run the main tasks task list where you do your work, around every task certain hook lists can be run
  5. Run either the on_success or on_fail task list for notification of Slacks etc
  6. Run the post_book task list for cleanups etc

Today a task can be a MCollective request, a shell script or a Slack notification. I imagine this list will grow huge, I am thinking you will want to ping webhooks, or interact with Razor to provision machines and wait for them to finish building, run Terraform or make EC2 API requests. This list of potential integrations is endless and you can use any task in any of the above task lists.

A Node Set is simply a named set of nodes, in MCollective that would be certnames of nodes but the playbook system itself is not limited to that. Today Node Sets can be resolved from MCollective Discovery, PQL Queries (PuppetDB), YAML files with groups of nodes in them or a shell command. Again the list of integrations that make sense here is huge. I imagine querying PE or Foreman for node groups, querying etcd or Consul for service members. Talking to random REST services that return node lists or DB queries. Imagine using Terraform outputs as Node Set sources or EC2 API queries.

In cases where you wish to manage nodes via MCollective but you are using a cached discovery source you can ask node sets to be tested for reachability over MCollective. And node sets that need certain MCollective agents can express this desire as SemVer version ranges and the valid network state will be asserted before any playbook is run.

Example


I’ll show an example here of what I think you will be able to achieve using these Playbooks.

Here we have a web stack and we want to do Blue/Green deploys against it, sub clusters have a fact cluster. The deploy process for a cluster is:

  • Gather input from the user such as cluster to deploy and revision of the app to deploy
  • Discover the Haproxy node using Node Set discovery from PQL queries
  • Discover the Web Servers in a particular cluster using Node Set discovery from PQL queries
  • Verify the Haproxy nodes and Web Servers are reachable and running the versions of agents we need
  • Upgrade the specific web tier using:
    1. Tell the ops room on slack we are about to upgrade the cluster
    2. Disable puppet on the webservers
    3. Wait for any running puppet runs to stop
    4. Disable the nodes on a particular haproxy backend
    5. Upgrade the apps on the servers using appmgr#upgrade to the input revision
    6. Do up to 10 NRPE checks post upgrade with 30 seconds between checks to ensure the load average is GREEN, you’d use a better check here something app specific
    7. Enable the nodes in haproxy once NRPE checks pass
    8. Fetch and display the status of the deployed app – like what version is there now
    9. Enable Puppet

Should the task list all FAIL we run these tasks:

  1. Call a webhook on AWS Lambda
  2. Tell the ops room on slack
  3. Run a whole other playbook called deploy_failure_handler with the same parameters

Should the task list PASS we run these tasks:

  1. Call a webhook on AWS Lambda
  2. Tell the ops room on slack

This example and sample playbooks etc can be found on the Choria Site.

Status


Above is the eventual goal. Today the major missing piece here that I think MCollective needs to be extended with the ability for Agent plugins to deliver a Macro plugin. A macro might be something like Puppet.wait_till_idle(:timeout => 600), this would be something you call after disabling the nodes and you want to be sure Puppet is making no more changes, you can see the workflow above needs this.

There is no such Macros today, I will add a stop gap solution as a task that waits for a certain condition but adding Macros to MCollective is high on my todo list.

Other than that it works, there is no web service yet so you run them from the CLI and the integrations listed above is all that exist, they are quite easy to write so hoping some early adopters will either give me ideas or send PRs!

This is available today if you upgrade to version 0.0.12 of the ripienaar-mcollective_choria module.

Again see the Choria Website for much more details on this feature.

Puppet Query Language

For a few releases now PuppetDB had a new query language called Puppet Query Language or PQL for short. It’s quite interesting, I thought a quick post might make a few more people aware of it.

Overview


To use it you need a recent PuppetDB and as this is quite a new feature you really want the latest PuppetDB. There is nothing to enable when you install it the feature is already active. The feature is marked as experimental so some things will change as it moves to production.

PQL Queries look more or less like this:

nodes { certname ~ 'devco' }

This is your basic query it will return a bunch of nodes, something like:

[
  {
    "deactivated": null,
    "latest_report_hash": null,
    "facts_environment": "production",
    "cached_catalog_status": null,
    "report_environment": null,
    "latest_report_corrective_change": null,
    "catalog_environment": "production",
    "facts_timestamp": "2016-11-01T06:42:15.135Z",
    "latest_report_noop": null,
    "expired": null,
    "latest_report_noop_pending": null,
    "report_timestamp": null,
    "certname": "devco.net",
    "catalog_timestamp": "2016-11-01T06:42:16.971Z",
    "latest_report_status": null
  }
]

There are a bunch of in-built relationships between say a node and it’s facts and inventory, so queries can get quite complex:

inventory[certname] { 
  facts.osfamily = "RedHat" and
  facts.dc = "linodeldn" and
  resources { 
    type = "Package" and
    title = "java" and
    parameters.ensure = "1.7.0" 
  } 
}

This finds all the RedHat machines in a particular DC with Java 1.7.0 on them. Be aware this will also find machines that are deactivated.

I won’t go into huge details of the queries, the docs are pretty good – examples, overview.

So this is quite interesting, this finally gives us a reasonably usable DB to do queries that mcollective discovery used to be used for – but of course its not a live view nor does it have any clue what the machines are up to but as a cached data source for discovery this is interesting.

Using


CLI


You can of course query this stuff on the CLI and I suggest you familiarise yourself with JQ.

First you’ll have to set up your account:

{
  "puppetdb": {
    "server_urls": "https://puppet:8081",
    "cacert": "/home/rip/.puppetlabs/etc/puppet/ssl/certs/ca.pem",
    "cert": "/home/rip/.puppetlabs/etc/puppet/ssl/certs/rip.mcollective.pem",
    "key": "/home/rip/.puppetlabs/etc/puppet/ssl/private_keys/rip.mcollective.pem"
  }
}

This is in ~/.puppetlabs/client-tools/puppetdb.conf which is a bit senseless to me since there clearly is a standard place for config files, but alas.

Once you have this and you installed the puppet-client-tools package you can do queries like:

$ puppet query "nodes { certname ~ 'devco.net' }"

Puppet Code


Your master will have the puppetdb-termini package on it and this brings with it Puppet functions to query PuppetDB so you do not need to use a 3rd party module anymore:

$nodes = puppetdb_query("nodes { certname ~ 'devco' }")

Puppet Job


At the recent PuppetConf Puppet announced that their enterprise tool puppet job supports using this as discovery, if I remember right it’s something like:

$ puppet job run -q 'nodes { certname ~ 'devco' }'

MCollective


At PuppetConf I integrated this into MCollective and my Choria tool, both these are still due a release (MCO-776, choria #61):

Run Puppet on all the nodes matched by the query:

$ puppet query "nodes { certname ~ 'devco.net' }"|mco rpc puppet runonce

The above is a bit limited in that the apps in question have to specifically support this kind of STDIN discovery – the rpc app does.

I then also added support to the Choria CLI:

$ mco puppet runonce -I "pql:nodes[certname] { certname ~ 'devco.net' }"

These queries are a bit special in that they must return just the certname as here, I’ll document this up. The reason for this is that they are actually part of a much larger query done in the Choria discovery system (that uses PQL internally and is a good intro on how to query this API from code).

Here’s an example of a complex query – as used by Choria internally – that limits nodes to ones in a particular collective, our PQL query who have mcollective installed and running. You can see you can nest and combine queries into quite complex ones:

nodes[certname, deactivated] { 
  # finds nodes in the chosen inventory via a fact
  (certname in inventory[certname] { 
    facts.mcollective.server.collectives.match("d+") = "mcollective" 
  }) and 
 
  # does the supplied PQL query
  (certname in nodes[certname] {
    certname ~ 'devco.net'
  }) and
 
  # limited to machines with mcollective installed
  (resources {
    type = "Class" and title = "Mcollective"
  }) and 
 
  # who also have the service started
  (resources {
    type = "Class" and title = "Mcollective::Service"
  }) 
}

Conclusion


This is really handy and I hope more people will become familiar with it. I don’t think this quite rolls off the fingers easily – but neither does SQL or any other similar system so par for the course. What is amazing is that we can get nearer to having a common language across CLI, Code, Web UIs and 3rd party tools for describing queries of our estate so this is a major win.

Puppet 4 Sensitive Data Types

You often need to handle sensitive data in manifests when using Puppet. Private keys, passwords, etc. There has not been a native way to deal with these and so a cottage industry of community tools have spring up.

To deal with data at rest various Hiera backends like the popular hiera-eyaml exist, to deal with data on nodes a rather interesting solution called binford2k-node_encrypt exist. There are many more but less is more, these are good and widely used.

The problem is data leaks all over the show in Puppet – diffs, logs, reports, catalogs, PuppetDB – it’s not uncommon for this trusted data to show up all over the place. And dealing with this problem is a huge scope issue that will require adjustments to every component – Puppet, Hiera / Lookup, PuppetDB, etc.

But you have to start somewhere and Puppet is the right place, lets look at the first step.

Sensitive[T]


Puppet 4.6.0 introduce – and 4.6.1 fixed – a new data type that decorates other data telling the system it’s sensitive. And this data cannot by accident become logged or leaked since the type will only return a string indicating it’s redacted.

It’s important to note this is step one of a many step process towords having a unified blessed way of dealing with Sensitive data all over. But lets take a quick look at them. The official specification for this feature lives here.

In the most basic case we can see how to make sensitive data, how it looks when logged or leaked by accident:

$secret = Sensitive("My Socrates Note")
notice($secret)

This prints out the following:

Notice: Scope(Class[main]): Sensitive [value redacted]

To unwrap this and gain access to the real original data:

$secret = Sensitive(hiera("secret"))
 
$unwrapped = $secret.unwrap |$sensitive| { $sensitive }
notice("Unwrapped: ${unwrapped}")
 
$secret.unwrap |$sensitive| { notice("Lambda: ${sensitive}") }

Here you can see how to assign it unwrapped to a new variable or just use it in a block. Important to note you should never print these values like this and ideally you’d only ever use them inside a lambda if you have to use them in .pp code. Puppet has no concept of private variables so this $unwrapped variable could be accessed from outside of your classes. A lambda scope is temporary and private.

The output of above is:

Notice: Scope(Class[main]): Unwrapped: Too Many Secrets
Notice: Scope(Class[main]): Lambda: Too Many Secrets

So these are the basic operations, you can now of course pass the data around classes.

class mysql (
  Sensitive[String] $root_pass
) {
  # somehow set the password
}
 
class{"mysql":
  root_pass => Sensitive(hiera("mysql_root"))
}

Note here you can see the class specifically wants a String that is sensitive and not lets say a Number using the Sensitive[String] markup. And if you attempted to pass Sensitive(1) into it you’d get a type missmatch error.

Conclusion


So this appears to be quite handy, you can see down the line that lookup() might have a eyaml like system and emit Sensitive data directly and perhaps some providers and types will support this. But as I said it’s early days so I doubt this is actually useful yet.

I mentioned how other systems like PuppetDB and so forth also need updates before this is useful and indeed today PuppetDB is oblivious to these types and stores the real values:

$ puppet query 'resources[parameters] { type = "Class" and title = "Test" }'
...
  {
    "parameters": {
      "string": "My Socrates Note"
    }
  },
...

So this really does not yet serve any purpose but as a step one it’s an interesting look at what will come.

Will containers take over ?

and if so why haven't they done so yet ?

Unlike many people think, containers are not new, they have been around for more than a decade, they however just became popular for a larger part of our ecosystem. Some people think containers will eventually take over.

Imvho It is all about application workloads, when 8 years ago I wrote about a decade of open source virtualization, we looked at containers as the solution for running a large number of isolated instances of something on a machine. And with large we meant hundreds or more instances of apache, this was one of the example use cases for an ISP that wanted to give a secure but isolated platform to his users. One container per user.

The majority of enterprise usecases however were full VM's Partly because we were still consolidating existing services to VM's and weren't planning on changing the deployment patterns yet. But mainly because most organisations didn't have the need to run 100 similar or identical instances of an application or a service, they were going from 4 bare metal servers to 40 something VM's but they had not yet come to the need to run 100's of them. The software architecture had just moved from FatClient applications that talked directly to bloated relational databases containing business logic, to web enabled multi-tier
applications. In those days when you suggested to run 1 Tomcat instance per VM because VM's were cheap and it would make management easier, (Oh oops I shut down the wrong tomcat instance) , people gave you very weird looks

Slowly software architectures are changing , today the new breed of applications is small, single function, dedicated, and it interacts frequently with it's peers, together combined they provide similar functionality as a big fat application 10 years ago, But when you look at the market that new breed is a minority. So a modern application might consist of 30-50 really small ones, all with different deployment speeds. And unlike 10 years ago where we needed to fight hard to be able to build both dev, acceptance and production platforms, people now consider that practice normal. So today we do get environments that quickly go to 100+ instances , but requiring similar CPU power as before, so the use case for containers like we proposed it in the early days is now slowly becoming a more common use case.

So yes containers might take over ... but before that happens .. a lot of software architectures will need to change, a lot of elephants will need to be sliced, and that is usually what blocks cloud, container, agile and devops adoption.

Will containers take over ?

and if so why haven't they done so yet ?

Unlike many people think, containers are not new, they have been around for more than a decade, they however just became popular for a larger part of our ecosystem. Some people think containers will eventually take over.

Imvho It is all about application workloads, when 8 years ago I wrote about a decade of open source virtualization, we looked at containers as the solution for running a large number of isolated instances of something on a machine. And with large we meant hundreds or more instances of apache, this was one of the example use cases for an ISP that wanted to give a secure but isolated platform to his users. One container per user.

The majority of enterprise usecases however were full VM's Partly because we were still consolidating existing services to VM's and weren't planning on changing the deployment patterns yet. But mainly because most organisations didn't have the need to run 100 similar or identical instances of an application or a service, they were going from 4 bare metal servers to 40 something VM's but they had not yet come to the need to run 100's of them. The software architecture had just moved from FatClient applications that talked directly to bloated relational databases containing business logic, to web enabled multi-tier
applications. In those days when you suggested to run 1 Tomcat instance per VM because VM's were cheap and it would make management easier, (Oh oops I shut down the wrong tomcat instance) , people gave you very weird looks

Slowly software architectures are changing , today the new breed of applications is small, single function, dedicated, and it interacts frequently with it's peers, together combined they provide similar functionality as a big fat application 10 years ago, But when you look at the market that new breed is a minority. So a modern application might consist of 30-50 really small ones, all with different deployment speeds. And unlike 10 years ago where we needed to fight hard to be able to build both dev, acceptance and production platforms, people now consider that practice normal. So today we do get environments that quickly go to 100+ instances , but requiring similar CPU power as before, so the use case for containers like we proposed it in the early days is now slowly becoming a more common use case.

So yes containers might take over ... but before that happens .. a lot of software architectures will need to change, a lot of elephants will need to be sliced, and that is usually what
blocks cloud, container, agile and devops adoption.