50 000 Node Choria Network

I’ve been saying for a while now my aim with Choria is that someone can get a 50 000 node Choria network that just works without tuning, like, by default that should be the scale it supports at minimum.

I started working on a set of emulators to let you confirm that yourself – and for me to use it during development to ensure I do not break this promise – though that got a bit side tracked as I wanted to do less emulation and more just running 50 000 instances of actual Choria, more on that in a future post.

Today I want to talk a bit about a actual 50 000 real nodes deployment and how I got there – the good news is that it’s terribly boring since as promised it just works.



The network is pretty much just your typical DC network. Bunch of TOR switches, Distribution switches and Core switches, nothing special. Many dom0’s and many more domUs and some specialised machines. It’s flat there are firewalls between all things but it’s all in one building.


I have 4 machines, 3 set aside for the Choria Network Broker Cluster and 1 for a client, while waiting for my firewall ports I just used the 1 machine for all the nodes as well as the client. It’s a 8GB RAM VM with 4 vCPU, not overly fancy at all. Runs Enterprise Linux 6.

In the past I think we’d have considered this machine on the small side for a ActiveMQ network with 1000 nodes 😛

I’ll show some details of the single Choria Network Broker here and later follow up about the clustered setup.


I run a custom build of Choria 0.0.11, I bump the max connections up to 100k and turned off SSL since we simply can’t provision certificates, so a custom build let me get around all that.

The real reason for the custom build though is that we compile in our agent into the binary so the whole deployment that goes out to all nodes and broker is basically what you see below, no further dependencies at all, this makes for quite a nice deployment story since we’re a bit challenged in that regard.

$ rpm -ql choria

Other than this custom agent and no SSL we’re about on par what you’d get if you just install Choria from the repos.

Network Broker Setup

The Choria Network Broker is deployed basically exactly as the docs. Including setting the sysctl values to what was specified in the docs.

identity = choria1.example.net
logfile = /var/log/choria.log
plugin.choria.stats_address = ::
plugin.choria.stats_port = 8222
plugin.choria.network.listen_address = ::
plugin.choria.network.client_port = 4222
plugin.choria.network.peer_port = 4223

Most of this isn’t even needed basically if you use defaults like you should.

Server Setup

The server setup was even more boring:

logger_type = file
logfile = /var/log/choria.log
plugin.choria.middleware_hosts = choria1.example.net
plugin.choria.use_srv = false


So we were being quite conservative and deployed it in batches of 50 a time, you can see the graph below of this process as seen from the Choria Network Broker (click for larger):

This is all pretty boring actually, quite predictable growth in memory, go routines, cpu etc. The messages you see being sent is me doing lots of pings and rpc’s and stuff just to check it’s all going well.

$ ps -auxw|grep choria
root     22365 12.9 14.4 2879712 1175928 ?     Sl   Mar06 241:34 /usr/choria broker --config=....
# a bit later than the image above
$ sudo netstat -anp|grep 22365|grep ESTAB|wc -l


So how does work in practise? In the past we’d have had a lot of issues with getting consistency out of a network of even 10% this size, I was quite confident it was not the Ruby side, but you never know?

Well, lets look at this one, I set discovery_timeout = 20 in my client configuration:

$ mco rpc rpcutil ping --display failed
Finished processing 51152 / 51152 hosts in 20675.80 ms
Finished processing 51152 / 51152 hosts in 20746.82 ms
Finished processing 51152 / 51152 hosts in 20778.17 ms
Finished processing 51152 / 51152 hosts in 22627.80 ms
Finished processing 51152 / 51152 hosts in 20238.92 ms

That’s a huge huge improvement, and this is without fancy discovery methods or databases or anything – it’s the, generally fairly unreliable, broadcast based method of discovery. These same nodes on a big RabbitMQ cluster never gets a consistent result (and it’s 40 seconds slower), so this is a huge win for me.

I am still using the Ruby code here of course and it’s single threaded and stuck on 1 CPU, so in practise it’s going to have a hard ceiling of churning through about 2500 to 3000 replies/second, hence the long timeouts there.

I have a go based ping, it round trips this network in less than 3.5 seconds quite reliably – wow.

The broker peaked at 25Mbps at times when doing many concurrent RPC requests and pings etc, but it’s all just been pretty good with no surprises.

So, that’s about it, I really can’t complain about this.

Choria Progress Update

It’s been a while since I posted about Choria and where things are. There are major changes in the pipeline so it’s well overdue a update.

The features mentioned here will become current in the next release cycle – about 2 weeks from now.

New choria module

The current gen Choria modules grew a bit organically and there’s a bit of a confusion between the various modules. I now have a new choria module, it will consume features from the current modules and deprecate them.

On the next release it can manage:

  1. Choria YUM and APT repos
  2. Choria Package
  3. Choria Network Broker
  4. Choria Federation Broker
  5. Choria Data Adatpaters

Network Brokers

We have had amazing success with the NATS broker, lightweight, fast, stable. It’s perfect for Choria. While I had a pretty good module to configure it I wanted to create a more singular experience. Towards that there is a new Choria Broker incoming that manages an embedded NATS instance.

To show what I am on about, imagine this is all that is required to configure a cluster of 3 production ready brokers capable of hosting 50k or more Choria managed nodes on modestly specced machines:

plugin.choria.broker_network = true
plugin.choria.network.peers = nats://choria1.example.net:4223, nats://choria2.example.net:4223, nats://choria3.example.net:4223
plugin.choria.stats_address = ::

Of course there is Puppet code to do this for you in choria::broker.

That’s it, start the choria-broker daemon and you’re done – and ready to monitor it using Prometheus. Like before it’s all TLS and all that kinds of good stuff.

Federation Brokers

We had good success with the Ruby Federation Brokers but they also had issues particularly around deployment as we had to deploy many instances of them and they tended to be quite big Ruby processes.

The same choria-broker that hosts the Network Broker will now also host a new Golang based Federation Broker network. Configuration is about the same as before you don’t need to learn new things, you just have to move to the configuration in choria::broker and retire the old ones.

Unlike the past where you had to run 2 or 3 of the Federation Brokers per node you now do not run any additional processes, you just enable the feature in the singular choria-broker, you only get 1 process. Internally each run 10 instances of the Federation Broker, its much more performant and scalable.

Monitoring is done via Prometheus.

Data Adapters

Previously we had all kinds of fairly bad schemes to manage registration in MCollective. The MCollective daemon would make requests to a registration agent, you’d designate one or more nodes as running this agent and so build either a file store, mongodb store etc.

This was fine at small size but soon enough the concurrency in large networks would overwhelm what could realistically be expected from the Agent mechanism to manage.

I’ve often wanted to revisit that but did not know what approach to take. In the years since then the Stream Processing world has exploded with tools like Kafka, NATS Streaming and offerings from GPC, AWS and Azure etc.

Data Adapters are hosted in the Choria Broker and provide stateless, horizontally and vertically scalable Adapters that can take data from Choria and translate and publish them into other systems.

Today I support NATS Streaming and the code is at first-iteration quality, problems I hope to solve with this:

  • Very large global scale node metadata ingest
  • IoT data ingest – the upcoming Choria Server is embeddable into any Go project and it can exfil data into Stream Processors using this framework
  • Asynchronous RPC – replies to requests streaming into Kafka for later processing, more suitable for web apps etc
  • Adhoc asynchronous data rewrites – we have had feature requests where person one can make a request but not see replies, they go into Elastic Search


After 18 months of trying to get Puppet Inc to let me continue development on the old code base I have finally given up. The plugins are now hosted in their own GitHub Organisation.

I’ve released a number of plugins that were never released under Choria.

I’ve updated all their docs to be Choria specific rather than out dated install docs.

I’ve added Action Policy rules allowing read only actions by default – eg. puppet status will work for anyone, puppet runonce will give access denied.

I’ve started adding Playbooks the first ones are mcollective_agent_puppet::enable, mcollective_agent_puppet::disable and mcollective_agent_puppet::disable_and_wait.

Embeddable Choria

The new Choria Server is embeddable into any Go project. This is not a new area of research for me – this was actually the problem I tried to solve when I first wrote the current gen MCollective, but i never got so far really.

The idea is that if you have some application – like my Prometheus Streams system – where you will run many of a specific daemon each with different properties and areas of responsibility you can make that daemon connect to a Choria network as if it’s a normal Choria Server. The purpose of that is to embed into the daemon it’s life cycle management and provide an external API into this.

The above mentioned Prometheus Streams server for example have a circuit breaker that can start/stop the polling and replication of data:

$ mco rpc prometheus_streams switch -T prometheus
Discovering hosts using the mc method for 2 second(s) .... 1
 * [ ============================================================> ] 1 / 1
     Mode: poller
   Paused: true
Summary of Mode:
   poller = 1
Summary of Paused:
   false = 1
Finished processing 1 / 1 hosts in 399.81 ms

Here I am communicating with the internals of the Go process, they sit in their of Sub Collective, expose facts and RPC endpoints. I can use discovery to find all only nodes in certain modes, with certain jobs etc and perform functions you’d typically do via a REST management interface over a more suitable interface.

Likewise I’ve embedded a Choria Server into IoT systems where it uses the above mentioned Data Adapters to publish temperature and humidity while giving me the ability to extract from those devices data in demand using RPC and do things like in-place upgrades of the running binary on my IoT network.

You can use this today in your own projects and it’s compatible with the Ruby Choria you already run. A full walk through of doing this can be found in the ripienaar/embedded-choria-sample repository.

Choria Playbooks DSL

I previously wrote about Choria Playbooks – a reminder they are playbooks written in YAML format and can orchestrate many different kinds of tasks, data, inputs and discovery systems – not exclusively ones from MCollective. It integrates with tools like terraform, consul, etcd, Slack, Graphite, Webhooks, Shell scripts, Puppet PQL and of course MCollective.

I mentioned in that blog post that I did not think a YAML based playbook is the way to go.

I am very pleased to announce that with the release of Choria 0.6.0 playbooks can now be written with the Puppet DSL. I am so pleased with this that effectively immediately the YAML DSL is deprecated and set for a rather short life time.

A basic example can be seen here, it will:

  • Reuse a company specific playbook and notify Slack of the action about to be taken
  • Discover nodes using PQL in a specified cluster and verify they are using a compatible Puppet Agent
  • Obtain a lock in Consul ensuring only 1 member in the team perform critical tasks related to the life cycle of the Puppet Agent at a time
  • Disable Puppet on the discovered nodes
  • Wait for up to 200 seconds for the nodes to become idle
  • Release the lock
  • Notify Slack that the task completed
# Disables Puppet and Wait for all in-progress catalog compiles to end
plan acme::disable_puppet_and_wait (
  Enum[alpha, bravo] $cluster
) {
  choria::run_playbook(acme::slack_notify, message => "Disabling Puppet in cluster ${cluster}")
  $puppet_agents = choria::discover("mcollective",
    discovery_method => "choria",
    agents => ["puppet"],
    facts => ["cluster=${cluster}"],
    uses => { puppet => ">= 1.13.1" }
  $ds = {
    "type" => "consul",
    "timeout" => 120,
    "ttl" => 60
  choria::lock("locks/puppet.critical", $ds) || {
      "action" => "puppet.disable",
      "nodes" => $puppet_agents,
      "fail_ok" => true,
      "silent" => true,
      "properties" => {"message" => "restarting puppet server"}
      "action"    => "puppet.status",
      "nodes"     => $puppet_agents,
      "assert"    => "idling=true",
      "tries"     => 10,
      "silent"    => true,
      "try_sleep" => 20,
    message => sprintf("Puppet disabled on %d nodes in cluster %s", $puppet_agents.count, $cluster)

As you can see we can re-use playbooks and build up a nice cache of utilities that the entire team can use, the support for locks and data sharing ensures safe and coordinated use of this style of system.

You can get this today if you use Puppet 5.4.0 and Choria 0.6.0. Refer to the Playbook Docs for more details, especially the Tips and Patterns section.

Why Puppet based DSL?

The Plan DSL as you’ll see in the Background and History part later in this post is something I have wanted a long time. I think the current generation Puppet DSL is fantastic and really suited to this problem. Of course having this in the Plan DSL I can now also create Ruby versions of this and I might well do that.

The Plan DSL though have many advantages:

  • Many of us already know the DSL
  • There are vast amounts of documentation and examples of Puppet code, you can get trained to use it.
  • The other tools in the Puppet stable support plans – you can use puppet strings to document your Playbooks
  • The community around the Puppet DSL is very strong, I imagine soon rspec-puppet might support testing Plans and so by extension Playbooks. This appears to be already possible but not quite as easy as it could be.
  • We have a capable and widely used way of sharing these between us in the Puppet Forge

I could not compete with this in any language I might want to support.

Future of Choria Playbooks

As I mentioned the YAML playbooks are not long for this world. I think they were an awesome experiment and I learned a ton from them, but these Plan based Playbooks are such a massive step forward that I just can’t see the YAML ones serving any purpose what so ever.

This release supports both YAML and Plan based Playbooks, the next release will ditch the YAML ones.

At that time a LOT of code will be removed from the repositories and I will be able to very significantly simplify the supporting code. My goal is to make it possible to add new task types, data sources, discovery sources etc really easily, perhaps even via Puppet modules so the eco system around these will grow.

I will be doing a bunch of work on the Choria Plugins (agent, server, puppet etc) and these might start shipping small Playbooks that you can use in your own Playbooks. The one that started this blog post would be a great candidate to supply as part of the Choria suite and I’d like to do that for this and many other plugins.

Background and History

For many years I have wanted Puppet to move in a direction that might one day support scripts – perhaps even become a good candidate for shell scripts, not at the expense of the CM DSL but as a way to reward people for knowing the Puppet Language. I wanted this for many reasons but a major one was because I wanted to use it as a DSL to write orchestration scripts for MCollective.

I did some proof of concepts of this late in 2012, you can see the fruits of this POC here, it allowed one to orchestrate MCollective tasks using Puppet DSL and a Ruby DSL. This was interesting but the DSL as it was then was no good for this.

I also made a pure YAML Puppet DSL that deeply incorporated Hiera and remained compatible with the Puppet DSL. This too was interesting and in hindsight given the popularity of YAML I think I should have given this a lot more attention than I did.

Neither of these really worked for what I needed. Around the time Henrik Lindberg started talking about massive changes to the Puppet DSL and I think our first ever conversation covered this very topic – this must have been back in 2012 as well.

More recently I worked on YAML based playbooks for Choria, a sample can be seen in the old Choria docs, this is about the closest I got to something workable, we have users in the wild using it and having success with these. As a exploration they were super handy and taught me loads.

Fast forward to Puppet Conf 2017 and Puppet Inc announced something called Puppet Plans, these are basically script like, uncompiled (kind of), top-down executed and aimed at use within your CLI much like you would a script. This was fantastic news, unfortunately the reality ended up with these locked up inside their new SSH based orchestrator called Bolt. Due to some very unfortunate technical direction and decision making Plans are entirely unusable by Puppet users without Bolt. Bolt vendors it’s own Puppet and Facter and so it’s unaware of the AIO Puppet.

Ideally I would want to use Plans as maintained by Puppet Inc for my Playbooks but the current status of things are that the team just is not interested in moving in that direction. Thus in the latest version of Choria I have implemented my own runner, result types, error types and everything needed to write Choria Playbooks using the Puppet DSL.


I am really pleased with how these playbooks turned out and am excited for what I can provide to the community in the future. There are no doubt some rough edges today in the implementation and documentation, your continued feedback and engagement in the Choria community around these would ensure that in time we will have THE Playbook system in the Puppet Eco system.

Replicating NATS Streams between clusters

I’ve mentioned NATS before – the fast and light weight message broker from nats.io – but I haven’t yet covered the sister product NATS Streaming before so first some intro.

NATS Streaming is in the same space as Kafka, it’s a stream processing system and like NATS it’s super light weight delivered as a single binary and you do not need anything like Zookeeper. It uses normal NATS for communication and ontop of that builds streaming semantics. Like NATS – and because it uses NATS – it is not well suited to running over long cluster links so you end up with LAN local clusters only.

This presents a challenge since very often you wish to move data out of your LAN. I wrote a Replicator tool for NATS Streaming which I’ll introduce here.


First I guess it’s worth covering what Streaming is, I should preface also that I am quite new in using Stream Processing tools so I am not about to give you some kind of official answer but just what it means to me.

In a traditional queue like ActiveMQ or RabbitMQ, which I covered in my Common Messaging Patterns posts, you do have message storage, persistence etc but those who consume a specific queue are effectively a single group of consumers and messages either go to all or load shared all at the same pace. You can’t really go back and forth over the message store independently as a client. A message gets ack’d once and once it’s been ack’d it’s done being processed.

In a Stream your clients each have their own view over the Stream, they all have their unique progress and point in the Stream they are consuming and they can move backward and forward – and indeed join a cluster of readers if they so wish and then have load balancing with the other group members. A single message can be ack’d many times but once ack’d a specific consumer will not get it again.

This is to me the main difference between a Stream processing system and just a middleware. It’s a huge deal. Without it you will find it hard to build very different business tools centred around the same stream of data since in effect every message can be processed and ack’d many many times vs just once.

Additionally Streams tend to have well defined ordering behaviours and message delivery guarantees and they support clustering etc. much like normal middleware has. There’s a lot of similarity between streams and middleware so it’s a bit hard sometimes to see why you won’t just use your existing queueing infrastructure.

Replicating a NATS Stream

I am busy building a system that will move Choria registration data from regional data centres to a global store. The new Go based Choria daemon has a concept of a Protocol Adapter which can receive messages on the traditional NATS side of Choria and transform them into Stream messages and publish them.

This gets me my data from the high frequency, high concurrency updates from the Choria daemons into a Stream – but the Stream is local to the DC. Indeed in the DC I do want to process these messages to build a metadata store there but I also want to processes these messages for replication upward to my central location(s).

Hence the importance of the properties of Streams that I highlighted above – multiple consumers with multiple views of the Stream.

There are basically 2 options available:

  1. Pick a message from a topic, replicate it, pick the next one, one after the other in a single worker
  2. Have a pool of workers form a queue group and let them share the replication load

At the basic level the first option will retain ordering of the messages – order in the source queue will be the order in the target queue. NATS Streaming will try to redeliver a message that timed out delivery and it won’t move on till that message is handled, thus ordering is safe.

The 2nd option since you have multiple workers you have no way to retain ordering of the messages since workers will go at different rates and retries can happen in any order – it will be much faster though.

I can envision a 3rd option where I have multiple workers replicating data into a temporary store where on the other side I inject them into the queue in order but this seems super prone to failure, so I only support these 2 methods for now.

Limiting the rate of replication

There is one last concern in this scenario, I might have 10s of data centres all with 10s of thousands of nodes. At the DC level I can handle the rate of messages but at the central location where I might have 10s of DCs x 10s of thousands of machines if I had to replicate ALL the data at near real time speed I would overwhelm the central repository pretty quickly.

Now in the case of machine metadata you probably want the first piece of metadata immediately but from then on it’ll be a lot of duplicated data with only small deltas over time. You could be clever and only publish deltas but you have the problem then that should a delta publish go missing you end up with a inconsistent state – this is something that will happen in distributed systems.

So instead I let the replicator inspect your JSON, if your JSON has something like fqdn in it, it can look at that and track it and only publish data for any single matching sender every 1 hour – or whatever you configure.

This has the effect that this kind of highly duplicated data is handled continuously in the edge but that it only gets a snapshot replication upwards once a hour for any given node. This solves the problem neatly for me without there being any risks to deltas being lost, it’s also a lot simpler to implement.

Choria Stream Replicator

So finally I present the Choria Stream Replicator. It does all that was described above with a YAML configuration file, something like this:

debug: false                     # default
verbose: false                   # default
logfile: "/path/to/logfile"      # STDOUT default
state_dir: "/path/to/statedir"   # optional
        topic: acme.cmdb
        source_url: nats://source1:4222,nats://source2:4222
        source_cluster_id: dc1
        target_url: nats://target1:4222,nats://target2:4222
        target_cluster_id: dc2
        workers: 10              # optional
        queued: true             # optional
        queue_group: cmdb        # optional
        inspect: host            # optional
        age: 1h                  # optional
        monitor: 10000           # optional
        name: cmdb_replicator    # optional

Please review the README document for full configuration details.

I’ve been running this in a test DC with 1k nodes for a week or so and I am really happy with the results, but be aware this is new software so due care should be given. It’s available as RPMs, has a Puppet module, and I’ll upload some binaries on the next release.

The Choria Emulator

In my previous posts I discussed what goes into load testing a Choria network, what connections are made, subscriptions are made etc.

From this it’s obvious the things we should be able to emulate are:

  • Connections to NATS
  • Subscriptions – which implies number of agents and sub collectives
  • Message payload sizes

To make it realistically affordable to emulate many more machines that I have I made an emulator that can start numbers of Choria daemons on a single node.

I’ve been slowly rewriting MCollective daemon side in Go which means I already had all the networking and connectors available there, so a daemon was written:

usage: choria-emulator --instances=INSTANCES [<flags>]
Emulator for Choria Networks
      --help                 Show context-sensitive help (also try --help-long and --help-man).
      --version              Show application version.
      --name=""              Instance name prefix
  -i, --instances=INSTANCES  Number of instances to start
  -a, --agents=1             Number of emulated agents to start
      --collectives=1        Number of emulated subcollectives to create
  -c, --config=CONFIG        Choria configuration file
      --tls                  Enable TLS on the NATS connections
      --verify               Enable TLS certificate verifications on the NATS connections
      --server=SERVER ...    NATS Server pool, specify multiple times (eg one:4222)
  -p, --http-port=8080       Port to listen for /debug/vars

You can see here it takes a number of instances, agents and collectives. The instances will all respond with ${name}-${instance} on any mco ping or RPC commands. It can be discovered using the normal mc discovery – though only supports agent and identity filters.

Every instance will be a Choria daemon with the exact same network connection and NATS subscriptions as real ones. Thus 50 000 emulated Choria will put the exact same load of work on your NATS brokers as would normal ones, performance wise even with high concurrency the emulator performs quite well – it’s many orders of magnitude faster than the ruby Choria client anyway so it’s real enough.

The agents they start are all copies of this one:

Choria Agent emulated by choria-emulator
      Author: R.I.Pienaar <rip@devco.net>
     Version: 0.0.1
     License: Apache-2.0
     Timeout: 120
   Home Page: http://choria.io
   Requires MCollective 2.9.0 or newer
   generate action:
       Generates random data of a given size
              Description: Amount of text to generate
                   Prompt: Size
                     Type: integer
                 Optional: true
            Default Value: 20
              Description: Generated Message
               Display As: Message

You can this has a basic data generator action – you give it a desired size and it makes you a message that size. It will run as many of these as you wish all called like emulated0 etc.

It has an mcollective agent that go with it, the idea is you create a pool of machines all with your normal mcollective on it and this agent. Using that agent then you build up a different new mcollective network comprising the emulators, federation and NATS.

Here’s some example of commands – you’ll see these later again when we talk about scenarios:

We download the dependencies onto all our nodes:

$ mco playbook run setup-prereqs.yaml --emulator_url=https://example.net/rip/choria-emulator-0.0.1 --gnatsd_url=https://example.net/rip/gnatsd --choria_url=https://example.net/rip/choria

We start NATS on our first node:

$ mco playbook run start-nats.yaml --monitor 8300 --port 4300 -I test1.example.net

We start the emulator with 1500 instances per node all pointing to our above NATS:

$ mco playbook run start-emulator.yaml --agents 10 --collectives 10 --instances 750 --monitor 8080 --servers

You’ll then setup a client config for the built network and can interact with it using normal mco stuff and the test suite I’ll show later. Simularly there are playbooks to stop all the various parts etc. The playbooks just interact with the mcollective agent so you could use mco rpc directly too.

I found I can easily run 700 to 1000 instances on basic VMs – needs like 1.5GB RAM – so it’s fairly light. Using 400 nodes I managed to build a 300 000 node Choria network and could easily interact with it etc.

Finally I made a ec2 environment where you can stand up a Puppet Master, Choria, the emulator and everything you need and do load tests on your own dime. I was able to do many runs with 50 000 emulated nodes on EC2 and the whole lot cost me less than $20.

The code for this emulator is very much a work in progress as is the Go code for the Choria protocol and networking but the emulator is here if you want to take a peek.

What to consider when speccing a Choria network

In my previous post I talked about the need to load test Choria given that I now aim for much larger workloads. This post goes into a few of the things you need to consider when sizing the optimal network size.

Given that we now have the flexibility to build 50 000 node networks quite easily with Choria the question is should we, and if yes then what is the right size. As we can now federate multiple Collectives together into one where each member Collective is a standalone network we have the opportunity to optimise for the operability of the network rather than be forced to just build it as big as we can.

What do I mean when I say the operability of the network? Quite a lot of things:

  • What is your target response time on a unbatched mco rpc rpcutil ping command
  • What is your target discovery time? You should use a discovery data source but broadcast is useful, so how long do you want?
  • If you are using a discovery source, how long do you want to wait for publishes to happen?
  • How many agents will you run? Each agent makes multiple subscriptions on the middleware and consume resources there
  • How many sub collectives do you want? Each sub collective multiply the amount of subscriptions
  • How many federated networks will you run?
  • When you restart the entire NATS, how long do you want to wait for the whole network to reconnect?
  • How many NATS do you need? 1 can run 50 000 nodes, but you might want a cluster for HA. Clustering introduces overhead in the middleware
  • If you are federating a global distributed network, what impact does the latency cross the federation have and what is acceptable

So you can see that to a large extend the answer here is related to your needs and not only to the needs of benchmarking Choria. I am working on a set of tools to allow anyone to run tests locally or on a EC2 network. The main work hose is a Choria emulator that runs a 1 000 or more Choria instances on a single node so you can use a 50 node EC2 network to simulate a 50 000 node one.

Middleware Scaling Concerns

Generally for middleware brokers there are a few things that impact their scalability:

  • Number of TCP Connections – generally a thread/process is made for each
  • TLS or Plain text – huge overhead in TLS typically and it can put a lot of strain on single systems
  • Number of message targets – queues, topics, etc. Different types of target have different overheads. Often a thread/process for each.
  • Number of subscribers to each target
  • Cluster overhead
  • Persistence overheads like storage and ACKs etc

You can see it’s quite a large number of variables that goes into this, anywhere that requires a thread or process to manage 1 of it means you should get worried or at least be in a position to measure it.

NATS uses 1 go routine for each connection and no additional ones per subscription etc, its quite light weight but there are no hard and fast rules. Best to observe how it grows by needs, something I’ll include in my test suite.

How Choria uses NATS

It helps then to understand how Choria will use NATS and what connections and targets it makes.

A single Choria node will:

  • Maintain a single TCP+TLS connection to NATS
  • Subscribe to 1 queue unique to the node for every Subcollective it belongs to
  • For every agent – puppet, package, service, etc – subscribe to a broadcast topic for that agent. Once in every Subcollective. Choria comes default with 7 agents.

So if you have a node with 10 agents in 5 Subcollectives:

  • 50 broadcast subjects for agents
  • 5 queue subjects
  • 1 TCP+TLS connection

So 100 nodes will have 5 500 subscriptions, 550 NATS subjects and 100 TCP+TLS connections.

Ruby based Federation brokers will maintain 1 subscription to a queue subject on the Federation and same on the Collective. The upcoming Go based Federation Brokers will maintain 10 (configurable) connections to NATS on each side, each with these subscriptions.


This will give us a good input into designing a suite of tools to measure various things during the run time of a big test, check back later for details about such a tool.

Load testing Choria


Many of you probably know I am working on a project called Choria that modernize MCollective which will eventually supersede MCollective (more on this later).

Given that Choria is heading down a path of being a rewrite in Go I am also taking the opportunity to look into much larger scale problems to meet some client needs.

In this and the following posts I’ll write about work I am doing to load test and validate Choria to 100s of thousands of nodes and what tooling I created to do that.


Choria builds around the NATS middleware which is a Go based middleware server that forgoes a lot of the persistence and other expensive features – instead it focusses on being a fire and forget middleware network. It has an additional project should you need those features so you can mix and match quite easily.

Turns out that’s exactly what typical MCollective needs as it never really used the persistence features and those just made the associated middleware quite heavy.

To give you an idea, in the old days the community would suggest every ~ 1000 nodes managed by MCollective required a single ActiveMQ instance. Want 5 500 MCollective nodes? That’ll be 6 machines – physical recommended – and 24 to 30 GB RAM in a cluster just to run the middleware. We’ve had reports of much larger RabbitMQ networks on 4 or 5 servers – 50 000 managed nodes or more, but those would be big machines and they had quite a lot of performance issues.

There was a time where 5 500 nodes was A LOT but now it’s becoming a bit every day, so I need to focus upward.

With NATS+Choria I am happily running 5 500 nodes on a single 2 CPU VM with 4GB RAM. In fact on a slightly bigger VM I am happily running 50 000 nodes on a single VM and NATS uses around 1GB to 1.5GB of RAM at peak.

Doing 100s of RPC requests in a row against 50 000 nodes the response time is pretty solid around 16 seconds for a RPC call to every node, it’s stable, never drops a message and the performance stays level in the absence of Java GC issues. This is fast but also quite slow – the Ruby client manages about 300 replies every 0.10 seconds due to the amount of protocol decoding etc that is needed.

This brings with it a whole new level of problem. Just how far can we take the client code and how do you determine when it’s too big and how do I know the client, broker and federation I am working on significantly improve things.

I’ve also significantly reworked the network protocol to support Federation but the shipped code optimize for code and config simplicity over lets say support for 20 000 Federation Collectives. When we are talking about truly gigantic Choria networks I need to be able to test scenarios involving 10s of thousands of Federated Network all with 10s of thousands of nodes in them. So I need tooling that lets me do this.

Getting to running 50 000 nodes

Not everyone just happen to have a 50 000 node network lying about they can play with so I had to improvise a bit.

As part of the rewrite I am doing I am building a Go framework with the Choria protocol, config parsing and network handling all built in Go. Unlike the Ruby code I can instantiate multiple of these in memory and run them in Go routines.

This means I could write a emulator that can start a number of faked Choria daemons all in one process. They each have their own middleware connection, run a varying amount of agents with a varying amount of sub collectives and generally behave like a normal MCollective machine. On my MacBook I can run 1 500 Choria instances quite easily.

So with fewer than 60 machines I can emulate 50 000 MCollective nodes on a 3 node NATS cluster and have plenty of spare capacity. This is well within budget to run on AWS and not uncommon these days to have that many dev machines around.

In the following posts I’ll cover bits about the emulator, what I look for when determining optimal network sizes and how to use the emulator to test and validate performance of different network topologies.

Choria Update

Recently at Config Management Camp I’ve had many discussions about Orchestration, Playbooks and Choria, I thought it’s time for another update on it’s status.

I am nearing version 1.0.0, there are a few things to deal with but it’s getting close. Foremost I wanted to get the project it’s own space on all the various locations like GitHub, Forge, etc.

Inevitably this means getting a logo, it’s been a bit of a slog but after working through loads of feedback on Twitter and offers for assistance from various companies I decided to go to a private designer called Isaac Durazo and the outcome can be seen below:


The process of getting the logo was quite interesting and I am really pleased with the outcome, I’ll blog about that separately.

Other than the logo the project now has it’s own GitHub organisation at https://github.com/choria-io and I have moved all the forge modules to it’s own space as well https://forge.puppet.com/choria.

There are various other places the logo show up like in the Slack notifications and so forth.

On the project front there’s a few improvements:

  • There is now a registration plugin that records a bunch of internal stats on disk, the aim is for them to be read by Collectd and Sensu
  • A new Auditing plugin that emits JSON structured data
  • Several new Data Stores for Playbooks – files, environment.
  • Bug fixes on Windows
  • All the modules, plugins etc have moved to the Choria Forge and GitHub
  • Quite extensive documentation site updates including branding with the logo and logo colors.

There is now very few things left to do to get 1.0.0 out but I guess another release or two will be done before then.

So from now to update to coming versions you need to use the choria/mcollective_choria module which will pull in all it’s dependencies from the Choria project rather than my own Forge.

Still no progress on moving the actual MCollective project forward but I’ve discussed a way to deal with forking the various projects in a way that seems to work for what I want to achieve. In reality I’ll only have time to do that in a couple of months so hopefully something positive will happen in the mean time.

Head over to Choria.io to take a look.

Choria Playbooks – Data Sources

About a month ago I blogged about Choria Playbooks – a way to write series of actions like MCollective, Shell, Slack, Web Hooks and others – contained within a YAML script with inputs, node sets and more.

Since then I added quite a few tweaks, features and docs, it’s well worth a visit to choria.io to check it out.

Today I want to blog about a major new integration I did into them and a major step towards version 1 for Choria.


In the context of a playbook or even a script calling out to other system there’s many reasons to have a Data Source. In the context of a playbook designed to manage distributed systems the Data Source needed has some special needs. Needs that tools like Consul and etcd fulfil specifically.

So today I released version 0.0.20 of Choria that includes a Memory and a Consul Data Source, below I will show how these integrate into the Playbooks.

I think using a distributed data store is important in this context rather than expecting to pass variables from the Playbook around like on the CLI since the business of dealing with the consistency, locking and so forth are handled and I can’t know all the systems you wish to interact with, but if those can speak to Consul you can prepare an execution environment for them.

For those who don’t agree there is a memory Data Store that exists within the memory of the Playbook. Your playbook should remain the same apart from declaring the Data Source.

Using Consul

Defining a Data Source

Like with Node Sets you can have multiple Data Sources and they are identified by name:

    type: consul
    timeout: 360
    ttl: 20

This creates a Consul Data Source called pb_data, you need to have a local Consul Agent already set up. I’ll cover the timeout and ttl a bit later.

Playbook Locks

You can create locks in Consul and by their nature they are distributed across the Consul network. This means you can ensure a playbook can only be executed once per Consul DC or by giving a custom lock name any group of related playbooks or even other systems that can make Consul locks.

  - pb_data
  - pb_data/custom_lock

This will create 2 locks in the pb_data Data Store – one called custom_lock and another called choria/locks/playbook/pb_name where pb_name is the name from the metadata.

It will try to acquire a lock for up to timeout seconds – 360 here, if it can’t the playbook run fails. The associated session has a TTL of 20 seconds and Choria will renew the sessions around 5 seconds before the TTL expires.

The TTL will ensure that should the playbook die, crash, machine die or whatever, the lock will release after 20 seconds.

Binding Variables

Playbooks already have a way to bind CLI arguments to variables called Inputs. Data Sources extend inputs with extra capabilities.

We now have two types of Input. A static input is one where you give the data on the CLI and the data stays static for the life of the playbook. A dynamic input is one bound against a Data Source and the value of it is fetched every time you reference the variable.

    description: "Cluster to deploy"
    type: "String"
    required: true
    data: "pb_data/choria/kv/cluster"
    default: "alpha"

Here we have a input called cluster bound to the choria/kv/cluster key in Consul. This starts life as a static input and if you give this value on the CLI it will never use the Data Source.

If however you do not specify a CLI value it becomes dynamic and will consult Consul. If there’s no such key in Consul the default is used, but the input remains dynamic and will continue to consult Consul on every access.

You can force an input to be dynamic which will mean it will not show up on the CLI and will only speak to a data source using the dynamic: true property on the Input.

Writing and Deleting Data

Of course if you can read data you should be able to write and delete it, I’ve added tasks to let you do this:

  - pb_data
    description: "Cluster to deploy"
    type: "String"
    required: true
    data: "pb_data/choria/kv/cluster"
    default: "alpha"
    validation: ":shellsafe"
    - data:
        action: "delete"
        key: "pb_data/choria/kv/cluster"
  - shell:
      description: Deploy to cluster {{{ inputs.cluster }}}
      command: /path/to/script --cluster {{{ inputs.cluster }}}
  - data:
      action: "write"
      value: "bravo"
      key: "pb_data/choria/kv/cluster"
  - shell:
      description: Deploy to cluster {{{ inputs.cluster }}}
      command: /path/to/script --cluster {{{ inputs.cluster }}}

Here I have a pre_book task list that ensures there is no stale data, the lock ensures no other Playbook will mess around with the data while we run.

I then run a shell command that uses the cluster input, with nothing there it uses the default and so deploys cluster alpha, it then writes a new value and deploys cluster brova.

This is a bit verbose I hope to add the ability to have arbitrarily named tasks lists that you can branch to, then you can have 1 deploy task list and use the main task list to set up variables for it and call it repeatedly.


That’s quite a mouthful, the possibilities of this is quite amazing. On one hand we have a really versatile data store in the Playbooks but more significantly we have expanded the integration possibilities by quite a bit, you can now have other systems manage the environment your playbooks run in.

I will soon add task level locks and of course Node Set integration.

For now only Consul and Memory is supported, I can add others if there is demand.

An update on my Choria project

Some time ago I mentioned that I am working on improving the MCollective Deployment story.

I started a project called Choria that aimed to massively improve the deployment UX and yield a secure and stable MCollective setup for those using Puppet 4.

The aim is to make installation quick and secure, towards that it seems a common end to end install from scratch by someone new to project using a clustered NATS setup can take less than a hour, this is a huge improvement.

Further I’ve had really good user feedback, especially around NATS. One user reports 2000 nodes on a single NATS server consuming 300MB RAM and it being very performant, much more so than the previous setup.

It’s been a few months, this is whats changed:

  • The module now supports every OS AIO Puppet supports, including Windows.
  • Documentation is available on choria.io, installation should take about a hour max.
  • The PQL language can now be used to do completely custom infrastructure discovery against PuppetDB.
  • Many bugs have been fixed, many things have been streamlined and made more easy to get going with better defaults.
  • Event Machine is not needed anymore.
  • A number of POC projects have been done to flesh out next steps, things like a very capable playbook system and a revisit to the generic RPC client, these are on GitHub issues.

Meanwhile I am still trying to get to a point where I can take over maintenance of MCollective again, at first Puppet Inc was very open to the idea but I am afraid it’s been 7 months and it’s getting nowhere, calls for cooperation are just being ignored. Unfortunately I think we’re getting pretty close to a fork being the only productive next step.

For now though, I’d say the Choria plugin set is production ready and stable any one using Puppet 4 AIO should consider using these – it’s about the only working way to get MCollective on FOSS Puppet now due to the state of the other installation options.