Update: This isn't actually that much better than letting them
access the private key, since nothing is stopping the user from
running their own SSH agent, which can be run under strace. A better
solution is in the works. Thanks Timo Juhani Lindfors and Bob Proulx
for both pointing this out.
At work, we have a shared SSH key between the different people
manning the support queue. So far, this has just been a file in a
directory where everybody could read it and people would sudo to the
support user and then run SSH.
This has bugged me a fair bit, since there was nothing stopping a
person from making a copy of the key onto their laptop, except policy.
Thanks to a tip, I got around to implementing this and figured writing
up how to do it would be useful.
First, you need a directory readable by root only, I use
/var/local/support-ssh here. The other bits you need are a small
sudo snippet and a profile.d script.
Everybody in group support can run ssh-add as root.
The profile.d goes in /etc/profile.d/support.sh and looks like:
if [ -n "$(groups | grep -E "(^| )support( |$)")" ]; then
export SSH_AUTH_ENV="$HOME/.ssh/agent-env"
if [ -f "$SSH_AUTH_ENV" ]; then
. "$SSH_AUTH_ENV"
fi
ssh-add -l >/dev/null 2>&1
if [ $? = 2 ]; then
mkdir -p "$HOME/.ssh"
rm -f "$SSH_AUTH_ENV"
ssh-agent > "$SSH_AUTH_ENV"
. "$SSH_AUTH_ENV"
fi
sudo ssh-add /var/local/support-ssh/id_rsa
fi
The key is unavailable for the user in question because ssh-add is
sgid and so runs with group ssh and the process is only debuggable for
root. The only thing missing is there's no way to have the agent
prompt to use a key and I would like it to die or at least unload keys
when the last session for a user is closed, but that doesn't seem
trivial to do.
Over the last few years I have been experimenting with different approaches for scaling systems that monitor large numbers of heterogenous hosts, specifically in hosting environments.
This post outlines a pipeline approach for modelling and manipulating monitoring data.
Monitoring can be represented as a pipeline which data flows through, and is eventually turned into a notification for a human.
This approach has several benefits:
Failures are compartmentalised
Compartments can be scaled independently from one another
Clear interfaces are required between compartments, enabling composability
Each stage of the pipeline is handled by a different compartment of monitoring infrastructure that analyses and manipulates the data before deciding whether to pass it onto the next compartment.
These components are the bare minimum required for a monitoring pipeline:
Data collection infrastructure, is generally a collection of agents on target systems, or standalone tools that extract metrics from opaque systems (preferably via an API).
Data storage infrastructure, provides a place to push collected metrics. These metrics are almost always numerical. These metrics are then queried and fetched for graphing, monitoring checks, and reporting - thus enabling "We alert on what we draw".
Check execution infrastructure, runs the monitoring checks that are configured for each host, that query the data storage infrastructure. Checks that query textual data often poll the target system directly, which can have effects on latency.
Notification infrastructure, processes check results from the check execution infrastructure to send notifications to engineers or stakeholders. Ideally the notification infrastructure can also feed back actions from engineers to acknowledge, escalate, or resolve alerts.
At a high level, this is how data flows between the compartments:
When using Nagios, the check + notification infrastructure are generally collapsed into one compartment (with the exception of NRPE).
Many monitoring pipelines start out with the data collection + storage infrastructure decoupled from the check infrastructure. Monitoring checks query the same targets that are being graphed, but:
Because the check intervals don't necessarily match up to the data collection intervals, it can be hard to correlate monitoring alerts to features on the graphs.
The more systems poll the target system, the more the observer effect is amplified.
There are two other compartments that are becoming increasingly common:
Event processing infrastructure. Sitting between the check execution and notification infrastructure, this compartment processes events generated from the check infrastructure, identifies trends and emergent behaviours, and forwards the alerts to the notification infrastructure. It may also make decisions on who to send alerts to.
Management infrastructure, provides command + control facilities across all the compartments, as well as being the natural place for graphing and dashboards of metrics in the data storage infrastructure to live. If the target audience is non-technical or strongly segmented (e.g. many customers on a shared monitoring infrastructure), it can also provide an abstracted pretty public face to all the compartments.
This is how event processing + management fit into the pipeline:
The management infrastructure can likely be broken up into different compartments as well, but for now it serves as a placeholder.
Let's explore the benefits of this pipeline design.
Failures are compartmentalised
Ideally, failures and scalability bottlenecks are compartmentalised.
Where there are cascading failures that can't be contained, safeguards can be implemented in the surrounding compartments to dampen the effects1.
For example, if the data storage infrastructure stops returning data, this causes the check infrastructure to return false negatives. Or false positives. Or false UNKNOWNs. Bad times.
We can contain the effects in the event processing infrastructure by detecting a mass failure and only sending out a small number of targeted notifications, rather than sending out alerts for each individual failing check.
This problem is tricky, interesting, and fodder for further blog posts. :-)
Compartments can be scaled independently
Monolithic monitoring architectures are a pain to scale. Viewing a monolithic architecture through the prism of the pipeline model, all of the compartments are squeezed onto a single machine. Quite often there isn't a data collection or storage layer either.
Monolithic architectures often use the same moving parts under the hood, but they tend to be very closely entwined. Each tool has very distinct performance characteristics, but because they all run on a single machine and poorly separated, the only way to improve performance is by throwing expensive hardware at the problem.
If you've ever worked with a monolithic monitoring system, you will likely be experiencing painful flashbacks right about now.
To generalise the workload of the different compartments:
Check execution, notifications, and event processing tends to be very CPU intensive + network latency sensitive
Data storage is IO intensive + disk space expensive
Making sure each compartment is humming along nicely is super important when providing a consistent and reliable monitoring service.
Splitting the compartments onto separate infrastructure enables us to:
Optimise the performance of each component individually, either through using hardware that's more appropriate for the workloads (SSDs, multi-CPU physical machines), or tuning the software stack at the kernel and user space level.
Expose data through well defined APIs, which leads into the next point:
Clear interfaces are required between compartments
I like to think of this as "the Duplo approach" - compartments with well defined interfaces you can plug together to compose your pipeline.
Clear interfaces abstract the tools used in each compartment of the pipeline, which is essential for chaining tools in a composable way.
Clear interfaces help us:
Replace underperforming tools that have reached their scalability limits
Test new tools in parallel with the old tools by verifying their inputs + outputs
Better identify input that could be considered erroneous, and react appropriately
It's not all rainbows and unicorns. There are some downsides to the pipeline approach.
Greater Cost
There will almost certainly be a bigger initial investment in building a monitoring system with the pipeline approach.
You'll be using more components, thus more servers, thus the cost is greater. While the cost of scaling out may be greater up-front, you limit the need to scale up later on.
You can counteract some of these effects by starting small and dividing up compartments over time as part of a piecemeal strategy, but this takes time + persistence.
I can tell you from personal project management experience when rolling out of this pipeline design that it's hard work keeping a model of the complexity in your head and also well documented.
More Complexity
The pipeline makes it easier to eliminate scalability bottlenecks at the expense of more moving parts. The more moving parts, the greater the likelihood of failure.
Operationally it will be more difficult to troubleshoot when failures occur, and this becomes worse as you increase the safeguards and fault tolerance within your compartments.
This is the cost of scalability, and there is no easy fix.
Conclusion
The pipeline model maps nicely to existing monitoring infrastructures, but also to larger distributed monitoring systems.
It provides scalability, fault tolerance, and composability at the cost of a larger upfront investment.
1: This is a vast simplification of a very complex topic. Thinking of failure as an energy to be contained by barriers was a popular perspective in accident prevention circles from the 1960's to the 1980's, but the concept doesn't necessarily apply to complex systems.
DISCLAIMER Test Kitchen 1.0 is still in alpha at the time of
this post.
Update Remove Gemfile and Vagrantfile
Let’s take a look at the anatomy of a cookbook set up with
test-kitchen 1.0-alpha.
Note It is outside the scope of this post to discuss how to write
minitest-chef tests or “test cookbook” recipes. Use the cookbook
described below as an example to get ideas for writing your own.
This is the full directory tree of Opscode’s
”bluepill” cookbook:
I’ll assume the reader is familiar with basic components of cookbooks
like “recipes,” “templates,” and the top-level documentation files, so
let’s trim this down to just the areas of concern for Test Kitchen.
Note that this cookbook has a “test” cookbook. I’ll get to that in a
minute.
First of all, we have the .kitchen.yml. This is the project
definition that describes what is required to run test kitchen itself.
This particular file tells Test Kitchen to bring up nodes of the
platforms we’re testing with Vagrant, and defines the boxes with their
box names and URLs to download. You can view the full
.kitchen.yml in the Git repo.
For now, I’m going to focus on the suite stanza in the
.kitchen.yml. This defines how Chef will run when Test Kitchen
brings up the Vagrant machine.
Each platform has a recipe it will run with, in this case apt and
yum. Then the suite’s run list is appended, so for example, the final run list of
the Ubuntu 12.04 node will be:
We have apt so the apt cache on the node is updated before Chef does
anything else. This is pretty typical so we put it in the default run
list of each Ubuntu box.
The minitest-handler recipe existing in the run list means that the
Minitest Chef Handler will be run at the end of the Chef run. In this
case, it will use the tests from the test cookbook, bluepill_test.
The bluepill cookbook itself does not depend on any of these
cookbooks. So how does Test Kitchen know where to get them? Enter the
next file in the list above, Berksfile. This informs
Berkshelf which cookbooks to download. The
relevant excerpt from the Berksfile is:
Based on the
Berksfile,
it will download apt, yum, and minitest-handler from the Chef
Community site. It will also use the
bluepill_test
included in the bluepill cookbook. This is transparent to the user, as
I’ll cover in a moment.
Test Kitchen’s Vagrant driver plugin handles all the configuration of
Vagrant itself based on the entries in the .kitchen.yml. To get the
Berkshelf integration in the Vagrant boxes, we need to install the
vagrant-berkshelf plugin in Vagrant. Then, we automatically get
Berkshelf’s Vagrant integration, meaning all the cookbooks defined in
the Berksfile are going to be available on the box we bring up.
Remember the test cookbook mentioned above? It’s the next component.
The default suite in .kitchen.yml puts bluepill_test in the run
list. This particular recipe will include the bluepill default
recipe, then it sets up a test service using the bluepill_service
LWRP. This means that when the nodes brought up by Test Kitchen via
Vagrant converge, they’ll have bluepill installed and set up, and then
a service running that we can test the final behavior. Since Chef will
exit with a non-zero return code if it encounters an exception, we
know that a successful run means everything is configured as defined
in the recipes, and we can run tests against the node.
The tests we’ll run are written with the
Minitest Chef Handler.
These are defined in the test cookbook, files/default/tests/minitest
directory. The minitest-handler cookbook (also in the default suite
run list) will execute the
default_test
tests.
In the next post, we’ll look at how to run Test Kitchen, and what all
the output means.
Test Kitchen combines the suite (default) with the platform names
(e.g., ubuntu-12.04). To run all the suites on all platforms, simply do:
1
% kitchen test
This will take awhile, especially if you don’t already have the
Vagrant boxes on your system, as it will download each one. To make
this faster, we’ll just run Ubuntu 12.04:
1
% kitchen test default.*1204
Test Kitchen 1.0 can take a regular expression for the instances to
test. This will match the box default-ubuntu-12.04. I could also
just say 12 as that will match the single entry in my kitchen list
(above).
It will take a few minutes to run Test Kitchen. Those familiar with
Chef know that if it encounters an unhandled exception, it exits with
a non-zero return code. This is important, because we know at the end
of a successful run, Chef did the right thing, assuming our recipe is
the right thing :-).
To recap the previous post, we have a run list like this:
Let’s break down the output of our successful run. I’ll show the
output first, and explain it after:
123456
Starting Kitchen
Cleaning up any prior instances of <default-ubuntu-1204>
Destroying <default-ubuntu-1204>
Finished destroying <default-ubuntu-1204> (0m0.00s).
Testing <default-ubuntu-1204>
Creating <default-ubuntu-1204>
This is basic setup to ensure that “The Kitchen” is clean beforehand
and we don’t have existing state interfering with the run.
123456
[vagrant command] BEGIN (vagrant up default-ubuntu-1204 –no-provision)
[default-ubuntu-1204] Importing base box ‘canonical-ubuntu-12.04’…
[default-ubuntu-1204] Matching MAC address for NAT networking…
[default-ubuntu-1204] Clearing any previously set forwarded ports…
[default-ubuntu-1204] Forwarding ports…
[default-ubuntu-1204] – 22 => 2222 (adapter 1)
This will look familiar to Vagrant users, we’re just getting some
basic setup from Vagrant initializing the box defined in the
.kitchen.yml (passed to the Vagrantfile by the kitchen-vagrant
plugin). This step does a vagrant up –no-provision.
12345678
[Berkshelf] installing cookbooks…
[Berkshelf] Using bluepill (2.2.2) at path: ‘/Users/jtimberman/Development/opscode/cookbooks/bluepill’
[Berkshelf] Using apt (1.8.4)
[Berkshelf] Using yum (2.0.0)
[Berkshelf] Using minitest-handler (0.1.2)
[Berkshelf] Using bluepill_test (0.0.1) at path: ‘./test/cookbooks/bluepill_test’
[Berkshelf] Using rsyslog (1.5.0)
[Berkshelf] Using chef_handler (1.1.0)
Remember from the previous post that we’re using Berkshelf? This is
the integration with Vagrant that ensures that the cookbooks are
available. The first four, apt, yum, minitest-handler and
bluepill_test are defined in the Berksfile. The next, rsyslog is a
dependency of the bluepill cookbook (for rsyslog integration), and the
last, chef_handler is a dependency of minitest-handler. Berkshelf
extracts the dependencies from the cookbook metadata of each cookbook
defined in the Berksfile.
12345678910111213
[default-ubuntu-1204] Creating shared folders metadata…
[default-ubuntu-1204] Clearing any previously set network interfaces…
[default-ubuntu-1204] Running any VM customizations…
[default-ubuntu-1204] Booting VM…
[default-ubuntu-1204] Waiting for VM to boot. This can take a few minutes.
[default-ubuntu-1204] VM booted and ready for use!
[default-ubuntu-1204] Setting host name…
[default-ubuntu-1204] Mounting shared folders…
[default-ubuntu-1204] – v-root: /vagrant
[default-ubuntu-1204] – v-csc-1: /tmp/vagrant-chef-1/chef-solo-1/cookbooks
[vagrant command] END (0m48.76s)
Vagrant instance <default-ubuntu-1204> created.
Finished creating <default-ubuntu-1204> (0m53.12s).
Again, this is familiar output to Vagrant users, where Vagrant is
making the cookbooks available to the instance.
1234567891011121314151617181920
Converging <default-ubuntu-1204>
[vagrant command] BEGIN (vagrant ssh default-ubuntu-1204 –command ‘should_update_chef() {\n…’)
Installing Chef Omnibus (11.4.0)
Downloading Chef 11.4.0 for ubuntu…
Installing Chef 11.4.0
Selecting previously unselected package chef.
g database … 60513 files and directories currently installed.)
Unpacking chef (from …/chef_11.4.0_amd64.deb) …
Setting up chef (11.4.0-1.ubuntu.11.04) …
Thank you for installing Chef!
[vagrant command] END (0m34.85s)
[vagrant command] BEGIN (vagrant provision default-ubuntu-1204)
[Berkshelf] installing cookbooks…
[Berkshelf] Using bluepill (2.2.2) at path: ‘/Users/jtimberman/Development/opscode/cookbooks/bluepill’
[Berkshelf] Using apt (1.8.4)
[Berkshelf] Using yum (2.0.0)
[Berkshelf] Using minitest-handler (0.1.2)
[Berkshelf] Using bluepill_test (0.0.1) at path: ‘./test/cookbooks/bluepill_test’
[Berkshelf] Using rsyslog (1.5.0)
[Berkshelf] Using chef_handler (1.1.0)
This part is interesting, in that we’re going to install the Full
Stack Chef (Omnibus) package. This means it doesn’t matter what the
underlying base box has installed, we get the right version of Chef.
This is defined in the .kitchen.yml. This is done through vagrant
ssh (second line). Then, Test Kitchen does vagrant provision. The
provisioning step is where Berkshelf happens, so we do see this happen
again (perhaps a bug?).
12345678
[default-ubuntu-1204] Running provisioner: Vagrant::Provisioners::ChefSolo…
[default-ubuntu-1204] Generating chef JSON and uploading…
[default-ubuntu-1204] Running chef-solo…
INFO: *** Chef 11.4.0 ***
INFO: Setting the run_list to ["recipe[apt]", "recipe[minitest-handler]", "recipe[bluepill_test]"] from JSON
INFO: Run List is [recipe[apt], recipe[minitest-handler], recipe[bluepill_test]]
INFO: Run List expands to [apt, minitest-handler, bluepill_test]
INFO: Starting Chef Run for default-ubuntu-1204.vagrantup.com
This is the start of the actual Chef run, using Chef Solo by Vagrant’s
provisioner. Note that we have our suite’s run list. I’m going to skip
a lot of the Chef output because it isn’t required. Note that a few
resources in the minitest–handler will report as failed, but they can
be ignored because it means that those tests were simply not implemented.
123456789
INFO: Processing directory[/var/chef/minitest/bluepill_test] action create (minitest-handler::default line 50)
INFO: directory[/var/chef/minitest/bluepill_test] created directory /var/chef/minitest/bluepill_test
INFO: Processing cookbook_file[tests-bluepill_test-default] action create (minitest-handler::default line 53)
INFO: cookbook_file[tests-bluepill_test-default] created file /var/chef/minitest/bluepill_test/default_test.rb
INFO: Processing remote_directory[tests-support-bluepill_test-default] action create (minitest-handler::default line 60)
INFO: remote_directory[tests-support-bluepill_test-default] created directory /var/chef/minitest/bluepill_test/support
INFO: Processing cookbook_file[/var/chef/minitest/bluepill_test/support/helpers.rb] action create (dynamically defined)
INFO: cookbook_file[/var/chef/minitest/bluepill_test/support/helpers.rb] mode changed to 644
INFO: cookbook_file[/var/chef/minitest/bluepill_test/support/helpers.rb] created file /var/chef/minitest/bluepill_test/support/helpers.rb
These are the relevant parts of the minitest-handler recipe, where it
has copied the tests from the bluepill_test cookbook into place.
1234567891011121314151617181920
INFO: Processing gem_package[i18n] action install (bluepill::default line 20)
INFO: Processing gem_package[bluepill] action install (bluepill::default line 24)
INFO: Processing directory[/etc/bluepill] action create (bluepill::default line 34)
INFO: directory[/etc/bluepill] created directory /etc/bluepill
INFO: directory[/etc/bluepill] owner changed to 0
INFO: directory[/etc/bluepill] group changed to 0
INFO: Processing directory[/var/run/bluepill] action create (bluepill::default line 34)
INFO: directory[/var/run/bluepill] created directory /var/run/bluepill
INFO: directory[/var/run/bluepill] owner changed to 0
INFO: directory[/var/run/bluepill] group changed to 0
INFO: Processing directory[/var/lib/bluepill] action create (bluepill::default line 34)
INFO: directory[/var/lib/bluepill] created directory /var/lib/bluepill
INFO: directory[/var/lib/bluepill] owner changed to 0
INFO: directory[/var/lib/bluepill] group changed to 0
INFO: Processing file[/var/log/bluepill.log] action create_if_missing (bluepill::default line 41)
INFO: entered create
INFO: file[/var/log/bluepill.log] owner changed to 0
INFO: file[/var/log/bluepill.log] group changed to 0
INFO: file[/var/log/bluepill.log] mode changed to 755
INFO: file[/var/log/bluepill.log] created file /var/log/bluepill.log
Recall from the previous post that the bluepill_test recipe includes
the bluepill recipe. This is the basic setup of bluepill.
123456789
INFO: Processing package[nc] action install (bluepill_test::default line 4)
INFO: Processing template[/etc/bluepill/test_app.pill] action create (bluepill_test::default line 16)
INFO: template[/etc/bluepill/test_app.pill] updated content
INFO: Processing bluepill_service[test_app] action enable (bluepill_test::default line 18)
INFO: Processing bluepill_service[test_app] action load (bluepill_test::default line 18)
INFO: Processing bluepill_service[test_app] action start (bluepill_test::default line 18)
INFO: Processing link[/etc/init.d/test_app] action create (/tmp/vagrant-chef-1/chef-solo-1/cookbooks/bluepill/providers/service.rb line 30)
INFO: link[/etc/init.d/test_app] created
INFO: Chef Run complete in 81.099185824 seconds
And this is the rest of the bluepill_test recipe. It sets up a test
service that will basically be a netcat process listening on a port.
Let’s take a moment here and discuss what we have.
First, we have successfully converged the default recipe in the
bluepill cookbook via its inclusion in bluepill_test. This is
awesome, because we know the recipe works exactly as we defined it,
since Chef resources are declarative, and Chef exits if there’s a
problem.
Second, we have successfully setup a service managed by bluepill
itself using the LWRP included in the bluepill cookbook,
bluepill_service. This means we know that the underlying provider
configured all the resources correctly.
At this point, we could say “Ship it!” and release the cookbook,
knowing it will do what we require. However, this may be disingenuous
because we don’t know if the behavior of the system after all this
runs is actually correct. Therefore we look to the next segment of
output from Chef, from minitest:
1234567891011121314151617
INFO: Running report handlers
Run options: -v –seed 38794
\# Running tests:
recipe::bluepill_test::default#test_0001_the_default_log_file_must_exist_cook_1295_ =
0.00 s = .
recipe::bluepill_test::default::create a bluepill configuration file#test_0001_anonymous =
0.00 s = .
recipe::bluepill_test::default::create a bluepill configuration file#test_0002_must_be_valid_ruby =
0.06 s = .
recipe::bluepill_test::default::runs the application as a service#test_0001_anonymous =
0.72 s = .
recipe::bluepill_test::default::runs the application as a service#test_0002_anonymous =
0.71 s = .
recipe::bluepill_test::default::spawn a netcat tcp client repeatedly#test_0001_should_receive_a_tcp_connection_from_netcat =
2.24 s = .
Finished tests in 3.746002s, 1.6017 tests/s, 1.8687 assertions/s.
6 tests, 7 assertions, 0 failures, 0 errors, 0 skips
This is performed by the minitest-handler, which runs the tests copied
from the bluepill_test cookbook before. It’s outside the scope of
this post to describe how to write minitest-chef tests, but we can
talk about the output.
We have 6 separate tests that perform 7 assertions, and they all
passed. The tests are asserting:
The log file is created, and by the full name of the test, this is
to check for a regression from
COOK-1295.
The .pill config file for the service must exist and be valid
Ruby.
The bluepill service must actually be enabled and running, thereby
testing that those actions in the LWRP work.
The running service, which listens on a TCP port, must be up and
available, thereby testing that bluepill started the service
correctly.
12345678910111213141516
[vagrant command] END (1m29.24s)
Finished converging <default-ubuntu-1204> (2m15.45s).
Setting up <default-ubuntu-1204>
Finished setting up <default-ubuntu-1204> (0m0.00s).
Verifying <default-ubuntu-1204>
Finished verifying <default-ubuntu-1204> (0m0.00s).
Destroying <default-ubuntu-1204>
[vagrant command] BEGIN (vagrant destroy default-ubuntu-1204 -f)
[default-ubuntu-1204] Forcing shutdown of VM…
[Berkshelf] cleaning Vagrant’s shelf
[default-ubuntu-1204] Destroying VM and associated drives…
[vagrant command] END (0m3.68s)
Vagrant instance <default-ubuntu-1204> destroyed.
Finished destroying <default-ubuntu-1204> (0m4.04s).
Finished testing <default-ubuntu-1204> (3m12.62s).
Kitchen is finished. (3m12.62s)
This output shows Test Kitchen cleaning up after itself. We destroy
the Vagrant instance on a successful convergence and test run in Chef,
because further investigation is not required. If the test failed for
some reason, Test Kitchen leaves it running so you can log into the
machine and poke around to find out what went wrong. Then simply
correct the required part of the cookbook (recipes, tests, etc) and
rerun Test Kitchen. For example:
1234
% bundle exec kitchen login 1204
vagrant@ubuntu-1204$ … run some commands
vagrant@ubuntu-1204$ ^D
% bundle exec kitchen converge 1204
My goal with these posts is to get some information out for folks to
consider when examining Test Kitchen 1.0 alpha for their own projects.
There’s a lot more to Test Kitchen, such as managing non-cookbook
projects, or even using other kinds of tests. We’ll have more
documentation and guides as we get the 1.0 release out.
Even a thoroughly-tested application can wreck havoc if it hasn’t been tested in the context of a production-like system under production-like conditions.
Tools like Puppet and Chef make it easy to produce a production-like environment for testing, but what about the production-like conditions?
One aspect of these conditions can be approximated with load testing tools like JMeter or The Grinder. I recently used The Grinder to troubleshoot a performance problem with a small web application. Here’s a walkthrough of my process.
Getting Started with the Grinder
Like JMeter, The Grinder is a Java-based load testing framework. It can coordinate the execution of a test plan by distributed worker processes for anything with a Java API. I used it to send requests to a web application with several distinct APIs and components.
The three main components of The Grinder are the Console, Agents, and Workers.
You’ll need to create a grinder.properties file. This file is used to configure several properties including the number of worker processes and threads. Here’s a simple one:
# Please refer to
# http://net.grinder.sourceforge.net/g3/properties.html for further
# documentation.
# The file name of the script to run.
#
# Relative paths are evaluated from the directory containing the
# properties file. The default is "grinder.py".
grinder.script = grinder.clj
# The number of worker processes each agent should start. The default
# is 1.
grinder.processes = 1
# The number of worker threads each worker process should start. The
# default is 1.
grinder.threads = 5
# The number of runs each worker process will perform. When using the
# console this is usually set to 0, meaning "run until the console
# sneds a stop or reset signal". The default is 1.
grinder.runs = 1
### Logging ###
# The directory in which worker process logs should be created. If not
# specified, the agent's working directory is used.
grinder.logDirectory = log
# The number of archived logs from previous runs that should be kept.
# The default is 1.
grinder.numberOfOldLogs = 2
I also created a couple bash scripts to set CLASSPATH and launch different grinder processes.
Run the startProxy.sh shell script from above. You should see something like this:
Using the Proxy
Once you’ve started the proxy, configure your browser to send all requests through the proxy. I chose to use Firefox for this because it allowed me to set the proxy at the browser level rather than send all of my HTTP traffic through the proxy.
Make sure to disable any extra plugins that might make extra requests not related to the subject under test. Visit pages which will model a typical user’s usage. When you’re done, stop the proxy.
A Clojure Test Script
Here is an example of some of the Clojure code generated by The Grinder.
;; The Grinder 3.11;; HTTP script recorded by TCPProxy at Mar 14, 2013 1:59:26 AM(ns user
(:import(net.grinder.script Test Grinder)(net.grinder.plugin.http HTTPPluginControl HTTPRequest)(HTTPClient NVPair Codecs)))(def grinder (Grinder/grinder))(def connectionDefaults (HTTPPluginControl/getConnectionDefaults))(def httpUtilities (HTTPPluginControl/getHTTPUtilities)); To use a proxy server, uncomment the next line and set the host and port.; (.setProxyServer connectionDefaults "localhost" 8001); Worker thread state is stored in a map using a dynamic var.(def^:dynamic*tokens*)(defn set-token [k v](set!*tokens*(assoc*tokens* k v)))(defn token [k](*tokens* k))(defn nvpairs [c](into-array NVPair
(map(fn[[k v]](NVPair. k v))(partition2 c))))(defn httprequest [url &[headers]](doto(HTTPRequest.)(.setUrl url)(.setHeaders (nvpairs headers))))(defn basic-authorization [u p](str"Basic "(Codecs/base64Encode (str u ":" p))))(defn to-bytes [s](letfn[(to-byte[x](byte (if(> x 0x7f)(- x 0x100) x)))](byte-array (map to-byte s))))(defmacro defrequest [name test & args]
`(do(def ~name (httprequest ~@args))(.record ~test ~name (HTTPRequest/getHttpMethodFilter))))(defmacro defpage [name description test &rest]
`(do(defn ~name ~description ~@rest)(.record ~test ~name))); Offline debug; (use '[clojure.string :only (join)]); (defmacro .GET [& k] `(.. grinder (getLogger) (debug (str "GET " (join ", " `(~~@k)))))); (defmacro .POST [& k] `(.. grinder (getLogger) (debug (str "POST " (join ", " `(~~@k))))))(.setDefaultHeaders connectionDefaults (nvpairs ["Accept-Encoding", "gzip, deflate""Accept-Language", "en-US,en;q=0.5""User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:19.0) Gecko/20100101 Firefox/19.0"]))(def headers0 ["Accept", "image/png,image/*;q=0.8,*/*;q=0.5""Referer", "http://www.example.com/"])(def headers1 ["Accept", "*/*""Referer", "http://www.example.com/"])(def url0 "http://www.example.com:80")(def url2 "http://ssl.static.example.com:80")(defrequest request101 (Test.101"GET /") url0)(defrequest request201 (Test.201"POST /") url1)(defrequest request301 (Test.301"GET chrome-48.png") url0 headers0)(defrequest request302 (Test.302"GET logo4w.png") url0 headers0)(defrequest request401 (Test.401"GET rs=AItRSTPdVT73a8ca8dITXjGUdziGAyC2IQ") url0 headers1)(defrequest request501 (Test.501"GET rs=AItRSTPdVT73a8ca8dITXjGUdziGAyC2IQ") url0 headers1)(defrequest request502 (Test.502"GET tia.png") url0 headers0)(defrequest request503 (Test.503"GET b84c02c3b64bf7ed.js") url0 headers1)(defrequest request601 (Test.601"GET csi") url0 headers0)(defrequest request602 (Test.602"GET nav_logo117.png") url0 headers0)(defrequest request701 (Test.701"GET sem_87e2600bd08d93bebd4d641cad5ffb62.js") url2 headers1); A function for each recorded page.(defpage page1 "GET / (request 101)."(Test.100"Page 1")[](.GET request101 "/" nil
(nvpairs ["Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"])))(defpage page3 "GET chrome-48.png (requests 301-302)."(Test.300"Page 3")[](.GET request301 "/images/icons/product/chrome-48.png")(.GET request302 "/images/srpr/logo4w.png"))(defpage page4 "GET rs=AItRSTPdVT73a8ca8dITXjGUdziGAyC2IQ (request 401)."(Test.400"Page 4")[](set-token :token_rt"j")(set-token :token_ver"Za8TToM0_vY.en_US.")(set-token :token_am"BA")(set-token :token_d"1")(set-token :token_sv"1")(set-token :token_rs"AItRSTPdVT73a8ca8dITXjGUdziGAyC2IQ")(.GET request401
(str"/xjs/_/js/s/c,sb,cr,cdos,vm,tbui,mb,hov,wobnm,cfm,abd,klc,kat,aut,bihu,kp,lu,m,rtis,tnv,amcl,erh,hv,lc,ob,rsn,sf,sfa,shb,tbpr,hsm,j,p,pcc,csi/rt="(token :token_rt)"/ver="(token :token_ver)"/am="(token :token_am)"/d="(token :token_d)"/sv="(token :token_sv)"/rs="(token :token_rs))))(defpage page5 "GET rs=AItRSTPdVT73a8ca8dITXjGUdziGAyC2IQ (requests 501-503)."(Test.500"Page 5")[](set-token :token_d"0")(.GET request501
(str"/xjs/_/js/s/sy9,gf,ifl/rt="(token :token_rt)"/ver="(token :token_ver)"/am="(token :token_am)"/d="(token :token_d)"/sv="(token :token_sv)"/rs="(token :token_rs)))(.GET request502 "/textinputassistant/tia.png")(set-token :token_bav"on.2,or.r_qf.")(.GET request503
(str"/extern_chrome/b84c02c3b64bf7ed.js""?bav="(token :token_bav))))(defpage page6 "GET csi (requests 601-602)."(Test.600"Page 6")[](set-token :token_v"3")(set-token :token_s"webhp")(set-token :token_action"")(set-token :token_e"17259,18168,39523,4000116,4001569,4001947,4001959,4001975,4002206,4002562,4002734,4002855,4003053,4003178,4003386,4003575,4003638,4003917,4004181,4004213,4004235,4004257,4004334,4004356,4004363,4004364,4004388,4004479,4004488,4004490,4004653,4004754,4004758,4004904")(set-token :token_ei"EtE-UcqbHKKEygGIg4CoBw")(set-token :token_imc"2")(set-token :token_imn"2")(set-token :token_imp"2")(set-token :token_atyp"csi")(set-token :token_adh"")(set-token :token_rt"xjsls.504,prt.538,xjses.3445,xjsee.3656,xjs.3659,ol.3969,iml.1089,wsrt.1342,cst.0,dnst.0,rqst.1425,rspt.175")(.GET request601
(str"/csi""?v="(token :token_v)"&s="(token :token_s)"&action="(token :token_action)"&e="(token :token_e)"&ei="(token :token_ei)"&imc="(token :token_imc)"&imn="(token :token_imn)"&imp="(token :token_imp)"&atyp="(token :token_atyp)"&adh="(token :token_adh)"&rt="(token :token_rt)))(.GET request602 "/images/nav_logo117.png"))(defpage page7 "GET sem_87e2600bd08d93bebd4d641cad5ffb62.js (request 701)."(Test.700"Page 7")[](.GET request701 "/gb/js/sem_87e2600bd08d93bebd4d641cad5ffb62.js"))(defn run
"Called for every run performed by the worker thread."[](page1); GET / (request 101)(.sleep grinder 246)(page3); GET chrome-48.png (requests 301-302)(.sleep grinder 32)(page4); GET rs=AItRSTPdVT73a8ca8dITXjGUdziGAyC2IQ (request 401)(.sleep grinder 2899)(page5); GET rs=AItRSTPdVT73a8ca8dITXjGUdziGAyC2IQ (requests 501-503)(.sleep grinder 249)(page6); GET csi (requests 601-602)(.sleep grinder 358)(page7); GET sem_87e2600bd08d93bebd4d641cad5ffb62.js (request 701))(defn runner-factory
"Create a run function. Called for each worker thread."[](binding[*tokens*{}](bound-fn* run)))
After recording your session, you can modify this script to eliminate any requests you want to exclude from your test.
Running the Test
First, start the Console:
./startConsole.sh
Then, start the Agent:
./startAgent.sh
From the Console, you can star the grinder.properies file to mark it for use:
And edit your grinder.properties to point to your test script:
Depending on what you’ve done, you may need to reset the Agent(s) at this point (I usually want to reset the Console, too):
Then you can distribute files to the Agent(s) – this includes the test script specified in our grinder.properties file:
Make it Fail
Turn up the number of threads and/or worker processes until the load replicates the failure case. As Red Green says, “If it ain’t broke, you’re not trying!”
Make sure to consider the whole system at this point because it’s easy to fool yourself into thinking you’ve crushed the server under heavy load when really you’ve only sapped the resources of you agents or local network.
In my case, I was trying to model the load of approximately 30 roughly concurrent requests for the same set of resources.
Due to interactions between several system components and a broken caching mechanism, this was causing the app to become unresponsive for several minutes.
My test script was able to model this failure quite well.
Go Green
Using The Grinder, I was able to model this failure well enough to test several configuration changes as well as a replacement caching mechanism. When the system was able to withstand the load of of the test (the test passed), I was confident that the changes were likely to work in production.
Summary
By first creating a failing test for the scenario of a complete system under load, I gained confidence that configuration changes I deployed to production would solve the problem. This was relatively a rudimentary example. What tools and techniques does your team use to test system integration at this level?
This is the first time I've actually blogged about Flapjack.
The past
In 2008 I started talking with Matt Moor about building a "next generation monitoring system" that would be simple to setup & operate, and provide obvious paths to scale.
In 2009 I started hacking on Flapjack while backpacking, and by mid 2009 I had a working prototype running basic monitoring checks.
The fundamental idea was simple: decouple the check execution from the alerting and notification, and use message queues to distribute the check execution across lots of machines.
It seems simple and obvious now, but at the time nobody was really talking about doing this, so Flapjack gathered a reasonable amount of attention relatively quickly after I started talking about it at conferences.
2010 rolled around and I was unable to maintain a good development pace and hold that attention gained by talking at conferences due to some fairly significant life changes. Pretty much all of my open source projects suffered, and in the space of 12 months:
There were plenty of other interesting projects like Sensu that were achieving similar goals excellently, so while winding up Flapjack was a source of bitter personal disappointment, it was offset by seeing other people doing awesome work in the monitoring space.
The present
Mid last year, an interesting problem arose at work:
In a modern "monitoring system", how do you:
Notify a dynamic group of people on a variety of media based on monitoring events?Bulletproof has thousands of people that may need to be notified by our monitoring system, depending on what monitoring checks are failing. While the thresholds on each monitoring check are universal, each of these people can have different notification settings based on time of day or week, the type of service affected, or the severity of the failure.
Dampen or roll up common events so on-call isn't bombarded during outages? When one system deep in the stack fails, it has significant flow-on effects to everything else that depends on it. This generally manifests as thousands (or tens of thousands, in extremely bad cases) of alerts being sent to on-call in a very short period of time (<60 seconds). Obviously this is bad, and we simply want to detect cases like these, and wake up people involved in the incident response process.
Do the above in an API driven way? We need to solve both problems in a way that works in a multitenant environment with strong segregation between customers, and integrates with an existing monitoring & customer self-service stack.
We've been actively working on the reboot since July last year, and have been sending alerts from Flapjack to customers since January.
We're developing Flapjack as a fully Open Sourcecomposable platform on which you can adapt and build to your organisation's needs by hooking it into your existing check execution infrastructure (we ship a Nagios event processor), and self service and provisioning automation tools.
Flapjack is built on Redis, and funnily enough R.I. Pienaar did a post earlier this year that investigates using Redis to solve the same problem in an extremely similar way. R.I.'s post provides a good primer on some of the thinking behind Flapjack, so I recommend giving it a read.
The future
Fundamentally, Flapjack is trying to plug a notification hole in the monitoring ecosystem that I don't believe is being adequately addressed by other tools, but the key to doing this is to play nicely with other tools and build a composable pipeline.
The above is merely a glimpse of Flapjack that leaves quite a few questions unanswered (e.g. "Why aren't you using $x feature of $y check execution engine to do roll-up?", "Do Flapjack and Riemann play nicely with one another?"), so stay tuned for more:
Paris is more and more becoming the DevOps place to be. We (apparently) successfully rebooted the Paris DevOps Meetups, with already twoevents so far, and two more already in the pipeline (stay tuned for the announcements).
The Paris edition of DevOpsDays is being held 18 - 19 April 2013, and we want you to be a part of it! This conference brings together speakers and attendees from around the world, with a focus on DevOps culture, techniques, and best practices.
The format is simple: talks in the morning, and open/hack spaces in the afternoon. We’ve done 17 very successful events on 5 continents, and we’re looking forward to another great edition here!
Perhaps you’re curious, “what exactly is DevOps?”
Well if you already follow this blog, or my twitter account you might already know what is under this term. The term is, of course, a portmanteau of Development and Operations, and is perhaps best thought of as a cultural movement within the IT world. It stresses communication, collaboration and integration between software developers and IT professionals. DevOps is a response to, and evolution of, the interdependence of software development and IT operations.
The conference itself will be held at the MAS. Tickets can be purchased for one or both days, and include full access to the talks and spaces, as well as a catered lunch.
What’s more, we’re currently offering 25% off of the ticket price - just use the code WELOVEDEVOPS when you register. This is a limited-time offer (until the end of this week), so don’t delay!
And, of course, the Call for Proposals is still open until 20 March 2013.
Finally, we invite you to peruse the list of proposals, and to comment and vote for your favorite ones!
So if you had to choose one devopsdays this year, choose ours and come
exchange with lots of talented French people
taste French food and wine
learn how we do devops in 3 hours of work per day (just kidding of course)
smell the fragrance of Paris in spring (there won’t be any more snow, I promise)
My next 2 months is going to be jam packed with conferences and travel!
Devopsdays NZ, March 8 2013. I will be giving a talk that analyses AA261 through a DevOps lense, looking at the collaborative maintenance and operation of the MD-83 in the crash.
Monitorama, March 28-29 2013. I'm looking forward to slowing down and listening at Monitorama, which has a tremendous line up of speakers. I'll be keen to hear what others think of the work we've been doing on Flapjack the last 6 months.
Mountain West Ruby Conf 2013, April 3-5 2013. MWRC has added an extra day of DevOps content to the conference this year, and I'll be joining an esteemed speaker lineup to talk about what both dev and ops can learn from AF447 when responding to rapidly evolving failure scenarios.
I'll be staying in the Netherlands for a little under a week between conferences, visiting family and friends. Hopefully I can visit a meetup or two.
Open Source Data Center Conference 2013, April 17-18 2013. This will be my first time in Nürenberg, and I'm really looking forward to saying I have attended bothOSDCs. I'll be talking about Ript, a DSL for describing firewall rules, and a tool for incrementally applying them.