Hi! Welcome...

Syndication of blogs and tweets by users of the Freenode ##infra-talk IRC channel

16 July 2010 ~ Comments Off

Analyzing your backend web page response times

I have blogged about in the past about some of the ways you can monitor your web site performance e.g how to monitor your site using 90th percentile response times, beauty of aggregate line graphs and tracking web clients in real time.

Most recently we wanted to get better insight into how our site and more specifically backend is performing. We wanted a tool that could provide us with per URL/page metrics such as

  • total number of requests
  • aggregate compute time
  • average request time
  • 90th percentile time (you can find more explanation what it means at monitor your site using 90th percentile response times) - this eliminates most of the really slow response times that may really affect your averages

Initial plan was to build a basic set of reports to tell us what are the pages with excessive response times or large total (aggregate) compute times. Next and yet to be implemented portion was to be able to analyze data in real time so that we'd have another data point to use in troubleshooting in case there is a site slow down.

Basic requirements for the tool were these

  • Capable of crunching 100+ million daily entries
  • Real-time analysis
  • Produce multiple metrics with potential to add more down the line
  • Low footprint

An obvious way to do this is to store all data in a heavy duty data store like a relational/SQL database or something MapReduce capable. Trouble is we may be doing in logging in excess of 3,000 hits per second (all dynamic content as static assets are served from the CDN). Doing that many inserts per second on a SQL-type database will be tricky unless you have powerful hardware. Next obvious problem is to scan through hundreds of millions or billions of rows will be slow even if I use MapReduce unless of course you throw tons of hardware at it. We wanted a low footprint remember.

Instead we decided to go with a key/value store. Major pluses were that footprint is relatively low and it performs very fast. Downside was I would not be able to run any sophisticated queries. Since we already have an app that uses memcached to give us real-time view per IP number of accesses we ended up using it for this purpose as well.

Implementation

I have been working for a while now with ganglia-logtailer which is a Python framework to crunch log data and submit it to Ganglia. There are a number of good pieces from it we could reuse and we did. What we ended up is a two part tool. A Python based log parsing piece and a PHP based web GUI and computation part. Division of "labor" was roughly this

  • Python part parses the logs and creates entries/keys where the value in each key represent all the response times observed on a particular server and URL in a particular time period ie. one hour
  • PHP part takes the list once the time period has ended, calculates total time, average time and 90th percentile times and stores computed values in memcache so that retrieval later can be quicker.

Graphing is achieved using simple CSS graphs while time based series are done using OpenFlashChart. I did look at Dygraphs for Javascript/DHTML based graphing however couldn't figure how to plot hourly values. I could only do daily values.

Tool is operational and so far it has led us to the realization that our mobile web pages are overall much slower than their corresponding web pages. This is due to the way we handle mobile ads since most feature phones don't support Javascript so we have to download the ad which introduces a slight delay. We did figure out that we could use Javascript on Webkit browsers similar to what we do for regular browsers so that should help a bit. We are also chasing some of the other "leads" regarding inconsistent performance for particular pages on some of the servers.

Next steps are to adapt parsing code to work with ganglia-logtailer which would give us real-time reporting. I don't expect too many problems with that. Also graphing could use some more love. Perhaps I'll even do standard deviation calculations :-) .

Anyways you can download source code from here

http://github.com/vvuksan/pagetime-analyzer

You know what to do :-) .

Obligatory screenshots

Hourly overview sorted by aggregate time in seconds (you can sort by any column)

This is the average response time (over an hour) for a particular URL on separate server instances

Daily view of performance for a particular URL

15 July 2010 ~ Comments Off

CouchDB views creation problems

I have had a frustrating time creating views in CouchDB using curl. Executing following command I would get

$ curl -s -X PUT -H "text/plain;charset=utf-8" -d cronview.json http://localhost:5984/cronologger/_design/cronview
{"error":"bad_request","reason":"invalid UTF-8 JSON"}

I checked and rechecked JSON, used the same JSON using CouchDB's Futon to no avail. Finally I found the answer here

http://stackoverflow.com/questions/2461798/error-about-invalid-json-with-couchdb-view-but-the-jsons-fine

The -d option of curl expects the actual data as the argument!

If you want to provide the data in a file, you need to prefix it with @:

curl -X PUT -d @keys.json  $CDB/_design/id

13 July 2010 ~ Comments Off

Bootstrapping Puppet on EC2 with MCollective

The problem of getting EC2 images to do what you want is quite significant, mostly I find the whole thing a bit flakey and with too many moving parts.

  • When and what AMI to start
  • Once started how to do you configure it from base to functional. Especially in a way that doesn’t become a vendor lock.
  • How do you manage the massive sprawl of instances, inventory them and track your assets
  • Monitoring and general life cycle management
  • When and how do you shut them, and what cleanup is needed. Being billed by the hour means this has to be a consideration

These are significant problems and just a tip of the ice berg. All of the traditional aspects of infrastructure management – like Asset Management, Monitoring, Procurement – are totally useless in the face of the cloud.

A lot of work is being done in this space by tools like Pool Party, Fog, Opscode and many other players like the countless companies launching control panels, clouds overlaying other clouds and so forth. As a keen believer in Open Source many of these options are not appealing.

I want to focus on the 2nd step above here today and show how I pulled together a number of my Open Source projects to automate that. I built a generic provisioner that hopefully is expandable and usable in your own environments. The provisioner deals with all the interactions between Puppet on nodes, the Puppet Master, the Puppet CA and the administrators.

<rant> Sadly the activity in the Puppet space is a bit lacking in the area of making it really easy to get going on a cloud. There are suggestions on the level of monitoring syslog files from a cronjob and signing certificates based on that. Really. It’s a pretty sad state of affairs when that’s the state of the art.

Compare the ease of using Chef’s Knife with a lot of the suggestions currently out there for using Puppet in EC2 like these: 1, 2, 3 and 4.

Not trying to have a general Puppet Bashing session here but I think it’s quite defining of the 2 user bases that Cloud readiness is such an after thought so far in Puppet and its community. </rant>

My basic needs are that instances all start in the same state, I just want 1 base AMI that I massage into the desired final state. Most of this work has to be done by Puppet so it’s repeatable. Driving this process will be done by MCollective.

I bootstrap the EC2 instances using my EC2 Bootstrap Helper and I use that to install MCollective with just a provision agent. It configures it and hook it into my collective.

From there I have the following steps that need to be done:

  • Pick a nearby Puppet Master, perhaps using EC2 Region or country as guides
  • Set up the host – perhaps using /etc/hosts – to talk to the right master
  • Revoke and clean any old certs for this hostname on all masters
  • Instruct the node to create a new CSR and send it to its master
  • Sign the certificate
  • Run my initial bootstrap Puppet environment, this sets up some hard to do things like facts my full build needs
  • Run the final Puppet run in my normal production environment.
  • Notify me using XMPP, Twitter, Google Calendar, Email, Boxcar and whatever else I want of the new node

This is a lot of work to be done on every node. And more importantly it’s a task that involves many other nodes like puppet masters, notifiers and so forth. It has to adapt dynamically to your environment and not need reconfiguring when you get new Puppet Masters. It has to deal with new data centers, regions and countries without needing any configuration or even a restart. It has to happen automatically without any user interaction so that your auto scaling infrastructure can take care of booting new instances even while you sleep.

The provisioning system I wrote does just this. It follows the above logic for any new node and is configurable for which facts to use to pick a master and how to notify you of new systems. It adapts automatically to your ever changing environments thanks to discovery of resources. The actions to perform on the node are easily pluggable by just creating an agent that complies to the published DDL like the sample agent.

You can see it in action in the video below. I am using Amazon’s console to start the instance, you’d absolutely want to automate that for your needs. You can also see it direct on blip.tv here. For best effect – and to be able to read the text – please fullscreen.

In case the text is unreadable in the video a log file similar to the one in the video can be seen here and an example config here

Past this point my Puppet runs are managed by my MCollective Puppet Scheduler.

While this is all done using EC2 nothing prevents you from applying these same techniques to your own data center or non cloud environment.

Hopefully this shows that you can wrap all the logic needed to do very complex interactions with systems that are perhaps not known for their good reusable API’s in simple to understand wrappers with MCollective, exposing those systems to the network at large with APIs that can be used to reach your goals.

The various bits of open source I used here are:

12 July 2010 ~ Comments Off

EC2 Bootstrap Helper

I’ve been working a bit on streamlining the builds I do on EC2 and wanted a better way to provision my machines. I use CentOS and things are pretty rough to non existent for nicely built EC2 images. I’ve used the Rightscale ones till now and while they’re nice they are also full of lots of code copyrighted by Rightscale.

What I really wanted was something as full featured as Ubuntu’s CloudInit but also didn’t feel much like touching any Python. I hacked up something that more or less do what I need. You can get it on GitHub. It’s written and tested on CentOS 5.5.

The idea is that you’ll have a single multi purpose AMI that you can easily bootstrap onto your puppet/mcollective infrastructure using this system. Below for some details.

I prepare my base CentOS AMI with the following mods:

  • Install Facter and Puppet – but not enabled
  • Install the EC2 utilities
  • Setup the usual getsshkeys script
  • Install the ec2-boot-init RPM
  • Add a custom fact that reads /etc/facts.txt – see later why. Get one here.

With this in place you need to create some ruby scripts that you will use to bootstrap your machines. Examples of this would be to install mcollective, configure it to find your current activemq. Or to set up puppet and do your initial run etc.

We host these scripts on any webserver – ideally S3 – so that when a machine boots it can grab the logic you want to execute on it. This way you can bug fix your bootstrapping without having to make new AMIs as well as add new bootstrap methods in future to existing AMIs.

Here’s a simple example that just runs a shell command:

newaction("shell") do |cmd, ud, md, config|
    if cmd.include?(:command)
        system(cmd[:command])
    end
end

You want to host this on any webserver in a file called shell.rb. Now create a file list.txt in the same location that just have this:

shell.rb

You can list as many scripts as you want. Now when you boot your instance pass it data like this:

--- 
:facts: 
  role: webserver
:actions: 
- :url: http://your.net/path/to/actions/list.txt
  :type: :getactions
- :type: :shell
  :command: date > /tmp/test

The above will fetch the list of actions – our shell.rb – from http://your.net/path/to/actions/list.txt and then run using the shell action the command date > /tmp/test. The actions are run in order so you probably always want getactions to happen first.

Other actions that this script will take:

  • Cache all the user and meta data in /var/spool/ec2boot
  • Create /etc/facts.txt with all your facts that you passed in as well as a flat version of the entire instance meta data.
  • Create a MOTD that shows some key data like AMI ID, Zone, Public and Private hostnames

The boot library provides a few helpers that help you write scripts for this environment specifically around fetching files and logging:

    ["rubygems-1.3.1-1.el5.noarch.rpm",
     "rubygem-stomp-1.1.6-1.el5.noarch.rpm",
     "mcollective-common-#{version}.el5.noarch.rpm",
     "mcollective-#{version}.el5.noarch.rpm",
     "server.cfg.templ"].each do |pkg|
        EC2Boot::Util.log("Fetching pkg #{pkg}")
        EC2Boot::Util.get_url("http://foo.s3.amazonaws.com/#{pkg}", "/mnt/#{pkg}")
     end

This code fetches a bunch of files from a S3 bucket and save them into /mnt. Each one gets logged to console and syslog. Using this GET helper has the advantage that it has sane retrying etc built in for you already.

It’s fairly early days for this code but it works and I am using it, I’ll probably be adding a few more features soon, let me know in comments if you need anything specific or even if you find it useful.

09 July 2010 ~ Comments Off

dynect4r: A Ruby Library and Command Line Client for the Dynect REST API (Version 2)

Well, I should have listened to everyone who warned me about UltraDNS’s obscene prices. But I figured it’s only DNS, so how much more could they be compared to their competition? $50 per month? Maybe $100? Boy was I surprised to find out that UltraDNS’s prices are literally 10-25 times more than everyone else’s! Hilarious…

I’ve actually been a DynDNS customer since the late nineties or so (I have free custom DNS service for life for making a donation to them back when they were a much smaller company), so I had looked at Dyn.com‘s products before. I just must have gotten confused with all their different websites and DNS products, because I somehow got the impression that the DynDNS API wasn’t powerful enough to do what I wanted to do. I was absolutely wrong. After having written command line clients for both APIs (see ultradns4r, and now dynectr4), I think I speak from authority when I say the Dynect API is every bit as powerful as UltraDNS’s. And at 1/10th – 1/25th the cost of UltraDNS, going with Dynect is a no-brainer. But I’ve digressed long enough.

I wrote dynect4r for the same reason I wrote ultradns4r; I wanted to be able to manage all my DNS records via the command line. And now that I’ve learned how to package Ruby projects as gems, you can simply…

gem install dynect4r

and then do things like…

dynect4r-client -n test.example.org 1.1.1.1

Since the key feature of this project is the command line client, the actual library behind it is a pretty simple wrapper around rest-client. If you’re looking for something a bit more powerful to use in your own Ruby projects, you may be interested in dynect_rest by Adam Jacob from Opscode. We actually discovered each other’s projects last night in #chef, and realized that it would probably be a good idea to pool our efforts eventually.

07 July 2010 ~ Comments Off

Puppet resources on demand with MCollective

Some time ago I wrote how to reuse Puppet providers in your Ruby script, I’ll take that a bit further here and show you to create any kind of resource.

Puppet works based on resources and catalogs. A catalog is a collection of resources and it will apply the catalog to a machine. So in order to do something you can do as before and call the type’s methods directly but if you wanted to build up a resource and say ‘just do it’ then you need to go via a catalog.

Here’s some code, I don’t know if this is the best way to do it, I dug around the code for ralsh to figure this out:

params = { :name => "rip",
           :comment => "R.I.Pienaar",
           :password => '......' }
 
pup = Puppet::Type.type(:user).new(params)
 
catalog = Puppet::Resource::Catalog.new
catalog.add_resource pup
catalog.apply

That’s really simple and doesn’t require you to know much about the inner workings of a type, you’re just mapping the normal Puppet manifest to code and applying it. Nifty.

The natural progression – to me anyway – is to put this stuff into a MCollective agent and build a distributed ralsh.

Here’s a sample use case, I wanted to change my users password everywhere:

$ mc-rpc puppetral do type=user name=rip password='$1$xxx'

And that will go out, find all my machines and use the Puppet RAL to change my password for me. You can do anything puppet can, manage /etc/hosts, add users, remove users, packages, services and anything even your own custom types can be used. Distributed and in parallel over any number of hosts.

Some other examples:

Add a user:

$ mc-rpc puppetral do type=user name=foo comment="Foo User" managehome=true

Run a command using exec, with the magical creates option:

$ mc-rpc puppetral do type=exec name="/bin/date > /tmp/date" user=root timeout=5 creates="/tmp/date"

Add an aliases entry:

$ mc-rpc puppetral do type=mailalias name=foo recipient="rip@devco.net" target="/etc/aliases"

Install a package:

$ mc-rpc puppetral do type=package name=unix2dos ensure=present

06 July 2010 ~ Comments Off

Store your cron output for analysis and correlation with cronologger

For the longest time I have wanted to get rid of dozen or so cron messages I receive every morning about things like DB backups, DB cleanups/vacuums, reporting etc. There are a number of solutions out there to help you manage the cron spam such as cronic, shush and cronwrap. They help by e-mailing you only if there is a problem however don't store the cron output itself. To get around that issue I have developed cronologger which can be downloaded from

http://github.com/vvuksan/cronologger

Cronologger is a BASH script that stores all the cron output into a database. I am using CouchDB since it is a great document oriented database that allows me to add attachments (blobs) to a document. I assume it would not be hard to use MongoDB, Riak and others.

Some of the benefits of this utility are

  • Reduce cron spam
  • Provide the ability to correlate adverse affects by overlaying cron events on e.g. Ganglia graphs
  • Provide a better report of all the batch jobs that ran, diff them with past jobs if they should look the same, etc.
  • Provide the ability to easily view what is currently running on the whole infrastructure ie. job_duration < 0
  • Review historical output

I am still working on web GUI for most of these things. I will gladly accept patches and new contributions.

Tip: To get view a list of documents in a CouchDB database you can use the _utils view e.g. http://localhost:5984/_utils/

05 July 2010 ~ Comments Off

rubyrep : master-mater replication PostgreSQL

rubyrep Database replication that doesn’t hurt. Unlike Oracle & MySQL : PostgreSQL doesn’t’ have built in replication solutions but there are many other replication solutions available for PostgreSQL liked listed here [...]

03 July 2010 ~ Comments Off

Aggregating Nagios Checks With MCollective

A very typical scenario I come across on many sites is the requirement to monitor something like Puppet across 100s or 1000s of machines.

The typical approaches are to add perhaps a central check on your puppet master or to check using NRPE or NSCA on every node. For this example the option exist to easily check on the master and get one check but that isn’t always easily achievable.

Think for example about monitoring mail queues on all your machines to make sure things like root mail isn’t getting stuck. In those cases you are forced to do per node checks which inevitably result in huge notification storms in the event that your mail server was down and not receiving the mail from the many nodes.

MCollective has had a plugin that can run NRPE commands for a long time, I’ve now added a nagios plugin using this agent to combine results from many hosts.

Sticking with the Puppet example, here are my needs:

  • I want to know if anywhere some puppet machine isn’t successfully doing runs.
  • I want to be able to do puppetd –disable and not get alerts for those machines.
  • I do not want to change any configs when I am adding new machines, it should just work.
  • I want the ability to do monitoring on subsets of machines on different probes

This is a pretty painful set of requirements for nagios on its own to achieve. Easy with the help of MCollective.

Ultimately, I just want this:

OK: 42 WARNING: 0 CRITICAL: 0 UNKNOWN: 0

Meaning 42 machines – only ones currently enabled – are all running happily.

The NRPE Check

We put the NRPE logic on every node. A simple check command in /etc/nagios/nrpe.d/check_puppet_run.cfg:

command[check_puppet_run]=/usr/lib/nagios/plugins/check_file_age -f /var/lib/puppet/state/state.yaml -w 5400 -c 7200

In my case I just want to know there are successful runs happening, if I wanted to know the code is actually compiling correctly I’d monitor the local cache age and size.

Determining if Puppet is enabled or not

Currently this is a bit hacky, I’ve filed tickets with Puppet Labs to improve this. The way to determine if puppet is disabled is to check if the lock file exist and if its 0 bytes. If it’s not zero bytes it means a puppetd is currently doing a run – there will be a pid in it. Or the puppetd crashed and there’s a stale pid preventing other runs.

To automate this and integrate into MCollective I’ve made a fact puppet_enabled. We’ll use this in MCollective discovery to only monitor machines that are enabled. Get this onto all your nodes perhaps using Plugins in Modules.

The MCollective Agent

You want to deploy the MCollective NRPE Agent to all your nodes, once you’ve got it right you can test it easily using something like this:

% mc-nrpe -W puppet_enabled=1 check_puppet_run
 
 * [ ============================================================> ] 47 / 47
 
Finished processing 47 / 47 hosts in 395.51 ms
              OK: 47
         WARNING: 0
        CRITICAL: 0
         UNKNOWN: 0

Note we’re restricting the run to only enabled hosts.

Integrating into Nagios

The last step is to add this to nagios. I create SSL certs and a specific client configuration for Nagios and put these in it’s home directory.

The check-mc-nrpe plugin works best with Nagios 3 as it will return subsequent lines of output indicating which machines are in what state so you get the details hidden behind the aggregation in alerts. It also outputs performance data for total node, each status and also how long it took to do the check.

The nagios command would be something like this:

define command{
        command_name                    check_mc_nrpe
        command_line                    /usr/sbin/check-mc-nrpe  --config /var/log/nagios/.mcollective/client.cfg  -W $ARG1$ $ARG2$
}

And finally we need to make a service:

define service{
        host_name                       monitor1
        service_description             mc_puppet-run
        use                             generic-service
        check_command                   check_mc_nrpe!puppet_enabled=1!check_puppet_run
        notification_period             awakehours
        contact_groups                  sysadmin
}

Here are a few other command examples I use:

All machines with my Puppet class “pki”, check the age of certs:

check_command   check_mc_nrpe!pki!check_pki

All machines with my Puppet class “bacula::node”, make sure the FD is running:

check_command   check_mc_nrpe!bacula::node!check_fd

…and that they were backed up:

check_command   check_mc_nrpe!bacula::node!check_bacula_main

Using this I removed 100s of checks from my monitoring platform, saving on resources and making sure I can do my critical monitor tasks better.

Depending on the quality of your monitoring system you might even get a graph showing the details hidden behind the aggregation:

The above is a graph showing a series of servers where the backup ran later than usual, I had 2 alerts only, would have had more than 30 before aggregation.

Restrictions for Probes

The last remaining requirement I had was to be able to do checks on different probes and restrict them. My Collective is one big one spread all over the world which means sometimes things are a bit slow discovery wise.

So I have many nagios servers doing local checks. Using MCollective discovery I can now easily restrict checks, for example If I only wanted to check machines in the USA and I had a fact country I only have to change my command line in the service declaration:

check_command   check_mc_nrpe!puppet_enabled=1 country=us!check_puppet_run

This will then via MCollective discovery just monitor machines in the US.

What to monitor this way

As this style of monitoring is done using Discovery you would need to think carefully about what you monitor this way. It’s totally conceivable that if a node is under high CPU load that it wont respond to discovery commands in time, and so wont get monitored!

You would then for example not want to monitor things like load averages or really critical services this way, but we all have a lot of peripheral things like zombie process counts and a lot of other places where aggregation makes a lot of sense, in those cases by all means consider this approach.

28 June 2010 ~ Comments Off

Overlay deploy timeline on Ganglia graphs

Don't you sometimes wish you could have a visual indicator of when code has been deployed in production. Something like this :-)

Shows deploy time line on a load graph

This is how you can add deploy timeline to your Ganglia graphs or for that matter to any tool that uses RRDs such as Cacti, Munin, Collectd etc.

Background

RRDtool supports so called VRULEs which are

VRULE:time#color[:legend][:dashes[=on_s[,off_s[,on_s,off_s]...]][:dash-offset=offset]]

Draw a vertical line at time. Its color is composed from three hexadecimal numbers specifying the rgb color components (00 is off, FF is maximum) red, green and blue followed by an optional alpha. Optionally, a legend box and string is printed in the legend section. time may be a number or a variable from a VDEF. It is an error to use vnames from DEF or CDEF here. Dashed lines can be drawn using the dashes modifier. See LINE for more details.

What we want to do is add a VRULE for each deployment. For example those three lines above have been generated using these VRULEs

VRULE:1277731886#FF00FF:"Deploys" VRULE:1277721886#FF00FF VRULE:1277711886#FF00FF

Implementation

Easiest way to add these to Ganglia is to modify graph.php in Ganglia Web. You need to look for following two lines at the end of the file

$command .=  array_key_exists('extras', $rrdtool_graph) ? ' '.$rrdtool_graph['extras'].' ' : '';
$command .=  " $rrdtool_graph[series]";

Then append your own VRULEs ie.

$command .= " VRULE:" . $time . "#FF00FF:\"Deploys\"";

Obviously you have to pull in the $time info from where you keep track of your deploy times. You can also get creative by using different colors for different deploys, change legend labels, add VRULEs to only certain graphs ie. load, CPU etc. This is a quick and dirty way to do it

$deploy_times = array(1278082860,1279393200);
foreach ( $deploy_times as $key => $time ) {
  # Put deploys label only once.
  if ( $key == 0 )
     $command .= " VRULE:" . $time . "#FF00FF:\"Deploys\"";
  else
     $command .= " VRULE:" . $time . "#FF00FF";
}

Now you just have to make sure you append deploy times in the array.

Alternate implementations

Alternate implementation is to create a RRD file whenever you do deploys then overlay that graph on top of an existing graph. Trouble is you have to worry about scaling the graph. Never could get it quite right.

Credit

Thanks goes to the Circonus guys :-) since they made me think of vertical lines instead of trying the RRD overlay. Also thanks to @toredash for pointing me in the right RRDtool direction by suggesting HRULE.