Designing a Scalable Deployment Pipeline

Anyone who’s led a product engineering team knows that a growing team requires investments in process, communication approaches, and documentation. These investments help new people get up to speed, become productive quickly, stay informed about what the rest of the team is doing, and codify tribal knowledge so it doesn’t leave with people.

One thing that receives less investment when a team scales is its deployment pipeline–the tools and infrastructure for deploying, testing, and running in production. Why are these investments lacking even when the team can identify the pain points? My theory is that it nearly always feels too expensive in terms of both money and lost progress on building features.

Following that theory, I now consider designing an effective and scalable deployment pipeline to be the first priority of a product engineering team—even higher than choosing a language or tech stack. The same staging/production design that was my standard just a few years ago now seems unacceptable.

What is a Deployment Pipeline?

Before we dive into what our deployment pipelines used to look like, let’s start by defining a few terms.

A deployment pipeline includes the automation, deploy environments, and process that supports getting code from a developer’s laptop into the hands of an end user.

A deploy environment is a named version of the application. It can be uniquely addressed or installed by a non-developer team member. A developer can distinctly deploy an arbitrary version of the underlying codebase to it. Often, distinct deploy environments will also have unique sets of backing data.

A deployment process is the set of rules the team agrees upon regarding hand-off, build promotion between environments, source control management, and new functionality verification.

Automation is the approach to making mundane parts of the deployment process executable by computers as a result of a detectable event (i.e. source control commit) or manual push-button trigger.

Our Old Approach: A Hot Mess

In the recent past, our goto template for a web app deployment pipeline utilized two deployment environments: staging and production.

Process

The process for utilizing these environments looked something like this:

  1. Developer works on a feature locally until it’s ready to be integrated and accepted.
  2. Developer integrates it with the version of the app on staging and deploys it to the staging environment.
  3. Delivery lead verifies that the feature is acceptable by functionally testing it in the staging environment.
  4. Delivery lead gives developer feedback for improvement or approves it as done.
  5. At some point, the developer deploys the features on staging to production.

Automation

We’d also, minimally, automate deployment of an arbitrary version of the app from a developer’s laptop to either environment.

Result

This deployment pipeline is straightforward and easy to implement–but it’s not easy to scale if, for example, you need to grow your dev team. Or if you support a heavily used production deployment while simultaneously developing new product functionality.

The most common sign that a prod/staging pipeline is breaking down due to scaling demands is integration pain felt by the delivery lead in Step 3 of the process above. Multiple developers pile their feature updates and bug fixes onto the staging environment. Staging starts to feel like a traffic accident on top of a log jam.  It’s a mix of verified and unverified bug fixes and accepted/brand new feature enhancements. This results in regressions for which the root cause cannot be easily found. Since it’s all on staging, a delivery lead doesn’t know which change is a likely culprit, and they’re probably not sure which developer should investigate it.

It’s a hot mess.

In this scenario, the staging environment rarely provides a sense of confidence for the upcoming production deployment. Rather, it foretells the disaster your team is likely to encounter once you go live.

We Can Do Better

If we look at this problem through the lens of the theory of constraints, it’s obvious that the staging deploy environment is the pipeline’s constraint/bottleneck.

We don’t want to drop staging because it provides a valuable opportunity to validate app changes just outside of the live environment. Instead, we want to optimize for staging to provide the most value possible–that being:

Provide a deploy environment identical to production except for one or two changes which can be verified one last time right before deploying them to production.

This definition of value implies that the staging environment spends a lot of time looking just like production, which is good. A clean staging environment is an open highway for the next feature or bug fix to be quickly deployed to production with confidence.

Deployed Dev Environments

To minimize the time a new feature spends on staging, we introduced new deploy environments which we call dev environments. These aren’t the same as local dev environments. A deploy environment needs to be uniquely addressable by the delivery lead-it can’t just be running on your laptop. The number of dev environments is fluid, scaling with the number of developers and number of in-progress features and updates.

Process

If you think of staging as a clone of production, then think of a dev environment as a clone of staging. The new process looks like this:

  1. Developer works on a feature locally until it’s ready to be integrated and accepted.
  2. Developer spins up a dev environment (cloned from staging) and deploys a change to it.
  3. Delivery lead verifies the feature is acceptable by functionally testing it in the dev environment.
  4. Delivery lead gives developer feedback for improvement or approves it as done.
  5. Developer deploys change to staging and shuts down dev environment.
  6. Delivery lead spot checks change in staging and deploys it to production.

The main difference in our process is moving the iteration on feature acceptance feedback from upstream from the staging environment to the dev environments. This allows staging to be a clean clone of production most of the time and lets us validate multiple updates in parallel isolated environments. The fact that features can validated in isolated environments means we can more easily identify the root cause of a defect or regression resulting from a recent change.

The idea of on-demand deploy environments may be uncommon, but it’s not new. Atlasssian called them rush boxes. Github called them staff servers and let developers spin them up with hubot commands.

Automation

In addition to automating deployment, we’ll need to automate the creation of a new dev environment to support this pipeline. Ideally, it should be a clone of staging and uniquely addressable (e.g. dev1.app.com, dev2.app.com, etc.).

Say you’re managing your deploy environments in a cloud service like AWS. Automating this process is doable with, at most, a few weeks of investment. As a stop gap, your team could also spin up a set of dev servers (one per developer) and try to suspend their respective computing resources (i.e. EC2 instance) when they’re not in use.

In 2014, we started implementing this pipeline design on top of Heroku. This made cloning environments really easy via the built-in ability to fork a copy of an app.

The Golden Triforce of Deployment Tools

Today, if you use GitHub and Heroku, you can get everything I described above right out of the box with Heroku Pipelines and Heroku Review Apps. Because of this, GitHub + Heroku is a killer stack for teams focused on building their product over their infrastructure.

I’d also throw in CircleCI for continuous integration. It’s a nearly zero-conf CI service that can automatically parallelize your slow test suite and execute it in parallel. All of these tools do a great job guiding a team to build a portable app. This makes it easy to move to another platform later, like AWS.

Deploying with Confidence

In summary: Use GitHub + Heroku + CircleCI unless you have a really good reason not to. Keep staging clean with on-demand dev environments. Deploy with confidence.

The post Designing a Scalable Deployment Pipeline appeared first on Atomic Spin.

Deploying from Git with Capistrano

Justin and I provide operational support to the SME Toolkit project, an education portal for small to medium sized enterprises in developing countries sponsored by the IFC (which is the private sector development branch of the World Bank Group).

Recently, the source code for the Rails-based web application was migrated from Subversion to Git. This also changed how we deploy the application. Previously, we deployed a snapshot of code from a tarball placed on a bastion server. With a few changes to our Capistrano configuration, we are able to deploy directly from the source code repository.

Basic Deployment

First we switch from

set :scm, :none
to
set :scm, :git

and from

set :repository, "/path/to/unzipped/snapshot"
to
set :repository, "git@server:project/repository.git"

Then we’ll also want to use set :deploy_via, :remote_cache so that we’ll only need to pull down the commits since the last deploy each time rather than cloning the repository and pulling down the entire history every time.

Now we can deploy using a command like:
cap stage_name deploy -S branch=master

Deploying a Specific Tagged Version

But what if we want to deploy a specific tagged version? We should be able to use cap stage_name deploy -S tag=3.2.1.

Unfortunately, it’s the ‘branch’ variable that’s used as a ref by Capistrano internally to resolve the revision deployed. This is not well documented, but see: lib/capistrano/recipes/deploy/scm/base.rb:77, and lib/capistrano/recipes/deploy/scm/git.rb:119, and lib/capistrano/recipes/deploy.rb:L29.

The simplest way to address this is just to set :branch, tag in config/deploy.rb.
In order to run other tasks without specifying a tag however, we should probably qualify that with a conditional: set :branch, tag if exists?(:tag).

Updating the Version Number

There’s one last thing to be updated: the version number. Before, the developers would populate the REVISION file with a version number when creating a release tarball. Now the REVISION file is populated by Capistrano, with the SHA of the commit being deployed.

If we substitute a new file, say BUILD_VERSION for REVISION in environment.rb, then we can again control of the version number displayed in various places throughout the application. Rather than a long seemingly meaningless string of letters and numbers, we’d like to include the date the code was deployed and the semantic version number of the code.

We can do this with a small Capistrano task like:

  task :build_version, :except => { :no_release => true } do
    deploy_date = Time.now.strftime('%F')
    build_version = "#{deploy_date} #{tag}.build+#{current_revision}"
    put(build_version,"#{current_release}/BUILD_VERSION")
  end

If we put this task in the deploy namespace, we can ensure that it’s run at the end of each deployment with something like: after "deploy:create_symlink", "deploy:build_version"

Now when we deploy, the date and the tag are prepended to the SHA for a much more meaningful build version.

The post Deploying from Git with Capistrano appeared first on Atomic Spin.

Devops in Munich

Devopsdays Mountainview sold out in a short 3 hours .. but there's other events that will breath devops this summer.
DrupalCon in Munich will be one of them ..

Some of you might have noticed that I`m cochairing the devops track for DrupalCon Munich,
The CFP is open till the 11th of this month and we are still actively looking for speakers.

We're trying to bridge the gap between drupal developers and the people that put their code to production, at scale.
But also enhancing the knowledge of infrastructure components Drupal developers depend on.

We're looking for talks both on culture (both success stories and failure) , automation,
specifically looking for people talking about drupal deployments , eg using tools like Capistrano, Chef, Puppet,
We want to hear where Continuous Integration fits in your deployment , do you do Continuous Delivery of a drupal environment.
And how do you test ... yes we like to hear a lot about testing , performance tests, security tests, application tests and so on.
... Or have you solved the content vs code vs config deployment problem yet ?

How are you measuring and monitoring these deployments and adding metrics to them so you can get good visibility on both
system and user actions of your platform. Have you build fancy dashboards showing your whole organisation the current state of your deployment ?

We're also looking for people talking about introducing different data backends, nosql, scaling different search backends , building your own cdn using smart filesystem setups.
Or making smart use of existing backends, such as tuning and scaling MySQL, memcached and others.

So lets make it clear to the community that drupal people do care about their code after they committed it in source control !

Please submit your talks here

Drupal and Configuration Mgmt, we’re getting there …

For those who haven't noticed yet .. I`m into devops .. I`m also a little bit into Drupal, (blame my last name..) , so one of the frustrations I've been having with Drupal (an much other software) is the automation of deployment and upgrades of Drupal sites ...

So for the past couple of days I've been trying to catch up to the ongoing discussion regarding the results of the configuration mgmt sprint , I've been looking at it mainly from a systems point of view , being with the use of Puppet/ Chef or similar tools in mind .. I know I`m late to the discussion but hey , some people take holidays in this season :) So below you can read a bunch of my comments ... and thoughts on the topic ..

First of all , to me JSON looks like a valid option.
Initially there was the plan to wrap the JSON in a PHP header for "security" reasons, but that seems to be gone even while nobody mentioned the problems that would have been caused for external configuration management tools.
When thinking about external tools that should be capable of mangling the file plenty of them support JSON but won't be able to recognize a JSON file with a weird header ( thinking e.g about Augeas (augeas.net) , I`m not talking about IDE's , GUI's etc here, I`m talking about system level tools and libraries that are designed to mangle standard files. For Augeas we could create a separate lens to manage these files , but other tools might have bigger problems with the concept.

As catch suggest a clean .htaccess should be capable of preventing people to access the .json files There's other methods to figure out if files have been tampered with , not sure if this even fits within Drupal (I`m thinking about reusing existing CA setups rather than having yet another security setup to manage) ,

In general to me tools such as puppet should be capable of modifying config files , and then activating that config with no human interaction required , obviously drush is a good candidate here to trigger the system after the config files have been change, but unlike some people think having to browse to a web page to confirm the changes is not an acceptable solution. Just think about having to do this on multiple environments ... manual actions are error prone..

Apart from that I also think the storing of the certificates should not be part of the file. What about a meta file with the appropriate checksums ? (Also if I`m using Puppet or any other tool to manage my config files then the security , preventing to tamper these files, is already covered by the configuration management tools, I do understand that people want to build Drupal in the most secure way possible, but I don't think this belongs in any web application.

When I look at other similar discussions that wanted to provide a similar secure setup they ran into a lot of end user problems with these kind of setups, an alternative approach is to make this configurable and or plugable. The default approach should be to have it enable, but the more experienced users should have the opportunity to disable this, or replace it with another framework. Making it plugable upfront solves a lot of hassle later.

Someone in the discussion noted :
"One simple suggestion for enhancing security might be to make it possible to omit the secret key file and require the user to enter the key into the UI or drush in order to load configuration from disk."

Requiring the user to enter a key in the UI or drush would be counterproductive in the goal one wants to achieve, the last thing you want as a requirement is manual/human interaction when automating setups. therefore a feature like this should never be implemented

Luckily there seems to be new idea around that doesn't plan on using a raped json file
instead of storing the config files in a standard place, we store them in a directory that is named using a hash of your site's private key, like sites/default/config_723fd490de3fb7203c3a408abee8c0bf3c2d302392. The files in this directory would still be protected via .htaccess/web.config, but if that protection failed then the files would still be essentially impossible to find. This means we could store pure, native .json files everywhere instead, to still bring the benefits of JSON (human editable, syntax checkable, interoperability with external configuration management tools, native + speedy encoding/decoding functions), without the confusing and controversial PHP wrapper.

Figuring out the directory name for the configs from a configuration mgmt tool then could be done by something similar to

  1. cd sites/default/conf/$(ls sites/default/conf|head -1)

In general I think the proposed setup looks acceptable , it definitely goes in the right direction of providing systems people with a way to automate the deployment of Drupal sites and applications at scale.

I`ll be keeping a eye on both the direction they are heading into and the evolution of the code !

Deploying to many servers

We currently deploy our app code to around 50 nodes using capistrano. We use the "copy" deployment method, ie. the code is checked out of svn onto the local deployment node, rolled into a tarball, then copied out to each target node where it is unrolled into a release dir before the final symlink is put in place.

As you might imagine, copying to 50 nodes generates quite a bit of traffic, and it takes ~5 mins to do a full deploy.

I was reading this interesting link today; one bullet in particular jumped out at me:

  • "… the few hundred MB binary gets rapidly pushed bia [sic] bit torrent."

Now that's an interesting idea – I wonder if I can knock up something in capistrano that deploys using bittorrent?

Devops homebrew

There has been quite a bit of discussion about Devops and what it means. @blueben has suggested we start a Devops patterns cookbook so people can learn what worked or didn't work. This is the description of the environment we implemented at a previous job. Some of these things may or may not work for you. I will try to keep it short.

Environment background

7 distinct applications/products that had to be deployed and tested ie. base/core application, messaging platform, reporting app etc. All applications were Java based running on either Tomcat or Jboss.

Application design for deployment

These are some of the key points

  1. Application should have a sane default configuration options. Any option should be overrideable by an external file. In most cases you only need to override database credentials (host, username, password). Goal is to be able to use the same binary across multiple environments.
  2. Application should expose key internal metrics. We for instance asked for a simple key/value pairs web page ie. JMSenqueue=OK etc. This is important because there are lots of things that can break inside the application which external monitoring may miss like JMS message can't be enqueued, etc.
  3. Keep release notes actions to a minimum. Release notes are often not followed or partially followed thus make sure point 1. is followed and/or try to automate everything else.

Continuous Integration

We used CruiseControl for Continuous Integration. It was used solely to make sure that someone didn't break the build.

Creating releases

Developers are in charge of building and packaging releases. This primarily because QA or Ops will not know what to do if a build fails (this is Java remember). Each release has to be clearly labeled with the version and tagged in the repository. For example Location 1.1.5 will be packaged as location-1.1.5.tar.gz. Archives should contain only WAR (Tomcat) or EAR (Jboss) files and DB patch files. Releases are to be deposited into an appropriate file share ie. /share/releases/location.

Deployment

In order to eliminate most manual deployment steps and support all the different applications we decided to write our own deployment tool. First we started off with a data model which roughly broke down to

  1. Applications – can use different app server containers ie. Tomcat/JBoss, may/will have configuration files that can be either key/value pairs or templates. For every application we also specified a start and stop script (hotdeploy was not an option due to bad experiences with our code).
  2. Domains/Customers – we wanted a single Dashboard that would allow us to deploy to multiple environments e.g. QA staging (current release), QA development (next scheduled release), Dev playbox, etc. Each of these domains had their own set of applications they could deploy with their own configuration options

First we wrote a command line tool that was capable of doing something like this

$ deployer –version 1.2.5 –server web10 –domain joedev –app base –action deploy 

What this would do is

  1. Find and unpack the proper app server container e.g. jboss-4.2.3.tar.gz
  2. Overlay WAR/EAR files for the name version e.g. base-1.2.5.tar.gz
  3. Build configuration files and scripts
  4. Stop the server on the remote box (if it's running)
  5. Rsync the contents of the packaged release
  6. Make sure Apache AJP proxy is configured to proxy traffic and do Apache reload
  7. Start up the server

One of the main reason we started off with a command line tool is that we could easily write batch scripts to upgrade whole set of machines. This was borne out of pain of having to upgrade 200 instances via a web GUI at another job.

Once deployer was working we wrote a web GUI that interfaced with it. You could do things like View running config (what config options are actually on the appserver), Stop, Restart, Deploy (particular version), Reconfig (apply config changes) and Undeploy. We also added the ability to change or add configuration options to the application specific override files. Picture is worth thousand words. This is a tiny snippet how it approximately looked for one domain

This was a big win since QA or developers no longer needed to have someone from ops deploy software.

DB patching

Another big win was "automated" DB patching. Every application would have a table called Patch with a list of DB patches that were already applied. We also agreed that every app would have dbpatches directory in the app archive which would contain a list of patches named with version and order in which they should be applied e.g.

  • 2.54.01-addUserColumn.sql
  • 2.54.02-dropUidColumn.sql

During deployment startup script would compare contents of the patch table and a list of dbpatches and apply any missing ones. If the patch script failed e-mail would be sent to the QA or dev in charge of particular domain.

A slightly modified process was used in production to try to reduce down time ie. things like adding a column could be done at any time. Automated process was largely there to make QA's job easier.

QA and testing

When a release was ready QA would deploy the release themselves. If there was a deployment problem they would attempt to troubleshoot it themselves then contact the appropriate person. Most of the times it was an app problem ie. particular library didn't get commited etc. This was a huge win since we avoided a lots of "waterfall" problems by allowing QA to self-service themselves.

Production

Production environment was strictly controlled. Only ops and couple key engineers had access to it. Reason was we tried to keep the environment as stable as possible. Thus ad hoc changes were frowned upon. If you needed to make a change you would either have to commit a change into the configuration management system (puppet) or use the deployment tool.

Production deployment

The day before the release QA would open up a ticket listing all the applications and versions that needed to be deployed. On the morning of the deployment (that was our low time) someone from ops, development and whole QA team engaged in deploying the app and resolving any observed issues.

Monitoring

Regular metrics such as CPU utilization, load etc. were collected. In addition we kept track of internal metrics and set up adequate alerts. This is an ongoing process since over time you discover what your key metrics are and what their thresholds are ie. number of threads, number of JDBC connections etc.

Things that didn't work so well or were challenging

  1. One of the toughest parts was getting developers' attention to add "goodies" for ops. Specifically exposing application internals was often put off until eventually we would have an outage and lack of having the metric resulted in extended outage.
  2. Deployment tool took couple tries to get right. Even as it was there were couple things I would have done differently ie. not relying on a relational database for the data model since it made it difficult to create diffs (you had to dump the whole DB). I'd likely go with JSON so that diffs could be easily reviewed and committed.
  3. Other issues I can't recall right now :-)

Wrapup

This is the shortest description I could write. There are a number of things I glossed over and omitted so that this is not too long. I may write about those on another occasion. Perhaps the key take away should be that Ops should focus on developing tools that either automate things or allow its customers (QA, dev, technical support, etc.) to self-service themselves.