Removing an orphaned resource from terraform state

If you manually delete a resource that is being managed by terraform, it it not removed from the state file and becomes "orphaned".

You many see errors like this when running terraform:

1 error(s) occurred:
* aws_iam_role.s3_readonly (destroy): 1 error(s) occurred:
* aws_iam_role.s3_readonly (deposed #0): 1 error(s) occurred:
* aws_iam_role.s3_readonly (deposed #0): Error listing Profiles for IAM Role (s3_readonly) when trying to delete: NoSuchEntity: The role with name s3_readonly cannot be found.

This prevents terraform from running, even if you don't care about the missing resource such as when you're trying to delete everything, ie. running terraform destroy.

Fortunately, terraform has a command for exactly this situation, to remove a resource from the state file: terraform state rm <name of resource>

In the example above, the command would be terraform state rm aws_iam_role.s3_readonly

Ansible Communication with AWS EC2 Instances on a VPC

I’ve recently started using Ansible to manage Elastic Compute Cloud (EC2) hosts on Amazon Web Services (AWS). While it is possible to have public IP addresses for EC2 instances on an AWS Virtual Private Cloud (VPC), I opted to place the EC2 instances on a private VPC subnet which does not allow direct access from the Internet. This makes communicating with the EC2 instances a little more complicated.

While I could create a VPN connection to the VPC, this is rather cumbersome without a compatible hardware router. Instead, I opted to create a bastion host which allows me to connect to the VPC, and communicate securely with EC2 instances over SSH.

VPC Architecture

I run a fairly simple VPC architecture with four subnets, two public and two private, with one of each type paired in separate availability zones. The public subnets have direct Internet access, whereas the private subnets cannot be addressed directly, and must communicate with the Internet via a NAT gateway.

sample_aws_vpc_arch

In the diagram, my computer at 70.80.50.30 wants to run Ansible against an EC2 instance at 172.31.50.5 in “Private Subnet 2.” 172.31.0.0/16 is a Class B private network; its addresses cannot be routed over the Internet. Furthermore, as “Private Subnet 2” does not have direct access to the Internet (it is via the NAT gateway at 172.31.32.2), there is no way to assign a public IP address.

On this network, in order to communicate with 172.31.50.5, my computer must either be connected to the VPC with a VPN connection, or by forwarding traffic via the bastion host. In my case, a VPN connection is not feasible, so I made use of the bastion host, which has both a publicly routable IP address (52.89.24.1), and a private address on the 172.16.0.0/16 network at 172.31.2.5.

SSH Jump Hosts

A common practice to reach hosts on an internal network which are not directly accessible is to use an SSH jump host. Once an SSH connection is made to the jump hosts, additional connections can be made to hosts on the internal network from the jump host.

Generally, this looks something like:


jk@localhost:~$ ssh ubuntu@52.50.10.5
ubuntu@52.50.10.5:~$ ssh ubuntu@192.168.0.20
ubuntu@192.168.0.20:~$ 

ssh jump host

This could also be simplified as one command invocation:


jk@localhost:~$ ssh -t ubuntu@52.50.10.5 'ssh ubuntu@192.168.0.20'
ubuntu@192.168.0.20:~$ 

(Note the -t to force pseudo-TTY allocation.)

The connections from the jump host to other hosts do not necessarily need to be SSH connections. For example, a socket connection can be opened:


jk@localhost:~$ ssh ubuntu@52.50.10.5 'nc 192.168.0.20 22'
SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4

SSH ProxyCommand

Ansible makes use of SSH to connect to remote hosts. However, it does not support configuration of an explicit SSH jump host. This would make it impossible for Ansible to connect to a private IP address without other networking (e.g. VPN) magic. Fortunately, Ansible takes common SSH configuration options, and will respect the contents of a system SSH configuration file.

The ProxyCommand option for SSH allows specifying a command to execute to connect to a remote host when connecting via SSH. This allows us to abstract the specifics of connecting to the remote host to SSH; we can get SSH to provide a jump host connection transparently to Ansible.

Essentially, ProxyCommand works by substituting the standard SSH socket connection with what is specified in the ProxyCommand option.


ssh -o ProxyCommand="ssh ubuntu@52.50.10.5 'nc 192.168.0.20 22'" ubuntu@nothing

The above command will, for example, first connect to 52.50.10.5 via SSH, and then open a socket to 192.168.0.20 on port 22. The socket connection (which is connected to the remote SSH server) is then passed to the original SSH client command invocation to utilize.

The ProxyCommand allows the interpolation of the original host and port to connect to with the %h and %p delimeters.

Running:


ssh -o ProxyCommand="ssh ubuntu@52.50.10.5 'nc %h %p'" ubuntu@192.168.0.20

Is equivalent to running:


ssh -o ProxyCommand="ssh ubuntu@52.50.10.5 'nc 192.168.0.20 22'" ubuntu@192.168.0.20

SSH Configuration File

Using the ProxyCommand in conjunction with an SSH configuration file, we can make SSH connections to a private IP address appear seamless to whichever application is executing SSH.

For my VPC architecture described above, I could add the following to an SSH configuration file:


Host 172.31.2.5
  ProxyCommand ssh ubuntu@52.89.24.1 nc %h %p

This makes all SSH connections to the private IP address 172.31.2.5 seamless:


ssh -F ./mysshconfig_file ubuntu@172.31.2.5

And, if using the default .ssh/config for storing your SSH configuration options, you don’t even need to specify the -F option:


ssh ubuntu@172.31.2.5

All Together Now

Using the ProxyCommand option, it is simple to abstract away the details of the underlying connection to the EC2 instances on the private VPC subnet and allow Ansible to connect to those hosts normally. Any hosts on the private VPC subnet can be added explicitly to an SSH configuration file, or the pattern can be expanded. For example, we can apply the ProxyCommand option to all hosts on the 172.31.0.0/16 VPC subnet:


Host 172.31..
  ProxyCommand ssh ubuntu@52.89.24.1 nc %h %p

When running Ansible, the host’s inventory can simply specify the private IP address (such as 172.31.2.5) as the connection hostname/address, and SSH will handle the necessary underlying connections to the bastion host.

Generally, the system or user SSH configuration file (~/.ssh/config)can be used, but Ansible-specific SSH configuration options can also be included in the ansible.cfg file.

This is particularly convenient when using dynamic host inventories with EC2, which can automatically return the private IP addresses of new EC2 instances from the AWS APIs.

Additional SSH and nc flags can be addded to the ProxyCommand option to enhance flexibility.

For example, adding in -A to enable SSH agent forwarding, -q to suppress extra SSH messages, -w to adjust the timeout for nc, and any other standard SSH configuration options:


Host 172.31..
  User ec2-user
  ProxyCommand ssh -q -A ubuntu@52.89.24.1 nc -w 300 %h %p

The post Ansible Communication with AWS EC2 Instances on a VPC appeared first on Atomic Spin.

Managing AWS CloudFront Security Group with AWS Lambda

One of our security groups on Amazon Web Services (AWS) allows access to an Elastic Load Balancer (ELB) from one of our Amazon CloudFront distributions. Traffic from CloudFront can originate from a number of a different source IP addresess that Amazon publishes. However, there is no pre-built security group to allow inbound traffic from CloudFront.

I constructed an AWS Lambda function to periodically update our security group so that we can ensure all CloudFront IP addresses are permitted to access our ELB.

AWS Lambda

AWS Lambda allows you to execute functions in a few different languages (Python, Java, and Node.js) in response to events. One of these events can be the triggering of a regular schedule. In this case, I created a scheduled event with an Amazon CloudWatch rule to execute a lambda function on an hourly basis.

CloudWatch Schedule to Lambda Function

The Idea

The core of my code involves calls to authorize_ingress and revoke_ingress using the boto3 library for AWS. AWS Lambda makes the boto3 library available for Python functions.


print("the following new ip addresses will be added:")
print(authorize_dict['ipranges'])
print("the following new ip addresses will be removed:")
print(revoke_dict['ipranges'])
security_group.authorize_ingress(ippermissions=[authorize_dict])
security_group.revoke_ingress(ippermissions=[revoke_dict])

Amazon publishes the IP address ranges of its various services online.


response = urllib2.urlopen('https://ip-ranges.amazonaws.com/ip-ranges.json')
jsondata = json.loads(response.read())
newipranges = [ x['ipprefix'] for x in jsondata['prefixes'] if x['service'] == 'cloudfront' ]
print(newip_ranges)

I can easily compare the allowed ingress address ranges in an existing security group with those retrieved from the published ranges. The authorized_ingress and revoke_ingress functions then allow me to make modifications to the security group to keep it up-to-date, and permit traffic from CloudFront to access my ELB.


for ip in new_ip_ranges:
    if ip not in current_ip_ranges:
        authorize_dict['ipranges'].append({u'cidrip': ip})
for ip in current_ip_ranges:
    if ip not in new_ip_ranges:
        revoke_dict['ipranges'].append({u'cidrip': ip})

The AWS Lambda Function

The full lambda function is written as a standard lambda_handler for AWS. In this case, the event and context are ignored, and the code is just executed on a regular schedule.

Lambda Function

Notice that the existing security group is directly referenced as sg-3xxexx5x.


from __future__ import print_function
import json, urllib2, boto3
def lambda_handler(event, context):
    response = urllib2.urlopen('https://ip-ranges.amazonaws.com/ip-ranges.json')
    json_data = json.loads(response.read())
    new_ip_ranges = [ x['ip_prefix'] for x in json_data['prefixes'] if x['service'] == 'cloudfront' ]
    print(new_ip_ranges)
    ec2 = boto3.resource('ec2')
    security_group = ec2.securitygroup('sg-3xxexx5x')
    current_ip_ranges = [ x['cidrip'] for x in security_group.ip_permissions[0]['ipranges'] ]
    print(current_ip_ranges)
    params_dict = {
        u'prefixlistids': [],
        u'fromport': 0,
        u'ipranges': [],
        u'toport': 65535,
        u'ipprotocol': 'tcp',
        u'useridgrouppairs': []
    }
    authorize_dict = params_dict.copy()
    for ip in new_ip_ranges:
        if ip not in current_ip_ranges:
            authorize_dict['ipranges'].append({u'cidrip': ip})
    revoke_dict = params_dict.copy()
    for ip in current_ip_ranges:
        if ip not in new_ip_ranges:
            revoke_dict['ipranges'].append({u'cidrip': ip})
    print("the following new ip addresses will be added:")
    print(authorize_dict['ipranges'])
    print("the following new ip addresses will be removed:")
    print(revoke_dict['ipranges'])
    security_group.authorize_ingress(ippermissions=[authorize_dict])
    security_group.revoke_ingress(ippermissions=[revoke_dict])
    return {'authorized': authorize_dict, 'revoked': revoke_dict}

The Security Policy

The above lamdba function presumes permissions to be able to edit the referenced security group. These permissions can be configured with an AWS Identity and Access Management (IAM) policy, applied to the role which the lamdba function executes as.

Lambda function role

Notice that the security group resource, sg-3xxexx5x, is specifically scoped to the us-west-2 AWS region.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeNetworkAcls"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:AuthorizeSecurityGroupIngress",
                "ec2:RevokeSecurityGroupIngress"
            ],
            "Resource": "arn:aws:ec2:us-west-2:*:security-group/sg-3xxexx5x"
        }
    ]
}

Making It All Work

In order to get everything hooked up correctly, an appropriate security group needs to exist. The identifier for the group needs to be referenced in both the Lambda script, and the policy used by the role that the lambda script executes as. The IAM policy uses the Amazon Resource Name (ARN) instead of the security group identifier. The AWS Lambda function presumes that Amazon will publish changes to the CloudFront IP address range in a timely manner, and that running the function once per hour will be sufficient to grant ingress permissions on the security group. If the CloudFront ranges change frequently, or traffic is particularly crucial, the frequency of the lambda function run should be increased.

The post Managing AWS CloudFront Security Group with AWS Lambda appeared first on Atomic Spin.

Four Stages of CloudFormation

AWS CloudFormation gives developers and systems administrators an easy way to create and manage a collection of related AWS resources, provisioning and updating them in an orderly and predictable fashion.
-- AWS CloudFormation Homepage

I've gone from never having used Amazon CloudFormation to building multi- tier, cross region, many availability zone deployments in a couple of months and while digging through official documentation, support requests, blog posts and sample templates I've put together what I've come to view as the 'Four Stages of CloudFormation'. If I'd known about these when I first started then I'd have saved myself some time and a few more of those few, semi-precious remaining hairs.

One Template to rule them

It begins innocuously. You decide to use CloudFormation and you start to put your resources in to what will become the all encompassing JSON file of darkness. You add the VPC, a couple of subnets and then you do a test build. It fails. You make the corrections and continue. A couple of times over the day you make enough progress to warrant another test run. Sometimes it fails and rolls back, some times it passes. You end the day with enough VPC in place to run an autoscaling web server group and all is well. You tear down all the testing resources and go home.

You start to add the application autoscaling groups and their requirements, security groups, subnets, launch configs etc and then you do a test run. You watch your email as tens of SNS notifications come in as the stack builds itself and then... it fails. You start getting the rollback emails. Something went wrong and now you get to see each stage of the build unwind itself. Your testing time has now grown to maybe 30 minutes for a change. Sometimes you get an intermittent failure like an EIP not getting attached and a CloudFormation template that normally works folds like cheap paper in the rain. You can lose a day testing a handful of changes this way - especially when you involve RDS. So you decide to grow and change, you decide to have multiple templates.

Little Nightmares

So you look at your architecture diagram (or your wedge of JSON) and start to separate the resources in to logical groupings. Basic VPC config, webservers and supporting functionality, RDS and option groups etc. You run the basic VPC template and it goes through quickly and easily. Too easily

You move further in and run the bastion host template. The reference errors begin. When everything was in a single template requiring a couple of "Parameters" and using references everywhere inside the template ("VpcId" : { "Ref" : "VPC" } ) was easy. Now you have to pass in a parameter for each bit of state you need in this new template. VPC id, public subnet ids, NAT route table ids. Your command lines start getting bigger but you decide the shorter testing cycles and component separation is worth it. Then you discover that SourceSecurityGroupName is a lie across templates and needs to be SourceSecurityGroupId, which you need to pass in.

Between the carried state and the duplication, writing out one public subnet, NAT route, Route association per AZ and then adding them all again but this time with 2 in the name because the template doesn't have iteration, you decide that a little coding will make it all so much better.

The Wrapper

Some people start off with a script to build their infrastructure - often using boto or fog - and gradually add to it over time. Avoiding all the hassles of the built in CloudFormation types and piles of JSON is an alluring prospect. However this leads to the same kind of problems that Puppet and Chef solved over provisioning shell script. Writing idempotent code against a big backend of different APIs is hard. You can end up with masses of exception handling code. Also scratching my personal itch becomes a lot harder - I like to generate a view of what impact this changeset will have - something that's quite easy to do if you have CloudFormation as your intermediate format (you're diffing two JSON files) but is quite hard to do consistently well when you're using a REST API and making lots of individual calls.

While I've been quite down on a pure script and API call approach I think using a library and scripting your infrastructure using something that abstracts CloudFormation is the current winning approach for me. When configuring an app that needs to run over three availability zones I can call a method in loop that generates the pile of CloudFormation boilerplate and keeps the three occurrences of it in perfect sync. I can even do the template upload and stack creation from the script itself and create a set of post actions that turn on name resolution or run basic nagios checks to confirm the stack works.

Compared to where I started this feels like a much better tool chain and removes a lot of the painful scutwork but it does feel like an intermediate step. Which leads us to

The Future

So what's beyond this? I think that the libraries will improve in a couple of different directions. Firstly something like an ActiveRecord / Clusto syntax mapping of common stacks could save a lot of time and effort -



application = cf_lib('appname')
application.tier('web', asg=True, own_subnet=True)
application.tier('app', asg=True)

application.connect_sgs(from=application.tier('web'), to=application.tier('app'),ports= [ 80,443 ] )

This tiny chunk of code would hide masses of configuration by convention boilerplate. Subnets, basic cloudwatch, notification topics, network ACLs etc. It'd also allow easy specification of parameters across templates. Eventually it'd become available as a layer in visio and then you can 'draw' your new applications, diff the JSON, get it back in a nice graphical form and I can put my pretty printing script to sleep.

I think the second direction will be to fill in some of the gaps in the CloudFormation functionality. It's currently impossible to turn on the per instance public fqdn option for a VPC using cloud formation. You can build your entire stack but to reach any of the hosts via a fqdn you either use the cli or the web interface. I think library shims for this kind of missing functionality masquerading as CloudFormation types could be added then listed as pre or post run actions or a normal dependency. Once CloudFormation adds support you can then remove the shim.

Like this post? - Digg Me! | Add to del.icio.us! | reddit this!

Amazon Web Services, Hosting in the Cloud and Configuration Management

Amazon is probably the biggest cloud provider in the industry – they certainly have the most features and are adding more at an amazing rate.

Amongst the long list of services provided under the AWS (Amazon Web Services) banner are:

  • Elastic Compute Cloud (EC2) – scalable virtual servers based on the Xen Hypervisor.
  • Simple Storage Service (S3) – scalable cloud storage.
  • Elastic Load Balancing (ELB) – high availability load balancing and traffic distribution.
  • Elastic IP Addresses – re-assignable static ip addresses to EC2 instances.
  • Elastic Block Store (EBS) – persistant storage volumes for EC2.
  • Relational Database Service (RDS) – scalable MySQL compatible database services.
  • CloudFront – a Content Delivery Network (CDN) for serving content from S3.
  • Simple E-Mail System (SES) – for sending bulk e-mail.
  • Route 53 – high availability and scalable Domain Name System (DNS).
  • CloudWatch – monitoring of resources such as EC2 instances.

Amazon provides these services in 5 different regions:

  • US East (North Virginia)
  • US West (North California)
  • Europe (Ireland)
  • Asia Pacific – Tokyo
  • Asia Pacific – Singapore

Each region has it’s own pricing and features available.

Within each region, Amazon provides multiple “Availability Zones”. These different zones are completely isolated from each other – probably in separate data centers, as Amazon describes them as follows:

Q: How isolated are Availability Zones from one another?
Each availability zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone.

However, unless you have been offline for the past few days, you will have no doubt heard about the extended outage Amazon has been having in their US East region. The outage started on Thursday, 21st April 2011) taking down some big name sites such as Reddit, Quora, Foursquare & Heroku and the problems are still ongoing now, nearly 2 days later – with Reddit and Quora still running in an impaired state.

I have to confess, my first reaction was that of surprise that such big names didn’t have more redundancy in place – however, once more information came to light, it became apparent that the outage was affecting multiple availability zones – something Amazon seems to imply above shouldn’t happen.

You may well ask why such sites are not split across regions to give more isolation against such outages. The answer to this lies in the implementation of the zones and regions in AWS. Although isolated, the zones within a single region are close enough together that low cost, low latency links can be provided between the different zones within the same region. Once you start trying to run services across regions, all inta-region communication will go over the normal internet and is therefore comparatively slow, expensive and unreliable so it becomes much more difficult and expensive to keep data reliably syncronised. This coupled with Amazon’s above claims about the isolation between zones and best practises has lead to the common setup being to split services over multiple availability zones within the same region – and what makes this outage worst is that US East is the most popular region due to it being a convenient location for sites targeting both the US and Europe.

On the back of this, there are many people are giving both Amazon and cloud hosting a good bashing in both blog posts and on Twitter.

Where Amazon has let everyone down in this instance is that they let a problem (which in this case is largely centered around EBS) to affect multiple availability zones and thus screwing everyone who either had not implemented redundancy or had followed Amazon’s own guidelines and assurances of isolation. I also believe that their communication has been poor and had customers been aware it would take so long to get back online, they may have been in a position to look at measures to get back online much sooner.

In reality though, both Amazon and cloud computing less to do with this problem and more specifically the blame associated with it. At the end of the day, we work in an industry that is susceptible to failure. Whether you are hosting on bare metal or in the cloud, you will experience failure sooner or later and part of the design of any infrastructure you need to take that into account. Failure will happen – it’s all about mitigating the risk of this failure through measures like backups and redundancy. There is a trade-off between the cost, time and complexity of implementing multiple levels of redundancy verses the risk of failure and downtime. On each project or infrastructure setup, you need to work out where on this sliding scale you are.

In my opinion, cloud computing provides us an easy way out of such problems. Cloud computing gives us the ability to quickly spin up new services and server instances within minutes, pay by the hour for them and destroy them when they are no longer required. Gone are the days of having to order servers or upgrades and wait in a queue for a data center technician to deal with hardware. It was the norm to incur large setup costs and/or get locked into contracts. In the cloud, instances can be resized, provisioned or destroyed in minutes and often without human intervention as most cloud computing providers also provide an API so users can handle the management of their services programatically. Under load, instances can be upgraded or additional instances brought online and in quiet periods, instances can be downgraded or destroyed, yielding a significant cost saving. Another huge bonus is that instances can be spun up for development, testing or to perform an intensive task and thrown away afterwards.

Being able to spin new instances up in minutes is however less effective if you have to spend hours installing and configuring each instance before it can perform it’s task. This is especially true if more time is wasted chasing and debugging problems because something is setup differently or missed during the setup procedure. This is where configuration management tools or the ‘infrastructure as code’ principles come in. Tools such as Puppet and Chef were created to allow you to describe your infrastructure and configuration in code and have machines or instances provisioned or updated automatically.

Sure, with virtual machines and cloud computing, things have got a little easier by easily allowing re-usable machine images. You can setup a certain type of system once and re-use the image for any subsequent systems of the same type. This is however greatly limiting in that it’s very time consuming to then later update that image with small changes, to cope with small variations between systems and almost impossible to keep track of what changes have been made to which instances.

Configuration Management tools like Puppet and Chef manage system configuration centrally and can:

  • Be used to provision new machines automatically.
  • Roll out a configuration change across a number of servers.
  • Deal with small variations between systems or different types of systems (web, database, app, dns, mail, development etc).
  • Ensure all systems are in a consistant state.
  • Ensure consistency and repeatability.
  • Easily allow the use of source code control (version control) systems to keep a history of changes.
  • Easily allow the provisioning of development and staging environments which mimic production.

As time permits, i’ll publish some follow up posts which go into Puppet and Chef in more detail and look at how they can be used. I’ll also be publishing a review of James Turnbull’s new book, Pro Puppet which is due to go to print at the end of the month.