Hi! Welcome...

Syndication of blogs and tweets by users of the Freenode ##infra-talk IRC channel

06 April 2013 ~ Comments Off

Learning to Type

typing

One of my biggest struggles as a developer is probably something a lot of people, particularly software professionals, take for granted. I don’t know how to touch type. Growing up, I was never enrolled in any type of keyboarding class in grade school. I was also never more than a casual computer user until the end of my first year of college, which gave me plenty of time to form bad habits.

The worst part is that, even though I was always aware of my problems with typing in the back of my mind, I was never aware of how impactful of a problem it was until I began to work full time at Atomic. While I can type marginally quickly (~30 WPM) using my practiced hunting and pecking, pair programming has revealed that I have a handicap compared to my fellow developers when it comes to being able to quickly put out code, and it’s embarrassing. I have also noticed that on particularly long days, especially when I am working by myself, I have begun to develop headaches from the frequency with which my eyes continually dart between my screen and keyboard.

While learning to type has been on my mental todo list for a long time, I’ve realized of late that I need to push it to the top of my priorities. While I’ve been doing online typing lesson at a leisurely pace for some time, it has become apparent to me that this is not enough.

In order to force myself to learn, I’ve defined a strict regimen that I will stick to until I deem that I have become proficient at touch typing.

  1. Morning: Do a quick exercise first thing when I come into work in the morning, retype a brief email that is near the top of my inbox.
  2. Lunchtime: Do a typing lesson during lunchtime.
  3. Evening: Do a typing lesson before dinner. After dinner, retype an entire source code file that I worked on that day.

I’ve also decided that I’m going to learn how to type using a Dvorak keyboard. I have a couple of reasons for doing this:

  1. Dvorak is regarded as more efficient than QWERTY; since I already don’t know how to type properly, I might as well start learning on the best system possible.
  2. Dvorak forces me to break my habit of looking at my keyboard as I type, since a QWERTY keyboard offers no feedback on the position of Dvorak keys.

Some learning tools that I’ve found helpful include Juerd Waalboer’s Dvorak training site and GNU Typist. I will be using them for the typing lessons in my regimen.

With a job that involves using a keyboard all day long, not being able to touch type is just too great of a handicap to ignore. It is my hope that a few weeks of firmly following my above plan will get me to the point of being able to touch type in my everyday work without worrying about slowing down my productivity. Once I get to that point, proficiency should come naturally from frequent use.
 

The post Learning to Type appeared first on Atomic Spin.

05 April 2013 ~ Comments Off

Why Mou Is My New Note-Taking App

Taking notes has been a part of my life since high school, in one form or another, but the tools I’ve used have varied quite a bit. Recently, a couple of my current project teammates and I have been trying out Mou for rapid note-taking during meetings and capturing decisions we make along the way. There are a few things that have made it a really good fit.

I even used it to write this blog post:

Blog post started in Mou Markdown editor for OS X.

Structure

My favorite note taking app has been Vim for a long time. I just whip open a new MacVim window and start typing away. Vim does nothing to constrain or inform the structure of my notes, and I’ve used a huge variety of undefined markup styles. The files end up littered with long dashed lines, indented text, asterisk bullets, and parenthesized phrases for tangential thoughts or statements. It’s not too difficult for me to understand, but sharing the notes usually takes some cleanup.

By having a target markup language, Markdown, I find I’m far more consistant with the structure of my notes. It’s easier to show multiple levels of headings, quotes, and emphasis, among many other things. And it takes less cleanup to get my notes in a sharable, printable form.

Immediate Feedback

Structure is good, but I could write Markdown in Vim. Mou, on the other hand, provides a split pane with the rendered HTML using a really good, simple default style. That immediate feedback makes it easy to consider the form and structure of my notes as I write, further reducing the need for later refacatoring.

Nearly Wiki-ready

Sometimes we put notes back on our project Wiki or post them back on Basecamp. By writing Markdown instead of my own inconsistant markup this becomes much simpler.

It’s Quick and Simple

I haven’t tried Evernote yet. My preference is usually for simple tools that let me throw files into git with our other project assets, and Mou fits that model perfectly. All three of us on the team have put our notes into our git doc repository for this project, and I’ve really appreciated the improved readability of a well-structured document rendered in HTML.
 

The post Why Mou Is My New Note-Taking App appeared first on Atomic Spin.

05 April 2013 ~ Comments Off

Find files matching criteria, exclude NFS mounts

We have app servers with smallish local file systems and application data mounted over NFS.

Sometimes I want to find all files matching a particular set of criteria but don't want to traverse the NFS mounts.

Here's how to do it:

find / -group sophosav -print -o -fstype nfs -prune

Ordering is important, as is the explict inclusion of -print. If you omit this, it will print the name(s) of the NFS mounts as well.

Change start location (/) and criteria (-group sophosav) to suit your own purposes.

04 April 2013 ~ Comments Off

A few common graphite problems and how they are already solved.

metrics often seem to lack details, such as units and metric types

looking at a metric name, it's often hard to know
  • the unit a metric is measured in (bits, queries per second, jiffies, etc)
  • the "type" (a rate, an ever increasing counter, gauge, etc)
  • the scale/prefix (absolute, relative, percentage, mega, milli, etc)
structured_metrics solves this by adding these tags to graphite metrics:
  • what
    what is being measured?: bytes, queries, timeouts, jobs, etc
  • target_type
    must be one of the existing clearly defined target_types (count, rate, counter, gauge)
    These match statsd metric types (i.e. rate is per second, count is per flushInterval)
In Graph-Explorer these tags are mandatory, so that it can show the unit along with the prefix (i.e. 'Gb/s') on the axis.
This will also allow you to request graphs in a different unit and the dashboard will know how to convert (say, Mbps to GB/day)

tree navigation/querying is cumbersome, metrics search is broken. How do I organize the tree anyway?

the tree is a simplistic model. There is simply too much dimensionality that can't be expressed in a flat tree. There's no way you can organize it so that will it satisfy all later needs. A tag space like structured_metrics makes it obsolete. with Graph-Explorer you can do (full-text) search on metric name, by any of their tags, and/or by added metadata. So practically you can filter by things like server, service, unit (e.g. anything expressed in bits/bytes per second, or anything denoting errors). All this irrespective of the source of a metric or the "location in the tree".

no interactivity with graphs

timeserieswidget allows you to easily add interactive graphite graph objects to html pages. You get modern features like togglable/reorderable metrics, realtime switching between lines/stacked, information popups on hoover, highlighting, smoothing, and (WIP) realtime zooming. It has a canvas (flot) and svg (rickshaw/d3) backend. So it basically provides a simpler api to use these libraries specifically with graphite.
There's a bunch of different graphite dashboards with different takes on graph composition/configuration and workflow, but the actual rendering of graphs usually comes down to plotting some graphite targets with a legend. timeserieswidget aims to be a drop-in plugin that brings all modern features so that different dashboards can benefit from a common, shared codebase, because static PNGs are a thing from the past

screenshot:

events lack text annotations, they are simplistic and badly supported

Graphite is a great system for time series metrics. Not for events. metrics and events are very different things across the board. drawAsInFinite() is a bit of a hack.
  • anthracite is designed specifically to manage events.
    It brings extra features such as different submission scripts, outage annotations, various ways to see events and reports with uptime/MTTR/etc metrics.
  • timeserieswidget displays your events on graphs along with their metadata (which can be just some text or even html code).
    this is where client side rendering shines

screenshots:

cumbersome to compose graphs

There's basically two approaches:
  • interactive composing: with the graphite composer, you navigate through the tree and apply functions. This is painfull, dashboards like descartes and graphiti can make this easier
  • use a dashboard that uses predefined templates (gdash and others) They often impose a strict navigation path to reach pages which may or may not give you the information you need (usually less or way more)
With both approaches, you usually end up with an ever growing pile of graphs that you created and then keep for reference.
This becomes unwieldy but is useful for various use cases and needs.
However, neither approach is convenient for changing information needs.
Especially when troubleshooting, one day you might want to compare the rate of increase of open file handles on a set of specific servers to the traffic on given network switches, the next day it's something completely different.
With Graph-Explorer:
  • GE gives you a query interface on top of structured_metric's tag space. this enables a bunch of things (see above)
  • you can yield arbitrary targets for each metric, to look at the same thing from a different angle (i.e. as a rate with `derivative()` or as a daily summary), and you can of course filter by angle
  • You can group metrics into graphs by arbitrary tags (e.g. you can see bytes used of all filesystems on a graph per server, or compare servers on a graph per filesystem). This feature always results in the "wow that's really cool" every time I show it
  • GE includes 'what' and 'target_type' in the group_by tags by default so basically, if things are in a different unit (B/s vs B vs b etc) it'll put them in separate graphs (controllable in query)
  • GE automatically generates the graph title and vertical title (always showing the 'what' and the unit), and shows all metrics' extra tags. This also gives you a lot of inspiration to modify or extend your query

limited options to request a specific time range

GE's query language supports freeform `from` and `to` clauses.

Referenced projects

  • anthracite:
    event/change logging/management with a bunch of ingestion scripts and outage reports
  • timeserieswidget:
    jquery plugin to easily get highly interactive graphite graphs onto html pages (dashboards)
  • structured_metrics:
    python library to convert graphite metrics tree into a tag space with clearly defined units and target types, and arbitrary metadata.
  • graph-explorer:
    dashboard that provides a query language so you can easily compose graphs on the fly to satisfy varying information needs.
All tools are designed for integration with other tools and each other. Timeserieswidget gets data from anthracite, graphite and elasticsearch. Graph-Explorer uses structured_metrics and timeserieswidget.

Future work

There's a whole lot going on in the monitoring space, but I'd like to highlight a few things I personally want to work more on:
  • I spoke with Michael Leinartas at Monitorama (and there's also a launchpad thread). We agreed that native tags in graphite are the way forward. This will address some of the pain points I'm already fixing with structured_metrics but in a more native way. I envision submitting metrics would move from:
    stats.serverdb123.mysql.queries.selects 895 1234567890
    
    to something more along these lines:
    host=serverdb123 service=mysql type=select what=queries target_type=rate 895 1234567890
    host=serverdb123 service=mysql type=select unit=Queries/s 895 1234567890
    h=serverdb123 s=mysql t=select queries r 895 1234567890
    
  • switch Anthracite backend to ElasticSearch for native integration with logstash data (and allow you to use kibana)

04 April 2013 ~ Comments Off

Getting Accustomed to Embedded Development

I have a bachelor’s degree with majors in both Computer Science and Computer Engineering, and have a pretty solid base of theory for programming at pretty much any level of computing. Most of my professional experience to date has been working on either desktop software or web applications. As such, I relished an opportunity that came up recently to switch gears and work with Atomic Embedded on the firmware for a project.

There has been quite a learning curve. This is the first time I’ve built a system of considerable size in C. Match the intricacies of C with the challenges of static memory management (thou shall not use malloc), a complex problem domain, and what are apparently normal issues with hardware quirks, and I’ve had to learn fast to keep up. Thankfully, that’s exactly why I signed up at AO.

One of my favorite things about embedded development has been the small set of functions available as part of the C language. When doing desktop software development, I constantly have some piece of core language API documentation open, figuring out behaviors of built-in datastructures or what magic function call would massage an object into a different format. The standard API I work with in C seems to be much smaller, and all the documentation is available right from my system’s shell just by calling up a manpage.

As I’ve discussed with coworkers on multiple occasions, there is no magic in C. You get a very basic set of tools, and the magic is up to you. If you want a linked list, you are put in charge of making the links manually. If you want to implement some module such that it can be instantiated multiple times, but don’t want to use malloc internally, you can just pass in memory from the outside.

An unexpected joy I’ve found in embedded development is how easy it is to understand exactly how our mocking framework CMock works under the covers. If you understand how a linker works, you can easily imagine someone scanning through your headers for exported function declarations, and producing a linker object to stub out your functions. Many of the high-level languages I’ve used are not nearly as straightforward to understand.

All these things being said, there are definitely things I miss from the higher-level languages I’m used to. First and foremost, I miss functional programming-style collection manipulation. After working for a few years with map(), fold(), filter(), and company, it’s a bit hard to do everything with loops. The lack of first-class or even just anonymous functions has also been wearing on me. Sure, you can use function pointers, but it’s never quite the same. The final thing I’ve been missing is any support for type introspection or metaprogramming. If you want to do something like store data on a composite type that’s being serialized to disk, you need to get mighty creative.

Overall, I’ve quite enjoyed this switch in focus. It can be quite fun to figure out how to constrain a solution down to a limited set of resources. What have been your favorite new technologies or areas of expertise to learn?
 

The post Getting Accustomed to Embedded Development appeared first on Atomic Spin.

30 March 2013 ~ Comments Off

djbdns dnscache not resolving akamai-hosted domains

I was experiencing problems with dnscache not resolving certain domains. On inspection, it turned out to be akamai-hosted domains that were failing. A quick google turned up this thread from 2004 (!), and a little further digging turned up this patch.

I tweaked the patch a little to set QUERY_MAXLOOP to 1000 (original value: 100, value in patch: 160), and rebuilt.

All works just fine now:

 

[robin@dist ~]$ env DNSCACHEIP=192.168.1.90 dnsqr A www.cisco.com
1 www.cisco.com:
212 bytes, 1+5+0+0 records, response, noerror
query: 1 www.cisco.com
answer: www.cisco.com 0 CNAME www.cisco.com.akadns.net
answer: www.cisco.com.akadns.net 0 CNAME wwwds.cisco.com.edgekey.net
answer: wwwds.cisco.com.edgekey.net 0 CNAME wwwds.cisco.com.edgekey.net.globalredir.akadns.net
answer: wwwds.cisco.com.edgekey.net.globalredir.akadns.net 0 CNAME e144.dscb.akamaiedge.net
answer: e144.dscb.akamaiedge.net 12 A 2.19.144.170

25 March 2013 ~ Comments Off

Cisco Routers for the Desperate (2nd edition) – Short Review

Reviewing the second edition of Cisco Routers for the Desperate was quite hard for me as I have very little to add to the Cisco Routers for the Desperate 1st edition review I posted a few years ago. After reading through this update pretty much all those comments still stand. It's an excellent, useful, well written book and the author still has a -distinct- written tone.

I enjoyed the book; I must have considering I bought the second edition! The material has been updated where needed and it's still lacking a section on ACLs so I'll stick to my score of 8/10 for people purchasing this book for the first time and look forward to another refresh in a couple of years time. If you already own the first edition then your choice is a little harder - this book is still an excellent stepping on point for the cost but don't expect much beyond a refresh on the same content.

Disclaimer: Part of my previous review is quoted in the marketing blurb at the front of the book. I did however pay for this book myself.

Like this post? - Digg Me! | Add to del.icio.us! | reddit this!

24 March 2013 ~ Comments Off

Hi Planet Devops and Infratalk

This blog just got added to planet devops and infra-talk, so for my new readers: you might know me as Dieterbe on irc, github or twitter. Since my move from Belgium to NYC (to do backend stuff at Vimeo) I've started writing more about devops-y topics (whereas I used to write more about general hacking and arch linux release engineering and (automated) installations). I'll mention some earlier posts you might be interested in: FWIW, I'm attending Monitorama next weekend in Boston.

24 March 2013 ~ Comments Off

Data failures, compartmentalisation challenges, monitoring pipelines

To recap, pipelines are a useful way of modelling monitoring systems.

Each compartment of the pipeline manipulates monitoring data before making it available to the next.

At a high level, this is how data flows between the compartments:

basic pipeline

This design gives us a nice separation of concern that enables scalability, fault tolerance, and clear interfaces.

The problem

What happens when there is no data available for the checks to query?

In this very concrete case, we can divide the problem into two distinct classes of failure:

  • Latency when accessing the metric storage layer, manifested as checks timing out.
  • Latency or failure when pushing metrics into the storage layer, manifested as checks being unable to retrieve fresh data.

There are two outcomes from this:

  • We need to provide clearer feedback to the people responding to alerts, to give them more insight into what's happening within the pipeline
  • We need to make the technical system more robust when dealing with either of the above cases

Alerting severity levels aren't granular or accurate in a modern monitoring context

There are entire classes of monitoring problems (like the one we're dealing with here) that map poorly into the existing levels. This is an artefact of an industry wide cargo culting of the alerting levels from Nagios, and these levels may not make sense in a modern monitoring pipeline with distinctly compartmentalised stages.

For example, the Nagios plugin development guidelines state that UNKNOWN from a check can mean:

  • Invalid command line arguments were supplied to the plugin
  • Low-level failures internal to the plugin (such as unable to fork, or open a tcp socket) that prevent it from performing the specified operation.

"Low-level failures" is extremely broad, and it's important operationally to provide precise feedback to the people maintaining the monitoring system.

Adding an additional level (or levels) with contextual debugging information would help close this feedback loop.

In defence of the current practice, there are operational benefits to mapping problems into just 4 levels. For example, there are only ever 4 levels that an engineer needs to be aware of, as opposed to a system where there are 5 or 10 different levels that capture the nuance of a state, but engineers don't understand what that nuance actually is.

Compartmentalisation as the saviour and bane

The core idea driving the pipeline approach is compartmentalisation. We want to split out the different functions of monitoring into separate reliable compartments that have clearly defined interfaces.

The motivation for this approach comes from the performance limitations of traditional monitoring systems where all the functions essentially live on a single box that can only be scaled vertically. Eventually you will reach the vertical limit of hardware capacity.

This is bad.

a monolithic monitoring system

Thus the pipeline approach:

Each stage of the pipeline is handled by a different compartment of monitoring infrastructure that analyses and manipulates the data before deciding whether to pass it onto the next compartment.

This sounds great, except that now we have to deal with the relationships between each compartment both in the normal mode of operation (fetching metrics, querying metrics, sending notifications, etc), but during failure scenarios (one or more compartments being down, incorrect or delayed information passed between compartments, etc).

The pipeline attempts to take this into account:

Ideally, failures and scalability bottlenecks are compartmentalised.

Where there are cascading failures that can't be contained, safeguards can be implemented in the surrounding compartments to dampen the effects.

For example, if the data storage infrastructure stops returning data, this causes the check infrastructure to return false negatives. Or false positives. Or false UNKNOWNs. Bad times.

We can contain the effects in the event processing infrastructure by detecting a mass failure and only sending out a small number of targeted notifications, rather than sending out alerts for each individual failing check.

While the design is in theory meant to allow this containment, the practicalities of doing this are not straightforward.

Some simple questions that need to be asked of each compartment:

  • How does the compartment deal with a response it hasn't seen before?
  • What is the adaptive capacity of each compartment? How robust is each compartment?
  • Does a failure in one compartment cascade into another? How far?

The initial answers won't be pretty, and the solutions won't be simple (ideal as that would be) or easily discovered.

Additionally, the robustness of each compartments in the pipeline will be different, so making each compartent fault tolerant is a hard slog with unique challenges in each compartment.

How are people solving this problem?

Netflix recently open sourced a project called Hystrix:

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.

Specifically, Netflix talk about how they make this happen:

How does Hystrix accomplish this?

  • Wrap all calls to external systems (dependencies) in a HystrixCommand object (command pattern) which typically executes within a separate thread.
  • Time-out calls that take longer than defined thresholds. A default exists but for most dependencies is custom-set via properties to be just slightly higher than the measured 99.5th percentile performance for each dependency.
  • Maintain a small thread-pool (or semaphore) for each dependency and if it becomes full commands will be immediately rejected instead of queued up.
  • Measure success, failures (exceptions thrown by client), timeouts, and thread rejections.
  • Trip a circuit-breaker automatically or manually to stop all requests to that service for a period of time if error percentage passes a threshold.
  • Perform fallback logic when a request fails, is rejected, timed-out or short-circuited.
  • Monitor metrics and configuration change in near real-time.

Potential Solutions

We can apply many of the strategies from Hystrix to the monitoring pipeline:

  • Wrap all monitoring checks with a timeout that returns an UNKNOWN (assuming you stick with the existing severity levels)
  • Add some sort of signalling mechanism to the checks so they fail faster, e.g.
    • Stick a load balancer like HAProxy or Nginx in front of the data storage compartment
    • Cache the state of the data storage compartment that all monitoring checks check before querying the compartment
  • Detect mass failures, and notify on-call and the monitoring system owners directly to shorten the MTTR. This is something Flapjack aims to do as part of the reboot.

I don't profess to have all (or even any) of the answers. This is new ground, and I'm very curious to hear how other people are solving this problem.

22 March 2013 ~ Comments Off

Sharing an SSH key, securely

Update: This isn't actually that much better than letting them access the private key, since nothing is stopping the user from running their own SSH agent, which can be run under strace. A better solution is in the works. Thanks Timo Juhani Lindfors and Bob Proulx for both pointing this out.

At work, we have a shared SSH key between the different people manning the support queue. So far, this has just been a file in a directory where everybody could read it and people would sudo to the support user and then run SSH.

This has bugged me a fair bit, since there was nothing stopping a person from making a copy of the key onto their laptop, except policy.

Thanks to a tip, I got around to implementing this and figured writing up how to do it would be useful.

First, you need a directory readable by root only, I use /var/local/support-ssh here. The other bits you need are a small sudo snippet and a profile.d script.

My sudo snippet looks like:

Defaults!/usr/bin/ssh-add env_keep += "SSH_AUTH_SOCK"
%support ALL=(root)  NOPASSWD: /usr/bin/ssh-add /var/local/support-ssh/id_rsa

Everybody in group support can run ssh-add as root.

The profile.d goes in /etc/profile.d/support.sh and looks like:

if [ -n "$(groups | grep -E "(^| )support( |$)")" ]; then
    export SSH_AUTH_ENV="$HOME/.ssh/agent-env"
    if [ -f "$SSH_AUTH_ENV" ]; then
        . "$SSH_AUTH_ENV"
    fi
    ssh-add -l >/dev/null 2>&1
    if [ $? = 2 ]; then
        mkdir -p "$HOME/.ssh"
        rm -f "$SSH_AUTH_ENV"
        ssh-agent > "$SSH_AUTH_ENV"
        . "$SSH_AUTH_ENV"
    fi
    sudo ssh-add /var/local/support-ssh/id_rsa
fi

The key is unavailable for the user in question because ssh-add is sgid and so runs with group ssh and the process is only debuggable for root. The only thing missing is there's no way to have the agent prompt to use a key and I would like it to die or at least unload keys when the last session for a user is closed, but that doesn't seem trivial to do.