Hi! Welcome...

Syndication of blogs and tweets by users of the Freenode ##infra-talk IRC channel

31 October 2009 ~ Comments Off

Using Cucumber as a scripting language

Yesterday at the excellent Devopsdays in Gent, Belgium, I proposed an open session to flesh out an idea I had a few weeks ago - to use Cucumber as a general scripting language.

Cucumber's Given/When/Then steps are well suited to procedural tasks like shell script, and you would be writing your "scripts" in straightforward language that non-technical users such as managers and clients could understand. Also, as writing a scenario without a Then to close it feels unbalanced, you'd get in the mindset of testing the actions of your "scripts" fairly quickly.

With little more than the hypothesis above, a group of us found a room and started modeling some scenarios. Our focus was on file manipulation, as it was a low hanging fruit and something most scripts do.

We came up with this:

Feature: Copy files around
  
  Scenario: A single file
    Given I am in "/tmp"
    And the file "spoons" exists
    When I copy the file "spoons" to "forks"
    Then the file "forks" should exist
    And the file "forks" should be readable

  Scenario: Multiple files
    Given I am in "/tmp"
    Given the following table of tasty fruit:
      | filename |
      | apples   | 
      | oranges  |
      | bananas  |
      | ananas   |
      | file with lots o spaces |
      | spoons of : doom |
    When I create the directory "/tmp/some_other_dir"
    When I copy the tasty fruit in the table to "/tmp/some_other_dir"
    Then the tasty fruit in the table should exist in "/tmp/some_other_dir"

The first scenario is fairly self explanatory, but the second one is where the interesting stuff starts happening.

In the implementation of the "following table" step, we create an instance variable that persists the list of files between steps. This way, we can reference the "tasty fruit" throughout our other steps:

Given /^the following table of (.+):$/ do |name, table|                          
  @tables = {}                                                                   
  @tables[name] = table.hashes                                                   
end

We use the (.+) regex to capture the name of the table so we can poke at it later on. This design lets you easily use multiple tables throughout your steps that won't conflict with one another:

  Scenario: Multiple files from multiple tables
    Given the following table of tasty fruit:
      | filename |
      | apples   | 
      | oranges  |
    And the following table of baggy baggage:
      | filename |
      | suitcase | 
      | backpack |
    When I copy the baggy baggage in the table to "/tmp/some_other_dir"
    And I copy the tasty fruit in the table to "/tmp/some_other_dir"
    Then the tasty fruit in the table should exist in "/tmp/some_other_dir"
    And the baggy baggage in the table should exist in "/tmp/some_other_dir"

Other steps can reference data in the table by accepting a name and looking it up in the hash of tables:

Then /^the (.+) in the table should exist in "([^\"]*)"$/ do |name, destination| 
  @tables[name].each do |file|                                                   
    File.exists?(File.join(destination, file["filename"])).should be_true        
  end                                                                            
end 

We also looked at handling permission problems:

  Scenario: Do things i'm not allowed to 
    When I create the directory "/usr/bin/wtf"

Here the step will raise an Errno::EACCES exception, and as Cucumber uses a pretty formatter by default, the failed step will appear in red.

Finally we tried copying files with a glob. The initial implementation I banged out was very Unix focused (it used *, which is a very explicit globbing syntax), so we scrapped that idea and wrote our intentions in plain English:

  Scenario: Copy based on a pattern
    Given I am in "/tmp"
    When I create the directory "/tmp/pattern_dir"
    And I copy files beginning with the letters z,y,x to "/tmp/pattern_dir"
    Then they should exist there

The implementation is obvious, and is very understandable (and seemingly powerful) to someone with no knowledge of globbing.

People who have used Cucumber in web development will likely note that the above implementation is an example of tightly coupled steps, which is sometimes regarded as an anti-pattern. I'm of the opinion that this is a lot more painful in a web development context than in a procedural/scripting tool one.

From my recollection of Euruko earlier this year, when Aslak was asked whether he considers it an antipattern, he said it can be ok to use depending on the problem you're trying to solve, so I take that as tacit permission that it is ok this context. :-)

I posted the results of the session to a Gist yesterday, and I have also published a repo with a bundler-ready install process, so people can hack on it more.

After the session I remembered that the feature file doesn't actually have to start with Feature, so it's possible to write standalone scenarios one after another.

When wrapping up, someone in the room pointed out that our implementation actually went one better than being readable by non-technical users - they could probably write the scripts themselves.

This is pretty powerful, and coupled with Cucumber's very cool step generation when running scenarios with undefined steps, makes it very easy to start prototyping a standard library of human readable scripting commands.

There was chatter on the Cucumber mailing list a few weeks ago about providing alternate interfaces for writing and executing Cucumber features, and it could be cool to see a drag-and-drop interface with a library of common tasks that calls out to Cucumber to execute them. You could even build something quite beautiful with HotCocoa.

Anyhow, if you think anything mentioned above is a cool idea, check out the code and start hacking!

25 October 2009 ~ Comments Off

Starting a Non-Profit in the UK

Back in February, Jonty and I started the Hackspace Foundation to provide a legal structure for our efforts to create a hacker space in London. I’m going to try and document this process to make it a little less daunting for other organizations (and because people keep asking me). This is naturally very UK-specific.

Types of Organization

The Hackspace Foundation is a membership association — the company is controlled and run by its members. The easiest way to set one of these up is to start an unincorporated association, which is basically just a group of people who have registered a business name, with which you can create a bank account. This is the easiest way to go about creating an organizational structure (and One Click Orgs is making it even easier), but unincorporated associations can’t enter into contracts, and therefore can’t sign a lease or get a loan.

In order to be able to do that, you need an actual legal structure — a limited company. There are two types of private limited company: Limited by Share Capital and Limited by Guarantee. Limited by Share Capital is the structure most profit-making companies have. Companies Limited by Guarantee (CLG) are the other option, and that’s what we used with the Hackspace Foundation. Most non-profit companies, including charities, are CLGs. Instead of having shareholders, a CLG has members, all of whom are liable to contribute a nominal amount (£5 in our case) if the company goes under.

In addition to incorporating as a CLG, there are certain other actions that can be taken to make sure people won’t profit from a company. Doing this will not only reassure prospective members, but may also help with grants and taxes.

A quick word about objects

The objects of a company are the purpose under which it trades, and are recorded in the company’s Memorandum of Association. Trading outside of the objects of a company is illegal. Starting in October 2009, a company no longer needs to have objects, however restricting the objects of a company to educational non-profit aims may help with getting grants and reducing tax.

Section 30 Companies

A Section 30 company refers to a company incorporated in accordance with section 30 of the Companies Act 1985. This rather odd section refers to the ability to omit the word “Limited” from the company name. However, it also adds restrictions on the objects you can use:

(3) Those requirements are that—
(a) the objects of the company are (or, in the case of a company about to be registered, are to be) the promotion of commerce, art, science, education, religion, charity or any profession, and anything incidental or conducive to any of those objects; and
(b) the company’s memorandum or articles—
(i) require its profits (if any) or other income to be applied in promoting its objects,
(ii) prohibit the payment of dividends to its members, and
(iii) require all the assets which would otherwise be available to its members generally to be transferred on its winding up either to another body with objects similar to its own or to another body the objects of which are the promotion of charity and anything incidental or conducive thereto (whether or not the body is a member of the company).

Section 30 companies are also exempt from sending details of their members to Companies House, and so are ideal for membership associations. The Hackspace Foundation is incorporated as a Section 30 company.

Community Interest Companies

Community Interest Companies (CIC) are a relatively new innovation which takes the Section 30 idea a bit further. Converting a CLG into a CIC is a one-way process which adds a statutory “asset lock” to the company’s assets. The company can only transfer its assets to another body for less than their market value if that body is also a CIC. This differs from Section 30 because a company can convert themselves back to a standard CLG from a Section 30 company (although this still requires a vote of all the members).

We haven’t gone down the CIC route with the Hackspace Foundation because we’re not ready to make that much commitment to our community-only business model just yet. We were also concerned about finding a relevant CIC to donate any remaining proceeds to if the company was wound down.

Charities

For completeness, I’ll just mention a few things about charities. Obviously being a charity is a bonus because donations are tax-deductible. However, charities are required to have a public benefit, and we don’t believe that hacker spaces necessarily pass that test. (The rules are quite complex.) Additionally, charities are required to submit more complex audited accounts, which are more costly.

Hopefully this is helpful to someone. We registered the Hackspace Foundation with UKPLC — they are cheap and very helpful, so I would definitely recommend them.

25 October 2009 ~ Comments Off

Help us Free London’s Data

Yesterday morning I went to Help us Free London’s Data on the top floor of City Hall. Organized by the GLA, it was intended to get feedback from the development community about their upcoming Datastore project, which is slated for launch in January 2010.

The people from the GLA admitted that they don’t produce that much interesting data, and have limited influence over the authorities who do (like TfL and the borough councils), so their plan is to set a precedent. I don’t agree that all the GLA’s data is boring — I think that transparency and accountability for our elected representatives is important, even if it isn’t as alluring to everyone as the elusive transport data.

Discussion was mainly about the mundane details of how to go about publishing the data. Thankfully pragmatism prevailed and the anticipated hour-long semantic web ontology discussion/flamewar didn’t materialize.

I was impressed by the amount of grassroots support open data appears to have within the GLA — there were representatives of several departments in attendance — and they definitely seem to be taking the right route by publishing data when they can. I’m really interested to see what they come up with in January.

My biggest annoyance was that they started at 10am on a Saturday. Hackers don’t do mornings, guys.

23 October 2009 ~ Comments Off

Letter to my MP about filesharing disconnection

This is the letter I’ve sent to my MP, Emily Thornberry, about the government’s current plans to disconnect file-sharers. You should write to your MP too. It’s important.

I’ve put it up here because sometimes it’s nice to have something to base your letter on when contacting your MP. That said, please don’t copy and paste this, MPs want to hear that you’re a real person and not some mindless sheep. So write your own damn letter. This is mine.

Dear Mrs. Thornberry,

I’m writing to you to express my concern about the new Digital Britain proposals to disconnect internet users who are accused of violating copyright.

These plans amount to taking disproportionate action against internet users for a crime they have not been found guilty of committing. Internet access is a vital utility, and disconnecting users due to alleged copyright violation would also cut off access to many other services, contrary to the government’s plans for universal access to broadband.

The recording industry is backing these heavy-handed tactics in preference to making their licensing rates more reasonable for small music streaming and download businesses. I believe the only viable way to effectively reduce music copyright infringement is to make music available on demand as a service. Services such as Spotify are doing this, but they are struggling to negotiate sensible rates from the music industry.

I’d appreciate it if you indicated your support for Tom Watson’s Early Day Motion 1997 which concerns these file-sharing plans.

Yours sincerely,

Russell Garrett

17 October 2009 ~ Comments Off

Hadoop Talk – SkillsMatter 2009

After an embarrassing tale of misunderstanding, wrong locations and blind luck I recently ended up at the Introduction to data processing with Hadoop and Pig talk over at SkillsMatter - and it was excellent.

For those that don't know about Hadoop, it's an OpenSource Java framework for data-intensive distributed applications. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google's MapReduce and Google File System (GFS) papers. I was aware of the basics but even in an hour I learned enough to know where to look for more details. Pig on the other hand is (to me) like SQL but for Hadoop, it's a lot easier to use than writing your own Java apps and simpler (and actually possible) for non-developers to read than the reams of classes required for custom jobs.

The speaker was excellent, the presentation was well timed, fluid, concise, paced just the way I like it and other than the question session the evening was very enjoyable. You can find the Hadoop slides online.

Like this post? - Digg Me! | Add to del.icio.us! | reddit this!

17 October 2009 ~ Comments Off

JRuby Cookbook – Short Review

First a disclaimer, I'm not a heavy Ruby or Java guy. Most of my coding for the last couple of years has been perl and shell - because I write little things that I need right now and those two languages excel at that (CPAN is still THE decision clincher).

I recently became involved in a side project that is written in Ruby and Java though and in an excellent timing coincidence a friend returned my previously unread copy of the JRuby Cookbook. The book isn't an introduction to either Java or Ruby (there are already excellent online and dead tree resources for that) but it shows where the two can meet and how to get started at those points. It's not really a book to read back to front but it is a good approach for a cookbook.

If you're curious as to how dynamic languages on static language VMs can complement each other this is a good book to flick through. Score - 6/10 - it's not the book for me right now but it does show a lot of entry points I'll probably come back to later.

Like this post? - Digg Me! | Add to del.icio.us! | reddit this!

15 October 2009 ~ Comments Off

MySQL InnoDB and table renaming don’t play well…

At Days of Wonder we are huge fans of MySQL (and since about a year of the various Open Query, Percona, Google or other community patches), up to the point we’re using MySQL for about everything in production.

But since we moved to 5.0, back 3 years ago our production databases which hold our website and online game systems has a unique issue: the mysqld process uses more and more RAM, up to the point where the kernel OOM decide to kill the process.

You’d certainly think we are complete morons because we didn’t do anything in the last 3 years to fix the issue :-)

Unfortunately, I never couldn’t replicate the issue in the lab, mainly because it is difficult to replicate the exact same load the production server sees (mainly because of the online games activity).

During those 3 years, I tried everything I could, from using other allocators, valgrind, debug builds and so on, without any success.

What is nice, is that we moved to an OurDelta build about a year ago, where InnoDB is able to print more memory statistics than the default MySQL version.

For instance it shows

Internal hash tables (constant factor + variable factor)
    Adaptive hash index 1455381240      (118999688 + 1336381552)
    Page hash           7438328
    Dictionary cache    281544240       (89251896 + 192292344)
    File system         254712  (82672 + 172040)
    Lock system         18597112        (18594536 + 2576)
    Recovery system     0       (0 + 0)
    Threads             408056  (406936 + 1120)
    innodb_io_pattern   0       (0 + 0)

Back several month ago, I analyzed this output just to see what figures were growing, and found that the Dictionary Cache variable part was increasing (slowly but definitely).

Sure fine MySQL experts would have been able to tell me exactly what, when and where the problem was, but since I’m not familiar with the code-base, I looked up what this number was and where it was increased (all in dict0dict.c) and added some logs each time it was increased.

I then installed this version for a quite long time (just to check it wouldn’t crash on production) on a slave server. But this server didn’t print anything interesting because it doesn’t see the exact same load the production masters.

A couple of months after that, I moved this code to one of the master and bingo! I found the operation and the tables exhibiting an increase:

mysqld[8131]: InnoDB: dict_table_rename_in_cache production/rank_tmp2 193330680 + 8112
mysqld[8131]: InnoDB: dict_table_rename_in_cache production/rank 193338792 + 8112

As soon as I saw the operation and table (ie rank), I found what the culprit is. We have a daemon that every 10s computes the player ranks for our online games.

To do this, we’re using the following pattern:

-- compute the ranks
SELECT NULL, playerID
FROM game_score as g
ORDER BY g.rankscore DESC
INTO OUTFILE "/tmp/rank_tmp.tmp"

-- load back the scores
LOAD DATA INFILE "/tmp/rank_tmp.tmp" INTO TABLE rank_tmp

-- swap tables so that clients see new ranks atomatically
RENAME TABLE rank TO rank_tmp2 , rank_tmp TO rank, rank_tmp2 TO rank_tmp

-- truncate the old ranks for a new pass
TRUNCATE TABLE rank_tmp

-- go back to the select above

You might ask why I’m doing a so much convoluted system, especially the SELECT INTO OUTFILE and the LOAD DATA. It’s just because INSERT … SELECT with innodb and binlog enabled can produce transactions abort (which we were getting tons of).

Back to the original issue, apparently the issue lies in the RENAME part of the daemon.

Looking at the dict0dict.c dict_table_rename_in_cache function we see:

ibool
dict_table_rename_in_cache(...)
...
  old_name = mem_heap_strdup(table->heap, table->name);
  table->name = mem_heap_strdup(table->heap, new_name);
...
}

Looking to mem_heap stuff, I discovered that each table has a heap associated in which InnoDB allocates various things. This heap can only grow (by block of 8112 bytes it seems), since the allocator is not a real one. This is done for performance reasons.

So each time we rename a table, the old name (why? since it is already allocated) is duplicated, along with the new name. Each time.

This heap is freed when the table is dropped, so there is a possibility to reclaim the used memory. That means this issue is not a memory leak per-se.

By the way, I’ve filed this bug on mysql bug system.

One work-around, beside fixing the code itself, would be to drop the rank table instead of truncating it. The issue with dropping/creating InnoDB table on a fast pace is that the dictionary cache itself will grow, because it can only grow as there is no way to purge it from old tables (except running one of the Percona patches). So the more tables we create the more we’ll use memory – back to square 0, but worst.

So right now, I don’t really have any idea on how to really fix the issue. Anyone having an idea, please do not hesitate to comment on this blog post :-)

And please, don’t tell me to move to MyISAM…

15 October 2009 ~ Comments Off

MySQL InnoDB and table renaming don’t play well…

At Days of Wonder we are huge fans of MySQL (and since about a year of the various Open Query, Percona, Google or other community patches), up to the point we’re using MySQL for about everything in production.

But since we moved to 5.0, back 3 years ago our production databases which hold our website and online game systems has a unique issue: the mysqld process uses more and more RAM, up to the point where the kernel OOM decide to kill the process.

You’d certainly think we are complete morons because we didn’t do anything in the last 3 years to fix the issue :-)

Unfortunately, I never couldn’t replicate the issue in the lab, mainly because it is difficult to replicate the exact same load the production server sees (mainly because of the online games activity).

During those 3 years, I tried everything I could, from using other allocators, valgrind, debug builds and so on, without any success.

What is nice, is that we moved to an OurDelta build about a year ago, where InnoDB is able to print more memory statistics than the default MySQL version.

For instance it shows

1
2
3
4
5
6
7
8
9
Internal hash tables (constant factor + variable factor)
    Adaptive hash index 1455381240      (118999688 + 1336381552)
    Page hash           7438328
    Dictionary cache    281544240       (89251896 + 192292344)
    File system         254712  (82672 + 172040)
    Lock system         18597112        (18594536 + 2576)
    Recovery system     0       (0 + 0)
    Threads             408056  (406936 + 1120)
    innodb_io_pattern   0       (0 + 0)

Back several month ago, I analyzed this output just to see what figures were growing, and found that the Dictionary Cache variable part was increasing (slowly but definitely).

Sure fine MySQL experts would have been able to tell me exactly what, when and where the problem was, but since I’m not familiar with the code-base, I looked up what this number was and where it was increased (all in dict0dict.c) and added some logs each time it was increased.

I then installed this version for a quite long time (just to check it wouldn’t crash on production) on a slave server. But this server didn’t print anything interesting because it doesn’t see the exact same load the production masters.

A couple of months after that, I moved this code to one of the master and bingo! I found the operation and the tables exhibiting an increase:

1
2
mysqld[8131]: InnoDB: dict_table_rename_in_cache production/rank_tmp2 193330680 + 8112
mysqld[8131]: InnoDB: dict_table_rename_in_cache production/rank 193338792 + 8112

As soon as I saw the operation and table (ie rank), I found what the culprit is. We have a daemon that every 10s computes the player ranks for our online games.

To do this, we’re using the following pattern:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- compute the ranks
SELECT NULL, playerID
FROM game_score as g
ORDER BY g.rankscore DESC
INTO OUTFILE "/tmp/rank_tmp.tmp"

-- load back the scores
LOAD DATA INFILE "/tmp/rank_tmp.tmp" INTO TABLE rank_tmp

-- swap tables so that clients see new ranks atomatically
RENAME TABLE rank TO rank_tmp2 , rank_tmp TO rank, rank_tmp2 TO rank_tmp

-- truncate the old ranks for a new pass
TRUNCATE TABLE rank_tmp

-- go back to the select above

You might ask why I’m doing a so much convoluted system, especially the SELECT INTO OUTFILE and the LOAD DATA. It’s just because INSERT … SELECT with innodb and binlog enabled can produce transactions abort (which we were getting tons of).

Back to the original issue, apparently the issue lies in the RENAME part of the daemon.

Looking at the dict0dict.c dict_table_rename_in_cache function we see:

1
2
3
4
5
6
7
ibool
dict_table_rename_in_cache(...)
...
  old_name = mem_heap_strdup(table->heap, table->name);
  table->name = mem_heap_strdup(table->heap, new_name);
...
}

Looking to mem_heap stuff, I discovered that each table has a heap associated in which InnoDB allocates various things. This heap can only grow (by block of 8112 bytes it seems), since the allocator is not a real one. This is done for performance reasons.

So each time we rename a table, the old name (why? since it is already allocated) is duplicated, along with the new name. Each time.

This heap is freed when the table is dropped, so there is a possibility to reclaim the used memory. That means this issue is not a memory leak per-se.

By the way, I’ve filed this bug on mysql bug system.

One work-around, beside fixing the code itself, would be to drop the rank table instead of truncating it. The issue with dropping/creating InnoDB table on a fast pace is that the dictionary cache itself will grow, because it can only grow as there is no way to purge it from old tables (except running one of the Percona patches). So the more tables we create the more we’ll use memory - back to square 0, but worst.

So right now, I don’t really have any idea on how to really fix the issue. Anyone having an idea, please do not hesitate to comment on this blog post :-)

And please, don’t tell me to move to MyISAM…

13 October 2009 ~ Comments Off

My Puppet Camp slides appearing on the slideshare homepage!

This morning I got the joy to see that my Puppet Camp 2009 slides had been selected by Slideshare to appear on their home page:

Waouh. For a surprise, that’s a surprise. I guess those stock photos I used are the underlying reason for this.

Still now that I talk about Puppet Camp again, I forgot to give the links to some pictures taken during the event:

and

13 October 2009 ~ Comments Off

My Puppet Camp slides appearing on the slideshare homepage!

This morning I got the joy to see that my Puppet Camp 2009 slides had been selected by Slideshare to appear on their home page:

Waouh. For a surprise, that’s a surprise. I guess those stock photos I used are the underlying reason for this.

Still now that I talk about Puppet Camp again, I forgot to give the links to some pictures taken during the event:

and