Error 400 on SERVER: Exported resource Sshkey[foo] cannot override local resource on node bar.example.com

I'm sure we've all seen this message from time to time when using puppet with exported resources:

Error 400 on SERVER: Exported resource Sshkey[foo] cannot override local resource on node bar.example.com

It's actually pretty easy to fix. Simply delete the exported resource for node foo.

Assuming you are using MySQL for your DB, something like this will do the trick:

mysql -e "delete from resources where restype like 'sshkey' and exported=1 and host_id = (select id from hosts where name 'foo')" puppet

Updating Dell firmware when update_firmware fails

We sometimes see problems updating our Dell machines to the latest firmware, ie. update_firmware -y fails:

Running updates...
-	Installing dell_dup_componentid_00159 - 1.4.7Installation failed for
package: dell_dup_componentid_00159 - 1.4.7
aborting update...

The error message from the low-level command was:

Could not parse output, bad xml for package: dell_dup_componentid_00159

Dell have been unable to tell me why this is, or provide a fix or workaround.

Here's what I did to get the firmware installed:

Identify the component for which the update is being installed. In this case that is dell_dup_componentid_00159.

Find the update for that component under /usr/share/firmware/dell:

# find /usr/share/firmware/dell -name "dell_dup_componentid_00159*"
/usr/share/firmware/dell/dup/system_ven_0x1028_dev_0x028c/dell_dup_componentid_00159_version_1.4.7
# ls /usr/share/firmware/dell/dup/system_ven_0x1028_dev_0x028c/dell_dup_componentid_00159_version_1.4.7/*.hdr
/usr/share/firmware/dell/dup/system_ven_0x1028_dev_0x028c/dell_dup_componentid_00159_version_1.4.7/PER410-010407.hdr

Install the update:

# cd /usr/share/firmware/dell/dup/system_ven_0x1028_dev_0x028c/dell_dup_componentid_00159_version_1.4.7
# dellBiosUpdate -f PER410-010407.hdr -u

As this is a system BIOS update, it is necessary to reboot for the update to be finalised.

# reboot

Note: the firmware packages are installed from the Dell repositories using yum -y update $(bootstrap_firmware). However, they do not seem to be up-to-date as the latest BIOS update for the R410 is v1.4.8 which was released on Sept. 13, 2010. But that's a different issue!

NexentaStor (OpenSolaris) – changing a user’s uid and getting CIFS to work

I have a NexentaStor-based NAS device on my home network. It has 10 x 500GB SATA Drives in a raidz2 configuration giving me approx 4TB of usable storage.

I don't have many users at home (basically, just me!) so I don't bother with any central authentication mechanism – I simply make sure that use I the same login (robin) across all servers, and on unix/linux servers I make sure the UID is the same (10000).

When a user is added to NexentaStor, it is allocated the next available UID beginning at 1001 (1000 is allocated to the admin user). For the robin login, I need this to be 10000.

Not a problem – simply drop to a bash shell on the NAS and use vipw to change the UID of the login.

The bit I *always* forget is that the CIFS server uses a different passwd file (/var/smb/smbpasswd) and it is necessary to change the UID in that file too.

Deploying to many servers

We currently deploy our app code to around 50 nodes using capistrano. We use the "copy" deployment method, ie. the code is checked out of svn onto the local deployment node, rolled into a tarball, then copied out to each target node where it is unrolled into a release dir before the final symlink is put in place.

As you might imagine, copying to 50 nodes generates quite a bit of traffic, and it takes ~5 mins to do a full deploy.

I was reading this interesting link today; one bullet in particular jumped out at me:

  • "… the few hundred MB binary gets rapidly pushed bia [sic] bit torrent."

Now that's an interesting idea – I wonder if I can knock up something in capistrano that deploys using bittorrent?

Fedora 13 as a domu guest with xen 3.4.2 on CentOS 5.5

I wanted to install a Fedora 13 machine as a paravirtual domu guest on our CentOS 5.5, xen 3.4.2 host. I also wanted to provision it using koan/cobbler. I ran into a few problems along the way, but I got there in the end!

Firstly, createrepo in CentOS 5.5 does not work with either the Fedora 13 distribution image, or the Fedora 13 Updates repo. I got round that by rebuilding createrepo from Fedora 13 on CentOS 5.5 which was in itself a major undertaking as it also involved rebuilding several other packages which are dependencies of createrepo (yum, yum-utils, python-urlgrabber, python-pycurl, curl, libssh2).

Once I'd got the F13 distro imported into cobbler, and mirrored the F13 Updates repo, I created the cobbler system entry and ran koan to kick off the PXE install. All seemed to work fine and the installation completed. However, when I tried to start the domu it wouldn't start. In /var/log/xen/xend.log I found this was the last line:

[2010-09-01 16:43:05 9717] DEBUG (XendBootloader:113) Launching bootloader as ['/usr/bin/pygrub', '--output=/var/run/xend/boot/xenbl.11641', '/dev/mapper/vg_guests-mock--disk0']

On a hunch, I modified the kickstart to explicitly format the /boot partition as ext3 rather than the default ext4 and re-tried the install process. Bingo! pygrub in xen 3.4.2 on CentOS 5.5 doesn't work with ext4.

Updating Dell OMSA on CentOS

Dell distributes its OMSA software in RPM packages and even has a yum repo available so you'd think that updating to the next version would be as simple as yum update, right? Wrong!

You have to remove the old version first, and then install the new version. Oh, and you also need to stop the Dell services, restart ipmi, then restart the Dell services.

Something like this:

yum -y remove srvadmin-* \
  && rm -Rf /opt/dell \
  && yum -y install srvadmin-all dell_ft_install \
  && srvadmin-services.sh stop \
  && service ipmi restart \
  && srvadmin-services.sh start

Dell DRAC BBU auto-learn tests kill disk performance

We've had a bunch of new servers in place for around 3 months now. They seem to be working well and are performing just fine.

Then, out of the blue, our monitoring started throwing alerts on seemingly random servers. Our queues were building up – basically, database performance had dropped dramatically and our processing scripts couldn't stuff data into the DBs fast enough.

What could be causing it?

I took a quick glance at our monitoring and various statistics confirmed the problems. You can clearly see in the following graphs that CPU Wait and Load Average increased at around 14:30, and at the same time MySQL throughput and command counters dropped dramatically:

CPU Usage

Load average

MySQL Network Traffic

MySQL Command Counters

So, I've found the symptoms; what could be the cause?

Digging around in the system logs, I found the following lines (emphasis is mine):

Apr 26 12:55:31 b008 Server Administrator: Storage Service EventID: 2176  The controller battery Learn cycle has started.:  Battery 0 Controller 0
Apr 26 12:56:36 b008 Server Administrator: Storage Service EventID: 2266  Controller log file entry: Battery is discharging:  Battery 0 Controller 0
Apr 26 12:56:36 b008 Server Administrator: Storage Service EventID: 2248  The controller battery is executing a Learn cycle.:  Battery 0 Controller 0
Apr 26 14:21:52 b008 Server Administrator: Storage Service EventID: 2278  The controller battery charge level is below a normal threshold.:  Battery 0 Controller 0
Apr 26 14:21:53 b008 Server Administrator: Storage Service EventID: 2188  The controller write policy has been changed to Write Through.:  Battery 0 Controller 0
Apr 26 14:21:53 b008 Server Administrator: Storage Service EventID: 2199  The virtual disk cache policy has changed.:  Virtual Disk 0 (Virtual Disk 0) Controller 0 (PERC 6/i Adapter)
Apr 26 14:55:06 b008 Server Administrator: Storage Service EventID: 2177  The controller battery Learn cycle has completed.:  Battery 0 Controller 0
Apr 26 14:55:21 b008 Server Administrator: Storage Service EventID: 2247  The controller battery is charging.:  Battery 0 Controller 0
Apr 26 14:55:21 b008 Server Administrator: Storage Service EventID: 2278  The controller battery charge level is below a normal threshold.:  Battery 0 Controller 0
Apr 26 15:25:41 b008 Server Administrator: Storage Service EventID: 2279  The controller battery charge level is operating within normal limits:  Battery 0 Controller 0
Apr 26 15:25:41 b008 Server Administrator: Storage Service EventID: 2189  The controller write policy has been changed to Write Back.:  Battery 0 Controller 0
Apr 26 15:25:42 b008 Server Administrator: Storage Service EventID: 2199  The virtual disk cache policy has changed.:  Virtual Disk 0 (Virtual Disk 0) Controller 0 (PERC 6/i Adapter)
Apr 26 18:50:26 b008 Server Administrator: Storage Service EventID: 2358  The battery charge cycle is complete.:  Battery 0 Controller 0

Bingo!

The RAID controller is running a battery learn cycle. When battery charge drops below a certain level, cache Write Back is disabled, and that kills disk performance. It seems this test is enabled by default and is configured to run every 90 days. Of course, Sod's law dictates that it has to trigger in our busy period!

The Dell tools are not able to turn off this feature, but the LSI MegaCli tool can (Dell PERC 6/i controllers are re-badged LSI cards). I've run the following script on all servers (thanks to burr86 on ##infra-talk @ Freenode):

#!/bin/sh
TMPFILE=$(mktemp -p /tmp bbu.relearn.off.XXXXXXXXXX) || exit 1
echo "autoLearnMode=1" > $TMPFILE # or =0 to enable the bbu relearn
MegaCli -AdpBbuCmd -SetBbuProperties -f$TMPFILE -a0
rm -f $TMPFILE

I wrote a puppet class to run this script on all nodes:

class megacli::check {
    exec{'perc_bbu_autolearn.sh':
        require => Class['megacli::install'],
        cwd    => '/tmp',
        path   => '/usr/local/bin:/bin',
        unless => 'MegaCli -AdpBbuCmd -GetBbuProperties -a0 | grep -q "Auto-Learn Mode: Disabled"'
    }
}
All that remains to be done is to put a cron job in place that runs the battery learning cycle every three months at an off-peak time. The command to kick that off is:
omconfig storage battery action=startlearn controller=0 battery=0

MAC Addresses of embedded NICs on Dell servers through DRAC

I use cobbler to provision our new Dell servers, which is great but it needs the MAC addresses of the servers to identify each machine.

Previously, I have been doing this manually:

  1. log in to the DRAC web interface
  2. launch the java console
  3. rebooting the server
  4. go into the BIOS
  5. navigate to Embedded Devices
  6. manually record the MAC addresses

This takes quite a while, and is prone to error.

I recently had another 42 servers to deploy to I looked for a way to automate this process. I found one!

I got the inkling that this should be possible because I noticed that the System | Properties | System Details page in the DRAC web interface lists the MAC addresses in the Main System Chassis section:

Embedded NIC MAC Addresses
    NIC1  Ethernet  a4:ba:db:11:38:2d
          iSCSI     00:00:00:00:00:00
    NIC2  Ethernet  a4:ba:db:11:38:2e
          iSCSI     00:00:00:00:00:00
 
So, I checked in the DRAC6 documentation for a suitable command, and found it: racadm racdump

This dumps a whole load of information about the DRAC and the attached system. However, pass the output through a simple grep and … bingo!

$ racadm -r $DRAC -u root -p calvin racdump | egrep '^MAC Address|^NIC. Ethernet'
MAC Address             = a4:ba:db:11:38:2f
NIC1 Ethernet           = a4:ba:db:11:38:2d
NIC2 Ethernet           = a4:ba:db:11:38:2e
NIC3 Ethernet           = N/A
NIC4 Ethernet           = N/A