Linux Utilities for Diagnostics

I spend a fair amount of time troubleshooting issues on Linux and other Unix and Unix-like systems. While there are dozens of utilities I use for diagnosing and resolving issues, I consistently employ a small set of tools to do quick, high-level checks of system health. These checks are in the categories of disk utilization, memory and CPU utilization, and networking and connectivity. Triaging the health of the system in each of these categories allows me to quickly hone in on where a problem may exist.

These utilities are usually available on all Linux systems. Most are available, or have analogues, on other Unix and Unix-like systems.

Disk Utilization

Generally, disk utilization is the first thing I check as a lack of free disk space spells certain doom for most user and kernel processes. I have seen more strange behavior from a lack of free disk space than anything else.

  • df reports filesystem disk space usage. This quickly allows me to see how much free space remains on each filesystem.

  • df -h displays, in human-readable format, the free space available on all mounted filesystems.

    
    $ df -h
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/xvda        47G   26G   19G  58% /
    devtmpfs        4.0G   12K  4.0G   1% /dev
    none            802M  184K  802M   1% /run
    none            5.0M     0  5.0M   0% /run/lock
    none            4.0G     0  4.0G   0% /run/shm
    

  • du estimates file space usage. This allows me to pinpoint which fields are taking up large amounts of disk space so I can investigate further.
  • du -sh * summarizes, in human-readable format, the space utilized by all files/folders in the current directory.
    
    $ du -sh *
    18M     bundle
    8.6M    cached-copy
    444M    log
    4.0K    pids
    4.0K    system
    

Memory, CPU Utilization, and I/O

Running out of available memory is also a major cause of performance problems and strange behavior on systems. CPU utilization and I/O rates can quickly provide clues as to whether performance problems are due to bottlenecks internal to a given system, or from external sources.

  • free reports the amount of free and used memory on the system. This provides immediate feedback on whether a system lacks free memory.
  • free -m displays, in megabytes, the amount of used and free physical and swap memory, and the amount of memory used for buffers/caching.
    
    $ free -m
                    total       used       free     shared    buffers     cached
    Mem:          8014       6339       1674          0        136       3887
    -/+ buffers/cache:       2314       5699
    Swap:          511        153        358
    
  • vmstat reports on memory, swap, I/O, system activity, and CPU activity. This provides averages of various metrics since boot and can report continuously on current metrics. Analyzing the metrics can provide insight into what the system is doing at a given time (e.g. frequently swapping, waiting on I/O, etc.).
  • vmstat 1 will print out the metrics once every second until halted, using megabytes instead of bytes.
    
    $ vmstat 1
    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
    r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
    1  0     80     59     43   1791    0    0     0     0 1131 1057 15  2 83  0
    0  0     80     57     43   1791    0    0     8    96 1031  936 19  2 79  0
    0  0     80     60     43   1791    0    0    40    64 1666 1444  9  2 89  0
    0  0     80     60     43   1791    0    0     8     0  667  553  0  0 100  0
    1  0     80     57     43   1791   16    0    16   104  808  748 12  2 86  0
    0  0     80     59     43   1791    0    0    12  3028 1813 1723 44  5 50  0
    0  0     80     59     43   1791    0    0     0    56 1119 1066 17  1 81  0
    1  0     80     50     43   1791    0    0    68     0 1219 1024 25  4 71  0
    0  0     80     60     43   1791    0    0    52    68 1725 1435 12  1 86  0
    0  0     80     60     43   1791    0    0     8     0 2236 1699 35  5 60  0
    0  0     80     60     43   1791    0    0     0    68  163  209  0  0 99  0
    1  0     80     60     43   1791    0    0     0   140 1456 1379 22  3 74  0
    1  0     80     61     43   1791    0    0     0    56 1481 1242 24  4 72  0
    0  0     80     60     43   1791    0    0   356     0 1359  930 11  3 86  0
    0  0     80     60     43   1792    0    0   428     0 1619  992  2  1 97  0
    0  0     80     60     43   1792    0    0     8  2196  313  396  0  0 100  0
    0  0     80     60     43   1792    0    0     0     0  144  181  0  0 100  0
    

Networking and Connectivity

Network connectivity and routing issues are usually apparent. However, trying to determine the exact nature of or reason for the issue can be a bit more difficult.

  • ping sends an ICMP echo request to a host. This provides immediate confirmation of whether or not a remote host is accessible.
  • ping 8.8.8.8 will ping Google’s DNS servers, which usually indicates with a high degree of certainty whether or not Internet connectivity is available.
    
    $ ping 8.8.8.8
    PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
    64 bytes from 8.8.8.8: icmpreq=1 ttl=54 time=0.681 ms
    64 bytes from 8.8.8.8: icmpreq=2 ttl=54 time=0.679 ms
    64 bytes from 8.8.8.8: icmpreq=3 ttl=54 time=0.703 ms
    64 bytes from 8.8.8.8: icmpreq=4 ttl=54 time=0.703 ms
    64 bytes from 8.8.8.8: icmp_req=5 ttl=54 time=0.677 ms
    
  • mtr combines ping with traceroute and prints the route packet trace to a remote host, along with packet response times and loss percentages.

  • mtr -c 5 -r 8.8.8.8 will send five packets to Google’s DNS servers and report back the intermediate routers, with details about response times and packet loss along the way.

    
    $ mtr -c 5 -r  8.8.8.8
    HOST: localhost                 Loss%   Snt   Last   Avg  Best  Wrst StDev
    1.|-- router2-dal.linode.com     0.0%     5    0.9   0.7   0.6   0.9   0.2
    2.|-- ae2.car02.dllstx2.network  0.0%     5    0.3   6.3   0.3  30.5  13.5
    3.|-- po102.dsr01.dllstx2.netwo  0.0%     5    1.1   0.6   0.5   1.1   0.3
    4.|-- po21.dsr01.dllstx3.networ  0.0%     5    1.3   2.5   0.6   8.0   3.1
    5.|-- ae17.bbr02.eq01.dal03.net  0.0%     5    0.5   0.6   0.5   0.8   0.1
    6.|-- ae7.bbr01.eq01.dal03.netw  0.0%     5    0.5   0.6   0.5   0.7   0.1
    7.|-- 25.10.6132.ip4.static.sl-  0.0%     5    0.6   0.9   0.6   2.2   0.7
    8.|-- 216.239.50.89              0.0%     5    0.5   0.6   0.5   0.8   0.1
    9.|-- 64.233.174.69              0.0%     5    1.0   0.8   0.8   1.0   0.1
    10.|-- google-public-dns-a.googl  0.0%     5    0.8   0.8   0.7   0.8   0.0
    

  • netstat displays information about network connections, routing tables, and interfaces. While it is a very sophisticated tool which has many different possible applications, it provides an easy way to display a few important bits of data:
  • netstat -nlp displays information about processes that are currently listening on a socket.
    
    $ sudo netstat -nlp
    Active Internet connections (only servers)
    Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
    tcp        0      0 127.0.0.1:3306          0.0.0.0:*               LISTEN      2858/mysqld
    tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      2665/sshd
    tcp        0      0 0.0.0.0:25              0.0.0.0:*               LISTEN      3133/master
    tcp6       0      0 :::8080                 :::*                    LISTEN      3160/apache2
    tcp6       0      0 :::22                   :::*                    LISTEN      2665/sshd
    tcp6       0      0 :::25                   :::*                    LISTEN      3133/master
    tcp6       0      0 :::443                  :::*                    LISTEN      3160/apache2
    udp        0      0 0.0.0.0:68              0.0.0.0:*                           2633/dhclient3
    
  • netstat -rn displays the current routing table.
    
    $ netstat -rn
    Kernel IP routing table
    Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
    0.0.0.0         173.255.206.1   0.0.0.0         UG        0 0          0 eth0
    173.255.206.0   0.0.0.0         255.255.255.0   U         0 0          0 eth0
    

Conclusion

The examples above show some of the most common ways these utilities can be used to perform diagnostics on systems based on disk utilization, memory and CPU utilization, and network activity and connectivity. Some of these utilities (particularly netstat) are quite powerful, and could be used to display or diagnose much more than shown in the examples above. Past troubleshooting experience, and the specific histories of given systems, guide the particular ways that I deploy these tools to assist in the investigation and resolution of system issues.

The post Linux Utilities for Diagnostics appeared first on Atomic Spin.

Graphing on the CLI

I’ve recently been thinking about ways to do graphs on the CLI. We’ve written a new Puppet Agent for MCollective that can gather all sorts of interesting data from your server estate and I’d really like to be able to show this data on the CLI. This post isn’t really about MCollective though the ideas applies to any data.

I already have sparklines in MCollective, here’s the distribution of ping times:

This shows you that most of the nodes responded quickly with a bit of a tail at the end being my machines in the US.

Sparklines are quite nice for a quick overview so I looked at adding some more of this to the UI and came up with this:

Which is quite nice – these are the nodes in my infrastructure stuck into buckets and the node counts for each bucket is shown. We can immediately tell something is not quite right – the config retrieval time shows a bunch of slow machines and the slowness does not correspond to resource counts etc. On investigation I found these are my dev machines – KVM nodes hosted on HP Micro Servers so that’s to be expected.

I am not particularly happy with these graphs though so am still exploring other options, one other option is GNU Plot.

GNU Plot can target its graphs for different terminals like PNG and also line printers – since the Unix terminal is essentially a line printer we can use this.

Here are 2 graphs of config retrieval time produced by MCollective using the same data source that produced the spark line above – though obviously from a different time period. Note that the axis titles and graph title is supplied automatically using the MCollective DDL:

$ mco plot resource config_retrieval_time
 
                   Information about Puppet managed resources
  Nodes
    6 ++-*****----+----------+-----------+----------+----------+----------++
      +      *    +          +           +          +          +           +
      |       *                                                            |
    5 ++      *                                                           ++
      |       *                                                            |
      |        *                                                           |
    4 ++       *      *                                                   ++
      |        *      *                                                    |
      |         *    * *                                                   |
    3 ++        *    * *                                                  ++
      |          *  *  *                                                   |
      |           * *   *                                                  |
    2 ++           *    *                         *        *              ++
      |                 *                         **       **              |
      |                  *                       * *      *  *             |
    1 ++                 *               *       *  *     *   **        * ++
      |                  *              * *     *   *     *     **    **   |
      +           +       *  +         * + *    *   +*   *     +     *     +
    0 ++----------+-------*************--+--****----+*****-----+--***-----++
      0           10         20          30         40         50          60
                              Config Retrieval Time

So this is pretty serviceable for showing this data on the console! It wouldn’t scale to many lines but for just visualizing some arbitrary series of numbers it’s quite nice. Here’s the GNU Plot script that made the text graph:

set title "Information about Puppet managed resources"
set terminal dumb 78 24
set key off
set ylabel "Nodes"
set xlabel "Config Retrieval Time"
plot '-' with lines
3 6
6 6
9 3
11 2
14 4
17 0
20 0
22 0
25 0
28 0
30 1
33 0
36 038 2
41 0
44 0
46 2
49 1
52 0
54 0
57 1

The magic here comes from the second line that sets the output terminal to dump and supplies some dimensions. Very handy, worth exploring some more and adding to your toolset for the CLI. I’ll look at writing a gem or something that supports both these modes.

There are a few other players in this space, I definitely recall coming across a Python tool to do graphs but cannot find it now, shout out in the comments if you know other approaches and I’ll add them to the post!

Updated: some links to related projects: sparkler, Graphite Spark

Rich data on the CLI

I’ve often wondered how things will change in a world where everything is a REST API and how relevant our Unix CLI tool chain will be in the long run. I’ve known we needed CLI ways to interact with data – like JSON data – and have given this a lot of thought.

MS Powershell does some pretty impressive object parsing on their CLI but I was never really sure how close we could get to that in Unix. I’ve wanted to start my journey with the grep utility as that seemed a natural starting point and my most used CLI tool.

I have no idea how to write parsers and matchers but luckily I have a very talented programmer working for me who were able to take my ideas and realize them awesomely. Pieter wrote a json grep and I want to show off a few bits of what it can do.

I’ll work with the document below:

[
  {"name":"R.I.Pienaar",
   "contacts": [
                 {"protocol":"twitter", "address":"ripienaar"},
                 {"protocol":"email", "address":"rip@devco.net"},
                 {"protocol":"msisdn", "address":"1234567890"}
               ]
  },
  {"name":"Pieter Loubser",
   "contacts": [
                 {"protocol":"twitter", "address":"pieterloubser"},
                 {"protocol":"email", "address":"foo@example.com"},
                 {"protocol":"msisdn", "address":"1234567890"}
               ]
  }
]

There are a few interesting things to note about this data:

  • The document is an array of hashes, this maps well to the stream of data paradigm we know from lines of text in a file. This is the basic structure jgrep works on.
  • Each document has another nested set of documents in an array – the contacts array.

Examples


The examples below show a few possible grep use cases:

A simple grep for a single key in the document:

$ cat example.json | jgrep "name='R.I.Pienaar'"
[
  {"name":"R.I.Pienaar",
   "contacts": [
                 {"protocol":"twitter", "address":"ripienaar"},
                 {"protocol":"email", "address":"rip@devco.net"},
                 {"protocol":"msisdn", "address":"1234567890"}
               ]
  }
]

We can extract a single key from the result:

$ cat example.json | jgrep "name='R.I.Pienaar'" -s name
R.I.Pienaar

A simple grep for 2 keys in the document:

% cat example.json | 
    jgrep "name='R.I.Pienaar' and contacts.protocol=twitter" -s name
R.I.Pienaar

The nested document pose a problem though, if we were to search for contacts.protocol=twitter and contacts.address=1234567890 we will get both documents and not none, that’s because in order to effectively search the sub documents we need to ensure that these 2 values exist in the same sub document.

$ cat example.json | 
     jgrep "[contacts.protocol=twitter and contacts.address=1234567890]"

Placing [] around the 2 terms works like () but restricts the search to the specific sub document. In this case there is no sub document in the contacts array that has both twitter and 1234567890.

Of course you can have many search terms:

% cat example.json | 
     jgrep "[contacts.protocol=twitter and contacts.address=1234567890] or name='R.I.Pienaar'" -s name
R.I.Pienaar

We can also construct entirely new documents:

% cat example.json | jgrep "name='R.I.Pienaar'" -s "name contacts.address"
[
  {
    "name": "R.I.Pienaar",
    "contacts.address": [
      "ripienaar",
      "rip@devco.net",
      "1234567890"
    ]
  }
]

Real World

So I am adding JSON output support to MCollective, today I was rolling out a new Nagios check script to my nodes and wanted to be sure they all had it. I used the File Manager agent to fetch the stats for my file from all the machines then printed the ones that didn’t match my expected MD5.

$ mco rpc filemgr status file=/.../check_puppet.rb -j | 
   jgrep 'data.md5!=a4fdf7a8cc756d0455357b37501c24b5' -s sender
box1.example.com

Eventually you will be able to then pipe this output to mco again and call another agent, here I take all the machines that didn’t yet have the right file and cause a puppet run to happen on them, this is very Powershell like and the eventual use case I am building this for:

$ mco rpc filemgr status file=/.../check_puppet.rb -j | 
   jgrep 'data.md5!=a4fdf7a8cc756d0455357b37501c24b5' |
   mco rpc puppetd runonce

I also wanted to know the total size of a logfile across my web servers to be sure I would have enough space to copy them all:

$ mco rpc filemgr status file=/var/log/httpd/access_log -W /apache/ -j |
    jgrep -s "data.size"|
    awk '{ SUM += $1} END { print SUM/1024/1024 " MB"}'
2757.9093 MB

Now how about interacting with a webservice like the GitHub API:

$ curl -s http://github.com/api/v2/json/commits/list/puppetlabs/marionette-collective/master|
   jgrep --start commits "author.name='Pieter Loubser'" -s id
52470fee0b9fe14fb63aeb344099d0c74eaf7513

Here I fetched the most recent commits in the marionette-collective GitHub repository, searched for ones by Pieter and returns the ID of those commits. The –start argument is needed because the top of the JSON returned is not the array we care for. The –start tells jgrep to take the commits key and grep that.

Or since it’s Sysadmin Appreciation Day how about tweets about it:

% curl -s "http://search.twitter.com/search.json?q=sysadminday"|
   jgrep --start results -s "text"
 
RT @RedHat_Training: Did you know that today is Systems Admin Day?  A big THANK YOU to all our system admins!  Here's to you!  http://t.co/ZQk8ifl
RT @SinnerBOFH: #BOFHers RT @linuxfoundation: Happy #SysAdmin Day! You know who you are, rock stars. http://t.co/kR0dhhc #linux
RT @google: Hey, sysadmins - thanks for all you do. May your pagers be silent and your users be clueful today! http://t.co/N2XzFgw
RT @google: Hey, sysadmins - thanks for all you do. May your pagers be silent and your users be clueful today! http://t.co/y9TbCqb #sysadminday
RT @mfujiwara: http://www.sysadminday.com/
RT @mitchjoel: It's SysAdmin Day! Have you hugged your SysAdmin today? Make sure all employees follow the rules: http://bit.ly/17m98z #humor
? @mfujiwara: http://www.sysadminday.com/

Here as before we have to grep the results array that is contained inside the results.

I can also find all the restaurants near my village via SimpleGEO:

curl -x localhost:8001 -s "http://api.simplegeo.com/1.0/places/51.476959,0.006759.json?category=Restaurant"|
   jgrep --start features "properties.distance<2.0" -s "properties.address \
                                      properties.name \
                                      properties.postcode \
                                      properties.phone \
                                      properties.distance"
[
  {
    "properties.address": "15 Stratheden Road",
    "properties.distance": 0.773576114771768,
    "properties.phone": "+44 20 8858 8008",
    "properties.name": "The Lamplight",
    "properties.postcode": "SE3 7TH"
  },
  {
    "properties.address": "9 Stratheden Parade",
    "properties.distance": 0.870622234751732,
    "properties.phone": "+44 20 8858 0728",
    "properties.name": "Sun Ya",
    "properties.postcode": "SE3 7SX"
  }
]

There’s a lot more I didn’t show, it supports all the usual <= etc operators and a fair few other bits.

You can get this utility by installing the jgrep Ruby Gem or grab the code from GitHub. The Gem is a library so you can use these abilities in your ruby programs but also includes the CLI tool shown here.

It’s pretty new code and we’d totally love feedback, bugs and ideas! Follow the author on Twitter at @pieterloubser and send him some appreciation too.