Linux Troubleshooting 101 , 2016 Edition

Back in 2006 I wrote a blog post about linux troubleshoooting. Bert Van Vreckem pointed out that it might be time for an update ..

There's not that much that has changed .. however :)

Everything is a DNS Problem

Everything is a Fscking DNS Problem
No really, Everything is a Fscking DNS Problem
If it's not a fucking DNS Problem ..
It's a Full Filesystem Problem
If your filesystem isn't full
It is a SELinux problem
If you have SELinux disabled
It might be an ntp problem
If it's not an ntp problem
It's an arp problem
If it's not an arp problem...
It is a Java Garbage Collection problem
If you ain't running Java
It's a natting problem
If you are already on IPv6
It's a Spanning Tree problem
If it's not a spanning Tree problem...
It's a USB problem
If it's not a USB Problem
It's a sharing IRQ Problem
If it's not a sharing IRQ Problem
But most often .. its a Freaking Dns Problem !

`

Linux Troubleshooting 101 , 2016 Edition

Back in 2006 I wrote a blog post about linux troubleshoooting. Bert Van Vreckem pointed out that it might be time for an update ..

There's not that much that has changed .. however :)

Everything is a DNS Problem

Everything is a Fscking DNS Problem
No really, Everything is a Fscking DNS Problem
If it's not a fucking DNS Problem ..
It's a Full Filesystem Problem
If your filesystem isn't full
It is a SELinux problem
If you have SELinux disabled
It might be an ntp problem
If it's not an ntp problem
It is a Java Garbage Collection problem
If you ain't running Java
It's a natting problem
If you are alraedy on IPv6
It's an arp problem
If it's not an arp problem...
It's a Spanning Tree problem
If it's not a spanning Tree problem...
It's a USB problem
If it's not a USB Problem
It's a sharing IRQ Problem
If it's not a sharing IRQ Problem
But most often .. its a Freaking Dns Problem !

`

Linux Troubleshooting 101 , 2016 Edition

Back in 2006 I wrote a blog post about linux troubleshoooting. Bert Van Vreckem pointed out that it might be time for an update ..

There's not that much that has changed .. however :)

Everything is a DNS Problem

Everything is a Fscking DNS Problem
No really, Everything is a Fscking DNS Problem
If it's not a fucking DNS Problem ..
It's a Full Filesystem Problem
If your filesystem isn't full
It is a SELinux problem
If you have SELinux disabled
It might be an ntp problem
If it's not an ntp problem
It's an arp problem
If it's not an arp problem...
It is a Java Garbage Collection problem
If you ain't running Java
It's a natting problem
If you are already on IPv6
It's a Spanning Tree problem
If it's not a spanning Tree problem...
It's a USB problem
If it's not a USB Problem
It's a sharing IRQ Problem
If it's not a sharing IRQ Problem
But most often .. its a Freaking Dns Problem !

`

Length limit to iptables chain names

I  ran into an odd issue today – my firewall build script was failing on our account master node.

It turns out that I was trying to use a chain name in iptables that exceeded the maximum length allowed. I wanted to use "REMOTE_ACCOUNT_SLAV ES_ASHEVILLE" (31 chars) and the limit is 30 chars.

You can see this in /usr/include/linux/netfilter_ipv4/ip_tables.h and /usr/include/linux/netfilter/x_tables.h:

/usr/include/linux/netfilter_ipv4/ip_tables.h
22:#define IPT_FUNCTION_MAXNAMELEN XT_FUNCTION_MAXNAMELEN

/usr/include/linux/netfilter/x_tables.h
4:#define XT_FUNCTION_MAXNAMELEN 30

This was on CentOS 5.6.

Issues with iptables stateful filtering

I hit a weird issue today, I have Apache configured as a reverse proxy using mod_proxy_balancer which is a welcome addition in Apache 2.2.x. This is forwarding selected requests to more Apache instances running mod_perl, although this could be any sort of application layer, pretty standard stuff.

With ProxyStatus enabled and pushing a reasonable amount of traffic through the proxy, I started to notice that the application layer Apache instances would consistently get marked in error state every so often, removing them from the pool of available servers until their cooldown period expired and the proxy enabled them again.

Investigating the logs showed up this error:

[Tue Sep 28 00:24:39 2010] [error] (113)No route to host: proxy: HTTP: attempt to connect to 192.0.2.1:80 (192.0.2.1) failed
[Tue Sep 28 00:24:39 2010] [error] ap_proxy_connect_backend disabling worker for (192.0.2.1)

A spot of the ol’ google-fu turned up descriptions of similar problems, some not even related to Apache. The problem looked related to iptables.

All the servers are running CentOS 5 and anyone who runs this is probably aware of the stock Red Hat iptables ruleset. With HTTP access enabled, it looks something similar to this:

1
2
3
4
5
6
7
8
9
10
11
12
13
-A INPUT -j RH-Firewall-1-INPUT
-A FORWARD -j RH-Firewall-1-INPUT
-A RH-Firewall-1-INPUT -i lo -j ACCEPT
-A RH-Firewall-1-INPUT -p icmp --icmp-type any -j ACCEPT
-A RH-Firewall-1-INPUT -p 50 -j ACCEPT
-A RH-Firewall-1-INPUT -p 51 -j ACCEPT
-A RH-Firewall-1-INPUT -p udp --dport 5353 -d 224.0.0.251 -j ACCEPT
-A RH-Firewall-1-INPUT -p udp -m udp --dport 631 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp -m tcp --dport 631 -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT
-A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited

Almost all of it is boilerplate apart from line 12, which I added and is identical to the line above it granting access to SSH. Analysing the live hit counts against each of these rules showed a large number hitting that last catch-all rule on line 13 and indeed, this is what is causing the Apache errors.

Analysing the traffic with tcpdump/wireshark showed that the frontend Apache server is only getting as far as sending the initial SYN packet and it’s failing to match either the dedicated rule on line 12 for HTTP traffic, or even the rule on line 10 to match any related or previously established traffic, although I wouldn’t really expect it to match that.

Adding a rule before the last one to match and log any HTTP packets that are considered to be in the state INVALID showed that indeed for some strange reason, iptables is deciding that an initial SYN is somehow invalid.

More information about why it might be invalid can be coaxed from the kernel by issuing the following:

# echo 255 > /proc/sys/net/ipv4/netfilter/ip_conntrack_log_invalid

Although all this gave me was some extra text saying “invalid state” and the same output you get from the standard logging target:

ip_ct_tcp: invalid state IN= OUT= SRC=192.0.2.2 DST=192.0.2.1 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=21158 DF PROTO=TCP SPT=57351 DPT=80 SEQ=1402246598 ACK=0 WINDOW=5840 RES=0x00 SYN URGP=0 OPT (020405B40402080A291DC7600000000001030307)

From searching the netfilter bugzilla for any matching bug reports I found a knob that relaxes the connection tracking, enabled with the following:

# echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal

This didn’t fix things, plus it’s recommended to only use this in extreme cases with a broken router or firewall. As all of these machines are on the same network segment with just a switch connecting them there shouldn’t be anything mangling the packets.

With those avenues exhausted the suggested workaround of adding rules similar to the following:

1
-A RH-Firewall-1-INPUT -m tcp -p tcp --dport 80 --syn -j REJECT --reject-with tcp-reset

before your last rule doesn’t really work as the proxy still drops the servers from the pool as before, but just logs a different reason instead:

[Tue Sep 28 13:59:23 2010] [error] (111)Connection refused: proxy: HTTP: attempt to connect to 192.0.2.1:80 (192.0.2.1) failed
[Tue Sep 28 13:59:23 2010] [error] ap_proxy_connect_backend disabling worker for (192.0.2.1)

This isn’t really acceptable to me anyway as it requires your proxy to retry in response to a server politely telling it to go away and it’s going to be a performance hit. The whole point of using mod_proxy_balancer for me was so it could legitimately remove servers that are dead or administratively stopped, how is it supposed to tell the difference between that and a dodgy firewall?

The only solution that worked and was deemed acceptable was to simply remove the state requirement on matching the HTTP traffic, like so:

-A RH-Firewall-1-INPUT -m tcp -p tcp --dport 80 -j ACCEPT

This will match both new and invalid packets, however it’s left me with a pretty low opinion of iptables now. Give me pf any day.

Count of denied connections with iptables

In my iptables configurations, I generally allow all traffic I am interested in and deny the rest, logging anything that is denied.

I found that this can get a bit noisy with loads of connections to udp:137 and udp:500, etc. so I decided to deny the more common ports without logging. But which are the most common ports?

A typical log line looks like this:

Jan 12 09:42:01 b003 kernel: FAILSAFE -- DENY IN=bond0 OUT= MAC=00:26:b9:3d:fb:18:00:d0:c0:52:88:00:08:00 SRC=115.177.32.129 DST=a.b.c.d LEN=40 TOS=0x00 PREC=0x00 TTL=46 ID=22118 PROTO=TCP SPT=6643 DPT=20480 WINDOW=0 RES=0x00 ACK RST URGP=0

I wrote this perl one-liner to count no. of packets blocked per port:

perl -e 'while(<>) {if ($_ =~ /PROTO=([^ ]+).*DPT=([^ ]+)/) {$proto=$1 ; $port=$2; $count{"$proto:$port"}++}} foreach $key (sort { $count{$b} <=> $count{$a} } keys %count) {print "$key: $count{$key}\n"}' /var/log/messages | head -10

Sample output:

UDP:137: 226619
UDP:500: 107056
UDP:33436: 25244
UDP:33435: 22203
UDP:33437: 16035
TCP:20480: 10782
TCP:30080: 8291
UDP:33438: 8162
UDP:33439: 7268
UDP:33440: 5885