nf_conntrack: table full - how the absence of rules can lead to unexpected behaviour

Posted by sam Tue, 15 Nov 2011 21:49:00 GMT

I recently observed the dreaded:

nf_conntrack: table full, dropping packet

message on a host that formed part of the external tier of an infrastructure, where we expected, managed and throttled many connections. The odd thing was, the hosts should have been doing nothing iptables-wise to be tracking connections or otherwise generating this message. On behaving and misbehaving hosts both an `iptables -L` would show a bunch of empty chains. Odd.

However, a few leaps of logic later lead to the following being discovered on the well-behaved hosts:

# lsmod | egrep 'ip_tables|conntrack'
ip_tables               9899  1 iptable_filter
x_tables               14175  1 ip_tables

and curiously this on the mis-behaving hosts:

# lsmod | egrep 'ip_tables|conntrack'
nf_conntrack_ipv4      10346  3 iptable_nat,nf_nat
nf_conntrack           60975  4 ipt_MASQUERADE,iptable_nat,nf_nat,nf_conntrack_ipv4
nf_defrag_ipv4          1073  1 nf_conntrack_ipv4
ip_tables               9899  2 iptable_nat,iptable_filter
x_tables               14175  3 ipt_MASQUERADE,iptable_nat,ip_tables

Sure enough, we can see why nf_conntrack is now involved in the TCP stack and why we might be filling up its buffers, but it doesn’t explain the disparity between the hosts.

In retrospect the explanation is both blindingly obvious, craftily subtle and provable to boot. In short, when a rule is added to the ‘nat’ iptables table the various kernel modules required to support it are dynamically loaded. They remain, and are therefore part of the execution path of iptables, even if their contents is flushed. What this means in practice is that for a running kernel, once you have defined a nat iptables rule you are at the mercy of its buffer size and other constraints for the lifetime of that kernel run. Or put more simply, creating and flushing nat rules does not leave you in the same state as having never created them.

We can prove this in a rather ham-fisted way.

We’ll create a small dummy client and server in Ruby for the purposes of opening many concurrent connections. We’ll manipulate some of the limits down in order to enable us to re-produce the error without requiring massive live scale. The following scripts are best run under Ruby 1.9 so that we can make use of native threads.

#!/usr/bin/env ruby1.9
#
# Accept many connections
#
require 'socket'

server = TCPServer.open(7777)
loop {
    Thread.start(server.accept) do |client|
        loop {
            sleep 60    # do nothing
        }
    end
}

 

#!/usr/bin/env ruby1.9
#
# Connect many times
#

require 'socket'

host = 'localhost'
port = 7777

19998.times do 
    Thread.start do
        TCPSocket.open(host, port)
        loop {
            sleep 60    # do nothing
        }
    end
end

As root, we’ll run the following to open up our ability to create connections:

ulimit -n 20000
echo 20000 > /proc/sys/kernel/threads-max
echo 0 > /proc/sys/net/ipv4/tcp_syncookies
iptables -L   # forces the 'ip_tables' kernel modules to be loaded with empty tables and chains

If you then run ./server.rb followed by ./client.rb and `watch -n 2 “dmesg | tail -10”` you’ll see, well, not much going on. However, if we introduce and then flush and delete a nat table iptables ruleset we’ll see both the modules loaded and the tests produce the expected error in the kernel ring buffer output:

iptables --table nat --append POSTROUTING --out-interface eth0 -j MASQUERADE
iptables --flush
iptables --table nat --flush
iptables --delete-chain
iptables --table nat --delete-chain
...
# lsmod | egrep 'ip_tables|conn'
nf_conntrack_ipv4      10346  3 iptable_nat,nf_nat
nf_conntrack           60975  4 ipt_MASQUERADE,iptable_nat,nf_nat,nf_conntrack_ipv4
nf_defrag_ipv4          1073  1 nf_conntrack_ipv4
ip_tables               9899  2 iptable_nat,iptable_filter
x_tables               14175  3 ipt_MASQUERADE,iptable_nat,ip_tables
...
sysctl net.netfilter.nf_conntrack_max=100

If we run the same tests again with the artificially low limit and monitor the kernel ring buffer with `watch -n 2 “dmesg | tail -10”`once again you’ll quickly see the “nf_conntrack: table full, dropping packet” message.

So, what have we learnt here? The short of it is that manipulating nat tables under iptables on a running kernel will change the behaviour of your network stack, and that clearing down any nat tables will not return the stack to the same previous state. In order to do that you’ll have to:

rmmod iptable_nat
rmmod ipt_MASQUERADE
rmmod nf_nat
rmmod nf_conntrack_ipv4
rmmod nf_conntrack
rmmod nf_defrag_ipv4
...
# lsmod | egrep 'ip_tables|conn'
ip_tables               9899  1 iptable_filter
x_tables               14175  1 ip_tables

to return things to the previous state, at least for this example. Whether that is preferable on a production world-facing system to a reboot or a recommission is open to debate.

EDIT:

Some further testing on my part has determined that even listing the nat tables with `iptables -t nat -L` will cause the conntrack modules to be probed into the kernel. For very busy world-facing hosts the only solution that I can see is to add the various conntrack modules to /etc/modprobe.d/blacklist.conf to ensure that they are never loaded.

I’ve seen this on CentOS/RHEL and Ubuntu, right up to the current server release of the latter.
 

Nice aliases for working with iptables

Posted by sam Fri, 20 Nov 2009 15:54:00 GMT

I’ve been using a couple of nice shell aliases when working with ad-hoc iptables rules. You can spruce them up as a batch file, but they’re fine for me as a quick and dirty way to manipulate rules.

alias ips="/sbin/iptables --line-numbers -vn -L INPUT | grep -i"
alias ipd="/sbin/iptables -D INPUT"

That’s all there is to it. You can then interrogate almost any aspect of the default INPUT filter with:

ips icmp
ips 10.64.0
ips drop

to view all ICMP rules, any rules relating to the 10.64.0 subnet, or all rules that drop packets.

The way I use these together, and the reason that `ips` includes the –line-numbers argument, is that I like to add rules and then easily delete them with:

# ips 192
30       0     0 DROP       all  --  *      *       192.0.2.0/24         0.0.0.0/0
# ipd 30

using the rule number as an easier way of deleting the rule without having to conjour up a matching rule specification.