nf_conntrack: table full - how the absence of rules can lead to unexpected behaviour
I recently observed the dreaded:
nf_conntrack: table full, dropping packet
message on a host that formed part of the external tier of an infrastructure, where we expected, managed and throttled many connections. The odd thing was, the hosts should have been doing nothing iptables-wise to be tracking connections or otherwise generating this message. On behaving and misbehaving hosts both an `iptables -L` would show a bunch of empty chains. Odd.
However, a few leaps of logic later lead to the following being discovered on the well-behaved hosts:
# lsmod | egrep 'ip_tables|conntrack' ip_tables 9899 1 iptable_filter x_tables 14175 1 ip_tables
and curiously this on the mis-behaving hosts:
# lsmod | egrep 'ip_tables|conntrack' nf_conntrack_ipv4 10346 3 iptable_nat,nf_nat nf_conntrack 60975 4 ipt_MASQUERADE,iptable_nat,nf_nat,nf_conntrack_ipv4 nf_defrag_ipv4 1073 1 nf_conntrack_ipv4 ip_tables 9899 2 iptable_nat,iptable_filter x_tables 14175 3 ipt_MASQUERADE,iptable_nat,ip_tables
Sure enough, we can see why nf_conntrack is now involved in the TCP stack and why we might be filling up its buffers, but it doesn’t explain the disparity between the hosts.
In retrospect the explanation is both blindingly obvious, craftily subtle and provable to boot. In short, when a rule is added to the ‘nat’ iptables table the various kernel modules required to support it are dynamically loaded. They remain, and are therefore part of the execution path of iptables, even if their contents is flushed. What this means in practice is that for a running kernel, once you have defined a nat iptables rule you are at the mercy of its buffer size and other constraints for the lifetime of that kernel run. Or put more simply, creating and flushing nat rules does not leave you in the same state as having never created them.
We can prove this in a rather ham-fisted way.
We’ll create a small dummy client and server in Ruby for the purposes of opening many concurrent connections. We’ll manipulate some of the limits down in order to enable us to re-produce the error without requiring massive live scale. The following scripts are best run under Ruby 1.9 so that we can make use of native threads.
#!/usr/bin/env ruby1.9
#
# Accept many connections
#
require 'socket'
server = TCPServer.open(7777)
loop {
Thread.start(server.accept) do |client|
loop {
sleep 60 # do nothing
}
end
}
#!/usr/bin/env ruby1.9
#
# Connect many times
#
require 'socket'
host = 'localhost'
port = 7777
19998.times do
Thread.start do
TCPSocket.open(host, port)
loop {
sleep 60 # do nothing
}
end
end
As root, we’ll run the following to open up our ability to create connections:
ulimit -n 20000 echo 20000 > /proc/sys/kernel/threads-max echo 0 > /proc/sys/net/ipv4/tcp_syncookies iptables -L # forces the 'ip_tables' kernel modules to be loaded with empty tables and chains
If you then run ./server.rb followed by ./client.rb and `watch -n 2 “dmesg | tail -10”` you’ll see, well, not much going on. However, if we introduce and then flush and delete a nat table iptables ruleset we’ll see both the modules loaded and the tests produce the expected error in the kernel ring buffer output:
iptables --table nat --append POSTROUTING --out-interface eth0 -j MASQUERADE iptables --flush iptables --table nat --flush iptables --delete-chain iptables --table nat --delete-chain ... # lsmod | egrep 'ip_tables|conn' nf_conntrack_ipv4 10346 3 iptable_nat,nf_nat nf_conntrack 60975 4 ipt_MASQUERADE,iptable_nat,nf_nat,nf_conntrack_ipv4 nf_defrag_ipv4 1073 1 nf_conntrack_ipv4 ip_tables 9899 2 iptable_nat,iptable_filter x_tables 14175 3 ipt_MASQUERADE,iptable_nat,ip_tables ... sysctl net.netfilter.nf_conntrack_max=100
If we run the same tests again with the artificially low limit and monitor the kernel ring buffer with `watch -n 2 “dmesg | tail -10”`once again you’ll quickly see the “nf_conntrack: table full, dropping packet” message.
So, what have we learnt here? The short of it is that manipulating nat tables under iptables on a running kernel will change the behaviour of your network stack, and that clearing down any nat tables will not return the stack to the same previous state. In order to do that you’ll have to:
rmmod iptable_nat rmmod ipt_MASQUERADE rmmod nf_nat rmmod nf_conntrack_ipv4 rmmod nf_conntrack rmmod nf_defrag_ipv4 ... # lsmod | egrep 'ip_tables|conn' ip_tables 9899 1 iptable_filter x_tables 14175 1 ip_tables
to return things to the previous state, at least for this example. Whether that is preferable on a production world-facing system to a reboot or a recommission is open to debate.
EDIT:
Some further testing on my part has determined that even listing the nat tables with `iptables -t nat -L` will cause the conntrack modules to be probed into the kernel. For very busy world-facing hosts the only solution that I can see is to add the various conntrack modules to /etc/modprobe.d/blacklist.conf to ensure that they are never loaded.
I’ve seen this on CentOS/RHEL and Ubuntu, right up to the current server release of the latter.
Nice aliases for working with iptables
I’ve been using a couple of nice shell aliases when working with ad-hoc iptables rules. You can spruce them up as a batch file, but they’re fine for me as a quick and dirty way to manipulate rules.
alias ips="/sbin/iptables --line-numbers -vn -L INPUT | grep -i" alias ipd="/sbin/iptables -D INPUT"
That’s all there is to it. You can then interrogate almost any aspect of the default INPUT filter with:
ips icmp ips 10.64.0 ips drop
to view all ICMP rules, any rules relating to the 10.64.0 subnet, or all rules that drop packets.
The way I use these together, and the reason that `ips` includes the –line-numbers argument, is that I like to add rules and then easily delete them with:
# ips 192 30 0 0 DROP all -- * * 192.0.2.0/24 0.0.0.0/0 # ipd 30
using the rule number as an easier way of deleting the rule without having to conjour up a matching rule specification.

