Deprecating Nagios, Or Why Every Host In Your Estate Should Serve A RESTful API
Large infrastructures built on cloud architectures have already solved the problem of how to manage many thousands of hosts by using configuration management frameworks, such as Chef, Puppet and a bunch of other tools that prefer other underlying runtimes, paradigms or approaches. However, operational monitoring of error conditions across many thousands of instances is generally still handled by some Nagios-like (or latterly Icinga) style system executing local or remote rulesets to test individual conditions and escalate problems through to operations teams and developers. I’ll use Nagios as the typical “traditional” monitoring framework in this post as it is so widely deployed and understood, but I believe the ideas contained here apply equally well to other monitoring systems that follow in this vein.
Whilst it is possible to scale these traditional monitoring systems by variously overloading tests, moving work to instances via remote execution plugins such as NRPE, and by delegating remote test execution to worker hosts, this is approach is complicated, does not scale linearly and annexes monitoring and host status to a single monolithic mechanism that is not especially queriable or reportable. Even if one exports notification events to some external data store such as MySQL, the latency between an operational event and action upon that event increases as the monitoring infrastructure grows to support more hosts and a growing ecosystem of tools and dashboards interacting with it. All of this inefficiency comes at the cost of more instances, more data to manage and more data to move around.
Tools such as Chef coupled with useful RESTful public APIs (and this applies to private clouds and other providers as much as it does to Amazon’s EC2) have, and quite rightly, turned infrastructure management into a code and development task, albeit one that benefits from high-level languages and simple concepts and in which the engineer/developer must be grounded somewhat in the real world of machinery and its various foibles. Infrastructure teams Doing It Right today are providing dashboards and “public” APIs as documented and dynamic entry points into the infrastructure on which services execute and persist so that they can use their knowledge, talents and specialization to abstract and inform the rest of the business from these problems.
So, taking all of this progress and best-practice, it seems strange that operational monitoring has not followed in this vein, and generally our sum-total knowledge of the estate that we manage is provided by various scripts executed periodically and aggregated back to a single monolithic system. Sure, Chef certainly has the ability to provide HTTP callbacks within recipes (and indeed, anything that can be accomplished in plain Ruby - which is, well, anything) as part of its execution, and has a system of report handlers that further formalize this. However, it is very unlikely that your configuration management tool of choice is executing with enough frequency to provide useful operational knowledge of things such as daemon failures. And if it does, you’ve just moved your scaling problem from one service to another.
I believe the way forward is to use one of the small web frameworks in your operational language of choice (and for me that’s Ruby) to have every instance host its own API service, and to do this in a very particular way. Let’s consider this trivial 10 minute implementation of a monitoring API server, written using Sinatra:
#!/usr/bin/env ruby
#
# a very simple example of an HTTP status daemon
#
require 'rubygems'
require 'sinatra'
require 'json'
registered_metrics = ['hostname', 'uptime', 'load_average']
helpers do
def hostname(function)
name = %x['hostname'].chomp
case function
when :status
return false if name == nil
return true
when :metric
return name
end
end
def uptime(function)
time = IO.read('/proc/uptime').chomp.to_i
case function
when :status
return false if time < 600
return true
when :metric
return time
end
end
def load_average(function)
lavg = IO.read('/proc/loadavg').chomp
case function
when :status
return false if lavg.split(' ')[0].to_f > 2
return true
when :metric
return lavg
end
end
end
def build_hash(type, metrics)
status = Hash.new
metrics.each do |metric|
status[metric] = send(metric.to_sym, type)
end
return status
end
get '/status' do
build_hash(:status, registered_metrics).to_json
end
get '/metrics' do
build_hash(:metric, registered_metrics).to_json
end
get '/happy' do
return false.to_json if build_hash(:status, registered_metrics).values.include?(false)
return true.to_json
end
Start the process on an instance or your workstation, and hit http://localhost:4567/happy. If your host has a hostname, a load average below 2 and has been up for more than 10 minutes then you’ll get the JSON string back reflecting whether the box is happy or not. Hopefully it is, but suppose it was not - let’s pretend your load average is stupidly high and you’re warming a drink on the case. http://localhost:4567/status will return a JSON representation of the boolean state of the three tests, with the state of the load average test reflected in the returned structure. If we call http://localhost:4567/metrics we can examine the actual values.
So what have we gained, apart from essentially re-implementing a very basic NRPE-like service over HTTP rather than TCP in 10 minutes in Ruby?
The job of determining whether an instance is in a decent state has been offloaded to the instance itself, and we have a lightweight (for the systems monitoring and aggregating) method of interrogating this as a simple boolean value, plus a way of drilling down when we need to. More importantly, we no longer have to aggregate to a single consuming service. There is no reason why a graphing service cannot call http://localhost:4567/metrics across the estate whilst a high-level executive dashboard is polling http://localhost:4567/happy. Even for aggregating services over large estates, we have a lightweight manner of polling to help us scale, with a method of exposing detail when it is required.
We’re exposing information about a host in a universal format using a universal transport mechanism. From pointing a browser at a known port on the instance from your workstation through to writing dashboards, or even integrating with an existing Icinga or Nagios installation via a plugin that can’t be more than 20-30 lines in most modern scripting languages, a single mechanism can service many needs. Using a framework such as Sinatra there is no reason why the toy example above cannot be extended to serve some markup when fetched with the correct MIME-type, and JSON likewise. Instant dashboard suitable views, built by your instances with data about themselves.
Of course there will always be the need for “external” checks for things that a host cannot be relied upon to determine about itself. In a large number of cases, rather than a being truly external, what we’re really monitoring is the interaction between tiers that are strictly hierarchical. As such, for many cases it is acceptable to have a instance report the high-level status of the tiers with which is is required to interact. http://localhost:4567/application_servers returning a JSON array of application server instances to which it is possible to make a connection and fetch a status page is not inconceivable.
It is not that much of a leap to take the above example and to extend it to be trivially RESTful, such that one can refer to resources like http://localhost:4567/cpu/load and http://localhost:4567/cpu/cores/1/steal and be returned a value, and also to refer to higher-level collection endpoints to be returned JSON structures as summaries. And from there it is not too much of a leap to extend the concept from purely representing state to manipulating it: http://localhost:4567/service/mysql/restart ..?
This, I believe, is the future of monitoring frameworks in large dynamic virtualized infrastructure estates as it both frees us from our scalability woes, sensibly enables us to extend the DRY concept across monitoring, graphing, dashboards and alerting, and finally reduces our toolset for customization and interaction to a decent HTTP library and a JSON parser, the very tools we’re already using and are familiar with when interacting with our configuration management installations and cloud provider APIs.
Installing Nagios on Ubuntu or Debian without Postfix
If you install the default ‘nagios3’ package from the repositories on a Debian-based distribution, you wind up with a full copy of postfix installed. This is fine if you’re simply trying to get the thing to work, but as part of a wider infrastructure you most likely do not want a full-fledged MTA arbitrarily popping up on your Nagios host - an MTA that you have to administer, monitor (!), patch and most importantly secure.
The dependency chain that causes postfix to be installed is:
nagios3 → nagios3-core → nagios3-common → bsd-mailx → default-mta | mail-transport-agent.
Why the package maintainers made bsd-mailx dependent on a fully-fledged MTA I will never know. Perhaps they wanted to ensure things “just worked”? It still seems a bit heavy handed to me, especially when one can configure .mailrc to point to a mailhost and be done with it.
In order to install nagios3 from the repositories and satisfy those dependencies without pulling in postfix you should install the ‘lsb-invalid-mta’ package, which provides ‘mail-transport-agent’ and satisfies the dependency chain above, in place of postfix. The package provides a sendmail binary that does nothing but return a non-zero return code, so you’ll never accidentally send mail from a local system, but you will have to configure your system to take advantage of a suitable MTA host.
Here is some puppet to install nagios3 without postfix:
# /etc/puppet/modules/nagios-server/manifests/init.pp
#
# Class: nagios-server
#
# This class maintains a Nagios server.
#
# Parameters:
# None
#
# Requires:
# nagios-server::install
#
class nagios-server {
include nagios-server::install
service { 'apache2':
ensure => running,
enable => true,
require => Class['nagios-server::install'],
}
service { 'nagios3':
ensure => running,
enable => true,
require => Class['nagios-server::install'],
}
}
# /etc/puppet/modules/nagios-server/manifests/install.pp
#
# Class: nagios-server::install
#
# This class will install a Nagios server from the repo packages
#
# Parameters:
# None
#
# Requires:
# Nothing
#
class nagios-server::install {
# Prevent nagios3-common->mailx dependency from pulling in an MTA.
package { 'lsb-invalid-mta':
ensure => present,
}
$packages = ['nagios3', 'nagios-images', 'nagios-plugins', 'nagios3-doc',]
package { $packages:
ensure => present,
require => Package['lsb-invalid-mta'],
}
}
Automated Deployment 1
Traditionally automated deployment on UNIX systems has been the domain of cobbled together shell scripts and bespoke solutions for each site. The power and scriptability of UNIX systems has, in this regard, worked against itself. Compared with the Windows world, which has traditionally been much harder to script, the ease at which the resourceful sysadmin can create a bunch of perl/shell scripts and cobble them together into a tangled mess of a "deployment system" invariably leads to a bloated and convoluted bespoke mess.
The Windows world has tackled the problem head-on by using native and proprietary package installers (and some novel outsiders such as AutoIt) to abstract the whole thing away behind a managed tool.
Enter the automated deployment system.
Now, let me be clear. I’m not talking about systems for deploying large farms of identical or managed boxes. I’m talking about a managed method of getting code, its dependencies, configuration and inevitable subsequent fixes through the various development and testing phases and into production in a managed manner. Tools to handle this job seem to be rather thin on the ground. I’ve managed to identify:
All of these tools have their disadvantages, and being the healthy pessimist I am I’ll tackle those first. All apart from Cfengine require an external runtime or VM: ruby (Capistrano, Puppet), python (Fabric) or Java (SmartFrog). These may or may not be part of your server build (CoolStack on Solaris anyone?). Still I guess this is less brittle than the natively compiled Cfengine if you’re not using one of the supported platform variants and versions. It’s strange the ubiquitous perl 5 is missing from that list. Of the tools Capistrano and Fabric certainly have the least verbose of the various domain-specific languages these tools employ, a common trait shared by all. Of course, as the Puppet team states, creating GUIs for such infinitely configuarble tools is a nonsense task best left to those proprietary vendors selling new clothes to various large organisations.
Capistrano also seems to win out for the lower volume box situations, keeping closer to the traditional UNIX philosophy of building small tools to do one thing well. It does have a rails bias, however. Fabric claims to be bias free, but I’m not in a position to say one way or the other.
I would be interested to hear from anyone who has experience with any of these deployment tools, especially on Solaris and working with Java applications. How do you guys push your code into live in a managed way with easy rollback?

