Deprecating Nagios, Or Why Every Host In Your Estate Should Serve A RESTful API

Posted by sam Sun, 15 Apr 2012 18:38:00 GMT

Large infrastructures built on cloud architectures have already solved the problem of how to manage many thousands of hosts by using configuration management frameworks, such as Chef, Puppet and a bunch of other tools that prefer other underlying runtimes, paradigms or approaches. However, operational monitoring of error conditions across many thousands of instances is generally still handled by some Nagios-like (or latterly Icinga) style system executing local or remote rulesets to test individual conditions and escalate problems through to operations teams and developers. I’ll use Nagios as the typical “traditional” monitoring framework in this post as it is so widely deployed and understood, but I believe the ideas contained here apply equally well to other monitoring systems that follow in this vein.

Whilst it is possible to scale these traditional monitoring systems by variously overloading tests, moving work to instances via remote execution plugins such as NRPE, and by delegating remote test execution to worker hosts, this is approach is complicated, does not scale linearly and annexes monitoring and host status to a single monolithic mechanism that is not especially queriable or reportable. Even if one exports notification events to some external data store such as MySQL, the latency between an operational event and action upon that event increases as the monitoring infrastructure grows to support more hosts and a growing ecosystem of tools and dashboards interacting with it. All of this inefficiency comes at the cost of more instances, more data to manage and more data to move around.

Tools such as Chef coupled with useful RESTful public APIs (and this applies to private clouds and other providers as much as it does to Amazon’s EC2) have, and quite rightly, turned infrastructure management into a code and development task, albeit one that benefits from high-level languages and simple concepts and in which the engineer/developer must be grounded somewhat in the real world of machinery and its various foibles. Infrastructure teams Doing It Right today are providing dashboards and “public” APIs as documented and dynamic entry points into the infrastructure on which services execute and persist so that they can use their knowledge, talents and specialization to abstract and inform the rest of the business from these problems.

So, taking all of this progress and best-practice, it seems strange that operational monitoring has not followed in this vein, and generally our sum-total knowledge of the estate that we manage is provided by various scripts executed periodically and aggregated back to a single monolithic system. Sure, Chef certainly has the ability to provide HTTP callbacks within recipes (and indeed, anything that can be accomplished in plain Ruby - which is, well, anything) as part of its execution, and has a system of report handlers that further formalize this. However, it is very unlikely that your configuration management tool of choice is executing with enough frequency to provide useful operational knowledge of things such as daemon failures. And if it does, you’ve just moved your scaling problem from one service to another.

I believe the way forward is to use one of the small web frameworks in your operational language of choice (and for me that’s Ruby) to have every instance host its own API service, and to do this in a very particular way. Let’s consider this trivial 10 minute implementation of a monitoring API server, written using Sinatra:

#!/usr/bin/env ruby
#
# a very simple example of an HTTP status daemon
#
require 'rubygems'
require 'sinatra'
require 'json'

registered_metrics = ['hostname', 'uptime', 'load_average']

helpers do
  def hostname(function)
    name = %x['hostname'].chomp

    case function
    when :status
      return false if name == nil
      return true
    when :metric
      return name
    end
  end

  def uptime(function)
    time = IO.read('/proc/uptime').chomp.to_i

    case function
    when :status
      return false if time < 600
      return true
     when :metric
      return time
    end
  end

  def load_average(function)
    lavg = IO.read('/proc/loadavg').chomp

    case function
    when :status
      return false if lavg.split(' ')[0].to_f > 2
      return true
    when :metric
      return lavg
    end
  end
end


def build_hash(type, metrics)
  status = Hash.new
  metrics.each do |metric|
    status[metric] = send(metric.to_sym, type)
  end
  return status
end

get '/status' do
  build_hash(:status, registered_metrics).to_json
end

get '/metrics' do
  build_hash(:metric, registered_metrics).to_json
end

get '/happy' do
  return false.to_json if build_hash(:status, registered_metrics).values.include?(false)
  return true.to_json
end

Start the process on an instance or your workstation, and hit http://localhost:4567/happy. If your host has a hostname, a load average below 2 and has been up for more than 10 minutes then you’ll get the JSON string back reflecting whether the box is happy or not. Hopefully it is, but suppose it was not - let’s pretend your load average is stupidly high and you’re warming a drink on the case. http://localhost:4567/status will return a JSON representation of the boolean state of the three tests, with the state of the load average test reflected in the returned structure. If we call http://localhost:4567/metrics we can examine the actual values.

So what have we gained, apart from essentially re-implementing a very basic NRPE-like service over HTTP rather than TCP in 10 minutes in Ruby?

The job of determining whether an instance is in a decent state has been offloaded to the instance itself, and we have a lightweight (for the systems monitoring and aggregating) method of interrogating this as a simple boolean value, plus a way of drilling down when we need to. More importantly, we no longer have to aggregate to a single consuming service. There is no reason why a graphing service cannot call http://localhost:4567/metrics across the estate whilst a high-level executive dashboard is polling http://localhost:4567/happy. Even for aggregating services over large estates, we have a lightweight manner of polling to help us scale, with a method of exposing detail when it is required.

We’re exposing information about a host in a universal format using a universal transport mechanism. From pointing a browser at a known port on the instance from your workstation through to writing dashboards, or even integrating with an existing Icinga or Nagios installation via a plugin that can’t be more than 20-30 lines in most modern scripting languages, a single mechanism can service many needs. Using a framework such as Sinatra there is no reason why the toy example above cannot be extended to serve some markup when fetched with the correct MIME-type, and JSON likewise. Instant dashboard suitable views, built by your instances with data about themselves.

Of course there will always be the need for “external” checks for things that a host cannot be relied upon to determine about itself. In a large number of cases, rather than a being truly external, what we’re really monitoring is the interaction between tiers that are strictly hierarchical. As such, for many cases it is acceptable to have a instance report the high-level status of the tiers with which is is required to interact. http://localhost:4567/application_servers returning a JSON array of application server instances to which it is possible to make a connection and fetch a status page is not inconceivable.

It is not that much of a leap to take the above example and to extend it to be trivially RESTful, such that one can refer to resources like http://localhost:4567/cpu/load and http://localhost:4567/cpu/cores/1/steal and be returned a value, and also to refer to higher-level collection endpoints to be returned JSON structures as summaries. And from there it is not too much of a leap to extend the concept from purely representing state to manipulating it: http://localhost:4567/service/mysql/restart ..?

This, I believe, is the future of monitoring frameworks in large dynamic virtualized infrastructure estates as it both frees us from our scalability woes, sensibly enables us to extend the DRY concept across monitoring, graphing, dashboards and alerting, and finally reduces our toolset for customization and interaction to a decent HTTP library and a JSON parser, the very tools we’re already using and are familiar with when interacting with our configuration management installations and cloud provider APIs.

Installing Nagios on Ubuntu or Debian without Postfix

Posted by sam Wed, 17 Aug 2011 11:25:00 GMT

If you install the default ‘nagios3’ package from the repositories on a Debian-based distribution, you wind up with a full copy of postfix installed. This is fine if you’re simply trying to get the thing to work, but as part of a wider infrastructure you most likely do not want a full-fledged MTA arbitrarily popping up on your Nagios host - an MTA that you have to administer, monitor (!), patch and most importantly secure.

The dependency  chain that causes postfix to be installed is:

nagios3 → nagios3-core → nagios3-common → bsd-mailx → default-mta | mail-transport-agent.

Why the package maintainers made bsd-mailx dependent on a fully-fledged MTA I will never know. Perhaps they wanted to ensure things “just worked”? It still seems a bit heavy handed to me, especially when one can configure .mailrc to point to a mailhost and be done with it.

In order to install nagios3 from the repositories and satisfy those dependencies without pulling in postfix you should install the ‘lsb-invalid-mta’ package, which provides ‘mail-transport-agent’ and satisfies the dependency chain above, in place of postfix. The package provides a sendmail binary that does nothing but return a non-zero return code, so you’ll never accidentally send mail from a local system, but you will have to configure your system to take advantage of a suitable MTA host.

Here is some puppet to install nagios3 without postfix:

# /etc/puppet/modules/nagios-server/manifests/init.pp
#
# Class: nagios-server
#
# This class maintains a Nagios server.
#
# Parameters:
#       None
#
# Requires:
#       nagios-server::install
#
class nagios-server {
        include nagios-server::install

        service { 'apache2':
                ensure => running,
                enable => true,
                require => Class['nagios-server::install'],
        }

        service { 'nagios3':
                ensure => running,
                enable => true,
                require => Class['nagios-server::install'],
        }
}
# /etc/puppet/modules/nagios-server/manifests/install.pp
#
# Class: nagios-server::install
#
# This class will install a Nagios server from the repo packages
#
# Parameters:
#       None
#
# Requires:
#       Nothing
#
class nagios-server::install {

        # Prevent nagios3-common->mailx dependency from pulling in an MTA.
        package { 'lsb-invalid-mta':
                ensure => present,
        }

        $packages = ['nagios3', 'nagios-images', 'nagios-plugins', 'nagios3-doc',]
        package { $packages:
                ensure => present,
                require => Package['lsb-invalid-mta'],
        }
}

I Keep Arriving Back at Perl

Posted by sam Mon, 07 Feb 2011 23:53:00 GMT

For some reason I keep finding myself writing Perl code. In 2011. Over the last couple of months I’ve written enough Ruby to make my head spin, and yet, in a fix, I find myself back in the arms of Perl.

Recently I needed to parse some Apache virtual host configurations into some Nagios configuration file stanzas, a BIND zonefile and a hosts file for a project for my employer. Perl. On some private boxes not yet big enough to warrant a full Nagios installation I needed a script to run the CLI from the Clickatell SMS Gem in response to various /proc values. Perl.

http://www.flickr.com/photos/reidrac/
Picture credit: reidrac

I like to think that my Perl is well written. I avoid implicits: $_, and the like, `use strict`, pass “-w” and declare all of my variables up-front. Sometimes I even entertain taint checking. It’s not perfect by any means, and not everybody’s style, but if you’re familiar with a C-like language you can probably read it. If you’ve written pseudo-code, you can read it. If you’re a patterns infected developer professional you’ll wince a bit. No FactoryFactories here. Objection orientation has its place, but it isn’t knocking up a quick 5 minute script to remove all of the commas from a YAML file driven by SOAP calls via wget in a panic.

I don’t think that anybody really pretends that Perl is the best thing to begin a new large-scale development project with today. But that doesn’t mean that Perl is down and out. Whilst Ruby gained frameworks and conferences and spawned religions, Perl sat there in it’s varied and glorious assorted version 5 point releases on nearly every UNIX-like box on the planet, just waiting for someone to ignore the framework bling and me-too bits and get stuck into good old text manipulation and nicking bits off of CPAN.

I’m by no means a language fanatic; I like all sorts. And I’ll tackle ActiveState or Strawberry if you force me, too. Perhaps it’s familiarity. Perhaps I simply don’t know Ruby or Python or anything else as well as I think I do. Or perhaps for your common or garden UNIX-style, “text as a univeral interface” basic string hackage, Perl is the crusty mig-welding old nutter who might just get you home?

Bug in Nagios check_http plugin before 1.4.14 with 301/302 HTTP redirects

Posted by sam Mon, 12 Apr 2010 13:23:00 GMT

There is a nasty bug in the Nagios check_http plugin before version 1.4.14 whereby the leading slash (/) of the URI paramter is left off of the string when encountering a 301 or 302 redirect. This can lead to errors such as:

HTTP WARNING - redirection creates an infinite loop

or others regarding HTTP redirection that is more than 15 levels deep (if you compiled with the default value). From the changelog:

2008-09-01  Holger Weiss 

* plugins/check_http.c: Under some circumstances, the 'url' path of
a redirection target missed a leading slash.  While this was fixed
later on, the incomplete 'url' was used for redirection loop
detection and error messages.  This is now fixed by adding the
missing slash immediately.  git-svn-id:

https://nagiosplug.svn.sourceforge.net/svnroot/nagiosplug/nagiosplug/trunk@2049 f882894a-f735-0410-b71e-b25c423dba1c

Here is the full changelog.

Nagios Recurring Scheduled Downtime

Posted by sam Fri, 29 Jan 2010 13:27:00 GMT

I recently tried out Nic le Roux’s sched_downtime Perl script as a quick punt to getting recurring Nagios downtime working. Cool script, and it beats having to hand-roll something to append the the nagios.cmd command pipe file by hand. Why reinvent the wheel, right?

There’s a small problem with the script, on my 3.0.6 install of Nagios anyway, whereby the command string that is produced is missing a field, leading to the commands not being acted upon. See:

[1264767246] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;sam;ssh;1264767600;1264768200;1;600;Sam;Testing service recurring scheduled downtime
[1264771220] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;sam;ssh;1264771800;1264772400;1;0;600;Sam;Testing service recurring scheduled downtime

See the extra field there? The same is true of host scheduled downtime too. If we check the documentation we can see that the extra field is a ‘trigger_id’.

I’ve created a patch that will enable you to use the script against Nagios versions that expect the 3.x format command strings. My patch deals with the possibility of being able to trigger downtime based on other downtime events by conveniently ignoring it, and hardcoding a zero in that field. I didn’t need it, sorry!