Deprecating Nagios, Or Why Every Host In Your Estate Should Serve A RESTful API
Large infrastructures built on cloud architectures have already solved the problem of how to manage many thousands of hosts by using configuration management frameworks, such as Chef, Puppet and a bunch of other tools that prefer other underlying runtimes, paradigms or approaches. However, operational monitoring of error conditions across many thousands of instances is generally still handled by some Nagios-like (or latterly Icinga) style system executing local or remote rulesets to test individual conditions and escalate problems through to operations teams and developers. I’ll use Nagios as the typical “traditional” monitoring framework in this post as it is so widely deployed and understood, but I believe the ideas contained here apply equally well to other monitoring systems that follow in this vein.
Whilst it is possible to scale these traditional monitoring systems by variously overloading tests, moving work to instances via remote execution plugins such as NRPE, and by delegating remote test execution to worker hosts, this is approach is complicated, does not scale linearly and annexes monitoring and host status to a single monolithic mechanism that is not especially queriable or reportable. Even if one exports notification events to some external data store such as MySQL, the latency between an operational event and action upon that event increases as the monitoring infrastructure grows to support more hosts and a growing ecosystem of tools and dashboards interacting with it. All of this inefficiency comes at the cost of more instances, more data to manage and more data to move around.
Tools such as Chef coupled with useful RESTful public APIs (and this applies to private clouds and other providers as much as it does to Amazon’s EC2) have, and quite rightly, turned infrastructure management into a code and development task, albeit one that benefits from high-level languages and simple concepts and in which the engineer/developer must be grounded somewhat in the real world of machinery and its various foibles. Infrastructure teams Doing It Right today are providing dashboards and “public” APIs as documented and dynamic entry points into the infrastructure on which services execute and persist so that they can use their knowledge, talents and specialization to abstract and inform the rest of the business from these problems.
So, taking all of this progress and best-practice, it seems strange that operational monitoring has not followed in this vein, and generally our sum-total knowledge of the estate that we manage is provided by various scripts executed periodically and aggregated back to a single monolithic system. Sure, Chef certainly has the ability to provide HTTP callbacks within recipes (and indeed, anything that can be accomplished in plain Ruby - which is, well, anything) as part of its execution, and has a system of report handlers that further formalize this. However, it is very unlikely that your configuration management tool of choice is executing with enough frequency to provide useful operational knowledge of things such as daemon failures. And if it does, you’ve just moved your scaling problem from one service to another.
I believe the way forward is to use one of the small web frameworks in your operational language of choice (and for me that’s Ruby) to have every instance host its own API service, and to do this in a very particular way. Let’s consider this trivial 10 minute implementation of a monitoring API server, written using Sinatra:
#!/usr/bin/env ruby
#
# a very simple example of an HTTP status daemon
#
require 'rubygems'
require 'sinatra'
require 'json'
registered_metrics = ['hostname', 'uptime', 'load_average']
helpers do
def hostname(function)
name = %x['hostname'].chomp
case function
when :status
return false if name == nil
return true
when :metric
return name
end
end
def uptime(function)
time = IO.read('/proc/uptime').chomp.to_i
case function
when :status
return false if time < 600
return true
when :metric
return time
end
end
def load_average(function)
lavg = IO.read('/proc/loadavg').chomp
case function
when :status
return false if lavg.split(' ')[0].to_f > 2
return true
when :metric
return lavg
end
end
end
def build_hash(type, metrics)
status = Hash.new
metrics.each do |metric|
status[metric] = send(metric.to_sym, type)
end
return status
end
get '/status' do
build_hash(:status, registered_metrics).to_json
end
get '/metrics' do
build_hash(:metric, registered_metrics).to_json
end
get '/happy' do
return false.to_json if build_hash(:status, registered_metrics).values.include?(false)
return true.to_json
end
Start the process on an instance or your workstation, and hit http://localhost:4567/happy. If your host has a hostname, a load average below 2 and has been up for more than 10 minutes then you’ll get the JSON string back reflecting whether the box is happy or not. Hopefully it is, but suppose it was not - let’s pretend your load average is stupidly high and you’re warming a drink on the case. http://localhost:4567/status will return a JSON representation of the boolean state of the three tests, with the state of the load average test reflected in the returned structure. If we call http://localhost:4567/metrics we can examine the actual values.
So what have we gained, apart from essentially re-implementing a very basic NRPE-like service over HTTP rather than TCP in 10 minutes in Ruby?
The job of determining whether an instance is in a decent state has been offloaded to the instance itself, and we have a lightweight (for the systems monitoring and aggregating) method of interrogating this as a simple boolean value, plus a way of drilling down when we need to. More importantly, we no longer have to aggregate to a single consuming service. There is no reason why a graphing service cannot call http://localhost:4567/metrics across the estate whilst a high-level executive dashboard is polling http://localhost:4567/happy. Even for aggregating services over large estates, we have a lightweight manner of polling to help us scale, with a method of exposing detail when it is required.
We’re exposing information about a host in a universal format using a universal transport mechanism. From pointing a browser at a known port on the instance from your workstation through to writing dashboards, or even integrating with an existing Icinga or Nagios installation via a plugin that can’t be more than 20-30 lines in most modern scripting languages, a single mechanism can service many needs. Using a framework such as Sinatra there is no reason why the toy example above cannot be extended to serve some markup when fetched with the correct MIME-type, and JSON likewise. Instant dashboard suitable views, built by your instances with data about themselves.
Of course there will always be the need for “external” checks for things that a host cannot be relied upon to determine about itself. In a large number of cases, rather than a being truly external, what we’re really monitoring is the interaction between tiers that are strictly hierarchical. As such, for many cases it is acceptable to have a instance report the high-level status of the tiers with which is is required to interact. http://localhost:4567/application_servers returning a JSON array of application server instances to which it is possible to make a connection and fetch a status page is not inconceivable.
It is not that much of a leap to take the above example and to extend it to be trivially RESTful, such that one can refer to resources like http://localhost:4567/cpu/load and http://localhost:4567/cpu/cores/1/steal and be returned a value, and also to refer to higher-level collection endpoints to be returned JSON structures as summaries. And from there it is not too much of a leap to extend the concept from purely representing state to manipulating it: http://localhost:4567/service/mysql/restart ..?
This, I believe, is the future of monitoring frameworks in large dynamic virtualized infrastructure estates as it both frees us from our scalability woes, sensibly enables us to extend the DRY concept across monitoring, graphing, dashboards and alerting, and finally reduces our toolset for customization and interaction to a decent HTTP library and a JSON parser, the very tools we’re already using and are familiar with when interacting with our configuration management installations and cloud provider APIs.
Trackbacks
Use the following link to trackback from your own site:
http://blog.sam-pointer.com/trackbacks?article_id=121

