Bang For Your Buck on Amazon EC2

Posted by sam Thu, 10 Jan 2013 19:24:00 GMT

I read an interesting paper that was presented at USENIX recently. It is worth going away to read it - it’s only five pages long.

The rub, if you are lazy, is that not all instances of a given instance type on Amazon EC2 are created equally. There is a range of gear you can land on when you stand up an m1.xlarge or a c1.medium, and as expected, performance therefore varies (sometimes quite significantly) within an instance type. 

This set me thinking. We have Chef and Ohai (although this applies equally to Puppet and Facter) storing a bunch of attributes about our nodes in our index. This could be used to see what the distribution of ‘good’ servers was over our estate and allow us to see where we were not getting the most performant kit for our outlay. This is important, as perhaps if we landed on the most performant kit we could remove a few nodes in a tier and save ourself some money. Or perhaps we might want to avoid a given instance type altogether.

I’m a big fan of using Sinatra to build quick dashes. I’m also a big fan of using Rails to front-up these self-contained dashes, but that is another post for another day. So, I sat down and wrote what amounted to 44 lines of ruby/sinatra using the chart_topper gem, which provides a nice way to work with gruff under Sinatra.

The Sinatra app queries the Chef API and therefore index and pulls out a bunch of CPU profile information. It then uses this to plot a distribution of hardware for each given instance type. The output looks like this.

bang for buck

The greater the surface area of the graph, the more performance we’re getting for our outlay over the whole estate. The further up a given axis, the better the range of kit we have for that instance type over the estate. To me this kind of thing demonstrates the 50% of configuration management tools that most people miss: estate intelligence. We can collect inumerable metrics about individual nodes and whole estates through Ohai and Facter. Exposing this data in meaningful ways is a return on the investment in configuration management that the business as a whole can see, use and understand. 

Moreover, these kinds of views are compelling insofar as they are “real-time”, where real-time is your client execution interval and splay period.

I’m speaking Tueday 15th of January at the LDN Devops conference in London on this and a number of other topics regarding Chef implementations at scale. If you can’t attend, then follow #ldndevops on Twitter and look out for the live stream.

Deprecating Nagios, Or Why Every Host In Your Estate Should Serve A RESTful API

Posted by sam Sun, 15 Apr 2012 18:38:00 GMT

Large infrastructures built on cloud architectures have already solved the problem of how to manage many thousands of hosts by using configuration management frameworks, such as Chef, Puppet and a bunch of other tools that prefer other underlying runtimes, paradigms or approaches. However, operational monitoring of error conditions across many thousands of instances is generally still handled by some Nagios-like (or latterly Icinga) style system executing local or remote rulesets to test individual conditions and escalate problems through to operations teams and developers. I’ll use Nagios as the typical “traditional” monitoring framework in this post as it is so widely deployed and understood, but I believe the ideas contained here apply equally well to other monitoring systems that follow in this vein.

Whilst it is possible to scale these traditional monitoring systems by variously overloading tests, moving work to instances via remote execution plugins such as NRPE, and by delegating remote test execution to worker hosts, this is approach is complicated, does not scale linearly and annexes monitoring and host status to a single monolithic mechanism that is not especially queriable or reportable. Even if one exports notification events to some external data store such as MySQL, the latency between an operational event and action upon that event increases as the monitoring infrastructure grows to support more hosts and a growing ecosystem of tools and dashboards interacting with it. All of this inefficiency comes at the cost of more instances, more data to manage and more data to move around.

Tools such as Chef coupled with useful RESTful public APIs (and this applies to private clouds and other providers as much as it does to Amazon’s EC2) have, and quite rightly, turned infrastructure management into a code and development task, albeit one that benefits from high-level languages and simple concepts and in which the engineer/developer must be grounded somewhat in the real world of machinery and its various foibles. Infrastructure teams Doing It Right today are providing dashboards and “public” APIs as documented and dynamic entry points into the infrastructure on which services execute and persist so that they can use their knowledge, talents and specialization to abstract and inform the rest of the business from these problems.

So, taking all of this progress and best-practice, it seems strange that operational monitoring has not followed in this vein, and generally our sum-total knowledge of the estate that we manage is provided by various scripts executed periodically and aggregated back to a single monolithic system. Sure, Chef certainly has the ability to provide HTTP callbacks within recipes (and indeed, anything that can be accomplished in plain Ruby - which is, well, anything) as part of its execution, and has a system of report handlers that further formalize this. However, it is very unlikely that your configuration management tool of choice is executing with enough frequency to provide useful operational knowledge of things such as daemon failures. And if it does, you’ve just moved your scaling problem from one service to another.

I believe the way forward is to use one of the small web frameworks in your operational language of choice (and for me that’s Ruby) to have every instance host its own API service, and to do this in a very particular way. Let’s consider this trivial 10 minute implementation of a monitoring API server, written using Sinatra:

#!/usr/bin/env ruby
# a very simple example of an HTTP status daemon
require 'rubygems'
require 'sinatra'
require 'json'

registered_metrics = ['hostname', 'uptime', 'load_average']

helpers do
  def hostname(function)
    name = %x['hostname'].chomp

    case function
    when :status
      return false if name == nil
      return true
    when :metric
      return name

  def uptime(function)
    time ='/proc/uptime').chomp.to_i

    case function
    when :status
      return false if time < 600
      return true
     when :metric
      return time

  def load_average(function)
    lavg ='/proc/loadavg').chomp

    case function
    when :status
      return false if lavg.split(' ')[0].to_f > 2
      return true
    when :metric
      return lavg

def build_hash(type, metrics)
  status =
  metrics.each do |metric|
    status[metric] = send(metric.to_sym, type)
  return status

get '/status' do
  build_hash(:status, registered_metrics).to_json

get '/metrics' do
  build_hash(:metric, registered_metrics).to_json

get '/happy' do
  return false.to_json if build_hash(:status, registered_metrics).values.include?(false)
  return true.to_json

Start the process on an instance or your workstation, and hit http://localhost:4567/happy. If your host has a hostname, a load average below 2 and has been up for more than 10 minutes then you’ll get the JSON string back reflecting whether the box is happy or not. Hopefully it is, but suppose it was not - let’s pretend your load average is stupidly high and you’re warming a drink on the case. http://localhost:4567/status will return a JSON representation of the boolean state of the three tests, with the state of the load average test reflected in the returned structure. If we call http://localhost:4567/metrics we can examine the actual values.

So what have we gained, apart from essentially re-implementing a very basic NRPE-like service over HTTP rather than TCP in 10 minutes in Ruby?

The job of determining whether an instance is in a decent state has been offloaded to the instance itself, and we have a lightweight (for the systems monitoring and aggregating) method of interrogating this as a simple boolean value, plus a way of drilling down when we need to. More importantly, we no longer have to aggregate to a single consuming service. There is no reason why a graphing service cannot call http://localhost:4567/metrics across the estate whilst a high-level executive dashboard is polling http://localhost:4567/happy. Even for aggregating services over large estates, we have a lightweight manner of polling to help us scale, with a method of exposing detail when it is required.

We’re exposing information about a host in a universal format using a universal transport mechanism. From pointing a browser at a known port on the instance from your workstation through to writing dashboards, or even integrating with an existing Icinga or Nagios installation via a plugin that can’t be more than 20-30 lines in most modern scripting languages, a single mechanism can service many needs. Using a framework such as Sinatra there is no reason why the toy example above cannot be extended to serve some markup when fetched with the correct MIME-type, and JSON likewise. Instant dashboard suitable views, built by your instances with data about themselves.

Of course there will always be the need for “external” checks for things that a host cannot be relied upon to determine about itself. In a large number of cases, rather than a being truly external, what we’re really monitoring is the interaction between tiers that are strictly hierarchical. As such, for many cases it is acceptable to have a instance report the high-level status of the tiers with which is is required to interact. http://localhost:4567/application_servers returning a JSON array of application server instances to which it is possible to make a connection and fetch a status page is not inconceivable.

It is not that much of a leap to take the above example and to extend it to be trivially RESTful, such that one can refer to resources like http://localhost:4567/cpu/load and http://localhost:4567/cpu/cores/1/steal and be returned a value, and also to refer to higher-level collection endpoints to be returned JSON structures as summaries. And from there it is not too much of a leap to extend the concept from purely representing state to manipulating it: http://localhost:4567/service/mysql/restart ..?

This, I believe, is the future of monitoring frameworks in large dynamic virtualized infrastructure estates as it both frees us from our scalability woes, sensibly enables us to extend the DRY concept across monitoring, graphing, dashboards and alerting, and finally reduces our toolset for customization and interaction to a decent HTTP library and a JSON parser, the very tools we’re already using and are familiar with when interacting with our configuration management installations and cloud provider APIs.

formtastic-bootstrap with Rails 3.2 and Twitter Bootstrap 2 4

Posted by sam Sun, 12 Feb 2012 11:23:00 GMT


The Ruby world moves at an astounding pace. Pat Shaughnessy wrote an excellent series of articles in December 2011 documenting the options avaliable for using Twitter’s Bootstrap framework version 1.3 with Rails 3.1. At the time of writing Bootstrap has moved onto version 2.0, Rails is on 3.2.1 and Pat’s example application no longer builds as described.

A Slight Digression

In the rest of this post I’ll explain what little needs to be done if you’d like to follow those articles but use Rails 3.2 and Bootstrap 2.0, but first a quick digression on Bootstrap.

For the visually inept and graphically challenged amongst us, a set of professional and consistent design elements is a God-send. I’ve been using Perl ( through to the later frameworks) and Rails since version 1.x to generate front-ends and dashboards and the like for all sorts of Infrastructure and traditional sysadmin tasks.

It just so happens that the Devops world follows, in part, the same route: assuming that system administrators can develop something other than shell script splattered with global variables, adopting Ruby the language from which the most prominent tools are built, and absorbing a huge amount from the Rails world: be it RESTful web services, rapid development or DRY. So, it is nice to finally be able to produce tools that look nice if for no other reason than quite often some fantastic operational work is trivialized and missed as it’s fronted by a bunch of crap in cgi-bin spitting out table elements.

Digression over, here’s what you need to do.


cgunther has kindly done the hard work in getting mjbellantoni’s formtastic-bootstrap working with Bootstrap 2.0 in his bootstrap2-rails3-2-formtastic-2-1 branch. However, it requires the 2.1 version of formtastic, which is still in rc at the time of writing. However, the following in your Gemfile should do it:

gem 'formtastic', :git => 'git://', :branch => '2.1-stable'
gem 'formtastic-bootstrap', :git => '', :branch => 'bootstrap2-rails3-2-formtastic-2-1'

Also, when editing your ./app/views/layouts/application.html.erb you should use the new Bootstrap 2.0 classes:


<div class="navbar navbar-fixed-top">
 <div class="navbar-inner">
  <div class="container">
   <a class="brand" href="#">OrigamiHub</a>
   <div class="nav-collapse">
    <%= tabs %>

<div class="container">
  <%= yield %>


Finally, you should heed the formtastic deprecation options, and construct your semantic forms thus:

<%= semantic_form_for @widget do |f| %>
  <%= f.semantic_errors %>
  <%= f.inputs do %>
    <%= f.input :name, :hint => "The wangdoodle is a best-seller" %>
    <%= f.input :type, :hint => "We only do three sizes!" %>
  <% end %>
  <%= f.actions do %>
    <%= f.action(:submit) %>
  <% end %>
<% end %>


You’ll need to use the:

config.tabs_ul_class = "nav nav-pills"

option within tabulous to get the navigation bar to behave properly, along with the options recommended in the original article.


I’m sure the work above will be merged back into the mainline for each of the respective gems, and it’s a tribute to the community of github that such a lot of good work is given and fixed freely.

From the vault: "A script to monitor log files and add persistent offenders to /etc/hosts.deny"

Posted by sam Sat, 04 Feb 2012 18:56:00 GMT

I’ve been sorting through some old code, and apparently on the 10th of September 2008 at 08:49 I felt compelled to write a daemon in Perl that would add persistently connecting source IPs to hosts.deny if they continually abused sshd.

I remember doing this: I was between jobs and somebody somewhere whom I’ve long abandoned to my mail archives asked for it, and I had nothing better to do. So, for the sake of posterity, here is probably the last piece of significant Perl I ever wrote before making the move to Ruby - make of it what you will:

#!/usr/bin/perl -w
# - A script to monitor log files and add persistent offenders to /etc/hosts.deny
# Author:       Sam Pointer
# Contact:
# Version:      0.03
# Usage:
# Should a given IP address connect more than the specified number of times, add
# it to the TCP wrappers host.deny file.
# Note that this script simply parses the ssh_log file for the number of failures
# for a given IP address, tests that against a threshold, and adds a tcp wrappers
# rule if that threshold is exceeded. Therefore, if your log files roll around
# daily, more than $max_connections failure in that 24 hour period will cause a
# rule to be generated.
# It will look for and add a rule in the format:
#       sshd : : deny
# See the hosts_access manpages or TCP Wrappers documentation for more information.
# Generally the script should be invoked as UID 0 (root user), due to the permissions
# set on the hosts.deny and log files to be scanned.
# The configuration option @exception_list contains a list of full or partial IP
# addresses that will never be blocked. See the example configuration for the format.
# When started the script will detach from the console and become a daemon. It can
# be terminated via a SIGHUP/signal 15.
# I've tested this on my own machine and it works fine. Change the configuration below
# to some files you can offord to loose first. Only $deny_file is opened for writing,
# so to test I copied that to my home directory, set the path there and checked that
# the rules added were correct, without affecting my live /etc/hosts.deny file.

# -- Configuration ----------------------------
my $max_connections     = '3';                  # Maximum number of denied connections
my $failure_string      = 'Failed password';        # Always present on a failed connection
my $ssh_log             = '/var/log/secure';    # Log file to scan for failed connections
my $deny_file           = '/home/hosts.deny';    # Location of TCP Wrappers host.deny file
my $daemon_list         = 'ALL';                # daemon_list string to add to hosts.deny. See hosts_access(5)
my $sleep_period        = '30';                 # Value in seconds to sleep before parsing log again. Adjust to suit system load.

# A list of full or partial IP addresses to never block
my @exception_list      = ('192.168', '');
# ---------------------------------------------

# Internal Global Variables
my ($record);
my (%failed, %blocked);
my $ip_regex = '\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b';

# Perlisms
use strict;
use POSIX qw(setsid);                           # Required to daemonize the script

# Make ourselves into a daemon

# Loop perpetually as a daemon
while(1) {
        # Get a list of failed user IP addresses from the log file
        %failed = get_failed($ssh_log);

        # Get a list of what's already blocked
        %blocked = get_already_blocked($deny_file);

        # Move through the keys of our failure hash. Anything with
        # more than $max_connections should be written to the file,
        # providing it is not already blocked by a rule and is not in our exceptions list.
        open (DENYFILE, ">>$deny_file") || die "ssh-deny: Cannot open $deny_file for writing\n";

        foreach $record (keys(%failed)) {
                if ( ! $blocked{$record} && $failed{$record} > $max_connections) {
                        print DENYFILE "$daemon_list : $record \n";

        close DENYFILE;

        sleep $sleep_period;

# -- Subroutines ------------------------------
sub get_already_blocked {
# This subroutine will parse $deny_file looking for all IPs
# in a rule that matches $daemon_list. These are returned as
# a hash for matching against later on

        # Local Variables
        my ($deny, $record);
        my (@fields);
        my (%already_denied);

        # We expect a deny file to be passed. Open or die.
        $deny = pop(@_);
        open (DENY, "$deny") || die " Cannot open $deny for reading\n";

        # Move through the file. For any rule that matches $daemon_list get the IPs
        while ($record = ) {
                @fields = split(/ /, $record);
                if ($fields[0] eq $daemon_list &&
                    $fields[2] =~ $ip_regex) {
        close DENY;

sub get_failed {
# This subroutine retrieves a list of failed logins and returns a hash of IPs, times, and
# failed connections.

        # Local Variable declarations
        my ($log, $record, $marked, $exception);
        my (@failure_records, @fields);
        my (%failure_stats);

        # We expect a path to be passed. Open the file or fail.
        $log = pop @_;
        open (LOG, "$log") || die " Cannot open $log for reading\n";

        # Iterate through the log file, selecting rows that have failed connections
        while ($record = ) {
                if ($record =~ $failure_string) {
                        push @failure_records, $record;

        # Close the log file
        close LOG;

        # Build a hash for each IP that has a connection
        while ($record = pop(@failure_records)) {
                @fields = split(/ /,$record);           # Field index 13 is the IP address
                $failure_stats{$fields[13]}++;          # Increase failure counter for IP

        # If any of the failed connections are in our @exception_list (never block)
        # ensure that the IP is deleted from the hash so that they aren't blocked.
        foreach $exception (@exception_list) {
                foreach $marked (keys(%failure_stats)) {
                        if ($marked =~ /$exception/) {
                                delete $failure_stats{$marked};

        # Return our hash with IP addresses as keys, and counts for each IP address as values

sub daemonize {
# This subroutine handles detaching the console, forking a new process, etc.
        chdir('/')                      || die "ssh-deny: Cannot chdir to /\n";
        open (STDIN, '/dev/null')       || die "ssh-deny: Cannot read /dev/null\n";

        # Uncomment the next two lines if you really don't want this to
        # echo anything out as a daemon. Probably best to leave it
        # so any error messages make it to the console.
        #open (STDOUT,'>>/dev/null')    || die "ssh-deny: Cannot write to /dev/null\n";
        #open (STDERR,'>>/dev/null')    || die "ssh-deny: Cannot write to /dev/null\n";

        defined(my $pid = fork)         || die "ssh-deny: Cannot fork\n";
        exit if $pid;
        setsid                          || die "ssh-deny: Cannot start a new session\n";
        umask 0;