I read an interesting paper that was presented at USENIX recently. It is worth going away to read it - it’s only five pages long.
The rub, if you are lazy, is that not all instances of a given instance type on Amazon EC2 are created equally. There is a range of gear you can land on when you stand up an m1.xlarge or a c1.medium, and as expected, performance therefore varies (sometimes quite significantly) within an instance type.
This set me thinking. We have Chef and Ohai (although this applies equally to Puppet and Facter) storing a bunch of attributes about our nodes in our index. This could be used to see what the distribution of ‘good’ servers was over our estate and allow us to see where we were not getting the most performant kit for our outlay. This is important, as perhaps if we landed on the most performant kit we could remove a few nodes in a tier and save ourself some money. Or perhaps we might want to avoid a given instance type altogether.
I’m a big fan of using Sinatra to build quick dashes. I’m also a big fan of using Rails to front-up these self-contained dashes, but that is another post for another day. So, I sat down and wrote what amounted to 44 lines of ruby/sinatra using the chart_topper gem, which provides a nice way to work with gruff under Sinatra.
The Sinatra app queries the Chef API and therefore index and pulls out a bunch of CPU profile information. It then uses this to plot a distribution of hardware for each given instance type. The output looks like this.
The greater the surface area of the graph, the more performance we’re getting for our outlay over the whole estate. The further up a given axis, the better the range of kit we have for that instance type over the estate. To me this kind of thing demonstrates the 50% of configuration management tools that most people miss: estate intelligence. We can collect inumerable metrics about individual nodes and whole estates through Ohai and Facter. Exposing this data in meaningful ways is a return on the investment in configuration management that the business as a whole can see, use and understand.
Moreover, these kinds of views are compelling insofar as they are “real-time”, where real-time is your client execution interval and splay period.
I’m speaking Tueday 15th of January at the LDN Devops conference in London on this and a number of other topics regarding Chef implementations at scale. If you can’t attend, then follow #ldndevops on Twitter and look out for the live stream.
Large infrastructures built on cloud architectures have already solved the problem of how to manage many thousands of hosts by using configuration management frameworks, such as Chef, Puppet and a bunch of other tools that prefer other underlying runtimes, paradigms or approaches. However, operational monitoring of error conditions across many thousands of instances is generally still handled by some Nagios-like (or latterly Icinga) style system executing local or remote rulesets to test individual conditions and escalate problems through to operations teams and developers. I’ll use Nagios as the typical “traditional” monitoring framework in this post as it is so widely deployed and understood, but I believe the ideas contained here apply equally well to other monitoring systems that follow in this vein.
Whilst it is possible to scale these traditional monitoring systems by variously overloading tests, moving work to instances via remote execution plugins such as NRPE, and by delegating remote test execution to worker hosts, this is approach is complicated, does not scale linearly and annexes monitoring and host status to a single monolithic mechanism that is not especially queriable or reportable. Even if one exports notification events to some external data store such as MySQL, the latency between an operational event and action upon that event increases as the monitoring infrastructure grows to support more hosts and a growing ecosystem of tools and dashboards interacting with it. All of this inefficiency comes at the cost of more instances, more data to manage and more data to move around.
Tools such as Chef coupled with useful RESTful public APIs (and this applies to private clouds and other providers as much as it does to Amazon’s EC2) have, and quite rightly, turned infrastructure management into a code and development task, albeit one that benefits from high-level languages and simple concepts and in which the engineer/developer must be grounded somewhat in the real world of machinery and its various foibles. Infrastructure teams Doing It Right today are providing dashboards and “public” APIs as documented and dynamic entry points into the infrastructure on which services execute and persist so that they can use their knowledge, talents and specialization to abstract and inform the rest of the business from these problems.
So, taking all of this progress and best-practice, it seems strange that operational monitoring has not followed in this vein, and generally our sum-total knowledge of the estate that we manage is provided by various scripts executed periodically and aggregated back to a single monolithic system. Sure, Chef certainly has the ability to provide HTTP callbacks within recipes (and indeed, anything that can be accomplished in plain Ruby - which is, well, anything) as part of its execution, and has a system of report handlers that further formalize this. However, it is very unlikely that your configuration management tool of choice is executing with enough frequency to provide useful operational knowledge of things such as daemon failures. And if it does, you’ve just moved your scaling problem from one service to another.
I believe the way forward is to use one of the small web frameworks in your operational language of choice (and for me that’s Ruby) to have every instance host its own API service, and to do this in a very particular way. Let’s consider this trivial 10 minute implementation of a monitoring API server, written using Sinatra:
#!/usr/bin/env ruby # # a very simple example of an HTTP status daemon # require 'rubygems' require 'sinatra' require 'json' registered_metrics = ['hostname', 'uptime', 'load_average'] helpers do def hostname(function) name = %x['hostname'].chomp case function when :status return false if name == nil return true when :metric return name end end def uptime(function) time = IO.read('/proc/uptime').chomp.to_i case function when :status return false if time < 600 return true when :metric return time end end def load_average(function) lavg = IO.read('/proc/loadavg').chomp case function when :status return false if lavg.split(' ').to_f > 2 return true when :metric return lavg end end end def build_hash(type, metrics) status = Hash.new metrics.each do |metric| status[metric] = send(metric.to_sym, type) end return status end get '/status' do build_hash(:status, registered_metrics).to_json end get '/metrics' do build_hash(:metric, registered_metrics).to_json end get '/happy' do return false.to_json if build_hash(:status, registered_metrics).values.include?(false) return true.to_json end
Start the process on an instance or your workstation, and hit http://localhost:4567/happy. If your host has a hostname, a load average below 2 and has been up for more than 10 minutes then you’ll get the JSON string back reflecting whether the box is happy or not. Hopefully it is, but suppose it was not - let’s pretend your load average is stupidly high and you’re warming a drink on the case. http://localhost:4567/status will return a JSON representation of the boolean state of the three tests, with the state of the load average test reflected in the returned structure. If we call http://localhost:4567/metrics we can examine the actual values.
So what have we gained, apart from essentially re-implementing a very basic NRPE-like service over HTTP rather than TCP in 10 minutes in Ruby?
The job of determining whether an instance is in a decent state has been offloaded to the instance itself, and we have a lightweight (for the systems monitoring and aggregating) method of interrogating this as a simple boolean value, plus a way of drilling down when we need to. More importantly, we no longer have to aggregate to a single consuming service. There is no reason why a graphing service cannot call http://localhost:4567/metrics across the estate whilst a high-level executive dashboard is polling http://localhost:4567/happy. Even for aggregating services over large estates, we have a lightweight manner of polling to help us scale, with a method of exposing detail when it is required.
We’re exposing information about a host in a universal format using a universal transport mechanism. From pointing a browser at a known port on the instance from your workstation through to writing dashboards, or even integrating with an existing Icinga or Nagios installation via a plugin that can’t be more than 20-30 lines in most modern scripting languages, a single mechanism can service many needs. Using a framework such as Sinatra there is no reason why the toy example above cannot be extended to serve some markup when fetched with the correct MIME-type, and JSON likewise. Instant dashboard suitable views, built by your instances with data about themselves.
Of course there will always be the need for “external” checks for things that a host cannot be relied upon to determine about itself. In a large number of cases, rather than a being truly external, what we’re really monitoring is the interaction between tiers that are strictly hierarchical. As such, for many cases it is acceptable to have a instance report the high-level status of the tiers with which is is required to interact. http://localhost:4567/application_servers returning a JSON array of application server instances to which it is possible to make a connection and fetch a status page is not inconceivable.
It is not that much of a leap to take the above example and to extend it to be trivially RESTful, such that one can refer to resources like http://localhost:4567/cpu/load and http://localhost:4567/cpu/cores/1/steal and be returned a value, and also to refer to higher-level collection endpoints to be returned JSON structures as summaries. And from there it is not too much of a leap to extend the concept from purely representing state to manipulating it: http://localhost:4567/service/mysql/restart ..?
This, I believe, is the future of monitoring frameworks in large dynamic virtualized infrastructure estates as it both frees us from our scalability woes, sensibly enables us to extend the DRY concept across monitoring, graphing, dashboards and alerting, and finally reduces our toolset for customization and interaction to a decent HTTP library and a JSON parser, the very tools we’re already using and are familiar with when interacting with our configuration management installations and cloud provider APIs.
The Ruby world moves at an astounding pace. Pat Shaughnessy wrote an excellent series of articles in December 2011 documenting the options avaliable for using Twitter’s Bootstrap framework version 1.3 with Rails 3.1. At the time of writing Bootstrap has moved onto version 2.0, Rails is on 3.2.1 and Pat’s example application no longer builds as described.
A Slight Digression
In the rest of this post I’ll explain what little needs to be done if you’d like to follow those articles but use Rails 3.2 and Bootstrap 2.0, but first a quick digression on Bootstrap.
For the visually inept and graphically challenged amongst us, a set of professional and consistent design elements is a God-send. I’ve been using Perl (CGI.pm through to the later frameworks) and Rails since version 1.x to generate front-ends and dashboards and the like for all sorts of Infrastructure and traditional sysadmin tasks.
It just so happens that the Devops world follows, in part, the same route: assuming that system administrators can develop something other than shell script splattered with global variables, adopting Ruby the language from which the most prominent tools are built, and absorbing a huge amount from the Rails world: be it RESTful web services, rapid development or DRY. So, it is nice to finally be able to produce tools that look nice if for no other reason than quite often some fantastic operational work is trivialized and missed as it’s fronted by a bunch of crap in cgi-bin spitting out table elements.
Digression over, here’s what you need to do.
cgunther has kindly done the hard work in getting mjbellantoni’s formtastic-bootstrap working with Bootstrap 2.0 in his bootstrap2-rails3-2-formtastic-2-1 branch. However, it requires the 2.1 version of formtastic, which is still in rc at the time of writing. However, the following in your Gemfile should do it:
gem 'formtastic', :git => 'git://github.com/justinfrench/formtastic.git', :branch => '2.1-stable' gem 'formtastic-bootstrap', :git => 'https://github.com/cgunther/formtastic-bootstrap.git', :branch => 'bootstrap2-rails3-2-formtastic-2-1'
Also, when editing your ./app/views/layouts/application.html.erb you should use the new Bootstrap 2.0 classes:
<body> <div class="navbar navbar-fixed-top"> <div class="navbar-inner"> <div class="container"> <a class="brand" href="#">OrigamiHub</a> <div class="nav-collapse"> <%= tabs %> </div> </div> </div> </div> <div class="container"> <%= yield %> </div> </body>
Finally, you should heed the formtastic deprecation options, and construct your semantic forms thus:
<%= semantic_form_for @widget do |f| %> <%= f.semantic_errors %> <%= f.inputs do %> <%= f.input :name, :hint => "The wangdoodle is a best-seller" %> <%= f.input :type, :hint => "We only do three sizes!" %> <% end %> <%= f.actions do %> <%= f.action(:submit) %> <% end %> <% end %>
You’ll need to use the:
config.tabs_ul_class = "nav nav-pills"
option within tabulous to get the navigation bar to behave properly, along with the options recommended in the original article.
I’m sure the work above will be merged back into the mainline for each of the respective gems, and it’s a tribute to the community of github that such a lot of good work is given and fixed freely.
For some reason I keep finding myself writing Perl code. In 2011. Over the last couple of months I’ve written enough Ruby to make my head spin, and yet, in a fix, I find myself back in the arms of Perl.
Recently I needed to parse some Apache virtual host configurations into some Nagios configuration file stanzas, a BIND zonefile and a hosts file for a project for my employer. Perl. On some private boxes not yet big enough to warrant a full Nagios installation I needed a script to run the CLI from the Clickatell SMS Gem in response to various /proc values. Perl.
Picture credit: reidrac
I like to think that my Perl is well written. I avoid implicits: $_, and the like, `use strict`, pass “-w” and declare all of my variables up-front. Sometimes I even entertain taint checking. It’s not perfect by any means, and not everybody’s style, but if you’re familiar with a C-like language you can probably read it. If you’ve written pseudo-code, you can read it. If you’re a patterns infected developer professional you’ll wince a bit. No FactoryFactories here. Objection orientation has its place, but it isn’t knocking up a quick 5 minute script to remove all of the commas from a YAML file driven by SOAP calls via wget in a panic.
I don’t think that anybody really pretends that Perl is the best thing to begin a new large-scale development project with today. But that doesn’t mean that Perl is down and out. Whilst Ruby gained frameworks and conferences and spawned religions, Perl sat there in it’s varied and glorious assorted version 5 point releases on nearly every UNIX-like box on the planet, just waiting for someone to ignore the framework bling and me-too bits and get stuck into good old text manipulation and nicking bits off of CPAN.
I’m by no means a language fanatic; I like all sorts. And I’ll tackle ActiveState or Strawberry if you force me, too. Perhaps it’s familiarity. Perhaps I simply don’t know Ruby or Python or anything else as well as I think I do. Or perhaps for your common or garden UNIX-style, “text as a univeral interface” basic string hackage, Perl is the crusty mig-welding old nutter who might just get you home?
I’ve recently had cause to try to parse the output of the `mailq’ command, by dint of it being the common queue output format between sendmail and postfix, and a desire to interoperate with them both. It is a complete bastard!
Here is some to save you running the command:
-Queue ID- --Size-- ----Arrival Time---- -Sender/Recipient------- 637F5CFF9C* 1497 Sat Dec 18 21:40:34 firstname.lastname@example.org email@example.com 4C4D5A32E6* 1481 Sat Dec 18 21:39:45 firstname.lastname@example.org email@example.com EBBE4D5C93* 1481 Sat Dec 18 21:51:52 firstname.lastname@example.org email@example.com ... 36BAAA3C9E 1471 Sat Dec 18 18:19:16 firstname.lastname@example.org (delivery temporarily suspended: host mailxx-xx.xxxx.aol.com[xxxxxx] refused to talk to me: 421 4.7.1 : (DYN:T1) http://postmaster.info.aol.com/errors/421dynt1.html) email@example.com 350FE10A66C 1438 Wed Dec 15 01:33:38 firstname.lastname@example.org (lost connection with mx-c1.xxxx.net[61.20.xx.xx] while receiving the initial server greeting) email@example.com ... -- 359752 Kbytes in 174586 Requests.
This is awkward to parse because:
The queue to which a mail identified by a given queue-id belongs to is denoted by a single character appended to the queue-id. Except if you’re in the deferred queue, in which case there is no special character - an atribute by absence, if you like.
The ‘arrival time’ doesn’t include a year. We either have to assume that the year is the current year and open ourselves up to year-end bugs, or we have to compensate for edge cases with time and boundary checks.
Records span multiple lines, and how many depends on the queue that the mail is in. This leads to nasty “if the last character of the first field of the following line was not an asterix of exclamation mark read two more line, otherwise read three more lines. And that field that was on the second line before is now on the third” logic in parsers.
Whilst parsable with somewhat icky code, it just seems designed not to be parsed or piped: “records” spanning multiple lines is a definite no-no when you’re ‘xargs’-ing or piping through awk or anything else remotely UNIXy. I wonder what was going through the minds of the original authors of the `sendmail -bp` option, that `mailq` ultimately emulates?
I’m currently working on some Ruby code to parse the output of `mailq` and to populate an ActiveRecord model with the appropriate attributes. Wrapped up as a gem, it would be a convenient way to provide Rails applications access to the mail queue.