Creating a Sensu Check

I recently did a lightning talk at DevOpsMtl about creating a Sensu check. Sensu is a monitoring framework that lets you build your monitoring solution, exactly how you need it.

Despite tons of checks being available already (see the ‘plugins’ directory of the sensu-community-plugins repo), you may run into situations where you need to build your own. Thankfully, it’s pretty easy.

What are sensu checks?

Sensu checks are run periodically to check anything from basic system state (CPU, memory, disk usage, load, etc) to the state of software you’re running (PostgreSQL, MySQL, ElasticSearch, Redis, etc). You can also go further and use Sensu to monitor the state of external systems you depend on, even if you don’t have direct control over them. At least you can initiate a failover (e.g. switch to Mailgun if SendGrid is down).

The doc on checks is pretty comprehensive. This post is meant as a primer.

So, what’s a Sensu check? Let’s start with how they’re configured:

{
  "checks": {
    "cpu_usage": {
      "command": "check-ram.rb -w 50 -c 15",
      "interval": 60,
      "subscribers": [ "linux" ]
    }
  }
}

The command parameter is simply an executable command that will be run on the server. Here we have an executable ruby script (properly hash-banged), invoked with a few parameters. It doesn’t have to be Ruby, it can be Bash, Python, a compiled program, anything. Heck, if you feel like it, you could cram a whole Bash one-liner in your command parameter and not need to upload a script to the monitored node.

The API

The script must behave a certain way for Sensu to be able to make sense of it. Before I go any further, I’ll just mention that there are two kinds of checks: standard checks and metric checks. In this post I’ll cover only the first. Go here to read about metrics checks

For those familiar with Nagios, standard Sensu checks are compatible with Nagios checks. So if you know of a Nagios check that does what you need, you can stop right here and go grab that. Otherwise, let’s continue.

The exit status of a Sensu check should be:

0: ok
1: warning
2: critical
3 or more: unknown

A sensu check optionally outputs text describing the state to stdout or stderr. Ideally, your check should output at least one line, but it can be more.

Example outputs of check-ram.rb:

Exit
Status  Output
0       CheckRAM OK: 65% free RAM left
1       CheckRAM WARNING: 40% free RAM left
2       CheckRAM CRITICAL: 13% free RAM left
3       CheckRAM UNKNOWN: invalid percentage

Dumb check example

Here’s an example of a dummy Bash check that monitors the value of $RANDOM. Anything over 22000 is ok, over 11000 is a warning and below that is critical.

#!/bin/bash

# Check the state
rand=$RANDOM # 0 to 32k

# Report how dire the situation is
if [ $rand -gt 22000 ]; then
  echo "Ok: random number generated was high enough ($rand)"
  exit 0
else
  if [ $rand -gt 11000 ]; then
    echo "Warning: random number generated was $rand"
    exit 1
  else
    echo "Critical: random number generated was $rand"
    exit 2
  fi
fi

echo "Unknown: How did I get here?"
exit 3 # or higher

Ruby is nicer

If you prefer using Ruby, the Sensu folks created a Ruby gem you can start from.

gem install sensu-plugin

Here’s a simplified version of the check-ram.rb plugin:

#!/usr/bin/env ruby
#
# Check free RAM Plugin
#

require 'sensu-plugin/check/cli'

class CheckRAM < Sensu::Plugin::Check::CLI

  option :warn,
    :short => '-w WARN',
    :proc => proc {|a| a.to_i },
    :default => 10

  option :crit,
    :short => '-c CRIT',
    :proc => proc {|a| a.to_i },
    :default => 5

  def run
    total_ram, free_ram = 0, 0

    `free -m`.split("\n").drop(1).each do |line|
      free_ram = line.split[3].to_i if line =~ /^-\/\+ buffers\/cache:/
      total_ram = line.split[1].to_i if line =~ /^Mem:/
    end

    unknown "invalid percentage" if config[:crit] > 100 or config[:warn] > 100

    percents_left = free_ram*100/total_ram
    message "#{percents_left}% free RAM left"

    critical if percents_left < config[:crit]
    warning if percents_left < config[:warn]
    ok
  end
end

You inherit from Sensu::Plugin::Check::CLI, then define a run method and you’re off to the races.

CLI parameters

The option method lets you declare how your script should be invoked. Here’s a few things you can specify:

default value
required arguments
short and long argument names (-v or --version)
a description for the CLI help
a proc to run on the argument. Commonly used to call .to_i, .to_f. All values you get in the config hash are strings otherwise.
is it a boolean argument (this is the only case where you don’t get a string).

Using option also has the nice side effect of producing a useful --help for your script.

The workings of option will sound familiar to fans of Thor, but in our case, option is a feature provided by Opscode’s mixlib-cli gem. You can go see there to see all of the documentation for option.

The instance method config is also created by mixlib-cli. It returns a hash of all resolved CLI parameters (manually specified value or default and so on). config being an instance method, it’s available inside run or in any other method you care to define, without needing to pass around the hash.

Reporting the state

So you have defined your params and you know how to access them in your code. The sensu-plugin gem also gives you a few helper methods to help you report the observed state in a consistent manner.

The basics are

ok(message)
warning(message)
critical(message)
unknown(message)

Each of those corresponds exactly to the API we defined at the top. One thing to know is that calling any of those stops the execution of the script right away.

The check-ram.rb example uses a slightly different approach. Since the message is always exactly the same, it calls message(message) once, and then calls ok() (or other) without a message.

If you look back at the code for check-ram.rb, you’ll notice that the message reported is “65% free RAM left”. The sensu-plugin gem automatically normalizes the output with the check name and status: “CheckRAM OK: 65% free RAM left”.

Testing your check

Since Sensu checks are simple executables, now you only need to run your check manually in all appropriate scenarios to validate that your check works as expected.

No one’s created an automated test harness yet for Sensu checks. So manual testing will have to do for now :-)

Last step

The last step of creating a Sensu check is to visit the sensu-community-plugins repo, create a pull request and share your check with the world. You’re probably not the only one who needs to monitor whatever you just created a check for, so why not give back to the community?