Monitor fleet aging


Background

Generally speaking, I refresh most of my systems pretty regularly, spurred on by security concerns, general hygeine, a desire to make sure the automation doesn't age out, and certificate expiration.

Although I don't need to refersh systems due to certificate expiration, it has historically been the easiest indicator of systems that are getting a little long in the tooth.

Working on some systems this weekend, I noticed some out-of-date copies of postgresql...really out of date..like close to a year old. This is what sent me off on this weekend's adventure.

What do you mean by refresh and why?

Given our penchant or building everything using Ansible, when I indicate I'm refreshing a system, that means the old VM gets taken down and a new one is built to then-current specifications as a replacement.

Rob and I have nurtured this workflow for years (ever since moving to using ansible for automation). In all cases, I build staging environments before production and in most cases there are some reasonable automated tests for that process.

As to why? The answer is mostly one of convenience, although there are security arguments as well, both getting the latest versions of libraries that may contain vulnerabilities and dislodging anything bad that may be sitting on the virtual machines.

Monitoring the fleet age

Based on the recent discovery of some aging systems, I figured that I should find a way to add this process to our monitoring system, the venerable Nagios.

This didn't need to be particularly complex, but I needed the nagios server to reach out to the SmartOS Global Zones in order to get information about the running VMs. Historically, we've done with with captive SSH, using dedicated keys and lines in ~/.ssh/authorized_keys which take advantage of the command= command in order to run a program, potentially with information from the incoming SSH connection. Results are sent in text, but preferably encoded in JSON or similar.

a new python framework for ssh requests

Most of our previous commands piggy-backed on the check_by_ssh checker, which is a standard nagios plugin. However, that command assumes that we put all of the intelligence at the other end of the line (on the recipient) and basically run the checks there. That could be done, but the need to do date math made coming up with an appropirate one-liner a bit ridiculous, so I decided to go with python.

The python code was strightforward, and I used my existing poetry-based environment as a starting point, creating a couple of new commands which I'd install on the nagios servers: one for SmartOS and another for AWS.

By making use of my existing poetry workflows, I got a number of things for free, including updating release notes, packaging releases in gitlab, etc.

Integrating with nagios

The nagios integration should have been simple, but for one small issue: I needed to parameterize the global zone system so that the command could take place there.

After some digging through the documentation for nagios, I found the section on custom macro variables, which is exactly what I needed in this case. I wanted to add a new variable _GZHOST to my existing host definitions which would indicate which host to query about the underlying VM. I already had this infromation in the PARENTS field, which I thought I could use as $HOSTPARENTS$, but it turns out that for some reason that's not exposed.

In this case, I was able to use $_HOSTGZHOST in my command definition in commands.cfg, resulting in:

define command {
       command_name check_smartos_vm_age
       command_line /opt/local/bin/ct-smartos-vm -H $_HOSTGZHOST$ $ARG2$ -i $USER5$/smartos-age-check-key $HOSTNAME$
}

With:

  • $_HOSTGZHOST$ having the Global Zone host
  • $ARG2$ being a placeholder for optional parameters (such as overriding the timelines)
  • $USER5$ pointing to our directory for storing ssh keys
  • $HOSTNAME$ the name of the VM to check

Results

In the end, I found a few more systems that were out of date than I was expecting, including one I could have sworn I'd refreshed just earlier this week. So, I'm pretty happy with the system.