Houston: a Health Check Service

Development

Houston, we have a problem

...or not. Houston is my self-hosted and open source health check service. Why? Well recently I changed the setup for my server and while I was thinking about how to make it as nice and fancy as possible, thinking about stuff like how to make things have nice interfaces and such, I decided that some central service that could alert me to failures in other services would be very helpful. Now I could, like any reasonable person, looked for an existing solution for self-hosted health checks, but being a programmer first and sysadmin second, I immediately decided that probably most existing solutions would be both too complicated and too limiting. While I don't know if this is the case, I do know that Houston can do everything I need and want it to and because I wrote it myself it's both infinitely configurable and the configuration is as intuitive as it could be for me.

So how does it work? Well I decided to start with a description of what exactly I wanted it to do: Periodically perform some test and if it meets some condition perform some other action with respect to the results of the test. You might notice this is very general but in this case I don't think that makes the design much more complicated but actually simpler. All I really needed to implement for the core service was something with a timer, a test (and condition) format, and an action format, as well as a central piece to put it all together. So about those three pieces specifically:

  • Timer: This one's pretty simple, but even still I decided it would be best if the way it worked was modular and predictable. So I decided to make it operate on even intervals (specified in the config) relative to the Unix epoch (midnight on the first day of 1970), and have an offset which can have a random component. So if you want it to run the test sometime between 3 and 4 am every day then you'd set the interval to 24 hours, the offset to 3 hours and the random offset to one hour.
  • Test format: For the tests I decided to simply have a list of "checks" that run for every test and each check could be one of multiple different types (e.g. HTTP), that I'll go into later, and then each check would return either failure or success as well as a message in the case of failure.
  • Condition format: What I mean by this is a way to match some data against an expected result, that can be used in multiple places throughout the config. I ended up writing a basic recursive matching system letting you do stuff like parse strings into JSON objects, compare equality as well as numeric less and greater than, string length, etc.
  • Action format: This one is very similar to the test format, in that there are many different types of actions (one to send an email over amzam SES2, one for executing a command over SSH, etc...), and each action would have many parameters, which can have placeholders that can be used to fill in information from the test (or tests) that failed and caused the action to trigger.
  • Central piece: For the central design I decided to have a structure where every configuration file stored in a directory will be recursively scanned and used, but the actual location of the file doesn't matter, and then in each file can (optionally) have two lists: One for tests and one for actions, both have names associated to them, so that multiple tests can reference the same action by name.

The reasoning behind this design is that it's as modular as I could come up with while keeping the goal (of notifying me when stuff breaks, and maybe even restarting broken things for me) in mind.

Tests

So what tests or checks does it support? Well for a complete list you can check the Gitlab but here's a summary as well as why I chose to include that test.

  • Always fail/Always succeed: These aren't too important and I just added them for testing.
  • HTTP: Performs an http request and optionally checks the returned data. This is a no-brainer as most services have an endpoint that will simply return if that service is running properly, which is what this makes the most sense for, but it can also just work to check if the main page is working, etc.
  • SSH: Connects to an SSH server and runs a set of commands, checking the exit status and optionally the output. I added this so that I can do stuff like check the status of docker using it's CLI, for example. This also works as a catchall for any unimplemented tests.
  • TCP: Connects to a TCP server, sends a payload, checks the response. I added this so that I can check custom protocols and the like, and in general most servers that aren't HTTP will work with this.
  • RCON: Like SSH, but uses Valve's Source RCON protocol (also used by many other notable games such as Minecraft). I added this because of the multiple game servers I'm hosting.

Actions

For completeness's sake here are the actions supported:

  • Log: This simply logs the failure to standard out.
  • SES2: I wanted to be able to be notified about failures so I decided to just use Amazon's SES2 service, that's all this does.
  • SSH: Just runs commands over ssh, much like the test but without any checking.

Closing thoughts

While there are almost certainly ways to do this without writing my own server, and there are a few features that it'd be nice to add in the future (like a way to prevent spamming due to the same test failing over and over), I'm overall very happy with how this project turned out. It's quite modular and as mentioned about can do everything I need for health check. I know it's inadvised to be self-hosting a health check but it's not mission critical and should still be able to catch most issues if running on a battery-backed raspberry pi, so I'm convinced that won't be an issue for me. Overall I'd say this project is a success, if a little simple and as always you can take a look at the code yourself at the project page here.

I know this article could be longer but I'm a bit tired so if you have any questions always feel free to shoot me a message or email.