Woot! Monitoring and alerting for our infrastructure, powered by Datadog

I’m proud to announce we now have comprehensive monitoring and alerting for the machines running our community services, thanks to Datadog!

 

Good boy, Ops Dog

In order to run community services (e.g. talk, issue tracker, wiki, OpenMRS ID, Continuous Integration, test servers, etc…) we have a fleet of around 20 machines maintained by infrastructure volunteers. Until now, we’ve had very little alerting (only being notified when a service was completely down) which resulted in ‘reactive mode.’ There was no way to tell how the CPU utilization or memory usage was changing over time which meant investigation was limited to the current state of a machine, and a lot of guess-work.

I’ve always said that one can only love their infrastructure as much as they love their monitoring/alerting system. Without such systems in place, infrastructure teams are effectively running production in complete darkness, with no visibility of  problems.

I was looking for a platform we could use that would notify us even if all our machines failed. About a month ago, Datadog opened their opensource program and OpenMRS was approved! This is perfect for OpenMRS because it’s a cloud-based product. I’ve been using Datadog for a few months now and I can say I’m a huge fan of the product.

Since our infrastructure is deployed using Ansible, I used the official Ansible role and deployment of the agent to all machines, which was easy. There’s a Puppet module too), Chef, or you can install agents manually if that’s your jam!

 

Without any configuration, Datadog Agent will send several different metrics to the Datadog servers such as operational system metrics and events. I quickly enabled Docker integration for Docker hosts and I discovered a lot of new things about our infrastructure:

  • There were quite a few machines using huge amounts of swap memory; that explains the perceived slowness I never stopped to properly investigate prior. Changing JVMs’ heap sizes and increasing memory for the machines were applied to almost all cases – if our CI is more responsive, it’s a direct effect of this discovery!
  • Certain machines had very high CPU utilization. As they were running JVMs, I started changing the JVM memory size/configuration with some success. Java GCs can consume quite a lot of CPU if not properly configured for your application needs. Here an example on our wiki, which hopefully is more stable now:

With metrics in hand it was time to create alerts for CPU usage, memory usage (beware of free memory is not the metric you are looking for in linux), disk utilization, swap, host down, and docker daemon up. We are now alerted with enough time to action and prevent outages for simple system-level failures.

A week ago, one of our infrastructure providers had a power outage. We received notification from Datadog that 3 machines were down from the same datacenter. No time wasted investigating, it was very clearly a problem with the provider.

More recently, we discovered that some machines have their DNS configuration broken every so often. It’s not clear when that happens, and with which machines exactly, so I enabled DNS-check to aid us on investigation and quick response.

There are a lot of Datadog features we’ll be using soon such as http_checks for all our services as well as alerting; utilizing nicely-designed dashboards; and Java-specific metrics, allowing us to investigate GC/Memory problems.

Last but not least, tagging the machines carefully allows me to analyze our infrastructure in a better way. I added information about the machine status (how they are created, which provider, level of automation, and services running), allowing pretty nice maps from Datadog.

Here, machines grouped by how they were created (manually vs terraform) within each subgroup and the level of automation (partial vs full):

Yeah, we have a long way to go!

Thanks again, Datadog!

Tags:

No comments yet.

Leave a Reply