How to Automate Uptime Display in via Synthetic Monitors in Datadog in 4 steps

I was given the task of converting a 1/0 metric on our page from Datadog Metrics using the Route53 healthcheck to an actual percentage uptime display in, or at least something similarly meaningful to the end user.

First stop: Service Level Objectives

When browsing around our current monitors and dashboards, one thing that stood out was “service level objectives.” In combination with synthetics, they provide an uptime percentage over a period of time that can be embedded on the dashboard. [We’ll come back to synthetics on a different approach]

SLO Synthetics Uptime Display in Datadog

Next stop: Trying to embed those SLOs

The System Metrics integration on the side seems to really only be built for flat queries for a point-in-time, and not aggregated over a period of time of days or weeks. A aws.route53.health_check_status query that produced either a 1 or a 0 at any given point in time was fine, but coming up with a way to “query” for a 24 hour or 90 day up time was a different story (impossible to do via direct integration between the two apps?)

Third stop: UptimeRobot and Similar

Jyll over @ suggested some experimentation with Uptime Robot and similar services with my own free instance of StatusPage, and it was in stripping away the extra configuration and being able to feed a simple up/down email or webhook to that I came back to the idea of looking to see if I could email or webhook synthetic alerts from Datadog to Statuspage. (Spoiler: You can!)

Final stop (and the actual steps needed!) Automating Datadog to Send Status to get Uptime Display in

  1. Add a component in your account
  2. Click on the “Automation” button to get the automation email. Copy that email:
uptime display in
Click the Automation button to reveal your automation email

3. (Create a synthetic monitor that checks a heartbeat route if you don’t already have one)

4. Go to your synthetic monitor in Datadog… under Step 6 is “Notify your team”. Your monitor name needs to use the template variables {{#is_alert}}DOWN{{/is_alert}}{{#is_recovery}}UP{{/is_recovery}} for statuspage automation to understand the message. The rest of the monitor name is irrelevant (as long as DOWN or UP isn’t a fixed part of that name!)

The automation email needs to be mentioned in the message body with an @ in front of it.

Monitor alert settings
No, that’s not a valid automation email.

Inauspicious Start for Oracle Cloud Free Tier Sign up

Oracle Cloud Free Tier offerings

I heard via word of mouth and Twitter of a new Oracle Cloud Free Tier (with permanently free services [for now]). The always free services looked enticing enough:

Oracle Cloud Free Tier offerings
AMD and ARM and object storage!

The challenge was, “Who can afford ‘free’ services?” Time is worth something. But I can always make use of another cloud server to run experiments on.

Problem #1: Email confirmation didn’t go through

Self-explanatory, but, yes… I checked my spam and all the auto-sorting tabs. The email confirmation link that’s only good for 30 minutes didn’t deliver in a timely manner. Second attempt, the email showed up immediately.

Problem #2: Password Too Strong

My first 30 character randomly generated password didn’t pass the test:

I think I met the requirements??

Problem #3: Wouldn’t validate my debit card

Maybe there’s a payment glitch right now? Maybe I don’t have enough in the account for a “free” account? Worse… the “try again” link makes you start over from the very first step of creating your account.

Problem #4: Declined my credit card

After going through 2-3 times with a debit card, I tried with a credit card. Maybe I needed five figures of available credit for a free account? This is Oracle, after all.

Upon resubmitting, I’m back to “Error processing transaction”

Aha! Moment

(b) in the above error message was the clue that eventually led me to the right answer… VPN (still US-based) was active, which possibly set off alarm bells with the payment processor. I’m in now and ready to try some VMs!

If 5 nines is a myth, what is 3 nines?

Burned By Gmail Outage? Google Will (Almost) Buy You a Postage Stamp.  Apparently the SLA for Google Apps will get you 3 days if uptime dips below 99.9% for the month.  At $50 annually for the service, that’s $0.41.  (Google has decided to pay out 15 days credit, anyway.)

Service credits from the SLA:

Monthly Uptime Percentage Days of Service added to the end of the Service term, at no charge to Customer
< 99.9% – ≥ 99.0% 3
< 99.0% – ≥ 95.0% 7
< 95.0% 15

That’s $0.41, $0.96, and $1.91 credit for over 43.2 minutes, 7.2 hours, and 36 hours of downtime, respectively.  I realize that across an entire business, that could potentially be a thousand credits or more, but what business would see a $1.91 per user credit as adequate for 36 hours of downtime in a month?

From these two register articles, Google’s email service goes down and Google blames Gmail outage on data centre collapse, it looks like the downtime was about 2 hours and 15 minutes, but there are some reports of outages as long as 4 hours.

There is a nice article on High Availability on Wikipedia to compare uptimes on a weekly, monthly, and yearly basis.

Is cloud computing a threat to Microsoft?

Washington Times – KELLNER: Cloud computing a Microsoft threat?.  From what I’ve seen, Microsoft is actively combating this with their Software + Services offering.

The battle really comes down to whether you want to drive the car of your choice to work (desktops apps), or depend on public transportation (cloud).  Software + services is like having a Hummer H3 and an unlimited booklet of bus passes   (Personally, I’d prefer cheaper personal transport).

Of course, there are risks with cloud computing as well (What’s your “Web Service Tanked” contigency plan?)  Nothing like a transportation worker strike to kill public transportation and leave you stranded with no way home.

Cloud wars are on the horizon.  Yes, running standalone apps are an expensive and inflexible form of transportation.  However, will the business side tolerate staying in the clouds all the time without the option of being able to operate from extended periods on the ground?  I understand that there are offline modes for the best offerings, but they’re short term contingencies, not long term operational options.