Investigating issues with ChargeOver

Incident Report for ChargeOver

Postmortem

Incident details

On November 18th, ChargeOver suffered a partial service outage.

Many customers were unable to access most parts of the ChargeOver application.

Root cause

ChargeOver has a cluster of redundant ingress servers which accept all application traffic, and then route that traffic internally to different backend services.

This cluster of ingress servers (HAProxy servers) is backed by a distributed network file store (GlusterFS), which stores data (encrypted SSL/TLS certificates) needed by HAProxy to serve secure connections for incoming HTTP requests.

Sometime between November 7th and November 17th, the connection between two of our ingress load balancers and our distributed network file store was disconnected. We are still investigating the cause of the disconnection.

On November 18th, we deployed a routine update to our ingress load balancers. The deployment failed on 2 of the 3 redundant ingress load balancers due to the lost connection to the distributed network file store. This caused approximately 2/3 of traffic to ChargeOver to be dropped. The third ingress load balancer stayed active and continued to serve traffic.

Automatic DNS fail-over occurred, automatically removing the 2 affected ingress load balancers from serving traffic within 5 minutes of the outage.

ChargeOver staff were notified immediately. We restarted the two affected ingress load balancers to re-establish the distributed network file store connection, and re-deployed the ingress load balancer update.

This restored access to all services.

Incident timeline

Sometime between November 7th and November 17th - connection between load balancers and distributed network file store is lost
November 18 - 3:29pm CT - Our team starts a routine deployment to our ingress load balancers
November 18 - 3:33pm CT - The deployment completes, and we are immediately notified of a problem with 2 of the 3 load balancers
November 18 - 3:38pm CT - Automatic DNS fail-over removes 2 of our 3 ingress load balancers from DNS, as expected
November 18 - 3:44pm CT - We post an initial update to https://status.chargeover.com/ of the partial outage
November 18 - 3:55pm CT - After assessing the situation, we reboot the two problematic load balancers
November 18 - 3:59pm CT - Re-deploy of the ingress update for load balancers succeeds across all 3 servers
November 18 - 4:02pm CT - Our team confirms everything looks good, and DNS fail-over automatically adds the 2 failed servers back to DNS
November 18 - 4:07pm CT - Update posted to https://status.chargeover.com/ indicating the issue has been resolved

Remediation plan

We’ve identified a number of take-aways from this incident.

We are implementing additional monitoring to proactively monitor the connection between the ingress load balancers and the distributed network file store.
We are discussing possible solutions to reduce/remove the dependency on the distributed file store from the load balancers.
There were several internal DNS names which pointed at specific ingress load balancers, rather than using the round-robin, more fault-tolerant DNS which our public services use. This hampered our ability to recover from this situation more quickly, so we are moving those internal DNS names to use round-robin DNS.

Posted Nov 25, 2024 - 13:37 CST

Resolved

This incident has been resolved.

A post-mortem containing further details to come.

Posted Nov 18, 2024 - 16:07 CST

Identified

We've identified the problem and are implementing a fix.

Posted Nov 18, 2024 - 15:58 CST

Investigating

We are aware of ongoing issues with access to our platform. We are currently investigating and will release details here as they become available.

Posted Nov 18, 2024 - 15:44 CST

This incident affected: Main Application, Public Website, Search, Email Sending, Payment Processing, Developer Docs, and Integrations.