Incident details
On November 18th, ChargeOver suffered a partial service outage.
Many customers were unable to access most parts of the ChargeOver application.
Root cause
ChargeOver has a cluster of redundant ingress servers which accept all application traffic, and then route that traffic internally to different backend services.
This cluster of ingress servers (HAProxy servers) is backed by a distributed network file store (GlusterFS), which stores data (encrypted SSL/TLS certificates) needed by HAProxy to serve secure connections for incoming HTTP requests.
Sometime between November 7th and November 17th, the connection between two of our ingress load balancers and our distributed network file store was disconnected. We are still investigating the cause of the disconnection.
On November 18th, we deployed a routine update to our ingress load balancers. The deployment failed on 2 of the 3 redundant ingress load balancers due to the lost connection to the distributed network file store. This caused approximately 2/3 of traffic to ChargeOver to be dropped. The third ingress load balancer stayed active and continued to serve traffic.
Automatic DNS fail-over occurred, automatically removing the 2 affected ingress load balancers from serving traffic within 5 minutes of the outage.
ChargeOver staff were notified immediately. We restarted the two affected ingress load balancers to re-establish the distributed network file store connection, and re-deployed the ingress load balancer update.
This restored access to all services.
Incident timeline
- Sometime between November 7th and November 17th - connection between load balancers and distributed network file store is lost
- November 18 - 3:29pm CT - Our team starts a routine deployment to our ingress load balancers
- November 18 - 3:33pm CT - The deployment completes, and we are immediately notified of a problem with 2 of the 3 load balancers
- November 18 - 3:38pm CT - Automatic DNS fail-over removes 2 of our 3 ingress load balancers from DNS, as expected
- November 18 - 3:44pm CT - We post an initial update to https://status.chargeover.com/ of the partial outage
- November 18 - 3:55pm CT - After assessing the situation, we reboot the two problematic load balancers
- November 18 - 3:59pm CT - Re-deploy of the ingress update for load balancers succeeds across all 3 servers
- November 18 - 4:02pm CT - Our team confirms everything looks good, and DNS fail-over automatically adds the 2 failed servers back to DNS
- November 18 - 4:07pm CT - Update posted to https://status.chargeover.com/ indicating the issue has been resolved
Remediation plan
We’ve identified a number of take-aways from this incident.
- We are implementing additional monitoring to proactively monitor the connection between the ingress load balancers and the distributed network file store.
- We are discussing possible solutions to reduce/remove the dependency on the distributed file store from the load balancers.
- There were several internal DNS names which pointed at specific ingress load balancers, rather than using the round-robin, more fault-tolerant DNS which our public services use. This hampered our ability to recover from this situation more quickly, so we are moving those internal DNS names to use round-robin DNS.