ChargeOver main application is unavailable

Incident Report for ChargeOver

Postmortem

Incident details

The ChargeOver team follows an agile software development lifecycle for rolling out new features and updates. We routinely roll out new features and updates to the platform multiple times per week.

Our typical continuous integration/continuous deployment roll-outs look something like this:

Developers make changes
Code review by other developers
Automated security, linting, and security testing is performed
Senior-level developer code review before deploying features to production
Automated deploy of new updates to production environment
If an error occurs, roll back the changes to a previously known-good configuration

ChargeOver uses Docker containers and a redundant set of Docker Swarm nodes for production deployments.

On July 19th at 10:24am CT we reviewed and deployed an update which, although passing all automated acceptance tests, errantly deployed an update which caused the deployment of Docker container images to silently fail. This caused the application to become unavailable, and a 503 Service Unavailable error message was shown to all users. The deployment appeared to be successful to automated systems, but due to a syntax error actually only removed existing application servers rather than replacing them with the new software version. No automated roll back occurred, because the deployment appeared successful but had actually failed silently.

Root cause

A single extra space (a single errant spacebar press!) was accidentally added to the very beginning of a docker-compose YAML file, which caused the docker-compose file to be invalid YAML syntax.

The single-space change was subtle enough to be missed when reviewing the code change. All automated tests passed, because the automated tests do not use the production docker-compose deployment configuration file.

When deploying the service to Docker Swarm, Docker Swarm interpreted the invalid syntax in the YAML file as an empty set of services to deploy, rather than a set of valid application services to be deployed. This caused the deployment to look successful (it successfully deployed, removing all existing application servers, and replacing them with nothing) and thus automated roll-back to a known-good set of services did not happen.

Incident timeline

10:21am CT - Change was reviewed and merged from a staging branch, to our production branch.
10:24am CT - Change was deployed to production, immediately causing an outage.
10:29am CT - Our team posted a status update here, notifying affected customers.
10:36am CT - Our team identified the related errant change, and started to revert to a known-good set of services.
11:06am CT - All services became operational again after deploying to last known good configuration.

At this time, all services were restored and operational.

11:09am CT - Our team identified exactly what was wrong - an accidentally added single space character at the beginning of a configuration file, causing the file to be invalid YAML syntax.
11:56am CT - Our team made a change to validate the syntax of the YAML configuration files, to ensure this failure scenario cannot happen again.

Remediation plan

There are several things that our team has identified as part of a remediation plan:

We have already deployed multiple checks to ensure that invalid YAML syntax and/or configuration errors cannot pass automated tests, and thus cannot reach testing/UAT or production environments.
Our team will work to improve the very generic 503 Service Unavailable message that customers received, directing affected customers to https://status.chargeover.com where they can see real-time updates regarding any system outages.
Customers logging in via https://app.chargeover.com received generic The credentials that you provided are not correct. messages, instead of a notification of the outage. This will be improved.
Our team will do a review of our deployment pipelines, to see if we can identify any other similar potential failure points.

Posted Jul 19, 2023 - 13:33 CDT

Resolved

The incident has been resolved.

A postmortem will be provided.

Posted Jul 19, 2023 - 11:17 CDT

Monitoring

All systems have been restored.

We are monitoring the fix. A postmortem will be posted.

Posted Jul 19, 2023 - 11:12 CDT

Update

We are continuing to work on a fix for this issue.

Posted Jul 19, 2023 - 10:52 CDT

Identified

We have identified the problem.

ETA to resolution is less than 30 minutes.

Posted Jul 19, 2023 - 10:36 CDT

Investigating

We are aware of the problem, and are investigating.

We will post further updates as we have them.

Posted Jul 19, 2023 - 10:29 CDT

This incident affected: Main Application, Search, Email Sending, Payment Processing, and Integrations.