The ChargeOver team follows an agile software development lifecycle for rolling out new features and updates. We routinely roll out new features and updates to the platform multiple times per week.
Our typical continuous integration/continuous deployment roll-outs look something like this:
Docker containers and a redundant set of
Docker Swarm nodes for production deployments.
July 19th at 10:24am CT we reviewed and deployed an update which, although passing all automated acceptance tests, errantly deployed an update which caused the deployment of
Docker container images to silently fail. This caused the application to become unavailable, and a
503 Service Unavailable error message was shown to all users. The deployment appeared to be successful to automated systems, but due to a syntax error actually only removed existing application servers rather than replacing them with the new software version. No automated roll back occurred, because the deployment appeared successful but had actually failed silently.
A single extra space (a single errant spacebar press!) was accidentally added to the very beginning of a
YAML file, which caused the
docker-compose file to be invalid
The single-space change was subtle enough to be missed when reviewing the code change. All automated tests passed, because the automated tests do not use the production
docker-compose deployment configuration file.
When deploying the service to
Docker Swarm interpreted the invalid syntax in the
YAML file as an empty set of services to deploy, rather than a set of valid application services to be deployed. This caused the deployment to look successful (it successfully deployed, removing all existing application servers, and replacing them with nothing) and thus automated roll-back to a known-good set of services did not happen.
At this time, all services were restored and operational.
YAML configuration files, to ensure this failure scenario cannot happen again.
There are several things that our team has identified as part of a remediation plan:
YAML syntax and/or configuration errors cannot pass automated tests, and thus cannot reach testing/UAT or production environments.
503 Service Unavailable message that customers received, directing affected customers to https://status.chargeover.com where they can see real-time updates regarding any system outages.
The credentials that you provided are not correct. messages, instead of a notification of the outage. This will be improved.