The ChargeOver team follows an agile software development lifecycle for rolling out new features and updates. We routinely roll out new features and updates to the platform multiple times per week.
Our typical continuous integration/continuous deployment roll-outs look something like this:
ChargeOver uses Docker
containers and a redundant set of Docker Swarm
nodes for production deployments.
On July 19th at 10:24am CT
we reviewed and deployed an update which, although passing all automated acceptance tests, errantly deployed an update which caused the deployment of Docker
container images to silently fail. This caused the application to become unavailable, and a 503 Service Unavailable
error message was shown to all users. The deployment appeared to be successful to automated systems, but due to a syntax error actually only removed existing application servers rather than replacing them with the new software version. No automated roll back occurred, because the deployment appeared successful but had actually failed silently.
A single extra space (a single errant spacebar press!) was accidentally added to the very beginning of a docker-compose
YAML
file, which caused the docker-compose
file to be invalid YAML
syntax.
The single-space change was subtle enough to be missed when reviewing the code change. All automated tests passed, because the automated tests do not use the production docker-compose
deployment configuration file.
When deploying the service to Docker Swarm
, Docker Swarm
interpreted the invalid syntax in the YAML
file as an empty set of services to deploy, rather than a set of valid application services to be deployed. This caused the deployment to look successful (it successfully deployed, removing all existing application servers, and replacing them with nothing) and thus automated roll-back to a known-good set of services did not happen.
At this time, all services were restored and operational.
YAML
syntax.YAML
configuration files, to ensure this failure scenario cannot happen again. There are several things that our team has identified as part of a remediation plan:
YAML
syntax and/or configuration errors cannot pass automated tests, and thus cannot reach testing/UAT or production environments. 503 Service Unavailable
message that customers received, directing affected customers to https://status.chargeover.com where they can see real-time updates regarding any system outages. The credentials that you provided are not correct.
messages, instead of a notification of the outage. This will be improved.