ChargeOver primarily uses the MariaDB database for data storage. The MariaDB process crashed on our primary database server at 11:09 CST. The process crashed with the following error message which is still being investigated:
InnoDB: Failing assertion: templ->clust_rec_field_no != ULINT_UNDEFINED
ChargeOver staff were immediately notified, and we began to investigate the issue.
The database was automatically restarted immediately, and the database began to run automated data integrity checks to ensure that no data was lost, and no data was corrupted before beginning to service requests again.
Although we had the ability to fail-over to a secondary database server, a decision was made to let the process complete, as the estimated downtime was very short.
This automated data integrity check took much longer than originally expected. Our estimated time to recovery was less than an hour, and instead the database server took approximately 3 hours to do integrity checks and restart safely. The data integrity checks took from 11:09 CST to 14:49 CST.
After the data checks were complete, our team ran through our recovery checklist, and started the database server.
Service was restored in a degraded state at 15:01 CST, and fully operational at 15:15 CST.
We recognize that there are many things to be learned from this lesson, and are working towards putting pieces in place to be able to avoid the long check times and revise our fail-over processes in the future to better account for possible long check times.
Please make sure to subscribe to updates at https://status.ChargeOver.com to be notified of service disruptions in the future.