Service Outage

Incident Report for ChargeOver

Postmortem

ChargeOver primarily uses the MariaDB database for data storage. The MariaDB process crashed on our primary database server at 11:09 CST. The process crashed with the following error message which is still being investigated:

InnoDB: Failing assertion: templ->clust_rec_field_no != ULINT_UNDEFINED

ChargeOver staff were immediately notified, and we began to investigate the issue.

The database was automatically restarted immediately, and the database began to run automated data integrity checks to ensure that no data was lost, and no data was corrupted before beginning to service requests again.

Although we had the ability to fail-over to a secondary database server, a decision was made to let the process complete, as the estimated downtime was very short.

This automated data integrity check took much longer than originally expected. Our estimated time to recovery was less than an hour, and instead the database server took approximately 3 hours to do integrity checks and restart safely. The data integrity checks took from 11:09 CST to 14:49 CST.

After the data checks were complete, our team ran through our recovery checklist, and started the database server.

Service was restored in a degraded state at 15:01 CST, and fully operational at 15:15 CST.

We recognize that there are many things to be learned from this lesson, and are working towards putting pieces in place to be able to avoid the long check times and revise our fail-over processes in the future to better account for possible long check times.

Please make sure to subscribe to updates at https://status.ChargeOver.com to be notified of service disruptions in the future.

Posted Sep 10, 2020 - 08:32 CDT

Resolved

This issue has been resolved and all services are operational. Root cause and postmortem will follow.

Posted Sep 09, 2020 - 15:39 CDT

Update

All ChargeOver services are operational now. We continue to monitor the situation.

A postmortem will follow.

Posted Sep 09, 2020 - 15:25 CDT

Monitoring

We continue to monitor the situation as services are being restored.

Posted Sep 09, 2020 - 15:17 CDT

Update

Services are being restored. We are continuing to monitor the situation and will provide further updates.

Posted Sep 09, 2020 - 15:16 CDT

Update

We are continuing to work towards restoration of service. More updates to follow.

Posted Sep 09, 2020 - 13:43 CDT

Identified

We have identified the issue, and are working to resolve the outage as quickly as possible.

Posted Sep 09, 2020 - 11:38 CDT

Investigating

We are aware of an issue, and are investigating.

Further information to follow.

Posted Sep 09, 2020 - 11:15 CDT

This incident affected: Main Application and Payment Processing.