System inaccessible

Incident Report for ChargeOver

Postmortem

Incident details

ChargeOver experienced an outage on June 7th, and an outage again on June 8th, as a result of a planned data center power outage, and an unplanned networking stack -related failure.

On Saturday, June 7th, ChargeOver was inaccessible from approximately 13:12 CT to 16:53 CT.

On Sunday, June 8th, ChargeOver was inaccessible from approximately 13:32 CT to 17:05 CT.

Root cause

ChargeOver’s primary data center (Flexential in Chaska, MN) performed a planned maintenance task which de-energized the four uninterruptible power supply (UPS) systems serving entire data center, one UPS system at a time.

ChargeOver’s network infrastructure is designed to be redundant, and so ChargeOver has two separate power feeds coming from the four data center UPS systems. ChargeOver’s stacked network switches are plugged into either the first power feed, or the second power feed, to provide redundancy.

When one of ChargeOver’s redundant network switches lost power due to the data center UPS maintenance, the expectation was that the switch would automatically fail over to the second redundant switch. Either the switches did not fail over correctly, or the switch lost it’s configuration on fail-over, which resulted in ChargeOver’s systems becoming unavailable to the outside world.

Incident timeline

  • May 29 - Flexential informed ChargeOver of anticipated maintenance. Due to a miscommunication, it was unclear to ChargeOver staff that the maintenance would impact our environment.

  • June 7 - 13:12 CT - Flexential de-energizes and then re-energizes the first power feed, causing a network switch to go offline. The network switch either fails to fail-over correctly, or loses it’s configuration as the power feed comes back on. ChargeOver staff are immediately notified.
  • June 7 - 13:19 CT - Attempts to reset the network switch fail to restore service.
  • June 7 - 16:51 CT - One network switch is removed, and the configuration on the second switch is re-loaded.
  • June 7 - 16:51 CT - Network availability is restored.

  • June 7 - 13:32 CT - Flexential de-energizes and then re-energizes the second power feed, causing a network switch to go offline. The network switch either fails to properly boot, or re-loses it’s configuration as the power feed comes back on. ChargeOver staff are immediately notified.
  • June 7 - 14:41 CT - Attempts to reset the network switch fail to restore service.
  • June 7 - 17:03 CT - The configuration on the second switch is re-loaded.
  • June 7 - 17:05 CT - Network availability is restored.

Remediation plan

We deeply apologize for the downtime, and are already working towards many improvements towards our infrastructure.

  • We have replaced a suspected-faulty network switch.
  • We are working internally to establish some better procedures for coordinating planned maintenance intervals in the future.
  • We are investigating network switch options and fail-over configuration, to identify possible improvements we may be able to make to our network stack to avoid future service interruptions of this type.
  • We are investigating possible UPS upgrades or replacements we may be able to make to provide extra power redundancy beyond what is already provided by the Flexential data center.
Posted Jun 27, 2025 - 16:16 CDT

Resolved

This incident has been resolved.

A postmortem will follow.
Posted Jun 07, 2025 - 17:21 CDT

Update

We are continuing to monitor for any further issues.
Posted Jun 07, 2025 - 17:11 CDT

Monitoring

A networking fix has been applied, and services are coming back up.

We are monitoring to ensure everything is accessible.
Posted Jun 07, 2025 - 17:11 CDT

Identified

We have identified a networking issue, and are working on a resolution.
Posted Jun 07, 2025 - 16:42 CDT

Investigating

We are currently investigating this issue.
Posted Jun 07, 2025 - 16:00 CDT
This incident affected: Main Application, Search, Email Sending, Payment Processing, Developer Docs, and Integrations.