On July 5th, a firewall-related issue caused ChargeOver to be inaccessible for approximately 90 minutes.
Customers were unable to log in or access the ChargeOver application during this time frame.
At approximately 3:27pm CT, one of the high-availability firewalls experienced an error which caused it to stop passing traffic into/out-of ChargeOver’s internal network. Although the firewall was configured in a way which should trigger automatic fail-over (we use industry standard Netgate firewalls, which use CARP to share IP and network status information and monitor for fail-over) to a secondary firewall, the automatic fail-over did not occur.
ChargeOver staff were immediately and automatically notified, and after troubleshooting were able to resolve the outage by power-cycling the affected firewall.
We are still investigating what caused the error which took the firewall offline, as well as why the firewall did not fail over correctly to a secondary firewall when the error occurred. As we investigate further, we’ll have a better idea of whether this requires replacement or reconfiguration of the firewall.
We are improving documentation, to enable our engineering team to troubleshoot quicker/isolate root cause quicker for future incidents.
We are also working with data center staff to establish some better monitoring tools to help us troubleshoot quicker/isolate root cause quicker.