Summary
Between 07:30 and 08:20 on September 17th, we encountered an issue affecting a central in-memory database critical to multiple processes across the platform. The issue prevented failover to backup services, leading to various platform anomalies. These anomalies included significant delays in message processing (requiring manual resends), failure of automation events, login issues, and other disruptions.
Customer Impact
Customers who had messages or activities scheduled between 07:30 and 08:20 were primarily affected, experiencing delays in execution. Scheduled jobs were disrupted during this period and, while many were able to self-heal upon their next scheduled run, certain critical activities—such as birthday automation events—may not have triggered as expected, potentially impacting some customers.
Root Cause and Mitigation
Root Cause
The root cause was a failure in a central in-memory database, which entered a degraded state. This database is designed to store data for rapid access under high-load, low-latency conditions. As the database is essential to numerous platform processes, the issue spread across multiple areas of the system, although only specific use cases were significantly impacted from a user perspective.
The database operates in a redundant primary-to-replica configuration, with automatic failover to replicas when the primary encounters issues. In this instance, all servers shifted into a replica state with no primary resource active, leading to widespread disruptions.
Mitigation
Enforce primary: To resolve the issue, we manually enforced the designation of a new primary resource within the configuration. This restored normal platform operations, allowing messages and scheduled activities to resume as expected.
Next steps
To prevent future occurrences, we have enhanced the system's downtime handling for our in-memory database. This improvement ensures that the system will automatically return to the correct state once back online, mitigating the risk of similar disruptions going forward.