Between the hour of 09:12 and 09:52 on February 2nd we encountered an issue affecting a central In-memory database used by many processes in the platform. The issue landed the service in a state which didn't trigger failover to backup services, causing various anomalies throughout the platform. Among those a large number of messages being delayed (requiring manual resend), automation events not triggering, reported login issues and more.
The issue mainly affected customers who had messages and activity execution during the time frame 09:12 - 09:52 with a delay.
Root Cause
The root cause of the issue was a central in-memory database ending up in a bad state. The database is used for storing data for quick access throughout the platform in a high load – low latency configuration. As this data is needed in various processes the effect was spread over multiple parts of the platform, but only specific use cases were greatly affected from a user perspective. The database has a redundant setup, with a primary to multiple replica configuration, where failover to a replica is automatic should the primary service run into issues. In this instance all servers in the setup ended up in a replica state, with no primary resource active, thus causing the issue.
Mitigation
Enforce primary: To mediate the issue we enforced a new primary resource in the configuration which returned Engage to a normal state where messages and execution of activities were functioning as expected.
Unfortunately, this has happened before and although we did take actions to mitigate it from happening again it did. The root cause is still, after investigation, not clear. Our intention is now to look over our current in-memory database setup and take action to upgrade and update the setup.