Summary
On the morning of February 9th, we detected system slowness through triggered warnings. Initially, it appeared to be linked to a single tenant's large-scale send-out, but further investigation revealed that multiple tenants were affected. An alert was later triggered indicating that a shared in-memory database, which helps process messages efficiently, was unavailable. This caused delays in message processing, impacting approximately 130 tenants. While most messages were eventually processed automatically, some required manual intervention. The maximum delay experienced was up to three hours, though this only affected a small number of messages for a few tenants.
Customer Impact
Customers experienced delays in their scheduled and automated message send-outs, including both SMS and emails. The disruption was due to a shared in-memory database becoming unavailable, which paused message processing. Once the system resumed, a backlog caused further delays. Our on-call team manually resent messages that got stuck, but a small portion of messages for a few tenants could not be resent. These customers were contacted directly.
Delays ranged from as little as five minutes up to a maximum of 240 minutes.
Root Cause
The issue was caused by an unexpected data handover problem in a shared in-memory database, which temporarily lost track of some messages. This database is designed to handle a high volume of messages quickly and efficiently. Normally, if there’s an issue, the system switches to a backup automatically. However, in this case, when the switch happened, some data was lost. As a result, the system had trouble determining which messages had been sent and which were still in progress, leading to delays.
Messages scheduled for processing after the disruption were handled as expected once the system recovered.
Mitigation
Since the issue was caused by an automatic switch to a backup system, the system recovered on its own. However, our team had to manually resend messages that had gotten stuck in the process.
Next Steps
We are currently evaluating improvements in the following areas:
We appreciate your patience and understanding. Our commitment remains to providing a reliable and seamless experience on the Engage platform. If you have any further questions or concerns, please reach out to our support team.