Voyado Engage - Emails delayed

Incident Report for Voyado

Postmortem

Summary
On the morning of February 9th, we detected system slowness through triggered warnings. Initially, it appeared to be linked to a single tenant's large-scale send-out, but further investigation revealed that multiple tenants were affected. An alert was later triggered indicating that a shared in-memory database, which helps process messages efficiently, was unavailable. This caused delays in message processing, impacting approximately 130 tenants. While most messages were eventually processed automatically, some required manual intervention. The maximum delay experienced was up to three hours, though this only affected a small number of messages for a few tenants.

Customer Impact
Customers experienced delays in their scheduled and automated message send-outs, including both SMS and emails. The disruption was due to a shared in-memory database becoming unavailable, which paused message processing. Once the system resumed, a backlog caused further delays. Our on-call team manually resent messages that got stuck, but a small portion of messages for a few tenants could not be resent. These customers were contacted directly.

Delays ranged from as little as five minutes up to a maximum of 240 minutes.

Root Cause
The issue was caused by an unexpected data handover problem in a shared in-memory database, which temporarily lost track of some messages. This database is designed to handle a high volume of messages quickly and efficiently. Normally, if there’s an issue, the system switches to a backup automatically. However, in this case, when the switch happened, some data was lost. As a result, the system had trouble determining which messages had been sent and which were still in progress, leading to delays.

Messages scheduled for processing after the disruption were handled as expected once the system recovered.

Mitigation
Since the issue was caused by an automatic switch to a backup system, the system recovered on its own. However, our team had to manually resend messages that had gotten stuck in the process.

Next Steps
We are currently evaluating improvements in the following areas:

  • Enhancing system robustness to minimize the risk of data loss during failovers.
  • Implementing automatic resending of delayed messages to quickly mitigate the effects of any disruption.

We appreciate your patience and understanding. Our commitment remains to providing a reliable and seamless experience on the Engage platform. If you have any further questions or concerns, please reach out to our support team.

Posted Feb 25, 2025 - 08:13 CET

Resolved

The issue has now been resolved, and we have processed the delayed messages.
Posted Feb 09, 2025 - 11:26 CET

Identified

We recently experienced an issue that has now been resolved. As a result, some messages may be delayed. We are actively working to send them out as soon as possible.
Posted Feb 09, 2025 - 10:32 CET
This incident affected: Engage (Messaging).