Summary
On the morning of April 21st, Voyado Engage experienced an issue causing delays in the delivery of email messages. This primarily impacted messages sent through automation workflows. While no messages were lost, many were delivered later than intended. The situation was fully resolved the same day, and we are taking steps to ensure it does not recur.
Customer Impact
Approximately fifty percent of our customer base were affected by the incident. The majority of the delays impacted automated email workflows, though some manual send-outs were also affected. While all messages were eventually delivered, delays ranged from about 30 minutes to up to 3 hours for some customers.
Root Cause
The incident was mainly caused by inefficient memory management in the mail-processing application code. Over time, servers' memory usage steadily increased, peaking on April 21st. Combined with a few exceptionally large email campaigns, the system experienced severe resource pressure:
Memory Leaks: Memory was not properly released, causing sustained high usage that led to issues with Time-outs and Storage Delays as well as high CPU load:
Importantly, no failures were detected in our cloud infrastructure, and no messages were lost.
Mitigation
Once the incident was identified:
By 19:00 CEST on April 21st, all messages had been successfully sent and the system was back to a healthy operational state.
Next Steps
To prevent similar issues in the future, we are taking several actions to evaluate and potentially adjust memory utilization in the application, in addition to fine-tune monitoring of memory and storage health. We are also updating our incident management process to enable faster mitigation actions should similar symptoms appear.
We appreciate your patience and understanding, and apologize for any inconvenience. We remain committed to providing a stable and reliable platform experience.