We are currently experiencing a delay in sending messages.

Incident Report for Voyado

Postmortem

Summary
On the morning of April 21st, Voyado Engage experienced an issue causing delays in the delivery of email messages. This primarily impacted messages sent through automation workflows. While no messages were lost, many were delivered later than intended. The situation was fully resolved the same day, and we are taking steps to ensure it does not recur.

Customer Impact
Approximately fifty percent of our customer base were affected by the incident. The majority of the delays impacted automated email workflows, though some manual send-outs were also affected. While all messages were eventually delivered, delays ranged from about 30 minutes to up to 3 hours for some customers.

Root Cause
The incident was mainly caused by inefficient memory management in the mail-processing application code. Over time, servers' memory usage steadily increased, peaking on April 21st. Combined with a few exceptionally large email campaigns, the system experienced severe resource pressure:

  • Memory Leaks: Memory was not properly released, causing sustained high usage that led to issues with Time-outs and Storage Delays as well as high CPU load:

    • Timeouts and Storage Delays: The platform struggled to write data to storage fast enough caused by the high memory usage, resulting in application slowdowns.
    • CPU Load: Some mail servers reached unnormal high CPU usage, worsening the delays.

Importantly, no failures were detected in our cloud infrastructure, and no messages were lost.

Mitigation
Once the incident was identified:

  • A full application deploy was initiated to clear up memory usage and stabilize the system. Essentially performing a reboot of the application.
  • On-call engineers monitored the queues and gradually cleared all delayed messages.
  • Additional manual steps were taken to resend any stuck processes, ensuring no message was left behind.

By 19:00 CEST on April 21st, all messages had been successfully sent and the system was back to a healthy operational state.

Next Steps
To prevent similar issues in the future, we are taking several actions to evaluate and potentially adjust memory utilization in the application, in addition to fine-tune monitoring of memory and storage health. We are also updating our incident management process to enable faster mitigation actions should similar symptoms appear.

We appreciate your patience and understanding, and apologize for any inconvenience. We remain committed to providing a stable and reliable platform experience.

Posted May 02, 2025 - 10:24 CEST

Resolved

This incident has been resolved.
Posted Apr 21, 2025 - 23:58 CEST

Update

The degradation has been mitigated and we're currently working on addressing the aftermath (making sure all delayed messages are sent)
Posted Apr 21, 2025 - 15:16 CEST

Update

We are continuing to investigate this issue and working on a solution.
Posted Apr 21, 2025 - 12:53 CEST

Investigating

We are currently experiencing a delay in sending messages. We are investigating this and working on a solution.
Posted Apr 21, 2025 - 11:32 CEST
This incident affected: Engage (Messaging).