Voyado Engage - Emails delayed

Incident Report for Voyado

Postmortem

Summary

The issue began at around 3:40 PM CET, causing delays in processing, sending and storing messages.

We discovered that resources used to send messages were experiencing unexpectedly high load, which overwhelmed their capacity to handle requests efficiently. This primarily led to delays in sending e-mails. A small number of SMS messages were also indirectly affected by the delays.

Our first attempt at mediating the issue by scaling resources didn't work, as the problem quickly resurfaced. To address this, we made a targeted adjustment to the code used to better manage the workload. This update was tested on an affected resource and proved effective, so the fix was subsequently rolled out to all affected resources.

With the system stabilized, we were able to resend all delayed messages, completing the process later that evening. These actions resolved the problem and restored normal functionality of sending e-mails from Engage.

Customer Impact

Delays, primarily in email messaging, impacted a large amount of tenants. Some messages were delayed by several hours.

Root Cause and Mitigation

The trigger for the issue was high load on the email messaging chain. In combination with certain code being activated which contained non ideal resource usage, CPU load and thread counts greatly exceeded anticipated and sustainable levels. This was mitigated by a code fix to reduce the number of threads used, which then reduced the load on the servers.

Next Steps

Implement further improvements to automatically resend delayed messages to quickly mediate the effects of any disturbance. We will look at architectural changes to reduce general load in the messaging chain to be better prepared for future high load scenarios.

Posted Dec 04, 2024 - 09:07 CET

Resolved

All the delay messages has now been sent. The incident has now been resolved.

Posted Nov 28, 2024 - 23:51 CET

Update

We still see a good effect of the implemented fix and the sendouts are going out as planned now. We are still processing the queue of the previously delayed messages.

Posted Nov 28, 2024 - 20:45 CET

Monitoring

We see good effect of a fix we implemented. Engage is coming back to normal performance. Sendouts are going out as planned now, and queues are being processed. We will monitor and update regarding send queue status.

Posted Nov 28, 2024 - 19:44 CET

Update

We are still seeing delays because of high load. Email sendouts are being queued. We are working on different solutions to make sure emails are going out. At the moment no other parts of Engage is affected.

Posted Nov 28, 2024 - 19:06 CET

Update

Posted Nov 28, 2024 - 18:39 CET

Update

Posted Nov 28, 2024 - 18:07 CET

Update

We are still working to find a solution. The current email sendout delay is approximately 1,5h.

Posted Nov 28, 2024 - 17:42 CET

Update

We’re still seeing delays and are working to find a solution. At the moment no other parts of Engage is affected.

Posted Nov 28, 2024 - 17:33 CET

Update

We’re still seeing delays and are working to find a solution. At the moment no other parts of Engage is affected.

Posted Nov 28, 2024 - 17:09 CET

Investigating

We are currently experiencing delays in sending emails. We are working to resolve the issue.

Posted Nov 28, 2024 - 16:34 CET

This incident affected: Engage (Messaging).