[Engage] Message sendout delays

Incident Report for Voyado

Postmortem

Summary

On January 29th, between approximately 09:30 and 20:00 CET, we experienced delays in message processing within the Engage platform. This resulted in messages such as emails and SMS, including automated communications, being sent later than expected. Some customers also encountered issues with promotions being assigned with a delay.

Customer Impact

Customers faced delays of 30-60 minutes for their messages to be delivered, affecting both scheduled and automated communications like welcome emails and order confirmations. Additionally, some customers had trouble assigning promotions. While all messages were eventually delivered, our incident response team manually resent a few messages that got stuck to ensure completion.

Root Cause and Mitigation

Root cause

The issue was caused by a recent update to the system responsible for handling message distribution between available resources. The update, which upgraded a component to the latest Microsoft version, unintentionally slowed down the way messages were processed, creating a bottleneck that led to delays. The problem became more noticeable as more messages were sent throughout the day, compounding the issue.

Mitigation

Once our monitoring systems flagged the delay, our incident response team immediately began investigating. To resolve the issue, we:
·      Deployed a temporary fix to gain better insight into what was causing the slowdown.
·      Identified that some areas of the system were handling messages more efficiently than others and adjusted message distribution to relieve pressure on the slowest parts.
·      Made system adjustments to ensure messages could be processed at normal speed again, significantly reducing the delays.

Next Steps

After the incident, we applied permanent system updates to prevent this issue from happening again. These changes have been successfully implemented and are being closely monitored. Moving forward, we will:
·      Continue working with our system providers to ensure the platform remains stable.
·      Improve our ability to detect similar issues earlier in our testing environments to catch potential delays before they affect customers.

We appreciate your patience during this incident and remain committed to providing a reliable and seamless experience on the Engage platform. If you have any further questions or concerns, please do not hesitate to reach out to our support team.

Posted Feb 06, 2025 - 13:31 CET

Resolved

The incident has now been resolved after active monitoring of the system after our recent fix. We have managed the delayed messages, where the majority have been resent and in a few minor cases been rescheduled.
Posted Jan 29, 2025 - 22:17 CET

Monitoring

The fix implemented in our recent deploy has show desired effects and we are experiencing new messages (email and sms) being sent as expected. However, we have an accumulated queue of messages that should have been sent throughout the day that will experience a delay. We are working on managing the delayed messages.
Posted Jan 29, 2025 - 20:41 CET

Update

The status of the delays remains unchanged. We are actively investigating the issue through troubleshooting and are awaiting the resolution of our recent update.
Posted Jan 29, 2025 - 20:16 CET

Update

We are rolling out an update that includes fixes to help address the issue and additional monitoring tools to better understand the root cause.
Posted Jan 29, 2025 - 18:12 CET

Update

We are still in active troubleshooting and delays are still present. Currently, we have an average delay on 45 min on all messages going out.
Posted Jan 29, 2025 - 16:47 CET

Update

We are still in active troubleshooting and delays are unchanged.
Posted Jan 29, 2025 - 15:50 CET

Update

Our attempt on mitigating the issue with implemented fix did not have the desired effect and we are continuing our troubleshooting. Delays in messaging are unchanged.
Posted Jan 29, 2025 - 15:08 CET

Update

We are still seeing a delay after the implementing an additional fix. Our efforts to mitigate and resolve the issue continues.
Posted Jan 29, 2025 - 14:27 CET

Update

We have identified and implemented a fix to mitigate the issue, and are seeing some improvement to the delay, however there is still a delay for messages to be sent. Troubleshooting is still ongoing to ensure our mitigation efforts are correct and to identify the root cause.
Posted Jan 29, 2025 - 13:01 CET

Update

We are continuing to investigate this issue.
Posted Jan 29, 2025 - 12:26 CET

Update

We are continuing to investigate this issue.
Posted Jan 29, 2025 - 11:50 CET

Investigating

We are currently seeing delays in sending both SMS and email messages. Troubleshooting is ongoing.
Posted Jan 29, 2025 - 11:24 CET
This incident affected: Engage (Messaging).