Engage - Problems with SMS messages in automations

Incident Report for Voyado

Postmortem

Summary

On January 29th, between approximately 09:30 and 20:00 CET, we experienced delays in message processing within the Engage platform. This resulted in messages such as emails and SMS, including automated communications, being sent later than expected. Some customers also encountered issues with promotions being assigned with a delay.

Customer Impact

Customers faced delays of 30-60 minutes for their messages to be delivered, affecting both scheduled and automated communications like welcome emails and order confirmations. Additionally, some customers had trouble assigning promotions. While all messages were eventually delivered, our incident response team manually resent a few messages that got stuck to ensure completion.

Root Cause and Mitigation

Root cause

The issue was caused by a recent update to the system responsible for handling message distribution between available resources. The update, which upgraded a component to the latest Microsoft version, unintentionally slowed down the way messages were processed, creating a bottleneck that led to delays. The problem became more noticeable as more messages were sent throughout the day, compounding the issue.

Mitigation

Once our monitoring systems flagged the delay, our incident response team immediately began investigating. To resolve the issue, we:
·      Deployed a temporary fix to gain better insight into what was causing the slowdown.
·      Identified that some areas of the system were handling messages more efficiently than others and adjusted message distribution to relieve pressure on the slowest parts.
·      Made system adjustments to ensure messages could be processed at normal speed again, significantly reducing the delays.

Next Steps

After the incident, we applied permanent system updates to prevent this issue from happening again. These changes have been successfully implemented and are being closely monitored. Moving forward, we will:
·      Continue working with our system providers to ensure the platform remains stable.
·      Improve our ability to detect similar issues earlier in our testing environments to catch potential delays before they affect customers.

We appreciate your patience during this incident and remain committed to providing a reliable and seamless experience on the Engage platform. If you have any further questions or concerns, please do not hesitate to reach out to our support team.

Posted Feb 06, 2025 - 13:30 CET

Resolved

We have fixed the issue with sending messages and are back at normal state. Messages that were not sent due to the issue is being resent.
Posted Jan 29, 2025 - 10:03 CET

Investigating

There are currently problems with SMS messages sent via automations, affecting all tenants.
We are troubleshooting.
Posted Jan 29, 2025 - 09:52 CET
This incident affected: Engage (Messaging, Automations).