[Engage] Service degradation

Incident Report for Voyado

Postmortem

Summary

During the morning of Black Friday, we were alerted about delays in the sending of SMS messages. Shortly thereafter API-alerts were also triggered, indicating a service disruption on several endpoints.

Upon further investigation we identified that a component in the messaging chain, a database used to house the messages to be sent, was under heavy load and did not scale the way it was supposed to, causing degradation to SMS delivery. Manual scaling had an immediate effect and the platform started recovering.

As a result of the service degradation many messages were only partially sent and manual efforts had to be taken to rectify these effects. As a result of the degradation some messages were partially duplicated due to automatic resends.

Customer Impact

The immediate effect of the service degradation was delays to SMS messaging and a partial outage on our APIs between 10:22 and 10:51 CET. During this period many requests towards the application could not be handled and received 5xx-errors, while the majority of traffic was handled but with sub-standard response times.

The customer impact was failing or delayed actions when performing actions through API, such as looking up contact data, redeeming promotions etc.
The customer impact of the SMS incident was that many messages were only partially sent. Measured on a platform level about 5% of the intended recipients did not receive their intended message, but individual messages had higher/lower failed delivery rates.

As we were working to find a way to resend the affected messages, we were alerted to the fact that the service degradation had led to an unanticipated side effect. Automatic retries designed to make sure all messages are always delivered had led to batches of messages being sent multiple times, somehow bypassing the checks present to prevent such duplicates. Due to this, manual resending of failing messages was paused, and later cancelled, to avoid the risk of further duplicates being sent (a risk further increased by the general delays of delivery receipts from operators during Black Friday). All efforts were redirected towards finding the bug that had led to the duplicates being sent. As the day progressed with us unable to safely resend messages without risking duplicates, we reached a point where we deemed the failing SMS messages could no longer be sent.

When this decision had been made, a manual effort was started to change the information shown in the application. From showing messages as ”sending” to ”sent”, to indicate that no further attempts on delivery were going to be made. Due to the nature of the incident, with duplicates and manual intervention, the stats for the affected messages will never be fully accurate. The actual deviation for the messages will vary, but all messages will see erroneous data in the delivery stats, making it very hard to follow up on the effect of the delivery.

Root Cause and Mitigation

A sudden burst of requests and severe load on the SMS SQL server led to long response times and eventually timeouts in connections when a service designed to auto scale failed to do so sufficiently. Manual scaling mitigated the immediate problem.
Several steps were taken to prevent duplicate SMS messages in the event of service degradation, and the functionality is now more robust. This includes code changes to Engage and related services.
Monitoring of the affected SQL server has been increased.

Next Steps

Further investigations will be made to fully understand why the scaling didn’t work as intended for the affected resources in the Messaging pipeline.
Investigations will be made to understand if, and why, this pipeline degradation influenced API performance and actions taken to mitigate the risks indicated by those findings.
We’re implementing further improvements to prevent unintended duplicates when in similar situations.

Posted Dec 06, 2024 - 22:42 CET

Resolved

Our API:s are still stable and back at normal operations.
Sending of SMS is also back to a more normal status and we are sending at full capacity.
Teams are making sure that any delayed messages are delivered to their intended recipients, albeit somewhat late in some cases.

We sincerely apologize for the inconvenience caused by this service degradation.
We are fully staffed and working hard to monitor the platform and taking swift action to mediate any problems caused by the Black Friday load. This day is as important to us as it is to you.

Posted Nov 29, 2024 - 11:55 CET

Update

We are seeing signs of improvement regarding the API:s since approximately 10:51 CET.
Sending of SMS still degraded.
Our efforts continue at full force.

Posted Nov 29, 2024 - 10:58 CET

Update

We are continuing to investigate the issue, using all resources available to us.
Monitoring shows that our APIs are affected as well as the sending of SMS messages.

SMS are being sent, but at a lower rate than normal.

Posted Nov 29, 2024 - 10:50 CET

Investigating

We are currently investigating service degradation on our APIs
More information will follow

Posted Nov 29, 2024 - 10:34 CET

This incident affected: Engage (API, Messaging).