During the morning of Black Friday, we were alerted about delays in the sending of SMS messages. Shortly thereafter API-alerts were also triggered, indicating a service disruption on several endpoints.
Upon further investigation we identified that a component in the messaging chain, a database used to house the messages to be sent, was under heavy load and did not scale the way it was supposed to, causing degradation to SMS delivery. Manual scaling had an immediate effect and the platform started recovering.
As a result of the service degradation many messages were only partially sent and manual efforts had to be taken to rectify these effects. As a result of the degradation some messages were partially duplicated due to automatic resends.
The immediate effect of the service degradation was delays to SMS messaging and a partial outage on our APIs between 10:22 and 10:51 CET. During this period many requests towards the application could not be handled and received 5xx-errors, while the majority of traffic was handled but with sub-standard response times.
The customer impact was failing or delayed actions when performing actions through API, such as looking up contact data, redeeming promotions etc.
The customer impact of the SMS incident was that many messages were only partially sent. Measured on a platform level about 5% of the intended recipients did not receive their intended message, but individual messages had higher/lower failed delivery rates.
As we were working to find a way to resend the affected messages, we were alerted to the fact that the service degradation had led to an unanticipated side effect. Automatic retries designed to make sure all messages are always delivered had led to batches of messages being sent multiple times, somehow bypassing the checks present to prevent such duplicates. Due to this, manual resending of failing messages was paused, and later cancelled, to avoid the risk of further duplicates being sent (a risk further increased by the general delays of delivery receipts from operators during Black Friday). All efforts were redirected towards finding the bug that had led to the duplicates being sent. As the day progressed with us unable to safely resend messages without risking duplicates, we reached a point where we deemed the failing SMS messages could no longer be sent.
When this decision had been made, a manual effort was started to change the information shown in the application. From showing messages as ”sending” to ”sent”, to indicate that no further attempts on delivery were going to be made. Due to the nature of the incident, with duplicates and manual intervention, the stats for the affected messages will never be fully accurate. The actual deviation for the messages will vary, but all messages will see erroneous data in the delivery stats, making it very hard to follow up on the effect of the delivery.