The issue began at around 3:40 PM CET, causing delays in processing, sending and storing messages.
We discovered that resources used to send messages were experiencing unexpectedly high load, which overwhelmed their capacity to handle requests efficiently. This primarily led to delays in sending e-mails. A small number of SMS messages were also indirectly affected by the delays.
Our first attempt at mediating the issue by scaling resources didn't work, as the problem quickly resurfaced. To address this, we made a targeted adjustment to the code used to better manage the workload. This update was tested on an affected resource and proved effective, so the fix was subsequently rolled out to all affected resources.
With the system stabilized, we were able to resend all delayed messages, completing the process later that evening. These actions resolved the problem and restored normal functionality of sending e-mails from Engage.
Delays, primarily in email messaging, impacted a large amount of tenants. Some messages were delayed by several hours.
The trigger for the issue was high load on the email messaging chain. In combination with certain code being activated which contained non ideal resource usage, CPU load and thread counts greatly exceeded anticipated and sustainable levels. This was mitigated by a code fix to reduce the number of threads used, which then reduced the load on the servers.
Implement further improvements to automatically resend delayed messages to quickly mediate the effects of any disturbance. We will look at architectural changes to reduce general load in the messaging chain to be better prepared for future high load scenarios.