Engage- slowness in automations

Incident Report for Voyado

Postmortem

Summary

On the evening of December 3rd our Automations experienced a service degradation. This incident affected all tenants running workflows the same way - contacts moving through the workflows at a much slower pace than normal, in many cases with delays of 1-3 hours

Customer Impact

All workflows ran during the incident, but at a much slower pace, resulting in potential delays of 1-3 hours. Workflows normally executing “immediately” could take up to 3 hours to finish during the incident, possibly affecting time critical use cases of Automations, such assignment of promotions at checkout etc.

Root Cause and Mitigation

Root Cause

We have been unable to find a singular root cause at this point, rather finding indications of several contributing factors leading up to the incident:

A certain part of the workflow initiation started experiencing minor delays, but the high load (always) experienced in Automations made this a significant total delay causing queues.
Functions present to ensure integrity and functionality in case of problems became inadvertent reasons for further problems as locks taken to prevent duplicates etc added further delays, retries designed to assure functionality added to the load after a certain point causing even more queues and throttling of traffic by underlying Azure components.

Mitigation

Upon redeploys of Automation services stale traffic and locks were reset, which lead to diminished queues as resources were forcibly made available again. This meant we could once again process workflows in a normal fashion, but it took some time to work through the queued up traffic caused by the incident, even at full throttle.

Actions Taken & Next Steps

Changes have been made to the functionality involved in the incident and we will monitor the preventive actions taken to see if we have mitigated the risk of landing in the same situation in the future. This will be done through further testing and live monitoring in combination.

Posted Dec 13, 2024 - 11:02 CET

Resolved

We have resolved the issue and will continue to work on the root cause.

Posted Dec 03, 2024 - 20:51 CET

Update

We are continuing to monitor for any further issues.

Posted Dec 03, 2024 - 18:37 CET

Update

We are continuing to monitor for any further issues.

Posted Dec 03, 2024 - 18:34 CET

Monitoring

We are back at normal performance for automations and will continue to monitor closely.

Posted Dec 03, 2024 - 18:30 CET

Update

We are still working to remediate the issue, status on performance is unchanged.

Posted Dec 03, 2024 - 17:44 CET

Identified

We have identified the issue and are currently in progress of verifying if our actions have resolved the issue with degraded performance on automations.

Posted Dec 03, 2024 - 17:06 CET

Investigating

We are currently experiencing larger than usual queues in our automations. We are investigating this.

Posted Dec 03, 2024 - 16:31 CET

This incident affected: Engage (Automations).