On the evening of December 3rd our Automations experienced a service degradation. This incident affected all tenants running workflows the same way - contacts moving through the workflows at a much slower pace than normal, in many cases with delays of 1-3 hours
All workflows ran during the incident, but at a much slower pace, resulting in potential delays of 1-3 hours. Workflows normally executing “immediately” could take up to 3 hours to finish during the incident, possibly affecting time critical use cases of Automations, such assignment of promotions at checkout etc.
We have been unable to find a singular root cause at this point, rather finding indications of several contributing factors leading up to the incident:
Upon redeploys of Automation services stale traffic and locks were reset, which lead to diminished queues as resources were forcibly made available again. This meant we could once again process workflows in a normal fashion, but it took some time to work through the queued up traffic caused by the incident, even at full throttle.
Changes have been made to the functionality involved in the incident and we will monitor the preventive actions taken to see if we have mitigated the risk of landing in the same situation in the future. This will be done through further testing and live monitoring in combination.