[Engage] Investigating slow response times for certain endpoints

Incident Report for Voyado

Resolved

Having monitored the platform for a period of time we can conclude that the mitigating actions taken have had a positive effect on the endpoints where we could see slowness in the higher percentiles (the slowest 1% etc).
The actions taken have improved the distribution of load in our underlying structures to make better use of all available resources, which is extra important in high load scenarios. This work can not eradicate slow responses, but we see a positive impact in most scenarios making us confident that we'll see less slow responses going forward. We will continue work within this area, but consider this incident closed.

Posted Mar 13, 2024 - 17:10 CET

Monitoring

We are currently monitoring the results of changes in the underlying infrastructure performed yesterday and during the night.
The changes aim to distribute load in a better way, mediating bottlenecks that could arise during high load, affecting a subset of requests.

Posted Mar 12, 2024 - 08:16 CET

Investigating

Our monitoring indicates intermittent slowness for certain endpoints in the Engage APIs during the past 48 hours and we're currently investigating the cause.

The main endpoints affected are used to create contacts, orders and to redeem reward vouchers, but other less used endpoints are also affected.

The problem does not affect all calls, the slowness comes and goes, but it is visible on our average and percentile monitoring prompting us to take action. Investigations are ongoing, but we have a suspicion that we're seeing the effects of resource throttling due to load in the deeper (infrastructural) layers of our cloud platform.

We will provide information on the matter as it becomes available to us

Posted Mar 08, 2024 - 09:36 CET

This incident affected: Engage (API, 3rd Party Integrations).