[Engage] Service degradation for a subset of customers

Incident Report for Voyado

Postmortem

Summary

One of the SQL servers reported high CPU utilization during the evening/night of November 5th. This affected a number of tenants currently residing on that server, in the form of longer response times for different API endpoints. Checking DPA suggested a query most likely to be the culprit. The issue resolved itself after some time, most likely due a decrease in the number of API calls. The next day, a missing database index was identified as the probable root cause.

Customer Impact

All tenants on the affected SQL would in theory have experienced decreased performance for all operations toward the database, including API calls, messages, automations. However, since the incident happened when there was low activity in the system, this didn’t result in any major problems for the end users.

Root Cause and Mitigation

An increase in queries against the Contact table, using a where-clause which was not included in any index. A missing index on the affected table was identified and added.

Next Steps

We will look into how much the database performance has been improved by the added index as well as investigating if there is a need for creating additional indexes. The API usage for tenants on the affected server will be evaluated to see if there is an inefficient implementation causing unnecessary load.

Posted Nov 08, 2024 - 08:21 CET

Resolved

We conclude that the incident has been resolved.
Response times have been normal for the past 30 mins (or more).

Posted Nov 05, 2024 - 23:36 CET

Monitoring

We are seeing signs of improvement and most of the customers affected are right now at more "normal" levels.
We are continuing to monitor the situation closely.

Posted Nov 05, 2024 - 23:17 CET

Identified

We have identified the issue and are working on a solution to mediate the issue.
The service degradation is currently affecting a subset of customers, linked to a specific resource cluster, while most customers are unaffected at this point. The affected customers experience higher than normal response times in our APIs as well as the web applications.

Posted Nov 05, 2024 - 22:48 CET

Investigating

We are currently investigating an issue causing service degradation (slowness, unresponsiveness) for a subset of customers.
Work is ongoing and information will be provided as soon as possible.

Posted Nov 05, 2024 - 22:20 CET

This incident affected: Engage (API, Web Application, 3rd Party Integrations).