[Engage] - Disturbance identified

Incident Report for Voyado

Postmortem

Summary
Between 07:30 and 08:20 on September 17th, we encountered an issue affecting a central in-memory database critical to multiple processes across the platform. The issue prevented failover to backup services, leading to various platform anomalies. These anomalies included significant delays in message processing (requiring manual resends), failure of automation events, login issues, and other disruptions.

Customer Impact
Customers who had messages or activities scheduled between 07:30 and 08:20 were primarily affected, experiencing delays in execution. Scheduled jobs were disrupted during this period and, while many were able to self-heal upon their next scheduled run, certain critical activities—such as birthday automation events—may not have triggered as expected, potentially impacting some customers.

Root Cause and Mitigation

Root Cause
The root cause was a failure in a central in-memory database, which entered a degraded state. This database is designed to store data for rapid access under high-load, low-latency conditions. As the database is essential to numerous platform processes, the issue spread across multiple areas of the system, although only specific use cases were significantly impacted from a user perspective.

The database operates in a redundant primary-to-replica configuration, with automatic failover to replicas when the primary encounters issues. In this instance, all servers shifted into a replica state with no primary resource active, leading to widespread disruptions.

Mitigation
Enforce primary: To resolve the issue, we manually enforced the designation of a new primary resource within the configuration. This restored normal platform operations, allowing messages and scheduled activities to resume as expected.

Next steps
To prevent future occurrences, we have enhanced the system's downtime handling for our in-memory database. This improvement ensures that the system will automatically return to the correct state once back online, mitigating the risk of similar disruptions going forward.

Posted Sep 25, 2024 - 13:43 CEST

Resolved

We are now in a stable state and incident is closed. We will continue to monitor the system during the day. Some data for statistics during the time period 07.30 - 08.20 CEST are not correct and needs to be handled manually, the team will continue working to mitigate these errors.

Posted Sep 17, 2024 - 12:26 CEST

Monitoring

Engage is up and running since approx 08.30 CEST. We keep on monitoring the applications since our mitigations. We are continuing to work for a solution for the activities scheduled between 07:30-08:30 CEST that was affected.

Posted Sep 17, 2024 - 09:48 CEST

Update

Engage is in a stable state and working as normal for now since approx. 08:30 CEST. However, activities scheduled between 07:30-08:20 are affected and are evaluated how to be handled.

Posted Sep 17, 2024 - 09:43 CEST

Identified

We managed to mitigate some effects of the incident and have a more stable situation. Login is working, automations and messaging are fully functional since approx. 08:30 CEST. We are continuing our work on the incident and any lingering effects.

Posted Sep 17, 2024 - 08:55 CEST

Update

We are continuing to investigate this issue.

Posted Sep 17, 2024 - 08:43 CEST

Update

We are continuing to investigate the issue.

Posted Sep 17, 2024 - 08:39 CEST

Investigating

We have identified a disturbance in Voyado and are currently investigating the issue.

Posted Sep 17, 2024 - 08:01 CEST

This incident affected: Engage (API, Web Application, Messaging, Automations).