[Engage] - Disturbance identified, affecting Messages and Automations

Incident Report for Voyado

Postmortem

Summary

Between the hour of 09:12 and 09:52 on February 2nd we encountered an issue affecting a central In-memory database used by many processes in the platform. The issue landed the service in a state which didn't trigger failover to backup services, causing various anomalies throughout the platform. Among those a large number of messages being delayed (requiring manual resend), automation events not triggering, reported login issues and more.

Customer Impact

The issue mainly affected customers who had messages and activity execution during the time frame 09:12 - 09:52 with a delay.

Root Cause and Mitigation

Root Cause

The root cause of the issue was a central in-memory database ending up in a bad state. The database is used for storing data for quick access throughout the platform in a high load – low latency configuration. As this data is needed in various processes the effect was spread over multiple parts of the platform, but only specific use cases were greatly affected from a user perspective. The database has a redundant setup, with a primary to multiple replica configuration, where failover to a replica is automatic should the primary service run into issues. In this instance all servers in the setup ended up in a replica state, with no primary resource active, thus causing the issue.

Mitigation

Enforce primary: To mediate the issue we enforced a new primary resource in the configuration which returned Engage to a normal state where messages and execution of activities were functioning as expected.

Next steps

Unfortunately, this has happened before and although we did take actions to mitigate it from happening again it did. The root cause is still, after investigation, not clear. Our intention is now to look over our current in-memory database setup and take action to upgrade and update the setup.

Posted Mar 12, 2025 - 09:26 CET

Resolved

All massages are now resent. The incident is now resolved.
Posted Mar 02, 2025 - 12:24 CET

Update

As previously stated, operations are back to normal. We are still working on the aftermath (i.e. resending email messages that were not sent due to the outage). Automations are fully synced and are running as normal.
Next update will come when all messages are resent.
Posted Mar 02, 2025 - 10:58 CET

Update

The processing of Messages and Automations are still looking good after the implemented fix. There are still some delays from queued up messages that are being processed but Automations are back to normal.
Posted Mar 02, 2025 - 10:25 CET

Monitoring

A fix has been implemented with promising results. We are seeing that Messages are being sent again and Automations are being processed.
We will continue to monitor the situation.
Posted Mar 02, 2025 - 10:00 CET

Update

We are continuing to investigate this issue.
Posted Mar 02, 2025 - 09:55 CET

Investigating

We have identified a disturbance in Voyado which is affecting Message sendouts and Automations, which are currently not being processed.
We are currently investigating the issue.
Posted Mar 02, 2025 - 09:36 CET
This incident affected: Engage (Messaging, Automations).