Voyado unreachable to some customers

Incident Report for Voyado

Postmortem

During the night of January 13 Microsoft rolled out patches to their machines in Azure to mitigate some unknown issue. This update failed on one of two database servers in a pair servicing a number of Voyado customers, causing the incident.

Databases in Voyado reside on database servers working in pairs to provide fallback functionality and high resiliency in the application. This essentially means that a database consists of two actual databases, a primary database on one server and a mirror on the other server in the pair. All activity takes place in the primary database, while the mirror provides a copy for fail over in case of trouble. The primary and mirror databases are spread out over the servers in the pair to maximize load capacity.

In this case the failed system update caused one of the two servers in the affected pair to end up in a failed state at about 03:06 CET. Normally this should have lead to the mirrors taking over operations from the primary databases on the faulty primary server, but for some reason this didn’t work - leaving the customers with their primary databases on the faulty server in a non desirable state.

We were alerted to the issue by a high number of errors in our monitoring setup and staff on call were called in to look into the incident. In order to mitigate the issue we eventually had to restart the restart the entire server and service. Full operations were restored at 07:37 CET.

A support case has been registered with Microsoft to establish why their update caused the malfunction and furthermore why the availability was affected due to the fall back not functioning as intended.

Posted Jan 13, 2021 - 17:20 CET

Resolved

This incident has been resolved.

Posted Jan 13, 2021 - 09:52 CET

Monitoring

At 7:37 CET we succeded in returning full functionality to the failing database server mitigating the issue.
We are currently looking into the whys of the matter and will provide a post mortem detailing the issue as soons as possible.

Posted Jan 13, 2021 - 07:47 CET

Investigating

Voyado is currently unreachable to some customers due to a malfunctioning database cluster.
We're working on the problem and will update the status as we go.

Posted Jan 13, 2021 - 07:33 CET

This incident affected: Engage (API, Web Application, Messaging, Automations).