We're investigating a major disturbance in the force

Incident Report for Voyado

Postmortem

Background:

We were alerted by our system health monitors that Voyado could not be reached from Stockholm (SE) at 2020-01-20, 16:02 CET. Monitors based in other locations did not display the same problem.

Engineers started looking into the problem immediately, but all monitoring pointed at external services being the culprit. Voyado was still receiving requests, but less than before the alerts were triggered. A full check of Voyado health did not indicate any errors in the platform itself, but requests to other, external, sources (such as address lookups) seemed to be affected as those could not be performed.

At 2020-01-20, 16:19 CET all monitors previously failing reported being back online, without any action being taken by Voyado engineers.

Conclusions:

According to information from Microsoft Azure it seems that the incident was caused by multiple fiber cuts that affected network traffic in the Nordic region.

The full statement from Azure:
_”Summary of impact: Between 15:00 and 16:30 UTC on 20 Jan 2020, a subset of customers in Sweden, Norway and Russia may have experienced difficulties connecting to Azure services.

Preliminary root cause: Multiple fiber cuts affected network traffic routing for the Nordic region resulting in intermittent connectivity issues during the impacted time.

Mitigation: Engineers failed over traffic to alternate sites to mitigate.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. A full RCA will be provided within approximately 72 hours.”_

For more information:
https://status.azure.com/en-us/status/history/ (Tracking ID 0TSQ-TT0)

Actions:

Due to the nature of the incident (network issues outside of the platform) no immediate actions are planned on the Voyado platform as a result of the incident.

Posted Jan 22, 2020 - 12:27 CET

Resolved

After analyzing logs it seems that Voyado became unreachable for many of us, but possibly not all. Voyado has been up and running during the incident and receiving traffic, but seems to have been unreachable to many for some reason. We're suspecting some sort of routing issue or a hickup in the traffic to our cloud provider for some locations, since our monitors from other places than Stockholm have been functional throughout the incident.
The incident seems to have been on a network or provider level and will therefore be declared as resolved for now. Any information on the cause of the incident will be backfilled.

Posted Jan 20, 2020 - 16:35 CET

Update

It's seems we're back in business, Voyado is back online and reachable.
The reason for the disturbance is still unclear and under investigation.

Posted Jan 20, 2020 - 16:24 CET

Investigating

We are currently experiencing a major disturbance in the availability for Voyado.
We are investigating the issue and our main suspect right now is general availability for the cloud provider.

Posted Jan 20, 2020 - 16:12 CET

This incident affected: Engage (API, Web Application, Messaging, Automations, FTP, Other).