[Engage] Application unreachable

Incident Report for Voyado

Postmortem

Summary of the incident

On Tuesday, we had our regular biweekly update to release new code to Engage. Unfortunately, an unexpected issue caused significant slowdowns and downtime in our user interface and APIs, which affected the functionality of our Engage product for all our customers.

The issue started at 09:25 AM and was resolved by 12:22 PM, causing a downtime of approximately three hours. As a result, all endpoints related to API integrations with Engage were significantly affected, meaning POS and e-commerce platforms could not communicate with Engage, impacting consumer interactions.

Incident timeline

08:49 Release update started.
09:25 The issue starts occurring.
09:28 First alerts of malfunction - incident lead responds.
09:37 Task force assigned, based on escalation from incident lead, working nonstop to solve the issue.
11:47 Fix completed and deployed to production.
12:10 Early indicators showed the issue was resolved.
12:22 Issue verified as solved, normal operation levels confirmed, and case closed.

The incident team followed set processes and provided regular updates on the Voyado Status page to communicate with affected customers. However, we acknowledge the significant impact that the loss of functionality has on our customers. As a result, we have formed a post-incident team to review our processes and procedures to prevent future incidents.

Posted Apr 19, 2024 - 14:06 CEST

Resolved

We are still seeing normal operations and will thanks to this declare this incident resolved.
We regained normal operations approximately 12:10 CEST.

We will keep working with the incident to make sure we minimize the risk of this happening again.
A post mortem for this incident will be appended as soon as we have mapped not only the whats, but also the whys.

Posted Apr 16, 2024 - 13:12 CEST

Monitoring

The initial signs of improvement are still valid, at the moment we can see normal response times throughout the solution.

We will keep working but change the status of the incident to monitoring.
Any degradations will make us revert this status.

Posted Apr 16, 2024 - 12:23 CEST

Update

The before mentioned fix has been rolled out and we're seeing some initial signs of improvement.
We will keep working on achieving a full resolution.

Posted Apr 16, 2024 - 12:12 CEST

Update

We are continuing to work hard on the mitigation of this disturbance. A new attempted fix is about to be rolled out. Expected ETA 30 min.

We can see that all parts of the platform are affected by the disturbance, even though we are not completely down. Some requests are getting through, some are slow and some are not getting through the door so to speak. We are however approaching this as if we were completely unavailable.

The slowness and unavailability affects all endpoints in our core APIs as well as the graphical user interface of Engage.

Posted Apr 16, 2024 - 11:36 CEST

Investigating

The applied fix was not enough to get us out of the woods unfortunately.
We are still seeing the same symptoms come back after the fix had been applied.

Work is still ongoing with top priority.
Any information made available will be communicated as soon as possible.

Posted Apr 16, 2024 - 11:08 CEST

Update

A first fix is currently being applied to try to mitigate some of the symptoms we can see through our monitoring.
As the application and it's resources are updated we hope to see a change for the better.
We are however looking into further actions.

Posted Apr 16, 2024 - 10:41 CEST

Identified

We have identified a possible cause for the disturbance and are implementing a fix.

Posted Apr 16, 2024 - 10:24 CEST

Update

We are continuing to investigate this issue.

Posted Apr 16, 2024 - 10:17 CEST

Update

Unfortunately we are still seeing major increases in response times from the application.
This leads to traffic build up causing some requests to be faced with 503-responses.

Mitigating the incident is our top priority and we have all hands on deck.
No ETA at the moment. Information will be provided as soon as it's available.

Posted Apr 16, 2024 - 10:16 CEST

Investigating

We are experiencing some longer response times since around 09:18am.

Posted Apr 16, 2024 - 09:31 CEST

This incident affected: Engage (API, Web Application, 3rd Party Integrations).