Partial outage due to extraordinary load

Incident Report for wordly Inc.

Postmortem

We want to explain what happened during this Wordly service interruption and what we are doing to ensure it does not happen again.

First, we would like to apologize to those of you who had difficulties using Wordly during this time. We understand that you depend on Wordly to facilitate communication and any interruption in service can cause serious problems. Our goal is to be available whenever you need us and we are working to ensure that the failure that caused this outage does not happen again.

What happened

The simple reason for this service interruption was an unusually large number of attendees all joining a session at the same time. Within just 1 or 2 minutes many 10s-of-thousands of attendees all joined together and his sudden load slowed down the Wordly backend enough that new attendees and presenters were unable to connect until the load subsided.

Unfortunately, this sudden load caused a cascade up problems that took some time to repair. The Wordly backend is implemented as a collection of redundant services scaled to handle large numbers of attendees. In this case the number was quite unusually large and the services slowed down while they were setting up all the new connections. In this case, the slowdown lasted long enough that our watchdog service decided the services were no longer working correctly and began to restart them. These restarts slowed things even more and the backend became unable to catch up.

During this time new attendees for other sessions were also caught in the slowdown, as was the Portal web application, which made it difficult for users to find out what was happening.

What has been done already

We have made a number of changes already to prevent this cascade of problems from happening again:

We have changed the watchdog service so it will not try to restart a slow service in this situation. This should prevent backend services from falling further and further behind.
We have made changes to the process that creates an attendee connection. This will speed up the connection and enable the backend services to handle larger loads before slowing down.
We have updated the number of backend services to better handle unexpectedly large loads. There may still be temporary slowdowns in very unusual cases but the new services should be able to catch up quickly.
We have changed our internal process for handling outages so that this status page is updated more quickly and you will be able to follow our recovery process in real-time.

What is being done now

We are also working on a number of additional changes to improve how we handle future incidents that might arise. These changes will roll out over the next few weeks:

We will make this status page easier to find so that you will be better able to get information about problems in a timely manner.
We will make changes to the Portal so that it behaves better when it cannot talk to our backend services. This will enable much easier access to immediate help and support.
We will be separating the backend services used by the Portal from those used by attendee connections so that problems in one area do not prevent access to other Wordly services.

Posted Jul 04, 2021 - 14:17 PDT

Resolved

The backend services needed to access the Portal and join a session as an attendee slowed down to a crawl and, as a result, many customers were unable to start new sessions.

Previously running sessions with existing attendees were unaffected, although new attendees had difficulty joining those sessions.

The incident was detected quickly but it took about 15 minutes to diagnose the specific cause and about 30 minutes to generate a fix. It took about 20 more minutes to deploy the fix and restore service to all affected users.

Posted Jun 29, 2021 - 07:30 PDT