We want to explain what happened during this Wordly service interruption and what we are doing to ensure it does not happen again.
First, we would like to apologize to those of you who had difficulties using Wordly during this time. We understand that you depend on Wordly to facilitate communication and any interruption in service can cause serious problems. Our goal is to be available whenever you need us and we are working to ensure that the failure that caused this outage does not happen again.
What happened
The simple reason for this service interruption was an unusually large number of attendees all joining a session at the same time. Within just 1 or 2 minutes many 10s-of-thousands of attendees all joined together and his sudden load slowed down the Wordly backend enough that new attendees and presenters were unable to connect until the load subsided.
Unfortunately, this sudden load caused a cascade up problems that took some time to repair. The Wordly backend is implemented as a collection of redundant services scaled to handle large numbers of attendees. In this case the number was quite unusually large and the services slowed down while they were setting up all the new connections. In this case, the slowdown lasted long enough that our watchdog service decided the services were no longer working correctly and began to restart them. These restarts slowed things even more and the backend became unable to catch up.
During this time new attendees for other sessions were also caught in the slowdown, as was the Portal web application, which made it difficult for users to find out what was happening.
What has been done already
We have made a number of changes already to prevent this cascade of problems from happening again:
What is being done now
We are also working on a number of additional changes to improve how we handle future incidents that might arise. These changes will roll out over the next few weeks: