We updated our app servers starting at 15:21 UTC with a changed deployment script. Unfortunately, the new script used the wrong configuration, which caused our apps to be pointed to the wrong database tables. The apps then didn’t know how to authenticate our users, falling back to a “403 error” instead. Fortunately it included a message to contact firstname.lastname@example.org where we were notified by a few customers starting at 15:44 UTC.
The cause of the issue was identified very quickly as the deployment script change was just rolled out. We started deploying the correct configuration at 16:03 UTC which finished at 16:15 UTC with the last servers using the wrong configuration shutting down. No data was lost and only workflow triggers and other webhooks (e.g. Slack notifications) might have been affected.
We’re very sorry that this has happened at all since we didn’t have a critical outage for more than a year. It was the first time that we were able to communicate the outage on statuspage which helped to quickly communicate the status to our customers.
In terms of preventing an outage like that in the future, we’ll definitely be more careful rolling out deployment changes. We also want to be alerted earlier so we don’t rely on customers highlighting a critical outage for us. This will be achieved by implementing live tests, that will continuously run our apps and report any errors to ourselves. We’re already alerted if the number of 403 (and other 4xx errors) is unusually high, but it would’ve probably taken another hour until we would’ve realized that a critical outage is ongoing.
We’re committed to make high quality apps for the Atlassian Marketplace and will continue to increase the reliability of our apps in the future.