Merge Agent for Jira and Dynamic Custom Fields for Jira were partially unavailable between March 11, 23:39 UTC and March 12, 01:21 UTC. The apps run on the same set of backend servers, which were flooded with Jira webhook requests from a single Dynamic Custom Fields for Jira customer during that time. This led to our backend servers to become unresponsive for web requests.
At around 00:40 UTC, our primary pager duty contact was woken up due to the ongoing alerts. He responded by investigating the alerts and decided to restart the affected servers. As the traffic spike also calmed down during that time, the apps started working again around 01:21 UTC. As the alert happened during the night (01:40 local time) and he was able to resolve it quickly, no communication to affected customers was done via status page or our support channels.
Further investigation during the next business day revealed that the flood of webhook requests caused a backlog of requests to Jira, which was also overwhelmed and responded with status 429 “Too Many Requests”. Due to our retry delays, re-sending any failed request to Jira after waiting for some time, too many requests were still open on our backend servers and they eventually stopped responding to other requests.
We were already processing those webhook requests in an asynchronous queue after making a single request to Jira. After the incident, we decided that we cannot prevent another flood of webhook requests from Jira and thus cannot do any requests during their initial processing. In the week following the incident, we have successfully improved our webhook processing, so those webhook requests get processed very quickly without any interaction with Jira. We can now handle a far greater amount of incoming Jira webhook requests. Even if our auto-scaling queue could not process them anymore, it would fail much faster and requests would be closed much more quickly, making it unlikely to overwhelm our backend servers.
We’re very sorry that we did not respond to our customers worried about the outage during the night. We’re aware that the outage happened during business hours in other parts of the world. In terms of communicating an outage like that in the future, we’ll open an incident on our status page right away - no matter the local time - and try to respond to any support requests until the status page has been updated. This will hopefully give our customers the confidence that we’re actively working on incident mitigation.
We’re committed to make high quality apps for the Atlassian Marketplace and will continue to increase the reliability of our apps in the future.