Between approximately 00:14 UTC on 27th February 2021 and 11:30 UTC on 1st March, Gearset experienced an issue that resulted in all

  • Change Monitors,

  • Test Monitors and

  • CI jobs

not being run during that time for some of our users. We'd like to apologise for the disruption and explain what happened:

The scheduling service is responsible for setting the jobs to run at a given time. It places jobs onto a queue that are then picked up by another system to actually run them.

Just after midnight, the scheduler suffered an unexpected restart. This does happen on occasion and the system was designed with this in mind, and would normally recover within a few seconds and no jobs would be missed.

Unfortunately in this particular instance part of our AWS infrastructure that handles the queueing also failed. Again, normally we would handle this gracefully and reconnect, however in this instance both of these disconnected events happened at the same time and this exposed a gap in our error handling.

We also had a gap in our alerting system that failed to spot this issue and warn us sooner. The gap in the logic was quickly identified by our development team and fixed, and we have also added further alerts so if something like this should happen again we will be warned much sooner and be able to react to it immediately.

Data backup jobs and scheduled deployments were not affected.

Did this answer your question?