Failures and Fixes: A Study of Software System Incident Response
Jonathan Sillito, Esdras Kutomi

TL;DR
This study analyzes 30 software system incidents to understand failure causes, detection, and mitigation, providing insights to improve system engineering and support practices.
Contribution
It offers a qualitative analysis of failures and challenges in incident response, highlighting key observations to enhance software system resilience.
Findings
Failures can cascade, causing major outages.
Engineers often lack understanding of system scaling limits.
Current practices face challenges in failure detection and mitigation.
Abstract
This paper presents the results of a research study related to software system failures, with the goal of understanding how we might better evolve, maintain and support software systems in production. We have qualitatively analyzed thirty incidents: fifteen collected through in depth interviews with engineers, and fifteen sampled from publicly published incident reports (generally produced as part of postmortem reviews). Our analysis focused on understanding and categorizing how failures occurred, and how they were detected, investigated and mitigated. We also captured analytic insights related to the current state of the practice and associated challenges in the form of 11 key observations. For example, we observed that failures can cascade through a system leading to major outages; and that often engineers do not understand the scaling limits of systems they are supporting until those…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
