The Art of Reliability War

02 Waging War

The Art of Incident Management

In the theater of digital operations, where systems and services form the backbone of modern enterprises, the discipline of Site Reliability Engineering (SRE) stands as a sentinel, guarding against chaos. And SRE's Incident Management emerges as a strategic battlefield. Here, swift decisions, coordination, and resource allocation determine the outcome.







Two key metrics that quantify the efficiency of incident response are MTTO (Mean Time to Organize) and MTTR (Mean Time to Restore).




With these metrics in mind, let's explore three primary and most effective recovery strategies: failover, fallback, and fix.



The choice of strategy often hinges on the incident's nature, system architecture, and potential user impact. An effective SRE team typically employs a combination of these strategies, ensuring they can address a broad spectrum of incidents with agility and efficiency.


After the battle, a wise general reflects on the engagement, analyzing successes and areas for improvement. Similarly, post-incident reviews in the SRE realm ensure that lessons are learned, processes refined, and systems fortified against future incidents.


SRE's Incident Management is a dynamic and challenging domain, where strategies, quick decisions, and effective resource allocation are pivotal. Drawing from Sun Tzu's principles underscores the timeless nature of strategic thinking, whether in ancient battles or today's digital operations.


The Art of Reliability War, v1, 2022