The Art of Reliability War

04 Tactical Dispositions

Tactical Preparedness for Incident Management

In Site Reliability Engineering (SRE), the tactical landscape comprises systems, services, and the strategies employed to ensure their reliability, rather than physical terrains and armies. Just as Sun Tzu emphasized the importance of positioning in warfare, SREs must strategically position their systems for maximum uptime and reliability. Here, we explore the tactical strategies of failover, fallback, and fixing, and their relationship to key metrics like MTTD, MTTO, and MTTR.


Failover and Fallback: The Defensive Tactics




Fixing: The Offensive Tactic

While failover and fallback are defensive strategies, fixing is offensive. It involves pinpointing the root cause of an issue and resolving it. It's essential to employ defensive tactics swiftly before embarking on this more time-consuming path.


In the tactical resolution of issues, two metrics are paramount:



Advanced Deployment Strategies: The Art of Deception


In SRE, deception pertains to risk-reducing change tests, not misleading adversaries:






Post high and critical priority incidents, it's imperative for SRE teams to conduct retrospectives, addressing two vital questions:

Any retrospective / post-mortem that does not have actionable responses to those two questions is quite useless.


In SRE's tactical world, the employed strategies can differentiate a resilient system from one susceptible to extended outages. By understanding and applying tactics like failover, fallback, and advanced deployment strategies, SREs can position their systems for optimal reliability and uptime. Echoing Sun Tzu's philosophy, the objective is achieving reliability (victory) with minimal conflict and loss.

The Art of Reliability War, v1, 2022