04 Tactical Dispositions
Tactical Preparedness for Incident Management
In Site Reliability Engineering (SRE), the tactical landscape comprises systems, services, and the strategies employed to ensure their reliability, rather than physical terrains and armies. Just as Sun Tzu emphasized the importance of positioning in warfare, SREs must strategically position their systems for maximum uptime and reliability. Here, we explore the tactical strategies of failover, fallback, and fixing, and their relationship to key metrics like MTTD, MTTO, and MTTR.
Failover and Fallback: The Defensive Tactics
Failover: This refers to the automatic transition to a redundant system or backup component when the primary system fails. It's akin to having a reserve army ready to replace the main force if compromised. With a failover strategy, services can persist even when a system component fails, thus reducing the Mean Time To Recovery (MTTR).
Fallback: Fallback strategies provide alternative methods to handle requests or processes if the primary method fails. It's analogous to having alternative routes for an army to retreat or advance if the primary path is obstructed. With a fallback, systems can continue to deliver, sometimes at a diminished capacity, further curtailing MTTR.
Fixing: The Offensive Tactic
While failover and fallback are defensive strategies, fixing is offensive. It involves pinpointing the root cause of an issue and resolving it. It's essential to employ defensive tactics swiftly before embarking on this more time-consuming path.
In the tactical resolution of issues, two metrics are paramount:
Blast Radius: This metaphorically denotes how many users, services, or system components an incident or change affects. When using tactical options for service restoration, SREs should consider actions that might temporarily diminish the blast radius, offering relief to a portion or majority of users.
MTTO (Mean Time To Organize): This metric indicates the time required to assemble the necessary resources and teams to address a detected issue. Clearly defined SLOs, monitors, alerts, and a coordinated team with explicit on-call and escalation paths can significantly reduce MTTO. For high-priority incidents, simultaneously paging Tiger teams can effectively decrease MTTO, though this might sometimes lead to larger-than-necessary teams. Striking a balance is an art.
MTTR (Mean Time To Restore): By optimizing both MTTD and MTTO and prioritizing defensive techniques, the overall MTTR can be minimized.
Advanced Deployment Strategies: The Art of Deception
In SRE, deception pertains to risk-reducing change tests, not misleading adversaries:
Feature Toggles: These enable specific features to be activated or deactivated without deploying fresh code, akin to adaptable scouts.
Blue/Green Deployments: This strategy uses two production environments: one (blue) with the current version and another (green) with the new version. Traffic can be toggled between them, ensuring zero downtime and immediate rollback if problems emerge.
Canary Deployments: New changes are introduced to a small user subset before full deployment, reminiscent of sending a small detachment ahead of the main army.
A/B Testing: This method compares two webpage or app versions to determine which performs better based on user engagement or other metrics.
Post high and critical priority incidents, it's imperative for SRE teams to conduct retrospectives, addressing two vital questions:
How can we prevent this issue's recurrence?
How can we reduce the blast radius, MTTD, MTTO, and MTTR from this issue?
Any retrospective / post-mortem that does not have actionable responses to those two questions is quite useless.
In SRE's tactical world, the employed strategies can differentiate a resilient system from one susceptible to extended outages. By understanding and applying tactics like failover, fallback, and advanced deployment strategies, SREs can position their systems for optimal reliability and uptime. Echoing Sun Tzu's philosophy, the objective is achieving reliability (victory) with minimal conflict and loss.