08 Variation in TACTICS

The ACT (Assess, Commit, Try) Framework

In the chapter, “Variation in Tactics”, Sun Tzu talks about the importance of flexibility in tactics, knowing when to engage, adapting to the situation, avoiding predictability and considering all factors. Similar situational awareness is vital in the field of SRE to be successful. SREs can use a bunch of proactive and reactive approaches to achieve that.

Proactive measures are those that are taken in anticipation of potential issues, aiming to prevent them. Reactive measures, on the other hand, are responses to issues after they've occurred, aiming to mitigate and learn from them.

Proactive Measures:

Service Level Objectives (SLOs) and Service Level Indicators (SLIs): Setting clear expectations about system reliability.
Error Budgets: Balancing between pushing new features and ensuring reliability.
Capacity Planning: Forecasting future needs to ensure infrastructure can handle demand.
Automation: Reducing human error by automating manual tasks and processes.
Infrastructure as Code (IaC): Managing infrastructure through code to ensure consistency and repeatability.
Proactive Testing:
- Chaos Engineering: Intentionally introducing failures to test system resilience.
- Load Testing: Simulating high traffic to ensure the system can handle it.
Performance Engineering: Regularly profiling and optimizing applications and infrastructure components.
Change Management: Understanding the impact of changes to reduce potential outages.
Documentation: Keeping documentation clear and up-to-date to ensure everyone has the necessary information before incidents occur.
Collaboration with Development Teams: Working closely with developers to address reliability and operational concerns during the software development lifecycle.
Continuous Learning and Training: Keeping the SRE team updated with the latest knowledge and best practices.

Reactive Measures:

Monitoring and Alerting: Detecting anomalies and ensuring alerts are actionable.
Incident Management: Having clear incident response protocols.
Postmortems: Conducting blameless postmortems to understand root causes and prevent recurrence.
On-call Rotations: Ensuring there's always someone available to address critical issues when they arise.
Feedback Loops: Incorporating lessons from incidents back into the development and operational processes.
Tooling: Utilizing tools for monitoring, logging, tracing, and visualization to gain insights into system behavior and performance after events have occurred.\

The ACT (Assess, Commit, Try) Framework

The ACT (Assess, Commit, Try) loop framework emerges as a pivotal methodology, guiding teams through the complexities of maintaining and enhancing system reliability. This iterative approach, reminiscent of the feedback loops in various engineering and management disciplines, underscores the importance of assessment, decision-making, action, and review. This essay delves into the intricacies of the ACT loop, elucidating why each step is crucial and how the cyclical nature of the framework fosters continuous improvement.

Assess: The Foundation of Informed Action

The initial step, 'Assess', is the bedrock upon which subsequent actions are based. In this phase, SREs evaluate the current state of the system, gathering data from monitoring tools, logs, and user feedback. This assessment provides a holistic view of system health, performance bottlenecks, and potential vulnerabilities. Without a thorough assessment, teams risk making decisions based on incomplete or outdated information, leading to ineffective or even detrimental actions.

Commit: The Power of Decisive Action

Once the situation is understood, the next step is to 'Commit' to a course of action. Decision-making in SRE is often a balance between maintaining current reliability and pushing for improvements or new features. Committing means making informed choices about which issues to address, which strategies to employ, and which resources to allocate. This phase is crucial because indecision or prolonged deliberation can lead to missed opportunities, deteriorating system health, or escalating incidents.

Try: The Courage to Implement

With a decision in hand, the next step is to 'Try' or implement the chosen action. This could range from deploying a new patch, scaling infrastructure, or even rolling back a recent change. The 'Try' phase is where theories are tested, and plans are put into motion. It's also a phase that requires courage, as there's always inherent risk in making changes to a live system. However, with the thorough assessment and clear commitment, these actions are calculated risks, taken with the expectation of positive outcomes.

Review and Repeat: The Essence of Iteration

After the action is taken, it's imperative to review the results. Did the change have the desired effect? Were there any unforeseen consequences? This review phase, while not explicitly named in the ACT acronym, is implied in the cyclical nature of the loop. The insights garnered from this review feed directly back into the 'Assess' phase, ensuring that the loop is a continuous cycle of learning and improvement.

The beauty of the ACT loop lies in its iterative nature. Systems, user behaviors, and environments are in a state of constant flux. What worked yesterday might not be effective today. By continuously cycling through assessment, commitment, and action, SREs ensure that they remain adaptive, proactive, and effective in their roles.

The goal in SRE is to have a balance between proactive and reactive measures. While proactive measures help in preventing many issues, reactive measures ensure that when issues do occur, they're handled efficiently and lessons are learned to further enhance proactive strategies. And the ACT loop framework ensures SREs get better everytime

The Art of Reliability War, v1, 2022