The Art of Reliability War

08 Variation in TACTICS

The ACT (Assess, Commit, Try) Framework

In the chapter, “Variation in Tactics”, Sun Tzu talks about the importance of flexibility in tactics, knowing when to engage, adapting to the situation, avoiding predictability and considering all factors. Similar situational awareness is vital in the field of SRE to be successful. SREs can use a bunch of proactive and reactive approaches to achieve that.


Proactive measures are those that are taken in anticipation of potential issues, aiming to prevent them. Reactive measures, on the other hand, are responses to issues after they've occurred, aiming to mitigate and learn from them.


Proactive Measures:



Reactive Measures:



The ACT (Assess, Commit, Try) Framework


The ACT (Assess, Commit, Try) loop framework emerges as a pivotal methodology, guiding teams through the complexities of maintaining and enhancing system reliability. This iterative approach, reminiscent of the feedback loops in various engineering and management disciplines, underscores the importance of assessment, decision-making, action, and review. This essay delves into the intricacies of the ACT loop, elucidating why each step is crucial and how the cyclical nature of the framework fosters continuous improvement.


The initial step, 'Assess', is the bedrock upon which subsequent actions are based. In this phase, SREs evaluate the current state of the system, gathering data from monitoring tools, logs, and user feedback. This assessment provides a holistic view of system health, performance bottlenecks, and potential vulnerabilities. Without a thorough assessment, teams risk making decisions based on incomplete or outdated information, leading to ineffective or even detrimental actions.

Once the situation is understood, the next step is to 'Commit' to a course of action. Decision-making in SRE is often a balance between maintaining current reliability and pushing for improvements or new features. Committing means making informed choices about which issues to address, which strategies to employ, and which resources to allocate. This phase is crucial because indecision or prolonged deliberation can lead to missed opportunities, deteriorating system health, or escalating incidents.

With a decision in hand, the next step is to 'Try' or implement the chosen action. This could range from deploying a new patch, scaling infrastructure, or even rolling back a recent change. The 'Try' phase is where theories are tested, and plans are put into motion. It's also a phase that requires courage, as there's always inherent risk in making changes to a live system. However, with the thorough assessment and clear commitment, these actions are calculated risks, taken with the expectation of positive outcomes.

After the action is taken, it's imperative to review the results. Did the change have the desired effect? Were there any unforeseen consequences? This review phase, while not explicitly named in the ACT acronym, is implied in the cyclical nature of the loop. The insights garnered from this review feed directly back into the 'Assess' phase, ensuring that the loop is a continuous cycle of learning and improvement.


The beauty of the ACT loop lies in its iterative nature. Systems, user behaviors, and environments are in a state of constant flux. What worked yesterday might not be effective today. By continuously cycling through assessment, commitment, and action, SREs ensure that they remain adaptive, proactive, and effective in their roles.


The goal in SRE is to have a balance between proactive and reactive measures. While proactive measures help in preventing many issues, reactive measures ensure that when issues do occur, they're handled efficiently and lessons are learned to further enhance proactive strategies. And the ACT loop framework ensures SREs get better everytime



The Art of Reliability War, v1, 2022