Blue Work / Red Work

SRE Through the Lens of 'Red Work' and 'Blue Work'

L. David Marquet's "Leadership is Language" offers a fresh perspective on leadership and collaboration in the modern workplace. Central to his thesis is the distinction between 'Red Work' and 'Blue Work'. Red Work is the doing—execution-focused tasks that are procedural and have a clear start and finish. Blue Work, on the other hand, is the thinking—it's about decision-making, planning, and strategizing. When we apply this dichotomy to Site Reliability Engineering (SRE), a discipline that bridges software development and operations, we gain unique insights into the roles and responsibilities of SRE teams.

1. Red Work in SRE: Execution at its Best

Red Work represents the tasks that are about execution, production, and getting things done. In the world of SRE, this translates to:

Incident Response: When systems go down or performance issues arise, SREs jump into action. This is Red Work in its purest form—following established protocols, using tools, and applying known solutions to restore service.

Routine Maintenance: Regular updates, patches, and backups fall under this category. These tasks are procedural and often scripted.

Monitoring: Keeping an eye on system health metrics, ensuring everything is running smoothly, and responding to alerts is quintessential Red Work.

Deployment: Rolling out new features, updates, or fixes, especially when using automated CI/CD pipelines, is about execution.

2. Blue Work in SRE: The Thinking Behind the Doing

While Red Work is crucial, it's the Blue Work that often defines the success of SRE teams. This involves:

Strategizing Reliability: Before jumping into solutions, SREs need to define what reliability means for a given system. This involves setting Service Level Objectives (SLOs) and determining the right Service Level Indicators (SLIs).

Planning for Capacity: Forecasting future demand, understanding traffic patterns, and ensuring that infrastructure can handle growth requires deep thinking and strategizing.

Architectural Decisions: Deciding on the right infrastructure, tools, and technologies is Blue Work. It's about evaluating options, considering trade-offs, and making informed decisions.

Postmortem Analysis: After incidents, SREs engage in blameless postmortems. This isn't about finding fault but understanding root causes, learning from mistakes, and strategizing for future improvements.

Security and Risk Assessment: Evaluating potential threats, vulnerabilities, and risks requires a deep understanding, foresight, and strategic planning.

3. Balancing Red and Blue Work in SRE

For SRE teams to be effective, a balance between Red and Blue Work is essential. While the execution (Red Work) ensures that systems remain operational and reliable, it's the thinking (Blue Work) that drives continuous improvement, innovation, and adaptability.

Automation as a Bridge: One of the ways SRE teams manage the balance is through automation. By automating routine Red Work tasks, SREs free up time and mental bandwidth for Blue Work. For instance, automated deployment pipelines reduce manual deployment tasks, allowing SREs to focus on strategizing for reliability.

Collaborative Decision-Making: Marquet emphasizes the importance of collaborative environments where teams can question, discuss, and decide together. SREs often work in cross-functional teams, collaborating with developers, security experts, and business stakeholders. This collaborative Blue Work ensures that decisions are holistic and well-informed.

L. David Marquet's distinction between Red Work and Blue Work offers a valuable lens to understand the multifaceted roles and responsibilities of Site Reliability Engineers. While the execution-driven Red Work keeps the digital wheels turning, it's the strategic Blue Work that ensures that SRE practices evolve, adapt, and continue to deliver value in an ever-changing tech landscape. For organizations and SRE leaders, recognizing the importance of both, and ensuring that teams have the time, tools, and culture to engage in both, is the key to long-term reliability and resilience.