Glossary
List of Terms and Abbreviations Used in Site Reliability Engineering (SRE), Sorted Alphabetically
ACL: Access Control List - A list of permissions attached to an object.
Alerting: The practice of notifying the responsible parties when metrics fall outside of acceptable levels.
Automation: The use of technology to perform tasks without human intervention.
Blue-Green Deployment: A release management strategy that involves two production environments, as identical as possible, to mitigate risk during deployments.
Blameless Culture: An approach to postmortems where the focus is on learning from incidents rather than assigning blame.
Caching: Storing copies of frequently accessed data in high-speed memory.
Canary Release: A technique to reduce the risk of introducing a new software version by gradually rolling out the change to a small subset of users before making it available to everyone.
Capacity Planning: The process of determining the production capacity needed to meet changing demands for a service.
Chaos Engineering: The discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions.
CI/CD: Continuous Integration and Continuous Deployment - A set of practices that involve automatically building, testing, and deploying code changes.
Containerization: A lightweight form of virtualization that allows for the packaging and distribution of applications and their dependencies.
Dashboard: A graphical representation of key metrics and performance indicators.
Disaster Recovery (DR): Strategies and procedures to recover systems and data after a disaster.
Distributed Systems: Systems in which components located on networked computers communicate and coordinate their actions by passing messages.
Downtime: The time during which a system is not operational or available.
Error Budget: The allowable threshold for errors and outages, calculated using SLOs.
Failover: The automatic switching to a redundant or standby system upon the failure or abnormal termination of the previously active system.
High Availability (HA): Systems designed to be available and operational for a very high percentage of the time.
Hotfix: A type of quick-and-dirty fix applied to an operational system.
Incident: An unplanned disruption or degradation of service.
Infrastructure as Code (IaC): The management of infrastructure through code rather than manual processes.
Latency: The time it takes to process a request.
Load Balancer (LB): A device or service that distributes incoming network or application traffic across multiple servers.
Microservices: An architectural style that structures an application as a collection of loosely coupled, independently deployable services.
Monitoring: The practice of collecting, analyzing, and using data to track a system's performance and make improvements.
MTBF: Mean Time Between Failures - The average time between system breakdowns.
MTTR: Mean Time To Repair - The average time taken to fix a failed component or system.
Observability: The ability to understand the internal state of a system from its external outputs.
Orchestration: The automated arrangement, coordination, and management of complex computer systems, middleware, and services.
Postmortem: A written record of an incident, its impact, the actions taken to mitigate or resolve it, and the lessons learned.
QoS: Quality of Service - The performance level of a service.
Rate Limiting: Controlling the rate at which a client can make requests to a service.
Redundancy: The duplication of critical components to ensure continued service during component failures.
Replication: The frequent copying of data to multiple locations to improve reliability and availability.
RPO: Recovery Point Objective - The maximum targeted period in which data might be lost due to a disaster.
RTO: Recovery Time Objective - The targeted duration of time within which a business process must be restored after a disaster.
SRE: Site Reliability Engineering - A discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.
Scalability: The ability of a system to handle increased load gracefully.
Sharding: A database architecture technique where data is partitioned across multiple servers or clusters.
SLA: Service Level Agreement - A formal commitment regarding the level of service a customer can expect.
SLI: Service Level Indicator - A specific metric used to measure the level of service provided.
SLO: Service Level Objective - A target level of reliability for a service, often expressed in terms of SLIs.
Throttling: Intentionally limiting the speed or frequency of requests.
Throughput: The number of requests that can be handled over a given time period.
Toil: Repetitive operational work that provides no enduring value and scales linearly with service growth.
Uptime: The time during which a service is available and operational.
Version Control: The practice of tracking and managing changes to code.
VPN: Virtual Private Network - A network that enables remote access to a private network.
VPC: Virtual Private Cloud - An on-demand configurable pool of shared computing resources allocated within a public cloud environment.
Rollback: The process of reverting to a previous version of code or configuration to resolve an issue.