SRE Practice Guide

SLI / SLO / SLA Definitions

TermDefinitionExample
SLI (Indicator)The measurable metric that indicates service healthRequest success rate, latency p99, error rate
SLO (Objective)Target value for the SLI over a time window99.9% availability over 30 days
SLA (Agreement)Contractual commitment โ€” consequences for missing SLO99.9% uptime; 10% service credit if below
Error Budget1 - SLO = allowable downtime/errors99.9% SLO = 43.8 min/month budget

Common SLIs

Service TypeKey SLIs
Request/Response (API)Availability (2xx/total), latency p99, error rate
Data PipelineFreshness (time since last successful run), correctness
StorageDurability (data loss rate), read/write availability, latency
Batch ProcessingThroughput, completion rate, success rate

Error Budget Calculation

# SLO: 99.9% availability over 30 days Error Budget = (1 - 0.999) ร— 30 ร— 24 ร— 60 = 43.2 minutes # Current burn rate Burn Rate = (Error Rate / (1 - SLO)) ร— (window / SLO window) # Alert: fast burn (last 1h burning 2% of monthly budget) Fast Burn Alert: burn_rate > 14.4 for 1h โ†’ page on-call # Alert: slow burn (6h window) Slow Burn Alert: burn_rate > 6 for 6h โ†’ create ticket

Availability Numbers

AvailabilityDowntime/YearDowntime/MonthDowntime/Week
99% (two nines)3.65 days7.31 hours1.68 hours
99.9% (three nines)8.77 hours43.8 min10.1 min
99.95%4.38 hours21.9 min5.04 min
99.99% (four nines)52.6 min4.38 min1.01 min
99.999% (five nines)5.26 min26.3 sec6.05 sec