Error budget
The amount of unreliability allowed by an SLO. It is the gap between 100% and the SLO target over a time window.
Definition
An error budget is the amount of unreliability your SLO allows.
Example: a 99.9% SLO means you can be down or failing 0.1% of the time in the window.
Concrete numbers (so it is not abstract)
If the window is 30 days:
- Total time is 30 × 24 × 60 = 43,200 minutes.
- 0.1% of that is 43.2 minutes.
So a 99.9% SLO roughly means you can “spend” about 43 minutes per month on user-visible failure.
The exact math depends on how you define the SLI (errors, latency, both), but the intuition holds.
Why it matters
Error budgets turn reliability into a tradeoff you can reason about.
If you are within budget, you can ship faster.
If you are out of budget, you should slow down and fix reliability.
How teams use it in real life
- If the budget is healthy, teams take more release risk (ship features).
- If the budget is burned, teams reduce risk (stabilize, fix incidents, pay down reliability debt).