Error Budget
An error budget is the amount of downtime you're allowed to have while still meeting your SLA. It's calculated from your SLO (uptime goal) and gives you a clear number: if you've budgeted 43 minutes of downtime per month, you have 43 minutes before you're in violation.
Definition
An error budget is the maximum amount of downtime (or errors) a service can incur during a measured period while still meeting its SLA. It's calculated as the complement of the SLO: error budget = 100% - SLO%.
For example, a 99.9% uptime SLO implies a 0.1% error budget. Over a month, this translates to approximately 43 minutes of allowed downtime. Once your service experiences 43 minutes of downtime, you've exhausted your budget and are in SLA violation.
Calculating Your Error Budget
Error budget is straightforward to calculate. Here's the math:
The Basic Formula
Error Budget % = 100% - SLO%
Example:
SLO = 99.9%
Error Budget = 100% - 99.9% = 0.1%
This gives you a percentage. To convert to actual time, multiply by the period length.
Converting to Time Periods
Error Budget (in time) = Error Budget % × Period Length
For 99.9% SLO (0.1% error budget):
Per month (30 days): 0.1% × 30 days = 4.32 hours = 43.2 minutes
Per year (365 days): 0.1% × 365 days = 8.76 hours
Per week (7 days): 0.1% × 7 days = 10.08 minutes
Key insight: Error budget is binary within a period. You don't get "partial credit" for uptime. If your SLO is 99.9% per month and you achieve 99.89%, you've failed SLA for the entire month. All your remaining buffer is gone. This is why error budgets change team behavior — one minute of unplanned downtime can wipe out your entire month's buffer.
Common SLOs and Their Error Budgets
Here's a reference table showing error budgets for common SLOs:
| SLO | Error Budget | Per Month (30d) | Per Year | Typical For |
|---|---|---|---|---|
| 99% | 1% | 7.2 hours | 3.65 days | Internal tools, staging |
| 99.5% | 0.5% | 3.6 hours | 1.83 days | Standard services |
| 99.9% | 0.1% | 43.2 minutes | 8.76 hours | Production services |
| 99.95% | 0.05% | 21.6 minutes | 4.38 hours | Critical services |
| 99.99% | 0.01% | 4.32 minutes | 52.6 minutes | Mission-critical systems |
| 99.999% | 0.001% | 25.9 seconds | 5.26 minutes | Extreme reliability (rare) |
Observation: Notice how error budgets shrink dramatically with each additional "9". A 99.99% SLO is 100x stricter than 99.9%. Going from 99.9% to 99.99% requires orders of magnitude more investment in redundancy, failover systems, and operational excellence.
Error Budget, SLO, and SLA: Understanding the Relationship
These three concepts are related but distinct. Understanding their relationship is crucial:
SLO (Service Level Objective)
Your internal target for uptime. "We want to achieve 99.9% uptime." This is what you publish to customers and build systems to achieve.
SLA (Service Level Agreement)
Your contractual obligation to customers. "We guarantee 99.9% uptime, or you get service credits." SLAs often have penalties for violations.
Error Budget
The downtime allowance derived from your SLO. "99.9% SLO = 43.2 minutes/month allowed." Your team uses this to decide when to take risks.
The relationship: Your SLO determines your error budget, which guides engineering decisions. If you exceed your error budget, you violate your SLA and owe customers compensation. Error budgets translate business commitments (SLA) into engineering guidelines (SLO and actual decisions).
Using Error Budget for Decisions
Error budgets are most valuable when they drive decision-making. Here's how teams use them in practice:
Scenario: Budget Available (Plenty Remaining)
Situation: It's the 5th of the month. You have 40 minutes of error budget remaining (plenty for the rest of the month).
Decision: You approve a deployment of a new feature that you estimate has a 0.2% chance of causing a 5-minute outage. This is a calculated risk you can afford.
Reasoning: You have budget to spend. Taking measured risks to deliver features is the right call.
Scenario: Budget Tight (Little Remaining)
Situation: It's the 20th of the month. You've had an outage on the 18th (20 minutes down). You have 23 minutes remaining.
Decision: You freeze deployments except for critical bug fixes. The team shifts focus to reliability improvements.
Reasoning: You're close to SLA violation. Risk is unacceptable. Investing in stability is more important than new features for the next 10 days.
Scenario: Budget Exhausted
Situation: It's the 25th. You've already used 50 minutes (7 minutes more than budget). You're in SLA violation.
Decision: All hands on deck. Post-mortem and prevention become top priority. No feature development until root causes are understood and fixed.
Reasoning: The damage is done. You're in violation and may owe customers refunds. Preventing recurrence is critical.
Burn Rate Alerts: Proactive Error Budget Management
Waiting until you've exhausted your error budget is too late. Many teams implement burn rate alerts to warn when error budget is being consumed too quickly:
What is Burn Rate?
Burn rate is the rate at which you're consuming your error budget. If your monthly error budget is 43 minutes and you lose 10 minutes on day 1, your burn rate is 10 minutes/day.
Burn Rate = Downtime in Period / Budget Remaining
Example:
Lost 10 minutes on day 1 of month
Remaining budget: 33 minutes
Burn rate: 10 min / 1 day = 10x expected rate
Common Burn Rate Alert Thresholds
Teams often set multi-tiered burn rate alerts:
- High burn rate (1h window): Consuming budget 10x faster than expected. Alert immediately.
- Medium burn rate (6h window): Consuming budget 3x faster than expected. Alert to on-call.
- Low burn rate (30d window): Already consumed 50% of budget. Alert to engineering management.
Smart alerting: Burn rate alerts let you take proactive action before violating SLA. High burn rate today might mean you'll violate SLA by Friday. You can then decide: should we stabilize immediately, or investigate the root cause first?
Error Budget Best Practices
Teams that use error budgets effectively see better outcomes. Here are proven practices:
Be Conservative with SLOs
Choose an SLO you can consistently beat. A 99.9% SLO that you violate every other month is meaningless. Better to commit to 99% and exceed it. You can always improve later.
Make Error Budget Visible
Track remaining error budget in a dashboard. Show it during planning meetings. The team should know: "We have 20 minutes left this month, is this deployment worth it?" Visibility drives better decisions.
Link Error Budget to Feature Development
Make error budget part of engineering culture. "We hit our deployments, but we're out of error budget now. Let's stabilize before the next release." This aligns incentives: everyone benefits from reliable services.
Use Different SLOs for Different Services
Not all services need the same SLO. Your payment API might need 99.99%, but your blog can get away with 99%. Allocate error budgets based on business impact. This optimizes resource spending.
Invest in Reliability When Running Low
When error budget is tight, shift team focus to reliability. Fix flaky tests, upgrade infrastructure, improve observability. These investments reduce burn rate and give you more breathing room.
Post-Incident: Error Budget Analysis
After incidents, analyze error budget impact. "This incident cost us 15% of our monthly budget. Can we prevent this? Should we invest in mitigation?" This ties incidents directly to business decisions.
Frequently Asked Questions
What is an error budget?
An error budget is the amount of downtime you're allowed to have while still meeting your SLA. It's calculated as 100% minus your SLO (Service Level Objective). For example, a 99.9% uptime SLO gives you a 0.1% error budget, which translates to about 43 minutes of allowed downtime per month.
How do I calculate error budget?
Error budget = (100% - SLO%). For example: 99.9% SLO = 0.1% error budget. To translate to time: 0.1% of 30 days = 0.3 hours = 18 minutes per month. Or 0.1% of a year = 8.76 hours per year. Use this formula: (1 - SLO%) × (total time period) = error budget in time.
Why is error budget important?
Error budget aligns engineering with business priorities. If you have only 43 minutes of error budget per month and you've used 35 of them, you should freeze new feature development and focus on reliability. Error budgets prevent teams from shipping unreliable features just to hit deadlines.
What happens if I exceed my error budget?
If you exceed your error budget, you've violated your SLA. You're now not meeting the uptime guarantee you promised to customers. You might owe service credits, face penalties, or damage your reputation. More importantly, it signals that your reliability investments weren't sufficient.
Can I use my error budget to deploy?
Yes, intentionally. If you have error budget available (say 40 minutes) and a deployment risks 10 minutes of downtime, you can do it. Error budgets let teams make deliberate risk/reward decisions: 'We have budget, we're deploying to get this feature to customers.' This is better than deployments that accidentally exceed error budget.
What's a typical error budget for different SLOs?
99% SLO = 3.6 days/year, 99.9% SLO = 8.76 hours/year, 99.99% SLO = 52 minutes/year. Higher SLOs are much stricter. A 99.99% SLO leaves almost no room for error, requiring substantial investment in redundancy and failover systems.
How do burn rate alerts relate to error budget?
Burn rate measures how fast you're consuming error budget. If your error budget is 43 minutes/month and you lose 10 minutes in a day, your burn rate is very high. Burn rate alerts can warn you before you've exhausted your entire budget, giving you time to take action (stability focus, limited deployments, etc.).
Definition
AtomPing tracks uptime with precision so you always know your remaining error budget. Free forever plan includes 50 monitors.
Start Monitoring Free