Error Budget

An error budget is the amount of downtime you're allowed to have while still meeting your SLA. It's calculated from your SLO (uptime goal) and gives you a clear number: if you've budgeted 43 minutes of downtime per month, you have 43 minutes before you're in violation.

Definition

An error budget is the maximum amount of downtime (or errors) a service can incur during a measured period while still meeting its SLA. It's calculated as the complement of the SLO: error budget = 100% - SLO%.

For example, a 99.9% uptime SLO implies a 0.1% error budget. Over a month, this translates to approximately 43 minutes of allowed downtime. Once your service experiences 43 minutes of downtime, you've exhausted your budget and are in SLA violation.

Calculating Your Error Budget

Error budget is straightforward to calculate. Here's the math:

The Basic Formula

Error Budget % = 100% - SLO%

Example:

SLO = 99.9%

Error Budget = 100% - 99.9% = 0.1%

This gives you a percentage. To convert to actual time, multiply by the period length.

Converting to Time Periods

Error Budget (in time) = Error Budget % × Period Length

For 99.9% SLO (0.1% error budget):

Per month (30 days): 0.1% × 30 days = 4.32 hours = 43.2 minutes

Per year (365 days): 0.1% × 365 days = 8.76 hours

Per week (7 days): 0.1% × 7 days = 10.08 minutes

Key insight: Error budget is binary within a period. You don't get "partial credit" for uptime. If your SLO is 99.9% per month and you achieve 99.89%, you've failed SLA for the entire month. All your remaining buffer is gone. This is why error budgets change team behavior — one minute of unplanned downtime can wipe out your entire month's buffer.

Common SLOs and Their Error Budgets

Here's a reference table showing error budgets for common SLOs:

SLO	Error Budget	Per Month (30d)	Per Year	Typical For
99%	1%	7.2 hours	3.65 days	Internal tools, staging
99.5%	0.5%	3.6 hours	1.83 days	Standard services
99.9%	0.1%	43.2 minutes	8.76 hours	Production services
99.95%	0.05%	21.6 minutes	4.38 hours	Critical services
99.99%	0.01%	4.32 minutes	52.6 minutes	Mission-critical systems
99.999%	0.001%	25.9 seconds	5.26 minutes	Extreme reliability (rare)

Observation: Notice how error budgets shrink dramatically with each additional "9". A 99.99% SLO is 100x stricter than 99.9%. Going from 99.9% to 99.99% requires orders of magnitude more investment in redundancy, failover systems, and operational excellence.

Error Budget, SLO, and SLA: Understanding the Relationship

These three concepts are related but distinct. Understanding their relationship is crucial:

SLO (Service Level Objective)

Your internal target for uptime. "We want to achieve 99.9% uptime." This is what you publish to customers and build systems to achieve.

Example: We commit to 99.9% uptime

SLA (Service Level Agreement)

Your contractual obligation to customers. "We guarantee 99.9% uptime, or you get service credits." SLAs often have penalties for violations.

Example: If uptime falls below 99.9%, customers receive 10% refund

Error Budget

The downtime allowance derived from your SLO. "99.9% SLO = 43.2 minutes/month allowed." Your team uses this to decide when to take risks.

Example: We have 43 minutes of downtime budget. If we've used 40, we should focus on stability before deploying risky changes.

The relationship: Your SLO determines your error budget, which guides engineering decisions. If you exceed your error budget, you violate your SLA and owe customers compensation. Error budgets translate business commitments (SLA) into engineering guidelines (SLO and actual decisions).

Using Error Budget for Decisions

Error budgets are most valuable when they drive decision-making. Here's how teams use them in practice:

Scenario: Budget Available (Plenty Remaining)

Situation: It's the 5th of the month. You have 40 minutes of error budget remaining (plenty for the rest of the month).

Decision: You approve a deployment of a new feature that you estimate has a 0.2% chance of causing a 5-minute outage. This is a calculated risk you can afford.

Reasoning: You have budget to spend. Taking measured risks to deliver features is the right call.

Scenario: Budget Tight (Little Remaining)

Situation: It's the 20th of the month. You've had an outage on the 18th (20 minutes down). You have 23 minutes remaining.

Decision: You freeze deployments except for critical bug fixes. The team shifts focus to reliability improvements.

Reasoning: You're close to SLA violation. Risk is unacceptable. Investing in stability is more important than new features for the next 10 days.

Scenario: Budget Exhausted

Situation: It's the 25th. You've already used 50 minutes (7 minutes more than budget). You're in SLA violation.

Decision: All hands on deck. Post-mortem and prevention become top priority. No feature development until root causes are understood and fixed.

Reasoning: The damage is done. You're in violation and may owe customers refunds. Preventing recurrence is critical.

Burn Rate Alerts: Proactive Error Budget Management

Waiting until you've exhausted your error budget is too late. Many teams implement burn rate alerts to warn when error budget is being consumed too quickly:

What is Burn Rate?

Burn rate is the rate at which you're consuming your error budget. If your monthly error budget is 43 minutes and you lose 10 minutes on day 1, your burn rate is 10 minutes/day.

Burn Rate = Downtime in Period / Budget Remaining

Example:

Lost 10 minutes on day 1 of month

Remaining budget: 33 minutes

Burn rate: 10 min / 1 day = 10x expected rate

Common Burn Rate Alert Thresholds

Teams often set multi-tiered burn rate alerts:

High burn rate (1h window): Consuming budget 10x faster than expected. Alert immediately.
Medium burn rate (6h window): Consuming budget 3x faster than expected. Alert to on-call.
Low burn rate (30d window): Already consumed 50% of budget. Alert to engineering management.

Smart alerting: Burn rate alerts let you take proactive action before violating SLA. High burn rate today might mean you'll violate SLA by Friday. You can then decide: should we stabilize immediately, or investigate the root cause first?

Error Budget Best Practices

Teams that use error budgets effectively see better outcomes. Here are proven practices:

Be Conservative with SLOs

Choose an SLO you can consistently beat. A 99.9% SLO that you violate every other month is meaningless. Better to commit to 99% and exceed it. You can always improve later.

Make Error Budget Visible

Track remaining error budget in a dashboard. Show it during planning meetings. The team should know: "We have 20 minutes left this month, is this deployment worth it?" Visibility drives better decisions.

Link Error Budget to Feature Development

Make error budget part of engineering culture. "We hit our deployments, but we're out of error budget now. Let's stabilize before the next release." This aligns incentives: everyone benefits from reliable services.

Use Different SLOs for Different Services

Not all services need the same SLO. Your payment API might need 99.99%, but your blog can get away with 99%. Allocate error budgets based on business impact. This optimizes resource spending.

Invest in Reliability When Running Low

When error budget is tight, shift team focus to reliability. Fix flaky tests, upgrade infrastructure, improve observability. These investments reduce burn rate and give you more breathing room.

Post-Incident: Error Budget Analysis

After incidents, analyze error budget impact. "This incident cost us 15% of our monthly budget. Can we prevent this? Should we invest in mitigation?" This ties incidents directly to business decisions.

Frequently Asked Questions

What is an error budget?

An error budget is the amount of downtime you're allowed to have while still meeting your SLA. It's calculated as 100% minus your SLO (Service Level Objective). For example, a 99.9% uptime SLO gives you a 0.1% error budget, which translates to about 43 minutes of allowed downtime per month.

How do I calculate error budget?

Error budget = (100% - SLO%). For example: 99.9% SLO = 0.1% error budget. To translate to time: 0.1% of 30 days = 0.3 hours = 18 minutes per month. Or 0.1% of a year = 8.76 hours per year. Use this formula: (1 - SLO%) × (total time period) = error budget in time.

Why is error budget important?

Error budget aligns engineering with business priorities. If you have only 43 minutes of error budget per month and you've used 35 of them, you should freeze new feature development and focus on reliability. Error budgets prevent teams from shipping unreliable features just to hit deadlines.

What happens if I exceed my error budget?

If you exceed your error budget, you've violated your SLA. You're now not meeting the uptime guarantee you promised to customers. You might owe service credits, face penalties, or damage your reputation. More importantly, it signals that your reliability investments weren't sufficient.

Can I use my error budget to deploy?

Yes, intentionally. If you have error budget available (say 40 minutes) and a deployment risks 10 minutes of downtime, you can do it. Error budgets let teams make deliberate risk/reward decisions: 'We have budget, we're deploying to get this feature to customers.' This is better than deployments that accidentally exceed error budget.

What's a typical error budget for different SLOs?

99% SLO = 3.6 days/year, 99.9% SLO = 8.76 hours/year, 99.99% SLO = 52 minutes/year. Higher SLOs are much stricter. A 99.99% SLO leaves almost no room for error, requiring substantial investment in redundancy and failover systems.

How do burn rate alerts relate to error budget?

Burn rate measures how fast you're consuming error budget. If your error budget is 43 minutes/month and you lose 10 minutes in a day, your burn rate is very high. Burn rate alerts can warn you before you've exhausted your entire budget, giving you time to take action (stability focus, limited deployments, etc.).

Definition

AtomPing tracks uptime with precision so you always know your remaining error budget. Free forever plan includes 50 monitors.

Start Monitoring Free

Monitoring

Features

Tools

Resources

Error Budget

Definition

Calculating Your Error Budget

The Basic Formula

Converting to Time Periods

Common SLOs and Their Error Budgets

Error Budget, SLO, and SLA: Understanding the Relationship

SLO (Service Level Objective)

SLA (Service Level Agreement)

Error Budget

Using Error Budget for Decisions

Scenario: Budget Available (Plenty Remaining)

Scenario: Budget Tight (Little Remaining)

Scenario: Budget Exhausted

Burn Rate Alerts: Proactive Error Budget Management

What is Burn Rate?

Common Burn Rate Alert Thresholds

Error Budget Best Practices

Be Conservative with SLOs

Make Error Budget Visible

Link Error Budget to Feature Development

Use Different SLOs for Different Services

Invest in Reliability When Running Low

Post-Incident: Error Budget Analysis

Frequently Asked Questions

Definition

Monitoring

Features

Tools

Resources

Error Budget

Definition

Calculating Your Error Budget

The Basic Formula

Converting to Time Periods

Common SLOs and Their Error Budgets

Error Budget, SLO, and SLA: Understanding the Relationship

SLO (Service Level Objective)

SLA (Service Level Agreement)

Error Budget

Using Error Budget for Decisions

Scenario: Budget Available (Plenty Remaining)

Scenario: Budget Tight (Little Remaining)

Scenario: Budget Exhausted

Burn Rate Alerts: Proactive Error Budget Management

What is Burn Rate?

Common Burn Rate Alert Thresholds

Error Budget Best Practices

Be Conservative with SLOs

Make Error Budget Visible

Link Error Budget to Feature Development

Use Different SLOs for Different Services

Invest in Reliability When Running Low

Post-Incident: Error Budget Analysis

Frequently Asked Questions

Related Glossary Terms

Definition