What is an SLA (Service Level Agreement)?
An SLA is a legal contract between a service provider and customer that defines performance commitments. It's the promise you make about how reliable and fast your service will be.
Definition
Service Level Agreement (SLA) is a contractual commitment that defines the expected performance levels, availability, and support response times for a service. It specifies metrics like uptime percentage, response time thresholds, and remedies if targets aren't met.
SLAs are binding agreements—if a provider breaches an SLA, customers are entitled to credits, refunds, or other compensation.
Why SLAs Matter
SLAs serve three critical purposes:
Trust Building
Customers need confidence that your service is reliable. An SLA backed by compensation creates accountability.
Risk Mitigation
Credits compensate customers for downtime impact. This protects customer business and reduces disputes.
Engineering Focus
SLA commitments drive reliability investments. Teams prioritize uptime when they have public targets to meet.
Common SLA Metrics
Most SLAs include these key metrics:
1. Uptime Percentage
The percentage of time your service is available. Example: "99.9% uptime" means the service can be down 43 minutes per month. This is the most important metric.
2. Response Time (Latency)
The maximum acceptable time from request to response. Example: "95% of requests respond within 500ms". This ensures users get fast experiences.
3. Mean Time to Recovery (MTTR)
The average time to fix an outage after detection. Example: "Restore service within 30 minutes of incident detection". This defines how quickly you commit to recovering.
4. Support Response Time
How fast your support team responds to issues. Example: "Critical issues acknowledged within 15 minutes". This is separate from technical MTTR.
5. Error Rate
Percentage of requests that fail. Example: "Error rate shall not exceed 0.1%". This ensures quality, not just availability.
6. Data Backup & Recovery
Frequency of backups and time to restore data. Example: "Daily backups, restore within 1 hour of request". Critical for data-sensitive services.
Common SLA Tiers & Downtime Allowances
Here are the most common SLA tiers you'll see from cloud providers and SaaS companies:
| SLA Tier | Uptime % | Downtime/Month | Downtime/Year | Typical Credit |
|---|---|---|---|---|
| Basic | 99% | 7 hours 12 min | 3.6 days | 5-10% |
| Standard | 99.9% | 43 minutes | 8 hours 45 min | 10-25% |
| Premium | 99.95% | 21 minutes 36 sec | 4 hours 22 min | 25-50% |
| Enterprise | 99.99% | 4 minutes 19 sec | 52 minutes | 50-100% |
Key insight: Most mature SaaS companies commit to 99.9% (Stripe, GitHub, Slack). 99.99% is typically reserved for enterprise services with redundancy across regions. Costs increase exponentially as you go higher.
Components of a Good SLA
A comprehensive SLA should include:
1. Clear Service Description
What is covered? What isn't? Example: "Uptime for API endpoints only; web dashboard excluded from uptime guarantee".
2. Specific Metrics
Use measurable numbers: "99.9% uptime" not "best-effort uptime". Include measurement methodology: "measured from 5+ external locations".
3. Exclusions
Define what breaches don't count: scheduled maintenance, force majeure, customer misuse, third-party failures, etc.
4. Credit Terms
How much refund/credit for each tier of breach? Example: "99%-99.9% uptime = 5% credit; 98%-99% = 10% credit; <98% = 25% credit".
5. Support Terms
Separate support SLA from technical SLA. Example: "Critical issues supported 24/7 with 15-min response; non-critical within 24 hours".
6. Claim Process
How do customers request credits? Within what timeframe? Example: "Customers must request credits within 30 days of breach with documentation".
Real-World SLA Examples
AWS (Amazon Web Services)
99.99% uptime SLA for EC2, S3, RDS, etc. (only infrastructure—not your code).
Credit: 10% (99%-99.99%), 30% (95%-99%), 100% (<95%). Measured per region.
Stripe (Payment Processing)
99.9% uptime guarantee with transparent uptime tracking. Proven ~99.97% actual uptime.
Credits available but rarely needed—Stripe over-delivers on commitments.
GitHub (Version Control)
99.9% uptime for platform-critical features. Excludes scheduled maintenance windows.
Measured monthly. Support SLA: 1-hour response for critical issues.
Google Workspace (Enterprise)
99.95% uptime for Gmail, Drive, Docs. Different tiers for enterprise vs. consumer.
Credit structure: 5% (99%-99.95%), 10% (95%-99%), 25% (<95%).
How to Monitor Your SLA Compliance
Your own monitoring is biased. When your infrastructure goes down, your monitoring system might too. Here's how to verify SLA compliance:
Use Third-Party External Monitoring
Services like AtomPing check your service from multiple regions. This provides independent, credible proof of your uptime for customer disputes.
Benefits:
- • Credibility: If customers dispute uptime, third-party data is more defensible
- • Visibility: See outages from customer perspective, not just internal metrics
- • Automation: Auto-generate SLA reports for customers and finance
- • Historical data: Track trends and improve SLA targets
AtomPing's SLA Monitoring Features
- ✓Check service from 25+ global locations every 30 seconds
- ✓Generate automated SLA compliance reports
- ✓Create public status pages showing real-time uptime
- ✓Export uptime data for audits and customer disputes
- ✓Track response time and error rates, not just uptime
Frequently Asked Questions
What happens if a company fails to meet its SLA?▼
Is an SLA legally binding?▼
Who should have an SLA?▼
What's the difference between SLA and SLO?▼
Can I change my SLA after signing?▼
How do I monitor my SLA compliance?▼
Related Resources
Deliver on Your SLA Promises
AtomPing provides the external monitoring and status page tools you need to monitor, track, and communicate your SLA compliance to customers.
Start Your Free SLA Monitoring