MTTI (Mean Time to Identify)
MTTI measures how long it takes your team to figure out what went wrong during an incident. The faster you can diagnose the problem, the faster you can fix it. Reducing MTTI from 60 minutes to 10 minutes can cut your average incident duration in half.
Definition
MTTI (Mean Time to Identify) is the average time elapsed between when an incident occurs (or is detected) and when the root cause is identified. It measures how quickly your team can diagnose the underlying problem causing the incident.
MTTI = (time root cause identified) - (incident start time). For example, if an outage starts at 10:00 AM and the team identifies the root cause at 10:18 AM, the MTTI for that incident is 18 minutes. Calculate MTTI for all incidents over a period (month, quarter, year) and take the average to get your Mean Time to Identify.
MTTI vs MTTA vs MTTR: Understanding the Differences
These metrics measure different parts of the incident response process. Understanding the distinction is crucial:
MTTA (Mean Time to Acknowledge)
Measures how fast your on-call person notices the alert and acknowledges it. This is about response speed.
Goal: Keep MTTA under 5 minutes for critical services. Depends on alert delivery system (email, Slack, PagerDuty) and on-call culture.
MTTI (Mean Time to Identify)
Measures how fast your team diagnoses the root cause. This is about investigative speed.
Goal: Keep MTTI under 15-30 minutes for critical services. Depends on observability, runbooks, and team expertise.
MTTR (Mean Time to Recovery)
Measures how fast your team fixes the problem and restores service. This is about fix speed.
Goal: Keep MTTR under 1 hour for critical services. MTTR = MTTI + fix implementation time + deployment time.
Key insight: MTTI is always a component of MTTR. You can't fix a problem until you know what it is. If you want to reduce MTTR, reducing MTTI is one of the most effective approaches. For many teams, improving observability to reduce MTTI has a bigger impact than optimizing deployment speed.
Why MTTI Matters
MTTI is one of the most important metrics for incident response. Here's why:
Directly Impacts Customer Impact
Every minute you're in the dark is a minute your service is likely still down. If your MTTI is 60 minutes, customers are suffering for at least 60 minutes before you even know what the problem is. Reduce MTTI to 10 minutes, and you can start fixing within 10 minutes.
Enables Faster Recovery
You can't fix a problem you don't understand. A short MTTI means you reach the "we know what's wrong" point quickly, allowing you to implement a fix immediately. A long MTTI means wasted time spinning on investigation while your service is down.
Indicator of System Observability
A team with excellent observability (good logs, metrics, traces, dashboards) has a much lower MTTI. A team with poor observability gets stuck investigating. MTTI is a proxy for how well you can see what's happening in your systems.
Reflects Team Expertise
A team that knows their systems well and has good runbooks for common issues will identify root causes quickly. A team new to the codebase or dealing with a novel issue will take longer. MTTI improves as team expertise and system understanding grows.
SLA Impact
If your SLA allows 1 hour of downtime per month, and your MTTI is 45 minutes, you have only 15 minutes left to implement and deploy a fix. A long MTTI forces you to fix issues extremely quickly just to meet SLAs. A short MTTI gives you time to implement proper fixes instead of quick hacks.
How to Calculate MTTI
Calculating MTTI requires tracking incident timeline data. Here's the process:
Step 1: Record Incident Timeline
For each incident, record:
- • Incident start: When the problem began (from logs or metrics)
- • Alert time: When alerting system detected the issue
- • Acknowledgment time: When on-call engineer acknowledged the alert
- • Root cause identified: When the team figured out what went wrong
- • Fix deployed: When the fix was implemented and rolled out
- • Service recovered: When service returned to normal
Step 2: Calculate Individual MTTI
For each incident:
MTTI = (root cause identified time) - (incident start time)
Example: Root cause identified 10:18 AM, incident started 10:00 AM → MTTI = 18 minutes
Step 3: Calculate Mean MTTI
Over a time period (month, quarter), calculate the average:
Mean MTTI = (sum of all MTTI values) / (number of incidents)
Example: 5 incidents with MTTI of 15, 22, 18, 45, 12 minutes → Mean MTTI = (15+22+18+45+12)/5 = 22.4 minutes
Pro tip: Track MTTI for different incident categories separately (e.g., database issues vs API failures). This reveals which types of problems you diagnose quickly and which ones stump the team.
Typical MTTI Targets by Service Type
Good MTTI targets depend on service criticality and SLA. Here's a framework:
Tier 1 (Critical Services)
Target MTTI: 5-10 minutes
Payment APIs, login systems, core platform. Every minute of downtime impacts revenue. Requires excellent observability and on-call training.
Tier 2 (Important Services)
Target MTTI: 15-30 minutes
Main website, dashboards, APIs. Downtime impacts user experience but not transactions. Good observability and runbooks required.
Tier 3 (Standard Services)
Target MTTI: 30-60 minutes
Internal tools, reporting systems, non-critical APIs. Some downtime is acceptable. Basic observability sufficient.
Strategies to Reduce MTTI
MTTI isn't just a metric to measure — it's something you can actively improve. Here are proven strategies:
Improve Observability
This is the single biggest impact. Better logs, metrics, and traces let you see what's happening. Invest in structured logging, comprehensive metrics, and distributed tracing. Create dashboards that show system health at a glance. When an incident hits, good observability cuts investigation time from hours to minutes.
Create Runbooks
Document common incidents and their solutions. A runbook for "Database connection pool exhausted" tells your team exactly what to check and how to fix it. Runbooks reduce MTTI from "60 minutes of investigation" to "5 minutes of following the playbook".
Smart Alerting
Don't just alert on symptoms (high error rate). Alert on root causes when possible. Instead of "Error rate is 5%", alert "Database query response time > 1000ms". Smart alerts cut investigation time because they point toward the cause.
Team Training
Your team should understand your systems well. Run incident simulations (gamedays). Document architecture decisions. Have code reviews. When your team knows the codebase deeply, they identify issues faster. Expertise reduces MTTI.
Build Context Dashboards
When an alert fires, your dashboards should immediately show context: recent deployments, error trends, resource usage, dependency health. A dashboard that answers "What changed in the last hour?" helps your team diagnose faster.
Reduce Complexity
Fewer components means fewer places for things to break and faster diagnosis when they do. Simplify architecture when possible. Reduce third-party dependencies. Less complexity = shorter MTTI.
Frequently Asked Questions
What is MTTI (Mean Time to Identify)?
MTTI is the average time between when an incident starts and when the root cause is identified. It measures how quickly your team can diagnose the problem. A short MTTI means you understand and can fix problems fast. A long MTTI means you're spending a lot of time in the dark, unable to proceed with a fix.
How does MTTI differ from MTTA?
MTTA (Mean Time to Acknowledge) measures how fast your on-call team notices and acknowledges an incident. MTTI (Mean Time to Identify) measures how fast they figure out what's wrong. You might acknowledge an incident in 1 minute but take 30 minutes to identify the root cause. MTTA is about response speed; MTTI is about diagnostic speed.
Why does MTTI matter?
The faster you identify the root cause, the faster you can fix it. If your MTTI is 2 minutes, you can start implementing a fix within 2 minutes. If your MTTI is 60 minutes, your service is down for at least 60 minutes (plus the time to implement the fix). Reducing MTTI directly reduces incident duration and customer impact.
How do I calculate MTTI?
Track the time when an incident started (or was detected) and the time when the root cause was identified. The difference is the MTTI for that incident. MTTI = (root cause identified time) - (incident start time). Calculate this for all incidents over a period (month, quarter) and take the average to get your mean MTTI.
What's a good MTTI target?
It depends on your service and team. For critical systems, aim for 5-15 minutes. For less critical systems, 30-60 minutes is acceptable. The best target is one that lets you fix and deploy a fix within your acceptable downtime window. If your SLA allows 1 hour of downtime/month, your MTTI should be much less than 1 hour.
How can I reduce MTTI?
Improve observability (better logs, metrics, traces), create runbooks for common issues, implement dashboards that show system health, set up better alerting that pinpoints the problem, and invest in training your on-call team. Tools matter less than process and team knowledge.
What's the difference between MTTI and MTTR?
MTTI is the time to identify (diagnose) the problem. MTTR is the time to recover (fix the problem and restore service). You need to identify before you can fix, so MTTI is always part of MTTR. MTTR = MTTI + (time to implement fix) + (time to deploy/rollout).
Definition
AtomPing reduces MTTI with multi-region detection and detailed incident timelines. Free forever plan includes 50 monitors.
Start Monitoring Free