AI API endpoints aren't like regular REST APIs. They're slow (2-30 seconds for a response), expensive (you pay per token), and subject to rate limiting. If your application depends on OpenAI or Anthropic, their downtime means your downtime. You'll find out from production errors if you don't monitor.
Monitoring AI endpoints differs from regular API monitoring. You need to check not just "does the endpoint respond" but "did it actually process my prompt", "did the response pass validation", "how many tokens did this cost me". And because of high costs, you also need to monitor costs and rate limits before they spiral out of control.
Why AI APIs Need Special Monitoring
Slowness
OpenAI GPT-4 usually responds in 5-10 seconds. Sometimes 30 seconds. If your HTTP timeout is 5 seconds (standard for regular APIs), you'll get constant timeout errors even when the API works. Set timeout to 60+ seconds and track actual latency to detect slowdowns.
Cost
GPT-4 costs ~$0.015 per 1000 input tokens and ~$0.06 per 1000 output tokens. If your health check costs $0.05 and you run it every minute, that's $72/month just on monitoring. Need cost tracking to keep spending under control. And response validation ensures API isn't wasting tokens on errors.
Rate Limiting
OpenAI rate-limits by plan. Free tier: 3 requests/minute. Paid: 3500 requests/minute. Exceed the limit and you get 429 Too Many Requests. Not downtime, but failure for your app. Need to monitor rate limits and alert when approaching.
Response Quality
API can respond, but response might be garbage. Prompt: "Generate SQL to fetch users", response: "I can't help with that". Endpoint technically works (returns 200), but answer is useless. Need keyword validation: check response contains expected text (e.g., "SELECT").
Monitoring OpenAI API
Health Check Pattern
OpenAI doesn't provide /health endpoint. Send real API call:
Simple prompt: curl -X POST https://api.openai.com/v1/chat/completions -H "Authorization: Bearer $OPENAI_API_KEY" -d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Respond with OK"}], "max_tokens": 10}'
Expected response: JSON with "choices" array, first choice has "message" object with "content". Content should be roughly "OK" or similar.
Keyword validation: response should contain "OK" (or variants: "ok", "Okay"). If response "I can't process this", it's failure despite HTTP 200.
Response time assertion: max 30 seconds. If slower, something wrong (OpenAI overloaded or network issue).
Frequency: every 10 minutes (not every minute, cost exceeds value). Or use cheaper model (GPT-3.5) for health checks.
Cost optimization: use GPT-3.5 instead of GPT-4 for checks. ~10x cheaper. Make check simple (max_tokens=10) to minimize tokens.
Monitor Rate Limits
OpenAI returns headers with rate limit info. Check these and alert approaching limit:
Response headers: x-ratelimit-limit-requests, x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens. Parse and alert if remaining less than 10% of limit.
Example: if limit is 1000 requests/minute, alert when remaining ≤ 100. Gives time to respond and reduce traffic.
Monitoring Anthropic Claude API
Anthropic generally delivers better reliability and consistency. Monitoring is similar:
API endpoint: https://api.anthropic.com/v1/messages
Simple check: POST with simple prompt, max_tokens=10, expect response with "content" array.
Response time: usually 2-8 seconds (faster than OpenAI). If suddenly 20+ sec, alert.
Rate limits: Anthropic also has limits by requests and tokens. Check headers.
Cost: slightly cheaper than GPT-4, but still need monitoring.
Monitoring Custom/Self-Hosted LLMs
If you self-host a model (llama, mistral, etc.), monitoring is even more critical because you own it.
Endpoint format: usually OpenAI-compatible API (if using vLLM or Text Generation WebUI). Check POST /v1/chat/completions.
Timeout: depends on model size. Mistral 7B: 1-3 sec. Llama 70B: 10-30 sec. Set timeout with margin.
Memory monitoring: large models can OOM. Monitor VRAM on host — if near 100%, model will crash soon.
Concurrency: how many simultaneous requests can it handle? More traffic = slower responses due to queuing.
Keyword validation: very important. Self-hosted model may be uncalibrated and return useless answers. Validate response contains expected text.
AtomPing AI Agent Probe
AtomPing has built-in AI Agent Probe check type for monitoring AI endpoints. Instead of writing custom scripts:
Provider: choose Anthropic, OpenAI, or custom endpoint.
Endpoint URL: if custom, provide address.
API Key: encrypted and stored securely. Never logged, only used for checks.
Prompt: simple prompt for checking. Example: "What is 2+2?"
Expected response: keywords for validation. Example: "4" or "four".
Model & tokens: which model (gpt-4, claude-3, etc.), max_tokens for cost control.
Frequency: how often to check. Recommended 10+ minutes for cost savings.
Results: response time, token count, cost per check, response validation. All in one dashboard.
Cost Tracking
If using expensive models (GPT-4, claude-opus), track costs. AtomPing AI Cost Monitoring integrates with OpenAI and Anthropic APIs to track usage in real-time.
Setup: provide read-only API key from OpenAI/Anthropic, AtomPing pulls usage every 15 minutes. Never sends requests on your behalf, only reads usage.
Tracking: see breakdown by model (gpt-4, gpt-3.5, claude-opus, etc.), by day, by week.
Alerts: set thresholds ($100/day, $500/week) and alert if exceeded. Catch runaway costs BEFORE surprise billing.
Multiple providers: if using OpenAI and Anthropic, see total costs and breakdown.
Response Time Baselines
When monitoring AI endpoints, know what's normal to detect abnormal.
OpenAI GPT-4 (streaming): 500ms-2s to first token, then ~50ms per token. Total for 100-token response: 5-10 seconds.
OpenAI GPT-3.5: 200ms-1s to first token. Total: 2-5 seconds per response.
Anthropic Claude: 500ms-2s to first token. Total: 3-8 seconds per response.
Self-hosted small (7B): 100-500ms. Total: 1-3 seconds.
Self-hosted large (70B+): 2-5s to first token. Total: 10-30 seconds.
Red flags: if response time suddenly 3x slower than baseline, API is overloaded or network issues. Alert on 50% deviation from baseline.
Alerting Strategy
Critical alerts: API completely down (returns 5xx or doesn't respond), or keyword validation fails (response lacks expected text). Send to Slack/email immediately.
Warning alerts: response time > 2x baseline, or approaching rate limit. Send to Slack, not urgent.
Info alerts: cost tracking (daily costs breached threshold), or provider status updates. Daily email digest.
Mute during maintenance: OpenAI sometimes does scheduled maintenance. Can temporarily disable alerts or reduce sensitivity.
Monitor Provider Status Page
OpenAI status: status.openai.com
Anthropic status: status.anthropic.com
Most support RSS feeds or webhooks. Monitoring provider status separately from health checks helps because:
Planned maintenance: provider warns in advance ("Maintenance Sunday 10pm UTC"). Alert customers and disable alerts for that time.
Degradation: "API responding slowly due to high load". Health check may pass (API responds), but status page says problem exists.
Regional issues: "Issues affecting users in EU". Health check in US may pass, but EU users suffer.
Practical Example: GPT-4 Monitoring
Setup in AtomPing:
Check type: AI Agent Probe
Provider: OpenAI
Model: gpt-3.5-turbo (cheaper for health checks)
Prompt: "Respond with OK"
Expected response: "OK"
Max tokens: 5
Timeout: 30 seconds
Frequency: every 15 minutes
Alerts: Slack #ai-team when fails
Cost: roughly $0.001 per check (gpt-3.5, 5 output tokens). 4 checks/hour, 96/day = ~$0.10/day = ~$3/month on monitoring. Worth saving from production breakage.
Related Articles
API Monitoring Guide — monitor regular API endpoints
Response Time Monitoring — detect performance degradation
Health Check Endpoint Design — create good health checks
Complete Uptime Monitoring Guide — monitoring fundamentals
Webhook Monitoring — monitor async callbacks
AI Cost Monitoring — track API spending
FAQ
What's the difference between monitoring OpenAI API availability vs rate limits?
Availability: is the endpoint up and responding? Rate limits: how many requests can you make per minute/hour/month? Both matter. Endpoint can be up but rate-limited (429 response), meaning your application stops working. Monitor both: health check for availability, and token counting/cost tracking for rate limit warnings.
How do I monitor custom/self-hosted LLM endpoints?
Same as any API: HTTP health check with keyword validation. Send a simple prompt (e.g., 'What is 2+2?'), validate response contains expected text (e.g., 'four' or '4'). Check response time (LLM endpoints are slow—expect 500ms to 30s depending on model size). Monitor at 5-10 minute intervals to catch crashes without overwhelming the endpoint.
Why does my AI API check fail intermittently?
Three common causes: (1) Rate limiting—service is up but rejecting requests due to quota, (2) Model loading—large models take 10-30s to load, timing out your check, (3) Regional latency—some regions to the API endpoint are slow. Use longer timeouts (30s+), implement exponential backoff, and monitor from multiple regions to detect patterns.
How do I validate that an LLM actually processed my prompt?
Keyword-based response validation: send a test prompt with a unique word or phrase, check that response contains it. Example: send prompt 'Respond with TESTWORD12345', validate response contains 'TESTWORD12345'. This ensures the model actually processed your request, not just returned a cached/default response.
What's a normal response time for different LLM API endpoints?
OpenAI GPT-4: 2-10 seconds per response. Anthropic Claude: 2-8 seconds. Self-hosted small models (3-7B): 500ms-3s. Self-hosted large models (30B+): 10-30s. These are TTFB (time to first byte). Set your timeout and alert threshold accordingly. If Claude suddenly takes 30s, something is wrong—alert.
Should I monitor the status page of my LLM provider separately?
Yes. OpenAI status page (status.openai.com), Anthropic status page, etc. These are public and update when there are issues. Monitor the public status page via RSS feed or webhook, so your team sees outages even if your application doesn't try to call the API. Stripe uses this pattern—many SaaS apps monitor their provider's status page independently.