Your database backup was supposed to run every night at 2:00 AM. It hasn't run for the last 17 days. Nobody noticed because cron jobs don't scream when they break — they just silently stop working. You only find out when you need to restore from backup and the last valid one is three weeks old.
This isn't hypothetical. Cron jobs are the invisible workhorses of infrastructure: backups, queue processing, email delivery, data synchronization, log rotation. When they work, you don't think about them. When they stop, the consequences are catastrophic.
Why cron jobs fail silently
Standard uptime monitoring won't help. It checks if a server responds — but cron jobs don't respond to requests. They run on schedule, do their work, and exit. If they don't run, nothing happens. Literally: no process, no error, no log, no alert. Silence.
Common reasons cron jobs stop working:
Server restarted — crontab loaded, but script paths changed or environment variables didn't load.
Disk full — cron daemon can't create log file and silently ignores the task.
Script fails with error — Python throws an exception, bash encounters an error. Cron ran the task, but it exited with non-zero code. By default cron sends an email with the error — but who configures MTA on servers in 2026?
Previous run hasn't finished — task supposed to run every 5 minutes but takes 7. Without flock or similar mechanism, a second copy starts, they interfere, both fail.
Someone edited crontab — when updating one entry, accidentally deleted another. Or deployment overwrote crontab without important task.
Heartbeat monitoring: the "dead man's switch" principle
The concept is simple and elegant. Instead of actively checking the cron job (technically hard — how do you check a process that runs for 2 seconds once an hour?), you flip the model: the cron job reports that it's alive.
How it works:
1. Create a heartbeat monitor with expected interval (e.g., "every 60 minutes").
2. Get a unique ping URL (e.g., https://atomping.com/heartbeat/abc123).
3. Add curl of this URL to the end of your cron script.
4. If ping arrives on time — all good. If it doesn't — heartbeat monitor opens incident and sends alert.
It's called "dead man's switch" — like the mechanism on trains: if the operator stops pressing a button every N seconds, the train stops automatically. Your cron job is the operator. Heartbeat URL is the button. Monitoring is the automatic brake.
What to monitor: top cron jobs that need attention
Backups
Absolute top priority. A broken backup is worse than no backup because you think you have one. Monitor not just the start, but completion: add ping at the very end of the script, after verifying backup file size. If the script crashes halfway, ping doesn't go out.
Queue and email processing
Celery beat, sidekiq, bull — scheduled workers that process background tasks. If they stop, emails don't send, orders don't process, notifications don't arrive. Heartbeat on each worker process is minimal insurance.
Data synchronization
ETL pipelines, CRM-to-database sync, external API imports. If sync task stops working, data diverges — you find out when the financial report shows wrong numbers.
Cleanup and rotation
Logrotate, /tmp cleanup, old session deletion, Docker image pruning. Seems unimportant until disk fills to 100% and everything breaks. Heartbeat check on cleanup task is 30 seconds of setup that prevents disk full incidents.
SSL renewal
Certbot renewal is also a cron job. If it stops working, in 30-60 days your SSL certificate expires. Double protection: heartbeat on certbot cron + SSL monitoring for tracking expiry date.
Adding heartbeat monitoring to existing cron jobs
Integration takes literally one line. Here are examples for different scenarios:
Simple bash script
#!/bin/bash
# Your task
pg_dump mydb > /backups/db-$(date +%F).sql
# Monitoring ping — only if backup succeeds
if [ $? -eq 0 ]; then curl -fsS --retry 3 https://atomping.com/heartbeat/YOUR-ID > /dev/null; fi
Python script
import requests
def main():
# ... your logic ...
process_data()
# Ping after successful completion
requests.get("https://atomping.com/heartbeat/YOUR-ID", timeout=10)
Directly in crontab
0 2 * * * /scripts/backup.sh && curl -fsS https://atomping.com/heartbeat/YOUR-ID > /dev/null
Key point: && ensures ping is sent only if the previous command succeeded. If backup.sh returns non-zero code, curl doesn't run, monitoring doesn't get ping and sends alert.
Grace period: accounting for reality
Cron jobs don't run with second-level precision. A backup that usually takes 30 seconds might take 5 minutes under heavy disk load. ETL task might stretch due to large data volume. Set grace period too short and you get false alarms. Too long and you discover problems too late.
Rule: grace period = 2x maximum observed task execution time. If backup usually takes 1 minute but sometimes stretches to 3, set grace period to 6 minutes. If task runs every hour, grace period of 10-15 minutes is usually sufficient.
Monitoring patterns for different architectures
One server, one crontab
Simplest case. One heartbeat per critical task. Everything transparent: if ping doesn't arrive, task didn't run on this server.
Multiple servers, one task (HA cron)
Three servers, but task should run on only one (leader election via Redis lock, consul, or simply one designated server). Create one heartbeat that only the server actually running the task pings. If leader fails and failover doesn't happen, no one sends the ping.
Celery Beat / Sidekiq / Bull scheduler
For task schedulers in frameworks, approach is the same: add HTTP call to heartbeat URL at end of task. In Celery this can be separate task running after main via chain or chord. In Sidekiq, use after_perform callback.
Kubernetes CronJob
Kubernetes CronJob creates Pod on schedule. Problem: Pod might be created but not run (ImagePullBackOff, resource limits). Or start and crash (OOMKilled). Heartbeat at end of container command is the only reliable way to know task didn't just start but completed successfully.
Common mistakes in cron job monitoring
Ping at beginning of script. If ping is before main logic, monitoring gets "all good" even if task crashes with error seconds later. Always put ping at the very end, after all checks.
Unconditional ping. backup.sh; curl heartbeat (semicolon) sends ping even if backup.sh fails. Use && so ping only sends on success.
No retry for ping. Network hiccup, curl fails, monitoring gets false alert. Add --retry 3 to curl command.
Monitoring startup, not completion. Knowing cron daemon started the task is useful but insufficient. Task could hang, crash halfway, or run forever. Monitor successful completion.
Getting started
Right now, open terminal and check crontab -l on your servers. List all tasks. Mark the critical ones — those where failure costs money, data, or user trust. For each critical task, create heartbeat monitor in AtomPing and add one curl line at the end of script.
Total setup is 5 minutes per task. And peace of mind that your backups, sync, and cleanup actually work is priceless. Because the worst time to discover a backup isn't running is when you need to restore from it.
FAQ
What is cron job monitoring?
Cron job monitoring (also called heartbeat monitoring) tracks whether your scheduled tasks actually run on time. Instead of checking a URL, it works in reverse: your cron job sends a ping to a monitoring endpoint after completing. If the ping doesn't arrive within the expected window, the monitor alerts you that the job failed or is late.
How does heartbeat monitoring work?
You create a heartbeat monitor with an expected interval (e.g., every 5 minutes). The monitor generates a unique URL. You add a curl/wget call to that URL at the end of your cron job script. The monitoring service tracks when pings arrive. If no ping arrives within the expected interval plus a grace period, it triggers an alert.
What's the difference between cron monitoring and uptime monitoring?
Uptime monitoring is active — it sends requests to your servers. Cron monitoring is passive — it waits for your jobs to check in. Uptime monitoring answers 'is my server reachable?' while cron monitoring answers 'did my scheduled task actually run?' They complement each other and cover different failure modes.
Can I monitor cron jobs that run on multiple servers?
Yes. Create one heartbeat monitor per server, or use a single monitor for the job and have only one server ping it (the leader). For redundant cron setups where any server can run the job, use a single monitor — the first server that completes the job sends the ping.
What cron jobs should I monitor?
Any cron job whose failure impacts your business or data integrity. Priority targets: database backups, payment processing, email sending queues, report generation, cache warming, data sync between systems, SSL certificate renewal, log rotation, and cleanup tasks. If you'd only find out the job stopped days later — it needs monitoring.
How do I add monitoring to an existing cron job?
Add a single line at the end of your cron script: curl -fsS --retry 3 https://atomping.com/heartbeat/YOUR-ID > /dev/null. This sends a silent HTTP request to the monitoring endpoint after your job completes. If the job fails or hangs before reaching this line, no ping is sent, and the monitor alerts you.