Your site went down at 2:47 AM. A customer tweeted about it at 6:15 AM. You found out at 9:30 AM, from the tweet.
That gap — between the moment something breaks and the moment you know — is the entire reason uptime monitoring exists. This guide covers how it actually works under the hood, the false-positive traps that make naive monitors cry wolf, and what a public status page does for the trust equation. It's written for developers shipping side projects and SaaS apps, not for SRE teams with a PagerDuty rotation.
Uptime monitoring is an external service requesting your site on a fixed interval, recording whether it responded, how fast, and with what status code — and alerting you when the answer changes.
The key word is external. Your own infrastructure can't reliably report its own death. A health check running on the same box, the same cluster, or even the same cloud region as your app shares its failure modes. Monitoring has to live outside.
A good monitor records four things per check:
That last one matters more than people expect. "Your site is down" is barely actionable. "DNS resolution failed" points at your registrar or nameserver change from yesterday. "TLS chain incomplete" points at a botched certificate renewal. "Connection refused" points at a dead process or a firewall rule. The error class is the first clue of the postmortem, captured automatically.
Monitors typically probe somewhere between every 30 seconds and every 5 minutes. The trade-off is simple: detection latency versus noise and load.
At a 5-minute interval, a 4-minute outage can pass completely undetected — and your measured downtime is wrong by up to 5 minutes on each edge of every incident. At 60-second checks, detection lag is bounded by about a minute, short blips actually register, and the load on your site is negligible: one request per minute is 1,440 requests a day, less than a single crawler bot.
Sub-30-second intervals are mostly theater for typical web apps. Your alerting pipeline (email, Slack) adds tens of seconds anyway, and the fix takes minutes at minimum. Spend the engineering budget on better failure classification instead.
A detail worth checking in any monitor: what HTTP method does it use? Well-behaved monitors send a HEAD request first — it returns headers only, costing your server almost nothing — and fall back to GET when a server answers HEAD with 405 or 501 (some frameworks and CDNs do).
The fastest way to make a monitor useless is letting it lie. Two classic traps:
Your WAF rate-limits an aggressive client and returns 429. Your login-protected staging area returns 401. Cloudflare challenges a datacenter IP with 403. None of these mean your site is down — they mean it's up and making decisions. The server is alive, terminating TLS, running logic.
A monitor tuned for signal treats 4xx responses as UP and reserves DOWN for what actually indicates an outage: 5xx responses and transport-level failures (DNS, TLS, connection refused, timeout). If your monitoring tool pages you because your own bot-protection blocked its probe, you'll mute it within a week — and then miss the real outage.
Networks blip. A single dropped packet between one probe location and your host is not an incident. The standard fix is a consecutive-failure threshold: a site is only marked down after N failed checks in a row (2 is a sane default — at 60-second intervals, that's confirmation within ~2 minutes). Recovery, by contrast, should flip on the first successful check, because a false "recovered" self-corrects within a minute, while a slow one understates your downtime.
This is a state machine, and it's worth being precise about: UP → DOWN requires sustained evidence; DOWN → UP requires one good probe. Each transition opens or closes an incident with a start time, end time, duration, and the last observed error class. Incidents — not raw checks — are what you alert on, report on, and show on a status page.
The alert email everyone designs first: "🔴 yoursite.com is DOWN." The alert everyone forgets: "🟢 yoursite.com recovered after 14 minutes."
Recovery alerts matter for three reasons. They close the loop when you're mid-firefight (no more manual refreshing). They give you the incident duration for the postmortem without log spelunking. And they're the difference between "monitoring as anxiety" and "monitoring as a record."
Beyond down/recovered, resist the urge to add more alert types. Latency-degradation warnings and certificate-expiry notices are useful signals, but they belong in dashboards and digests, not in the same channel as "the site is unreachable." Alert fatigue is the leading cause of death for monitoring setups.
A status page is your uptime data, published. At minimum: current state, a response-time figure, a 90-day uptime history, and a log of recent incidents with durations.
It feels counterintuitive to publish your failures. In practice it works the other way:
The numbers on a status page come from daily rollups: checks performed, checks failed, average latency per day. Showing per-day bars for 90 days, with 7-day and 30-day uptime percentages, is the industry-standard presentation because it answers both "is it down now?" and "is this thing generally reliable?" in one glance.
The DIY version is a cron job hitting your URL plus a notification script — a fun afternoon, and most of them die from the problems above: they run inside your infrastructure, they alert on single failures, they treat 429 as an outage, they have no incident model, no history, no status page, and they stop working silently when the cron box has a bad day. Monitoring your monitor is a real problem; the practical answer is using infrastructure that isn't yours.
We built uptime monitoring into CheckVibe because security scanning and availability monitoring answer the same customer question — "is my site okay?" — and the monitoring product was already watching for security regressions on a schedule. The uptime layer adds:
HEAD with automatic GET fallback, 10-second timeoutIt rides along with the uptime monitoring check in your project dashboard, next to the security, performance, and domain health panels — one place instead of four tools.
Run a free scan to see the rest of the platform; uptime monitoring switches on per project in the dashboard.
Every 60 seconds is the practical sweet spot for web apps. Five-minute intervals can miss short outages entirely and blur your downtime math; sub-30-second intervals add noise and load without changing your real-world response time, since alert delivery and human reaction dominate.
Transport failures (DNS resolution, TLS handshake errors, connection refused, timeouts) and HTTP 5xx responses. Well-designed monitors treat 4xx — including 401, 403, and 429 — as up, because the server is alive and responding deliberately; counting WAF challenges as outages is the most common source of false alarms.
99.9% — under 45 minutes of downtime a month — is achievable on modern hosting without heroics and is what most customers implicitly expect. Chasing 99.99% as a small team usually costs more in architecture than it returns; communicating honestly via a status page returns more trust per hour invested.
If anyone pays you, yes. It cuts support volume during incidents, gives procurement the reliability evidence they ask for, and a visible 90-day track record converts skeptics better than a marketing claim. The main objection — "it shows my failures" — inverts in practice: hiding failures is what erodes trust.
A health check inside your own infrastructure shares failure modes with your app — region outages, DNS misconfigurations, expired certificates, and full-host failures all take the monitor down with the site. External probing is the entire point of uptime monitoring.
Paste your URL and get a security report in 30 seconds. 100+ automated checks with AI-powered fix prompts.
Scan your site free