Website Uptime Monitoring: A Practical Guide

Your site went down at 2:47 AM. A customer tweeted about it at 6:15 AM. You found out at 9:30 AM, from the tweet.

That gap — between the moment something breaks and the moment you know — is the entire reason uptime monitoring exists. This guide covers how it actually works under the hood, the false-positive traps that make naive monitors cry wolf, and what a public status page does for the trust equation. It's written for developers shipping side projects and SaaS apps, not for SRE teams with a PagerDuty rotation.

What uptime monitoring actually is

Uptime monitoring is an external service requesting your site on a fixed interval, recording whether it responded, how fast, and with what status code — and alerting you when the answer changes.

The key word is external. Your own infrastructure can't reliably report its own death. A health check running on the same box, the same cluster, or even the same cloud region as your app shares its failure modes. Monitoring has to live outside.

A good monitor records four things per check:

Reachability — did the request complete at all?
Status code — what did the server say?
Latency — how long did it take?
Error class — if it failed, how? DNS failure, TLS error, connection refused, timeout, and HTTP 5xx are very different incidents with very different fixes.

That last one matters more than people expect. "Your site is down" is barely actionable. "DNS resolution failed" points at your registrar or nameserver change from yesterday. "TLS chain incomplete" points at a botched certificate renewal. "Connection refused" points at a dead process or a firewall rule. The error class is the first clue of the postmortem, captured automatically.

Check frequency: why every minute is the sweet spot

Monitors typically probe somewhere between every 30 seconds and every 5 minutes. The trade-off is simple: detection latency versus noise and load.

At a 5-minute interval, a 4-minute outage can pass completely undetected — and your measured downtime is wrong by up to 5 minutes on each edge of every incident. At 60-second checks, detection lag is bounded by about a minute, short blips actually register, and the load on your site is negligible: one request per minute is 1,440 requests a day, less than a single crawler bot.

Sub-30-second intervals are mostly theater for typical web apps. Your alerting pipeline (email, Slack) adds tens of seconds anyway, and the fix takes minutes at minimum. Spend the engineering budget on better failure classification instead.

A detail worth checking in any monitor: what HTTP method does it use? Well-behaved monitors send a HEAD request first — it returns headers only, costing your server almost nothing — and fall back to GET when a server answers HEAD with 405 or 501 (some frameworks and CDNs do).

The false-positive problem

The fastest way to make a monitor useless is letting it lie. Two classic traps:

Trap 1: treating every non-200 as "down"

Your WAF rate-limits an aggressive client and returns 429. Your login-protected staging area returns 401. Cloudflare challenges a datacenter IP with 403. None of these mean your site is down — they mean it's up and making decisions. The server is alive, terminating TLS, running logic.

A monitor tuned for signal treats 4xx responses as UP and reserves DOWN for what actually indicates an outage: 5xx responses and transport-level failures (DNS, TLS, connection refused, timeout). If your monitoring tool pages you because your own bot-protection blocked its probe, you'll mute it within a week — and then miss the real outage.

Trap 2: alerting on a single failed check

Networks blip. A single dropped packet between one probe location and your host is not an incident. The standard fix is a consecutive-failure threshold: a site is only marked down after N failed checks in a row (2 is a sane default — at 60-second intervals, that's confirmation within ~2 minutes). Recovery, by contrast, should flip on the first successful check, because a false "recovered" self-corrects within a minute, while a slow one understates your downtime.

This is a state machine, and it's worth being precise about: UP → DOWN requires sustained evidence; DOWN → UP requires one good probe. Each transition opens or closes an incident with a start time, end time, duration, and the last observed error class. Incidents — not raw checks — are what you alert on, report on, and show on a status page.

Alerting: down is half the story

The alert email everyone designs first: "🔴 yoursite.com is DOWN." The alert everyone forgets: "🟢 yoursite.com recovered after 14 minutes."

Recovery alerts matter for three reasons. They close the loop when you're mid-firefight (no more manual refreshing). They give you the incident duration for the postmortem without log spelunking. And they're the difference between "monitoring as anxiety" and "monitoring as a record."

Beyond down/recovered, resist the urge to add more alert types. Latency-degradation warnings and certificate-expiry notices are useful signals, but they belong in dashboards and digests, not in the same channel as "the site is unreachable." Alert fatigue is the leading cause of death for monitoring setups.

Public status pages: monitoring as marketing

A status page is your uptime data, published. At minimum: current state, a response-time figure, a 90-day uptime history, and a log of recent incidents with durations.

It feels counterintuitive to publish your failures. In practice it works the other way:

During an incident, a status page absorbs support load. Users who can see "we know, it's been 6 minutes" don't email you. Users staring at a spinning tab do.
Between incidents, a 90-day green bar with a 99.9% figure is sales collateral. Enterprise procurement asks for exactly this. So do savvy indie customers comparing two tools.
Internally, a public number keeps you honest. Uptime you don't measure is uptime you imagine — and self-reported "we're basically always up" beliefs rarely survive first contact with a monitor.

The numbers on a status page come from daily rollups: checks performed, checks failed, average latency per day. Showing per-day bars for 90 days, with 7-day and 30-day uptime percentages, is the industry-standard presentation because it answers both "is it down now?" and "is this thing generally reliable?" in one glance.

DIY vs. a service

The DIY version is a cron job hitting your URL plus a notification script — a fun afternoon, and most of them die from the problems above: they run inside your infrastructure, they alert on single failures, they treat 429 as an outage, they have no incident model, no history, no status page, and they stop working silently when the cron box has a bad day. Monitoring your monitor is a real problem; the practical answer is using infrastructure that isn't yours.

We built uptime monitoring into CheckVibe because security scanning and availability monitoring answer the same customer question — "is my site okay?" — and the monitoring product was already watching for security regressions on a schedule. The uptime layer adds:

Checks every 60 seconds from outside your infrastructure — HEAD with automatic GET fallback, 10-second timeout
A real incident model — consecutive-failure threshold before opening an incident, instant recovery detection, durations and error classes recorded
False-positive resistance — 4xx (including WAF challenges and rate limits) counts as up; only 5xx and transport failures count as down
Down + recovered email alerts with the incident duration and the probable cause class
A public status page per project — live state, 90-day daily history, uptime percentages, recent incidents

It rides along with the uptime monitoring check in your project dashboard, next to the security, performance, and domain health panels — one place instead of four tools.

Run a free scan to see the rest of the platform; uptime monitoring switches on per project in the dashboard.

FAQ

How often should uptime checks run?

Every 60 seconds is the practical sweet spot for web apps. Five-minute intervals can miss short outages entirely and blur your downtime math; sub-30-second intervals add noise and load without changing your real-world response time, since alert delivery and human reaction dominate.

What counts as "down"?

Transport failures (DNS resolution, TLS handshake errors, connection refused, timeouts) and HTTP 5xx responses. Well-designed monitors treat 4xx — including 401, 403, and 429 — as up, because the server is alive and responding deliberately; counting WAF challenges as outages is the most common source of false alarms.

What's a reasonable uptime target for a small SaaS?

99.9% — under 45 minutes of downtime a month — is achievable on modern hosting without heroics and is what most customers implicitly expect. Chasing 99.99% as a small team usually costs more in architecture than it returns; communicating honestly via a status page returns more trust per hour invested.

Do I really need a public status page?

If anyone pays you, yes. It cuts support volume during incidents, gives procurement the reliability evidence they ask for, and a visible 90-day track record converts skeptics better than a marketing claim. The main objection — "it shows my failures" — inverts in practice: hiding failures is what erodes trust.

Can't I just monitor from my own server?

A health check inside your own infrastructure shares failure modes with your app — region outages, DNS misconfigurations, expired certificates, and full-host failures all take the monitor down with the site. External probing is the entire point of uptime monitoring.

Is your app vulnerable?

Paste your URL and get a security report in 30 seconds — 100+ automated checks with AI-ready fix prompts.

Scan your site free

Your site went down at 2:47 AM. A customer tweeted about it at 6:15 AM. You found out at 9:30 AM, from the tweet.

What uptime monitoring actually is

A good monitor records four things per check:

Reachability — did the request complete at all?
Status code — what did the server say?
Latency — how long did it take?
Error class — if it failed, how? DNS failure, TLS error, connection refused, timeout, and HTTP 5xx are very different incidents with very different fixes.

Check frequency: why every minute is the sweet spot

Monitors typically probe somewhere between every 30 seconds and every 5 minutes. The trade-off is simple: detection latency versus noise and load.

The false-positive problem

The fastest way to make a monitor useless is letting it lie. Two classic traps:

Trap 1: treating every non-200 as "down"

Trap 2: alerting on a single failed check

Alerting: down is half the story

The alert email everyone designs first: "🔴 yoursite.com is DOWN." The alert everyone forgets: "🟢 yoursite.com recovered after 14 minutes."

Public status pages: monitoring as marketing

A status page is your uptime data, published. At minimum: current state, a response-time figure, a 90-day uptime history, and a log of recent incidents with durations.

It feels counterintuitive to publish your failures. In practice it works the other way:

During an incident, a status page absorbs support load. Users who can see "we know, it's been 6 minutes" don't email you. Users staring at a spinning tab do.
Between incidents, a 90-day green bar with a 99.9% figure is sales collateral. Enterprise procurement asks for exactly this. So do savvy indie customers comparing two tools.
Internally, a public number keeps you honest. Uptime you don't measure is uptime you imagine — and self-reported "we're basically always up" beliefs rarely survive first contact with a monitor.

DIY vs. a service

Checks every 60 seconds from outside your infrastructure — HEAD with automatic GET fallback, 10-second timeout
A real incident model — consecutive-failure threshold before opening an incident, instant recovery detection, durations and error classes recorded
False-positive resistance — 4xx (including WAF challenges and rate limits) counts as up; only 5xx and transport failures count as down
Down + recovered email alerts with the incident duration and the probable cause class
A public status page per project — live state, 90-day daily history, uptime percentages, recent incidents

It rides along with the uptime monitoring check in your project dashboard, next to the security, performance, and domain health panels — one place instead of four tools.

Run a free scan to see the rest of the platform; uptime monitoring switches on per project in the dashboard.

FAQ

How often should uptime checks run?

What counts as "down"?

What's a reasonable uptime target for a small SaaS?

Do I really need a public status page?

Can't I just monitor from my own server?

Is your app vulnerable?

Paste your URL and get a security report in 30 seconds — 100+ automated checks with AI-ready fix prompts.

Scan your site free

Website Uptime Monitoring: A Practical Guide for Developers Who Ship Fast

What uptime monitoring actually is

Check frequency: why every minute is the sweet spot

The false-positive problem

Trap 1: treating every non-200 as "down"

Trap 2: alerting on a single failed check

Alerting: down is half the story

Public status pages: monitoring as marketing

DIY vs. a service

FAQ

How often should uptime checks run?

What counts as "down"?

What's a reasonable uptime target for a small SaaS?

Do I really need a public status page?

Can't I just monitor from my own server?

Is your app vulnerable?

Website Uptime Monitoring: A Practical Guide for Developers Who Ship Fast

What uptime monitoring actually is

Check frequency: why every minute is the sweet spot

The false-positive problem

Trap 1: treating every non-200 as "down"

Trap 2: alerting on a single failed check

Alerting: down is half the story

Public status pages: monitoring as marketing

DIY vs. a service

FAQ

How often should uptime checks run?

What counts as "down"?

What's a reasonable uptime target for a small SaaS?

Do I really need a public status page?

Can't I just monitor from my own server?

Is your app vulnerable?