Service Health Check Runbook - IT Operations Notes
OPS-002 · Operations Runbook

Service Health Check
Runbook

A reusable first-response checklist for validating public web availability, DNS, Cloudflare, tunnel connectivity, Apache, and Linux host health.

6
Validation Layers
10m
Initial Triage Target
P1-P3
Severity Model
Linux
Host Platform
Runbook Purpose

Find the failing layer before changing the system.

This runbook provides a consistent first-response sequence for a self-hosted web service. It starts from the user-visible endpoint and moves inward through DNS, Cloudflare, the tunnel, Apache, and the Linux host.

  • Confirm whether the incident is external, origin-side, or host-wide.
  • Collect evidence before restarting services.
  • Record a clear result for handover or incident review.
Target Environment

Designed for the munyakazi.org service path.

Public endpointHTTPS website and response code
Edge & DNSCloudflare resolution and proxy path
Connectorcloudflared service and tunnel logs
OriginApache configuration, process, and ports
HostCPU, memory, disk, load, and failed units
EvidenceTimestamp, commands, result, and action
Validation Path

Check from outside in

Stop at the first failed layer, investigate it, and avoid unrelated changes farther down the stack.

LAYER 01
Public URL
Reachability, HTTPS status, latency, and certificate.
LAYER 02
DNS
Name resolution and expected Cloudflare records.
LAYER 03
Cloudflare
Edge response and error classification.
LAYER 04
Tunnel
cloudflared state, connections, and recent logs.
LAYER 05
Apache
Service state, config syntax, local HTTP, and ports.
LAYER 06
Linux Host
Load, memory, disk, failed units, and kernel warnings.
Phase 1 · External Validation

Confirm the user-visible condition

Run these checks from a system outside the origin server. Capture the timestamp and exact status before logging into the host.

01
Check HTTPS response and timing
Confirm whether the endpoint responds, which HTTP status is returned, and how long the request takes.
curl -sS -o /dev/null --max-time 15 \ -w 'HTTP=%{http_code} DNS=%{time_namelookup}s CONNECT=%{time_connect}s TOTAL=%{time_total}s\n' \ https://munyakazi.org Expected: HTTP 200 or intended redirect, completed within the timeout
02
Check DNS resolution
Verify that the public name resolves consistently. Compare results from a second resolver if DNS is suspected.
dig +short munyakazi.org dig @1.1.1.1 +short munyakazi.org dig @8.8.8.8 +short munyakazi.org Expected: consistent Cloudflare-fronted resolution without SERVFAIL or timeout
03
Classify the Cloudflare response
Record the displayed Cloudflare error code. A 1033/502/504-type response narrows the investigation toward the tunnel or origin path.
curl -I --max-time 15 https://munyakazi.org Record: HTTP status, server headers, Cloudflare Ray ID, and timestamp
Phase 2 · Connector & Origin

Validate the tunnel and Apache independently

If DNS resolves but the public endpoint fails, log into the host and check the connector before restarting anything.

04
Check cloudflared service state
Confirm that the service is active and review recent tunnel messages for connection or origin errors.
sudo systemctl is-active cloudflared sudo systemctl status cloudflared --no-pager sudo journalctl -u cloudflared --since '-15 minutes' --no-pager Expected: active service with connected tunnel sessions and no repeating origin errors
05
Check Apache service and configuration
Validate configuration syntax before considering a reload or restart.
sudo systemctl is-active apache2 sudo systemctl status apache2 --no-pager sudo apachectl configtest Expected: active and “Syntax OK”
06
Test the origin locally
Bypass Cloudflare and the tunnel. If localhost fails, the fault is at Apache, its virtual host, or the local application.
curl -I --max-time 10 http://127.0.0.1 sudo ss -ltnp | grep -E ':(80|443)\b' Expected: local HTTP response and Apache listening on the configured origin port
07
Review Apache logs
Look for shutdown signals, configuration failures, permission problems, or repeated application errors around the incident time.
sudo journalctl -u apache2 --since '-30 minutes' --no-pager sudo tail -n 100 /var/log/apache2/error.log Record: first relevant error, timestamp, and service state transition
Phase 3 · Host Health

Rule out system-wide pressure or failure

08
Check load, memory, swap, and disk
A healthy service can still become unreachable when the host is resource constrained or the filesystem is full.
uptime free -h df -h df -i Investigate: sustained high load, exhausted memory/swap, disk above 85%, or inode exhaustion
09
Check failed units and kernel events
Identify whether another failed dependency, OOM event, storage error, or network issue contributed to the outage.
systemctl --failed sudo journalctl -p warning --since '-30 minutes' --no-pager sudo dmesg --level=err,warn | tail -n 50 Expected: no related failed units, OOM kills, I/O errors, or interface failures
Recovery & Validation

Restore only the failed component

Use the least disruptive corrective action supported by the evidence. Do not restart the complete host when a single service can be safely recovered.

10
Apply targeted recovery
Examples: gracefully reload Apache after a configuration change, restart Apache if stopped, or restart cloudflared only when the connector is the failed layer.
sudo systemctl reload apache2 # or, when stopped/faulted: sudo systemctl restart apache2 # tunnel only when evidence points to it: sudo systemctl restart cloudflared
11
Repeat internal and external tests
Confirm local origin response first, then the public URL. Monitor logs for several minutes to ensure the recovery is stable.
curl -I http://127.0.0.1 curl -I https://munyakazi.org sudo journalctl -u apache2 -u cloudflared --since '-5 minutes' --no-pager Close only when the endpoint is reachable and no repeating errors remain
Severity & Escalation

Classify the operational impact

P1 · Unavailable

Public service is fully unreachable. Begin the complete runbook immediately and preserve incident evidence.

P2 · Degraded

Intermittent errors, high latency, or partial function. Investigate before users experience full downtime.

P3 · Warning

Service is available, but logs or resource thresholds show increased operational risk.

Check First. Change Second.

This runbook is part of IT Operations Notes, documenting practical response procedures for real infrastructure and service issues.