Service Health check Runbook

Phase 1 · External Validation

Confirm the user-visible condition

Run these checks from a system outside the origin server. Capture the timestamp and exact status before logging into the host.

Check HTTPS response and timing

Confirm whether the endpoint responds, which HTTP status is returned, and how long the request takes.

curl -sS -o /dev/null --max-time 15 \
  -w 'HTTP=%{http_code} DNS=%{time_namelookup}s CONNECT=%{time_connect}s TOTAL=%{time_total}s\n' \
  https://munyakazi.org

Expected: HTTP 200 or intended redirect, completed within the timeout

Check DNS resolution

Verify that the public name resolves consistently. Compare results from a second resolver if DNS is suspected.

dig +short munyakazi.org
dig @1.1.1.1 +short munyakazi.org
dig @8.8.8.8 +short munyakazi.org

Expected: consistent Cloudflare-fronted resolution without SERVFAIL or timeout

Classify the Cloudflare response

Record the displayed Cloudflare error code. A 1033/502/504-type response narrows the investigation toward the tunnel or origin path.

curl -I --max-time 15 https://munyakazi.org Record: HTTP status, server headers, Cloudflare Ray ID, and timestamp

Phase 2 · Connector & Origin

Validate the tunnel and Apache independently

If DNS resolves but the public endpoint fails, log into the host and check the connector before restarting anything.

Check cloudflared service state

Confirm that the service is active and review recent tunnel messages for connection or origin errors.

sudo systemctl is-active cloudflared
sudo systemctl status cloudflared --no-pager
sudo journalctl -u cloudflared --since '-15 minutes' --no-pager

Expected: active service with connected tunnel sessions and no repeating origin errors

Check Apache service and configuration

Validate configuration syntax before considering a reload or restart.

sudo systemctl is-active apache2
sudo systemctl status apache2 --no-pager
sudo apachectl configtest

Expected: active and “Syntax OK”

Test the origin locally

Bypass Cloudflare and the tunnel. If localhost fails, the fault is at Apache, its virtual host, or the local application.

curl -I --max-time 10 http://127.0.0.1
sudo ss -ltnp | grep -E ':(80|443)\b'

Expected: local HTTP response and Apache listening on the configured origin port

Review Apache logs

Look for shutdown signals, configuration failures, permission problems, or repeated application errors around the incident time.

sudo journalctl -u apache2 --since '-30 minutes' --no-pager
sudo tail -n 100 /var/log/apache2/error.log

Record: first relevant error, timestamp, and service state transition

Phase 3 · Host Health

Rule out system-wide pressure or failure

Check load, memory, swap, and disk

A healthy service can still become unreachable when the host is resource constrained or the filesystem is full.

uptime
free -h
df -h
df -i

Investigate: sustained high load, exhausted memory/swap, disk above 85%, or inode exhaustion

Check failed units and kernel events

Identify whether another failed dependency, OOM event, storage error, or network issue contributed to the outage.

systemctl --failed
sudo journalctl -p warning --since '-30 minutes' --no-pager
sudo dmesg --level=err,warn | tail -n 50

Expected: no related failed units, OOM kills, I/O errors, or interface failures

Recovery & Validation

Restore only the failed component

Use the least disruptive corrective action supported by the evidence. Do not restart the complete host when a single service can be safely recovered.

Apply targeted recovery

Examples: gracefully reload Apache after a configuration change, restart Apache if stopped, or restart cloudflared only when the connector is the failed layer.

sudo systemctl reload apache2
# or, when stopped/faulted:
sudo systemctl restart apache2
# tunnel only when evidence points to it:
sudo systemctl restart cloudflared

Repeat internal and external tests

Confirm local origin response first, then the public URL. Monitor logs for several minutes to ensure the recovery is stable.

curl -I http://127.0.0.1
curl -I https://munyakazi.org
sudo journalctl -u apache2 -u cloudflared --since '-5 minutes' --no-pager

Close only when the endpoint is reachable and no repeating errors remain

Severity & Escalation

Classify the operational impact

P1 · Unavailable

Public service is fully unreachable. Begin the complete runbook immediately and preserve incident evidence.

P2 · Degraded

Intermittent errors, high latency, or partial function. Investigate before users experience full downtime.

P3 · Warning

Service is available, but logs or resource thresholds show increased operational risk.

Service Health Check
Runbook

Find the failing layer before changing the system.

Designed for the munyakazi.org service path.

Check from outside in

Confirm the user-visible condition

Validate the tunnel and Apache independently

Rule out system-wide pressure or failure

Restore only the failed component

Classify the operational impact

Check First. Change Second.

Service Health CheckRunbook

Find the failing layer before changing the system.

Designed for the munyakazi.org service path.

Check from outside in

Confirm the user-visible condition

Validate the tunnel and Apache independently

Rule out system-wide pressure or failure

Restore only the failed component

Classify the operational impact

Check First. Change Second.

Cookie Preferences

Service Health Check
Runbook