How I monitor a small fleet of self-hosted projects on one Linux box: what runs, why it is wired the way it is, and the handful of decisions that took a few outages to get right.
There is no product to sell here and nothing exotic. Everything is open source, pinned to a version, and reproduced from a single Git repo. The interesting part is not the component list, it is the wiring: how an alert reaches me when the thing that is broken is the thing that delivers alerts, and how a fired alert becomes a tracked task instead of a notification I swipe away.
The starting point: a stock Ubuntu host running Docker
The baseline for everything below is a plain server: a current Ubuntu LTS, Docker Engine with the Compose plugin, and a reverse proxy already terminating TLS. In my case that is a single Contabo VPS in Germany, the same instance I have upgraded since 2019, never reinstalled, just improved in place. Nothing here is provider-specific; any VPS or bare-metal box with Docker will do.
A stock Docker host gives you containers and not much else. It will happily keep running long after something inside it has quietly died. You do not know:
- whether a public site is actually answering, or how fast
- when a TLS certificate is about to expire
- whether the host is running out of CPU, memory, or disk
- whether a container restarted at 3am
- why something broke, because the logs are scattered across
docker logs,journalctl, and files under/var/log
This stack is the layer you add on top to answer those questions and, more importantly, to tell you about them before a user does. Two principles drive the whole design:
- An alert has to survive the failure of its own delivery channel. On a self-hosted box, alert email is often delivered by your own mail server. The day the mail server goes down, the alert that says so is an email, routed to the dead mail server. You never see it. So critical alerts get a second, fully independent path.
- An alert is a work item, not a toast. Every firing alert opens a tracked issue and closes itself when the problem resolves. The issue tracker becomes a live, accurate list of what is broken right now, and goes quiet on its own when things recover.
What runs
| Component | Job | Pinned version |
|---|---|---|
| Prometheus | Metrics collection and alert evaluation | v2.48.1 |
| Grafana | Dashboards and visualisation | 10.2.3 |
| Alertmanager | Alert routing, grouping, deduplication | v0.26.0 |
| Loki | Log aggregation | 3.0.0 |
| Promtail | Log shipping | 2.9.x |
| cAdvisor | Per-container metrics | v0.47.2 |
| Node Exporter | Host metrics | v1.7.0 |
| Blackbox Exporter | HTTP / TLS / SMTP probing | v0.25.0 |
| ntfy | Self-hosted push for critical alerts | v2.11.0 |
| webhook-bridge | Turns alerts into self-closing GitHub Issues | internal |
Everything is pinned. Versions live in .env, so an upgrade is an explicit edit and a redeploy, never a surprise on a docker compose pull. Dependabot proposes bumps; I apply them on purpose. cAdvisor is pinned for a concrete reason covered later.
Architecture at a glance
+--------------+
host + containers ---> | Prometheus | --> alert rules
(node-exporter, +------+-------+
cAdvisor) | fires
v
public URLs ---> Blackbox -> +--------------+ email +-------------+
(HTTP/TLS/SMTP) exporter | Alertmanager | --------> | mail server |
+------+-------+ +-------------+
| webhook
v
+---------------+ --> GitHub Issues (open/close)
| webhook-bridge| --> ntfy push (critical only)
+---------------+
syslog / auth / proxy ---> Promtail -> Loki -> Grafana (dashboards + Explore)Metrics flow into Prometheus, which evaluates rules and hands firing alerts to Alertmanager. Alertmanager fans out to two receivers: email through the mail server, and a webhook. The webhook bridge files GitHub Issues and, for critical alerts only, pushes to a self-hosted ntfy server that shares nothing with the mail stack. Logs flow separately through Promtail into Loki, queried from Grafana.
Networks and the reverse proxy
The stack joins two pre-existing external Docker networks: web (where the reverse proxy lives) and mail_internal (a private network shared with the mail server). Create them once if they do not exist:
docker network create web
docker network create mail_internalOnly two services are ever reachable from the public internet: Grafana and ntfy. Everything else (Prometheus, Alertmanager, the exporters, the webhook bridge) is internal-only and carries traefik.enable=false. Grafana sits behind an IP-allowlist middleware on top of its own login, because a login page on the public internet is just a brute-force target waiting to be found. All admin URLs are IP-restricted; that is a standing rule across the whole box. The labels below assume Traefik; for nginx-proxy or Caddy the routing changes but nothing else does.
The Compose file
This is the single file that defines the stack. Substitute your own domains and paths.
services:
grafana:
image: grafana/grafana:${GRAFANA_VERSION:-10.2.3}
restart: unless-stopped
volumes:
- grafana_data:/var/lib/grafana
- ./config/provisioning:/etc/grafana/provisioning:ro
environment:
GF_SECURITY_ADMIN_USER: ${GF_SECURITY_ADMIN_USER:-admin}
GF_SECURITY_ADMIN_PASSWORD: ${GF_SECURITY_ADMIN_PASSWORD}
GF_USERS_ALLOW_SIGN_UP: "false"
GF_SMTP_ENABLED: "true"
GF_SMTP_HOST: mailserver
GF_SMTP_PORT: "587"
GF_SMTP_USER: ${MAIL_SMTP_USER}
GF_SMTP_PASSWORD: ${MAIL_SMTP_PASSWORD}
GF_SMTP_FROM_ADDRESS: ${ALERT_FROM_EMAIL}
GF_SMTP_FROM_NAME: "Grafana Alerts"
GF_SMTP_STARTTLS_POLICY: MandatoryStartTLS
GF_SMTP_SKIP_VERIFY: "true"
GF_LOG_LEVEL: warn
networks: [web, default, mail_internal]
labels:
- "traefik.enable=true"
- "traefik.http.routers.grafana.rule=Host(`${GRAFANA_DOMAIN}`)"
- "traefik.http.routers.grafana.tls.certresolver=lets-encrypt"
- "traefik.http.routers.grafana.middlewares=admin-ipallowlist@file"
- "traefik.http.services.grafana.loadbalancer.server.port=3000"
- "traefik.docker.network=web"
prometheus:
image: prom/prometheus:${PROMETHEUS_VERSION:-v2.48.1}
restart: unless-stopped
volumes:
- ./config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./config/alerts.rules:/etc/prometheus/alerts.rules:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
networks: [web, default]
labels: ["traefik.enable=false"]
alertmanager:
image: prom/alertmanager:${ALERTMANAGER_VERSION:-v0.26.0}
restart: unless-stopped
env_file: .env
entrypoint: ["/bin/sh", "/etc/alertmanager/alertmanager-entrypoint.sh"]
volumes:
- ./config/alertmanager.yml.template:/etc/alertmanager/alertmanager.yml.template:ro
- ./config/alertmanager-entrypoint.sh:/etc/alertmanager/alertmanager-entrypoint.sh:ro
- alertmanager_data:/alertmanager
networks: [default, mail_internal]
labels: ["traefik.enable=false"]
loki:
image: grafana/loki:${LOKI_VERSION:-3.0.0}
restart: unless-stopped
volumes:
- loki_data:/loki
- ./config/loki.yml:/etc/loki/loki.yml:ro
command: -config.file=/etc/loki/loki.yml -log.level=warn
labels: ["traefik.enable=false"]
promtail:
image: grafana/promtail:${PROMTAIL_VERSION:-2.9.0}
restart: unless-stopped
volumes:
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./config/promtail.yml:/etc/promtail/config.yml:ro
- promtail_positions:/var/lib/promtail
- ${PROXY_LOGS_PATH:-/srv/proxy/logs}:/proxy-logs:ro
command: -config.file=/etc/promtail/config.yml
labels: ["traefik.enable=false"]
cadvisor:
image: gcr.io/cadvisor/cadvisor:${CADVISOR_VERSION:-v0.47.2}
restart: unless-stopped
privileged: true
mem_limit: 1g
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
command:
- '--docker_only=true'
- '--housekeeping_interval=30s'
labels: ["traefik.enable=false"]
webhook-bridge:
image: python:3.11-alpine
restart: unless-stopped
volumes:
- ./services/webhook-bridge/app.py:/app/app.py:ro
working_dir: /app
command: python app.py
environment:
GITHUB_TOKEN: ${GITHUB_TOKEN}
GITHUB_REPO: ${GITHUB_REPO}
NTFY_URL: ${NTFY_URL:-http://ntfy}
NTFY_TOPIC: ${NTFY_TOPIC:-infra-alerts}
NTFY_TOKEN: ${NTFY_TOKEN}
labels: ["traefik.enable=false"]
ntfy:
image: binwiederhier/ntfy:${NTFY_VERSION:-v2.11.0}
restart: unless-stopped
command: serve
environment:
TZ: ${TZ:-UTC}
volumes:
- ./config/ntfy-server.yml:/etc/ntfy/server.yml:ro
- ntfy_data:/var/lib/ntfy
networks: [web, default]
healthcheck:
test: ["CMD-SHELL", "wget -q -O - http://localhost:80/v1/health 2>/dev/null | grep -q '\"healthy\":true' || exit 1"]
interval: 60s
timeout: 10s
retries: 3
start_period: 30s
labels:
- "traefik.enable=true"
- "traefik.http.routers.ntfy.rule=Host(`${NTFY_DOMAIN}`)"
- "traefik.http.routers.ntfy.tls.certresolver=lets-encrypt"
- "traefik.http.services.ntfy.loadbalancer.server.port=80"
- "traefik.docker.network=web"
blackbox-exporter:
image: prom/blackbox-exporter:${BLACKBOX_VERSION:-v0.25.0}
restart: unless-stopped
networks: [default, mail_internal, web]
volumes:
- ./config/blackbox.yml:/etc/blackbox_exporter/config.yml:ro
labels: ["traefik.enable=false"]
node-exporter:
image: prom/node-exporter:${NODE_EXPORTER_VERSION:-v1.7.0}
restart: unless-stopped
pid: host
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
labels: ["traefik.enable=false"]
networks:
web: { external: true }
mail_internal: { external: true }
volumes:
grafana_data:
prometheus_data:
loki_data:
alertmanager_data:
promtail_positions:
ntfy_data:A few choices worth pointing out:
- The Blackbox exporter sits on three networks on purpose. Being on the internal networks lets it probe other containers by name (
service:port) without that traffic crossing the host firewall or making a pointless round-trip out through the public reverse proxy. The same placement lets the mail-port probes reach the mail server directly overmail_internal. - node-exporter runs with
pid: hostand read-only mounts of/proc,/sys, and/, with the pseudo-filesystems excluded from the filesystem collector so you do not alert on tmpfs and friends. - cAdvisor gets a
mem_limitand--docker_only=true. It will otherwise account for everything on the host and grow unbounded over time.
Secrets live in .env, and .env is never committed
Nothing sensitive goes into Git. The repo ships a .env.example template; the real .env is git-ignored and only exists on the server. Every password, token, and SMTP credential is an environment variable.
# Grafana
GRAFANA_VERSION=10.2.3
GF_SECURITY_ADMIN_USER=admin
GF_SECURITY_ADMIN_PASSWORD=
GRAFANA_DOMAIN=grafana.example.com
# Pin everything; upgrade deliberately
PROMETHEUS_VERSION=v2.48.1
ALERTMANAGER_VERSION=v0.26.0
LOKI_VERSION=3.0.0
PROMTAIL_VERSION=2.9.0
CADVISOR_VERSION=v0.47.2
BLACKBOX_VERSION=v0.25.0
NODE_EXPORTER_VERSION=v1.7.0
NTFY_VERSION=v2.11.0
# SMTP for email alerts (Grafana + Alertmanager)
MAIL_SMTP_USER=
MAIL_SMTP_PASSWORD=
ALERT_FROM_EMAIL=alerts@example.com
ALERT_TO_EMAIL=you@example.com
# GitHub Issues bridge - fine-grained PAT, Issues read and write on the repo
GITHUB_TOKEN=
GITHUB_REPO=youruser/your-infra-repo
# Self-hosted push (independent of the mail stack)
NTFY_DOMAIN=ntfy.example.com
NTFY_TOPIC=infra-alerts
NTFY_TOKEN=
NTFY_URL=http://ntfy
# Reverse-proxy access log dir, bind-mounted into Promtail
PROXY_LOGS_PATH=/srv/proxy/logs
TZ=UTCIf a value is blank above, it is a secret you fill in on the server and nowhere else.
Scraping metrics and probing endpoints
Prometheus scrapes three kinds of target: the internal exporters by container name, and two flavours of Blackbox probe (web URLs and mail ports). The Blackbox jobs use relabel_configs to rewrite each scrape so Prometheus asks the exporter to probe the real target. That is the standard multi-target exporter pattern, and it is the same shape for every probe type.
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'primary'
alerting:
alertmanagers:
- static_configs:
- targets: [alertmanager:9093]
rule_files:
- /etc/prometheus/alerts.rules
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'loki'
static_configs:
- targets: ['loki:3100']
# Web uptime / latency / TLS expiry
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://www.example.com
- https://app.example.com
- https://api.example.com/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
# Mail submission port, probed by name over mail_internal
- job_name: 'blackbox_mail_submission'
metrics_path: /probe
params:
module: [smtp_starttls]
scrape_interval: 60s
scrape_timeout: 20s # must exceed the module's own 15s timeout
static_configs:
- targets: [mailserver:587]
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115Repeat the mail block for ports 25 (tcp_connect), 465 (smtps), and 993 (imaps), one job per module.
scrape_timeout must be larger than the probe module's own timeout. If it is not, Prometheus kills the scrape mid-handshake and you get phantom "down" readings on a mail server that is perfectly healthy. A STARTTLS dialogue legitimately takes several seconds, so a 15s module timeout needs a 20s scrape timeout around it.The probe modules
modules:
http_2xx:
prober: http
timeout: 10s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [] # empty = any 2xx/3xx is OK
method: GET
follow_redirects: true
preferred_ip_protocol: "ip4"
tls_config:
insecure_skip_verify: false # real cert validation for public sites
tcp_connect:
prober: tcp
timeout: 10s
tcp:
preferred_ip_protocol: "ip4"
smtp_starttls:
prober: tcp
timeout: 15s # the STARTTLS dialogue can take several seconds
tcp:
preferred_ip_protocol: "ip4"
query_response:
- expect: "^220 "
- send: "EHLO blackbox\r\n"
- expect: "^250"
- send: "STARTTLS\r\n"
- expect: "^220"
- starttls: true
tls_config:
insecure_skip_verify: true # internal hop; only testing reachability
smtps:
prober: tcp
timeout: 15s
tcp:
preferred_ip_protocol: "ip4"
tls: true
tls_config:
insecure_skip_verify: true
imaps:
prober: tcp
timeout: 10s
tcp:
preferred_ip_protocol: "ip4"
tls: true
tls_config:
insecure_skip_verify: trueThe smtp_starttls module actually speaks the protocol: it waits for the 220 greeting, sends EHLO, issues STARTTLS, then upgrades the connection. That proves the mail server is genuinely answering SMTP, not just accepting a bare TCP connection, which is the difference between "mail works" and "the port is open but Postfix is wedged." For public web targets I validate certificates for real; for the internal mail hops I do not, because there I am only testing reachability over a private network.
Alert rules
These cover the host, the public endpoints, and the containers. The thresholds and the for: windows matter as much as the expressions, they are what separate a useful page from a pager that cries wolf.
groups:
- name: node
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels: { severity: warning }
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU above 85% for 5 minutes (current: {{ $value }}%)"
- alert: CriticalCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
for: 2m
labels: { severity: critical }
annotations:
summary: "Critical CPU usage on {{ $labels.instance }}"
description: "CPU above 95% for 2 minutes (current: {{ $value }}%)"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels: { severity: warning }
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory above 85% (current: {{ $value }}%)"
- alert: LowDiskSpace
expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 80
for: 5m
labels: { severity: warning }
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk above 80% on / (current: {{ $value }}%)"
- alert: CriticalDiskSpace
expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 90
for: 2m
labels: { severity: critical }
annotations:
summary: "Critical disk space on {{ $labels.instance }}"
description: "Disk above 90% on / (current: {{ $value }}%)"
- name: blackbox
rules:
- alert: SiteDown
expr: probe_success == 0
for: 5m
labels: { severity: critical }
annotations:
summary: "Site down: {{ $labels.instance }}"
description: "{{ $labels.instance }} unreachable for 5 minutes"
- alert: SlowResponse
expr: probe_duration_seconds{job="blackbox"} > 2
for: 3m
labels: { severity: warning }
annotations:
summary: "Slow response: {{ $labels.instance }}"
description: "{{ $labels.instance }} responding in {{ $value | printf \"%.2f\" }}s"
- alert: SiteFlapping
expr: changes(probe_success{job="blackbox"}[30m]) > 4
for: 0m
labels: { severity: warning }
annotations:
summary: "Site flapping: {{ $labels.instance }}"
description: "{{ $labels.instance }} changed state >4 times in 30 minutes"
- alert: SSLCertExpiringSoon
expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
for: 1h
labels: { severity: warning }
annotations:
summary: "SSL cert expiring: {{ $labels.instance }}"
description: "Certificate expires in {{ $value | printf \"%.0f\" }} days"
- alert: SSLCertExpiryCritical
expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 7
for: 1h
labels: { severity: critical }
annotations:
summary: "SSL cert < 7 days: {{ $labels.instance }}"
description: "Certificate expires in {{ $value | printf \"%.0f\" }} days"
- name: containers
rules:
- alert: ContainerDown
expr: absent(container_last_seen{name!=""}) or time() - container_last_seen{name!=""} > 60
for: 5m
labels: { severity: critical }
annotations:
summary: "Container {{ $labels.name }} is down"
description: "{{ $labels.name }} not seen for over a minute"
- alert: ContainerHighMemory
expr: (container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""}) * 100 > 85 and container_spec_memory_limit_bytes{name!=""} > 0
for: 5m
labels: { severity: warning }
annotations:
summary: "Container {{ $labels.name }} high memory"
description: "{{ $labels.name }} memory above 85% of its limit"Two lessons are baked into those rules:
- Scope the latency rule to web probes with
job="blackbox". Without that label,SlowResponsealso fires on the mail TLS/SMTP probes, which legitimately take two to three and a half seconds. Leaving the label off once produced hundreds of noise alerts over a few days and buried the alerts that actually mattered. SiteDownusesfor: 5m. A single failed scrape, a momentary timeout or a reverse-proxy reload, should not page anyone. A real outage outlasts five minutes comfortably. The flapping rule catches the in-between case where a service is bouncing up and down.
Routing alerts: two receivers, and injecting secrets safely
Alertmanager groups, deduplicates, and routes. The important move is the two receivers: email and a webhook. Email is the primary, human-readable channel; the webhook is where the issue tracking and push notification happen.
global:
smtp_smarthost: 'mailserver:587'
smtp_require_tls: true
route:
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match: { severity: critical }
receiver: 'default'
repeat_interval: 1h # nag more often for criticals
receivers:
- name: 'default'
email_configs:
- to: '${ALERT_TO_EMAIL}'
from: '${ALERT_FROM_EMAIL}'
smarthost: 'mailserver:587'
auth_username: '${MAIL_SMTP_USER}'
auth_password: '${MAIL_SMTP_PASSWORD}'
require_tls: true
tls_config: { insecure_skip_verify: true }
send_resolved: true
webhook_configs:
- url: 'http://webhook-bridge:5001/webhook'
send_resolved: true
inhibit_rules:
- source_match: { severity: 'critical' }
target_match: { severity: 'warning' }
equal: ['alertname', 'instance']The inhibit_rules block is a small quality-of-life win: when CPU on a host goes critical, you do not also get the warning-level alert for the same instance. One event, one notification.
Injecting secrets without baking them into the image
Alertmanager does not expand environment variables in its config file. Rather than commit credentials, the container runs a tiny entrypoint that substitutes them at startup. It uses awk with index/substr, plain string replacement, not regex, so a password containing special characters is handled correctly instead of being mangled.
#!/bin/sh
# Substitute env vars into the Alertmanager config at startup.
# awk index/substr (not regex) so passwords with $, &, \, | survive intact.
awk \
-v TO="$ALERT_TO_EMAIL" \
-v FROM="$ALERT_FROM_EMAIL" \
-v USER="$MAIL_SMTP_USER" \
-v PASS="$MAIL_SMTP_PASSWORD" \
'
function replace(line, placeholder, value, idx, len) {
len = length(placeholder)
while ((idx = index(line, placeholder)) > 0)
line = substr(line, 1, idx-1) value substr(line, idx+len)
return line
}
{
line = $0
line = replace(line, "${ALERT_TO_EMAIL}", TO)
line = replace(line, "${ALERT_FROM_EMAIL}", FROM)
line = replace(line, "${MAIL_SMTP_USER}", USER)
line = replace(line, "${MAIL_SMTP_PASSWORD}", PASS)
print line
}
' /etc/alertmanager/alertmanager.yml.template > /tmp/alertmanager-resolved.yml
exec /bin/alertmanager \
--config.file=/tmp/alertmanager-resolved.yml \
--storage.path=/alertmanagerThe template with placeholders is what lives in Git. The resolved file, with real credentials in it, only ever exists at /tmp inside the running container and never touches disk outside it.
The webhook bridge: alerts become self-closing issues
This is the piece that turns notifications into tracked work. It is a single dependency-free Python file, standard library only, which is why it can run on a bare python:3.11-alpine image with the script bind-mounted in, no build step. It does three things: it opens a GitHub Issue on a firing alert, closes the matching issue on a resolved alert, and for critical alerts also pushes to the self-hosted ntfy server.
#!/usr/bin/env python3
"""
Alertmanager -> GitHub Issues bridge.
Creates/closes GitHub issues automatically; critical alerts also push via ntfy.
Standard library only; internal-only (not exposed via the reverse proxy).
"""
import json, os, sys, threading, time
import urllib.request, urllib.error
from http.server import HTTPServer, BaseHTTPRequestHandler
GITHUB_TOKEN = os.environ.get('GITHUB_TOKEN', '')
GITHUB_REPO = os.environ.get('GITHUB_REPO', '')
GITHUB_API = 'https://api.github.com'
NTFY_URL = os.environ.get('NTFY_URL', 'http://ntfy').rstrip('/')
NTFY_TOPIC = os.environ.get('NTFY_TOPIC', 'infra-alerts')
NTFY_TOKEN = os.environ.get('NTFY_TOKEN', '')
_lock = threading.Lock()
_recent = {} # title -> timestamp, a 5-minute in-flight dedup window
def log(msg):
print(msg, flush=True)
def gh(method, path, payload=None):
data = json.dumps(payload).encode() if payload else None
req = urllib.request.Request(
f'{GITHUB_API}{path}', data=data, method=method,
headers={
'Authorization': f'Bearer {GITHUB_TOKEN}',
'Accept': 'application/vnd.github.v3+json',
'Content-Type': 'application/json',
'X-GitHub-Api-Version': '2022-11-28',
})
try:
with urllib.request.urlopen(req) as r:
return json.loads(r.read())
except urllib.error.HTTPError as e:
log(f"GitHub {method} {path} -> HTTP {e.code}: {e.read().decode()}")
return None
except Exception as e:
log(f"GitHub {method} {path} error: {e}")
return None
def ntfy(title, message, priority='5', tags='rotating_light'):
"""Push to self-hosted ntfy. Title/tags stay ASCII (HTTP headers are latin-1);
emoji come from tag names. Body may be UTF-8. Best-effort; never blocks."""
if not NTFY_TOKEN:
log("ntfy skipped: NTFY_TOKEN not set")
return
try:
req = urllib.request.Request(
f'{NTFY_URL}/{NTFY_TOPIC}',
data=(message or title).encode('utf-8'), method='POST',
headers={
'Authorization': f'Bearer {NTFY_TOKEN}',
'Title': title, 'Priority': str(priority), 'Tags': tags,
})
urllib.request.urlopen(req, timeout=10)
log(f"ntfy sent ({priority}): {title}")
except Exception as e:
log(f"ntfy publish error: {e}")
def issue_title(alert):
labels = alert.get('labels', {})
name = labels.get('alertname', 'Alert')
severity = labels.get('severity', 'warning').upper()
instance = labels.get('instance', '')
title = f"[{severity}] {name}"
if instance:
title += f": {instance}"
return title
def find_open_issue(title):
# List API, not search API: fine-grained PATs can't use the search endpoint.
page = 1
while True:
issues = gh('GET', f'/repos/{GITHUB_REPO}/issues?state=open&per_page=100&page={page}')
if not issues:
return None
for issue in issues:
if issue.get('title') == title:
return issue
if len(issues) < 100:
return None
page += 1
def handle_firing(alert):
title = issue_title(alert)
labels = alert.get('labels', {})
severity = labels.get('severity', 'warning')
instance = labels.get('instance', 'N/A')
summary = alert.get('annotations', {}).get('summary', '')
desc = alert.get('annotations', {}).get('description', '')
with _lock:
now = time.time()
for k in list(_recent): # evict stale dedup entries
if now - _recent[k] > 300:
del _recent[k]
if title in _recent:
log(f"Deduplicated (in-flight): {title}")
return
if find_open_issue(title):
log(f"Issue already open: {title}")
return
_recent[title] = now
body = f"""## Alert Firing
| | |
|---|---|
| **Severity** | `{severity}` |
| **Instance** | `{instance}` |
| **Summary** | {summary} |
{desc}
---
*Auto-created by Alertmanager. Closes automatically when the alert resolves.*
"""
# Try with labels first; fall back to no labels if they don't exist on the repo.
result = gh('POST', f'/repos/{GITHUB_REPO}/issues',
{'title': title, 'body': body, 'labels': ['monitoring', severity]})
if not result:
result = gh('POST', f'/repos/{GITHUB_REPO}/issues', {'title': title, 'body': body})
if result:
log(f"Created issue #{result['number']}: {title}")
if severity == 'critical':
ntfy(title, f"{summary}\n\n{desc}".strip() or instance,
priority='5', tags='rotating_light')
def handle_resolved(alert):
title = issue_title(alert)
severity = alert.get('labels', {}).get('severity', 'warning')
issue = find_open_issue(title)
if not issue:
log(f"No open issue to close for: {title}")
return
number = issue['number']
gh('POST', f'/repos/{GITHUB_REPO}/issues/{number}/comments',
{'body': ':green_circle: Alert resolved - closing automatically.'})
gh('PATCH', f'/repos/{GITHUB_REPO}/issues/{number}', {'state': 'closed'})
log(f"Closed issue #{number}: {title}")
if severity == 'critical':
ntfy(f"Resolved: {title}", "Alert resolved - issue closed automatically.",
priority='3', tags='white_check_mark')
class Handler(BaseHTTPRequestHandler):
def do_POST(self):
length = int(self.headers.get('Content-Length', 0))
try:
data = json.loads(self.rfile.read(length))
for alert in data.get('alerts', []):
if alert.get('status') == 'firing':
handle_firing(alert)
elif alert.get('status') == 'resolved':
handle_resolved(alert)
except Exception as e:
log(f"Webhook error: {e}")
self.send_response(200); self.end_headers(); self.wfile.write(b'ok')
def do_GET(self):
self.send_response(200); self.end_headers()
self.wfile.write(json.dumps({'status': 'ok', 'repo': GITHUB_REPO}).encode())
def log_message(self, *_):
pass # suppress default request logging
if __name__ == '__main__':
if not GITHUB_TOKEN:
log("ERROR: GITHUB_TOKEN not set - issues will not be created")
sys.exit(1)
state = f"ntfy -> {NTFY_URL}/{NTFY_TOPIC}" if NTFY_TOKEN else "ntfy DISABLED"
log(f"Webhook bridge on :5001 -> {GITHUB_REPO} | critical -> {state}")
HTTPServer(('0.0.0.0', 5001), Handler).serve_forever()Three details that are not obvious until you hit them:
- Use the issues list API, not the search API. GitHub's fine-grained personal access tokens do not have access to the search endpoint, so matching an open issue by title means paging through
state=openissues a hundred at a time. The first time I built this with the search API it returned 403s that looked like a permissions bug and were not. - Two layers of deduplication. An in-memory five-minute window catches in-flight bursts cheaply, and a
find_open_issuecheck catches anything already filed. Without both, a flapping service fills the tracker with duplicates faster than you can close them. - The label fallback. Creating an issue with labels fails if those labels do not exist on the repo, so the bridge retries without them. The issue still gets filed.
Why the second channel is the whole point
Here is the trap this design exists to avoid. On a self-hosted box, the monitoring stack sends alert email through your own mail server. That is fine for almost every alert, until the alert you most need to receive is "the mail server is down." That alert is an email. It routes to the mail server. The mail server is down. You never hear about it. The monitoring did its job perfectly and you still found out from a user.
ntfy breaks that circular dependency. It is a small self-hosted push server; you install its app on your phone, subscribe to a topic, and the webhook bridge POSTs critical alerts straight to it over a path that shares nothing with the mail stack. Mail can be completely dead and the push still lands. Because the ntfy server has to be reachable from the public internet, it is locked to deny-all and given an explicit publisher:
base-url: https://ntfy.example.com
auth-default-access: deny-allThen create the publishing user and token once, on the server:
docker exec -it <ntfy-container> ntfy user add --role=user bridge # set a password
docker exec -it <ntfy-container> ntfy access bridge infra-alerts rw # publish + read
docker exec -it <ntfy-container> ntfy token add bridge # prints tk_...Put the printed token into .env as NTFY_TOKEN and redeploy so the bridge picks it up. On the phone: install the ntfy app, add the server, subscribe to the topic, sign in. Then prove the independent path actually works:
docker exec <ntfy-container> ntfy publish --token tk_... infra-alerts "test"If the phone buzzes, the channel that survives a mail outage is real. The token never gets committed, it goes in .env only.
Logs: Loki and Promtail
Metrics tell you that something broke; logs tell you why. Promtail tails the host logs plus every container's stdout/stderr and ships them to Loki, which Grafana queries.
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /var/lib/promtail/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: varlogs
static_configs:
- targets: [localhost]
labels: { job: varlogs, host: server, __path__: /var/log/{syslog,kern.log} }
# Auth log - SSH logins, sudo, PAM. Contains IPs/usernames; gated by Grafana auth.
- job_name: auth
static_configs:
- targets: [localhost]
labels: { job: auth, host: server, __path__: /var/log/auth.log }
# Reverse-proxy JSON access logs -> web analytics.
# Keep cardinality low: only host + method become labels; everything else
# (path, status, duration, client IP) is parsed at query time via | json.
- job_name: proxy-access
pipeline_stages:
- json:
expressions:
request_host: RequestHost
request_method: RequestMethod
- labels:
host: request_host
method: request_method
static_configs:
- targets: [localhost]
labels: { job: proxy-access, __path__: /proxy-logs/access.log }
- job_name: containers
pipeline_stages:
- docker: {}
static_configs:
- targets: [localhost]
labels: { job: containers, host: server, __path__: /var/lib/docker/containers/*/*-json.log }
relabel_configs:
- source_labels: [__path__]
regex: '/var/lib/docker/containers/([^/]+)/.+'
target_label: container_id
replacement: '$1'
action: replace| json.The Loki config itself is mostly defaults: filesystem storage, single-binary mode, retention around 30 days. The one thing to plan for is that Loki's schema versions change between major releases, so keep dated schema_config entries and old data stays readable after an upgrade instead of needing migration.
Dashboards as code
Grafana provisions its datasources and dashboards from files on disk. No click-ops, and the entire observability layer comes back identically from Git on every redeploy.
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
uid: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
- name: Loki
type: loki
uid: loki
access: proxy
url: http://loki:3100apiVersion: 1
providers:
- name: 'default'
folder: ''
type: file
options:
path: /etc/grafana/provisioning/dashboardsDrop dashboard JSON files next to that provider and they appear on startup. The workflow is: build a dashboard in the UI, export its JSON, commit it. Every redeploy then comes up identical to the last. The dashboards I keep provisioned:
- Site Uptime: a
probe_successgrid per target, HTTP response duration, uptime percentage over 24h / 7d / 30d, and TLS-expiry-in-days. - Infrastructure: host CPU, memory, disk gauges, network, container counts.
- Logs: Loki log rates, container error rates, auth log, syslog.
- Traefik / proxy: request rates, status-code breakdowns, p50/p95/p99 latency, 5xx rate.
Each UP/DOWN tile carries a data link straight to the site it represents, so the dashboard doubles as a launcher.
Deploying: manual, GitOps-style, no agents on the box
Deploys are deliberate. There is no auto-deploy on push; production only changes when I trigger it. A small "Deploy to Production" workflow dispatches an event to the orchestration repo, which runs on a self-hosted runner that SSHes into the VPS, syncs the named config files out of Git, pulls images, and brings the stack up. The pattern is identical for every project on the box, which is most of why a new project can be running end to end in about an hour.
name: Deploy to Production
on:
workflow_dispatch:
inputs:
force_recreate:
description: 'Force recreate containers'
type: boolean
default: false
jobs:
trigger-deploy:
runs-on: ubuntu-latest
steps:
- name: Trigger deploy
uses: peter-evans/repository-dispatch@v4
with:
token: ${{ secrets.DEPLOY_TOKEN }}
repository: youruser/infra-repo
event-type: platform-deploy
client-payload: |
{
"project_dir": "monitoring",
"project_type": "platform",
"sync_code": "true",
"sync_files": "docker-compose.yml config/prometheus.yml config/alerts.rules config/blackbox.yml config/loki.yml config/promtail.yml config/ntfy-server.yml services/webhook-bridge/app.py config/alertmanager.yml.template config/alertmanager-entrypoint.sh config/provisioning/...",
"force_recreate": "${{ inputs.force_recreate || false }}"
}The receiving deploy job, running on the self-hosted runner, does the boring-but-essential things in order: verify SSH, sync only the listed files, pull, bring up, then a health check that fails the run if any container is in an Exit / unhealthy / restarting state.
jobs:
deploy:
runs-on: self-hosted
environment: production
timeout-minutes: 30
steps:
- name: Sync config files from repo
if: steps.vars.outputs.sync_code == 'true'
run: |
ssh -i ~/.ssh/id_rsa_vps -p ${{ secrets.SSH_PORT }} \
${{ secrets.SSH_USER }}@${{ secrets.VPS_IP }} \
"cd ~/docker/${{ steps.vars.outputs.project_dir }} && \
git fetch origin main --depth=1 && \
git checkout origin/main -- ${{ steps.vars.outputs.sync_files }}"
- name: Pull images and deploy
run: |
ssh -i ~/.ssh/id_rsa_vps -p ${{ secrets.SSH_PORT }} \
${{ secrets.SSH_USER }}@${{ secrets.VPS_IP }} \
"cd ~/docker/${{ steps.vars.outputs.project_dir }} && \
docker compose pull && \
docker compose up -d --remove-orphans"
- name: Health check
run: |
sleep 30
ssh -i ~/.ssh/id_rsa_vps -p ${{ secrets.SSH_PORT }} \
${{ secrets.SSH_USER }}@${{ secrets.VPS_IP }} \
"cd ~/docker/${{ steps.vars.outputs.project_dir }} && \
STOPPED=\$(docker compose ps | grep -cE 'Exit|unhealthy|restarting' || true) && \
if [ \"\$STOPPED\" -gt 0 ]; then docker compose ps && exit 1; fi && \
echo 'All containers healthy'"Every host, port, user, and key path in there is a GitHub secret, never a literal. The runner holds the SSH key; the box runs no deploy agent of its own and exposes no deploy webhook. The deployment surface is one SSH key and one short-lived token. There is also a separate local Compose file for working on dashboards: Grafana only, anonymous admin, no Traefik, live provisioning reload, on localhost:3000.
Backups, so a dead box is not a dead project
Monitoring tells you something broke; snapshots let you put it back. A scheduled GitHub Actions job takes a provider-level snapshot of the VPS every week during quiet hours, with a manual trigger for ad-hoc snapshots before a risky change. The provider here is Contabo via its API, but the shape is generic; any provider with a snapshot API works the same way.
name: Weekly VPS Snapshots
on:
schedule:
- cron: '0 2 * * 0' # Sundays 02:00 UTC, low traffic
workflow_dispatch:
inputs:
target:
type: choice
options: [both, vps1, vps2]
default: both
jobs:
snapshot:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- run: sudo apt-get install -y jq uuid-runtime
- name: Create snapshots
env:
# all provider credentials are GitHub secrets, never literals
PROVIDER_CLIENT_ID: ${{ secrets.PROVIDER_CLIENT_ID }}
PROVIDER_CLIENT_SECRET: ${{ secrets.PROVIDER_CLIENT_SECRET }}
VPS_INSTANCE_ID: ${{ secrets.VPS_INSTANCE_ID }}
run: ./scripts/backup/snapshot.sh "${{ github.event.inputs.target || 'both' }}"Same discipline as everywhere else: every credential is a secret, the schedule is deliberate, and there is a manual escape hatch.
The discipline that keeps it boring
A few habits matter more than any single component:
- Pin every image and upgrade on purpose. cAdvisor is the cautionary tale: its
latesttag has shipped builds with a constant CPU overhead of around 13%, so it is pinned to v0.47.2 and only moves after a tested bump. Dependabot proposes; nothing applies itself. - Verify the alert path end to end before trusting it. Stop a non-critical container, watch a GitHub Issue open, start it again, watch the issue close. Then make the mail stack unreachable and confirm the ntfy push still lands. An alerting pipeline you have never seen fire is a guess, not a safety net.
- Keep secrets out of Git, always. Templates and
.env.examplelive in the repo; real values live only on the server and in Actions secrets. The entrypoint resolves credentials into/tmpat runtime so they never sit on disk in plaintext config. - Reproduce, do not repair. Because the whole stack is one Compose file plus pinned images plus tracked config, recovering from a bad state is a redeploy, not a debugging session.
What this design buys
- One
docker compose upreproduces the entire observability stack, dashboards and all, from pinned images and tracked config. - No secret is ever committed; everything sensitive is an environment variable injected at runtime.
- Alerts survive the failure of their own delivery channel, because criticals have a second path that shares no dependencies with email.
- Every alert is a tracked, self-closing work item, so the issue tracker is a live picture of what is broken right now and goes quiet on its own when things recover.
None of the components are unusual. The value is in the wiring: the two-channel fan-out, the runtime secret injection, the cardinality-aware log labels, and the bridge that closes the loop between "something fired" and "someone is tracking it." That is the part most tutorials skip, and it is the part that decides whether your monitoring actually wakes you up when it matters, and stops bothering you when it does not.
The result, on one VPS I have been carrying forward since 2019, is uptime sitting at 97-100% month after month. Not because the hardware is special, but because the box tells me the moment it is not.
Open source throughout, version-pinned, and updated regularly via Dependabot and deliberate manual bumps.
