← All writing

Self-hosted monitoring on a single VPS

· 26 min read
Self-hosted monitoring on a single VPS

How I monitor a small fleet of self-hosted projects on one Linux box: what runs, why it is wired the way it is, and the handful of decisions that took a few outages to get right.

There is no product to sell here and nothing exotic. Everything is open source, pinned to a version, and reproduced from a single Git repo. The interesting part is not the component list, it is the wiring: how an alert reaches me when the thing that is broken is the thing that delivers alerts, and how a fired alert becomes a tracked task instead of a notification I swipe away.

The starting point: a stock Ubuntu host running Docker

The baseline for everything below is a plain server: a current Ubuntu LTS, Docker Engine with the Compose plugin, and a reverse proxy already terminating TLS. In my case that is a single Contabo VPS in Germany, the same instance I have upgraded since 2019, never reinstalled, just improved in place. Nothing here is provider-specific; any VPS or bare-metal box with Docker will do.

A stock Docker host gives you containers and not much else. It will happily keep running long after something inside it has quietly died. You do not know:

  • whether a public site is actually answering, or how fast
  • when a TLS certificate is about to expire
  • whether the host is running out of CPU, memory, or disk
  • whether a container restarted at 3am
  • why something broke, because the logs are scattered across docker logs, journalctl, and files under /var/log

This stack is the layer you add on top to answer those questions and, more importantly, to tell you about them before a user does. Two principles drive the whole design:

  1. An alert has to survive the failure of its own delivery channel. On a self-hosted box, alert email is often delivered by your own mail server. The day the mail server goes down, the alert that says so is an email, routed to the dead mail server. You never see it. So critical alerts get a second, fully independent path.
  2. An alert is a work item, not a toast. Every firing alert opens a tracked issue and closes itself when the problem resolves. The issue tracker becomes a live, accurate list of what is broken right now, and goes quiet on its own when things recover.

What runs

ComponentJobPinned version
PrometheusMetrics collection and alert evaluationv2.48.1
GrafanaDashboards and visualisation10.2.3
AlertmanagerAlert routing, grouping, deduplicationv0.26.0
LokiLog aggregation3.0.0
PromtailLog shipping2.9.x
cAdvisorPer-container metricsv0.47.2
Node ExporterHost metricsv1.7.0
Blackbox ExporterHTTP / TLS / SMTP probingv0.25.0
ntfySelf-hosted push for critical alertsv2.11.0
webhook-bridgeTurns alerts into self-closing GitHub Issuesinternal

Everything is pinned. Versions live in .env, so an upgrade is an explicit edit and a redeploy, never a surprise on a docker compose pull. Dependabot proposes bumps; I apply them on purpose. cAdvisor is pinned for a concrete reason covered later.

Architecture at a glance

                          +--------------+
   host + containers ---> |  Prometheus  | --> alert rules
   (node-exporter,        +------+-------+
    cAdvisor)                    | fires
                                 v
   public URLs ---> Blackbox -> +--------------+   email   +-------------+
   (HTTP/TLS/SMTP) exporter     | Alertmanager | --------> | mail server |
                                +------+-------+           +-------------+
                                       | webhook
                                       v
                                +---------------+ --> GitHub Issues (open/close)
                                | webhook-bridge| --> ntfy push (critical only)
                                +---------------+
   syslog / auth / proxy ---> Promtail -> Loki -> Grafana (dashboards + Explore)

Metrics flow into Prometheus, which evaluates rules and hands firing alerts to Alertmanager. Alertmanager fans out to two receivers: email through the mail server, and a webhook. The webhook bridge files GitHub Issues and, for critical alerts only, pushes to a self-hosted ntfy server that shares nothing with the mail stack. Logs flow separately through Promtail into Loki, queried from Grafana.

Networks and the reverse proxy

The stack joins two pre-existing external Docker networks: web (where the reverse proxy lives) and mail_internal (a private network shared with the mail server). Create them once if they do not exist:

bash
docker network create web
docker network create mail_internal

Only two services are ever reachable from the public internet: Grafana and ntfy. Everything else (Prometheus, Alertmanager, the exporters, the webhook bridge) is internal-only and carries traefik.enable=false. Grafana sits behind an IP-allowlist middleware on top of its own login, because a login page on the public internet is just a brute-force target waiting to be found. All admin URLs are IP-restricted; that is a standing rule across the whole box. The labels below assume Traefik; for nginx-proxy or Caddy the routing changes but nothing else does.

The Compose file

This is the single file that defines the stack. Substitute your own domains and paths.

docker-compose.yml
services:
  grafana:
    image: grafana/grafana:${GRAFANA_VERSION:-10.2.3}
    restart: unless-stopped
    volumes:
      - grafana_data:/var/lib/grafana
      - ./config/provisioning:/etc/grafana/provisioning:ro
    environment:
      GF_SECURITY_ADMIN_USER: ${GF_SECURITY_ADMIN_USER:-admin}
      GF_SECURITY_ADMIN_PASSWORD: ${GF_SECURITY_ADMIN_PASSWORD}
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_SMTP_ENABLED: "true"
      GF_SMTP_HOST: mailserver
      GF_SMTP_PORT: "587"
      GF_SMTP_USER: ${MAIL_SMTP_USER}
      GF_SMTP_PASSWORD: ${MAIL_SMTP_PASSWORD}
      GF_SMTP_FROM_ADDRESS: ${ALERT_FROM_EMAIL}
      GF_SMTP_FROM_NAME: "Grafana Alerts"
      GF_SMTP_STARTTLS_POLICY: MandatoryStartTLS
      GF_SMTP_SKIP_VERIFY: "true"
      GF_LOG_LEVEL: warn
    networks: [web, default, mail_internal]
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.grafana.rule=Host(`${GRAFANA_DOMAIN}`)"
      - "traefik.http.routers.grafana.tls.certresolver=lets-encrypt"
      - "traefik.http.routers.grafana.middlewares=admin-ipallowlist@file"
      - "traefik.http.services.grafana.loadbalancer.server.port=3000"
      - "traefik.docker.network=web"

  prometheus:
    image: prom/prometheus:${PROMETHEUS_VERSION:-v2.48.1}
    restart: unless-stopped
    volumes:
      - ./config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./config/alerts.rules:/etc/prometheus/alerts.rules:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    networks: [web, default]
    labels: ["traefik.enable=false"]

  alertmanager:
    image: prom/alertmanager:${ALERTMANAGER_VERSION:-v0.26.0}
    restart: unless-stopped
    env_file: .env
    entrypoint: ["/bin/sh", "/etc/alertmanager/alertmanager-entrypoint.sh"]
    volumes:
      - ./config/alertmanager.yml.template:/etc/alertmanager/alertmanager.yml.template:ro
      - ./config/alertmanager-entrypoint.sh:/etc/alertmanager/alertmanager-entrypoint.sh:ro
      - alertmanager_data:/alertmanager
    networks: [default, mail_internal]
    labels: ["traefik.enable=false"]

  loki:
    image: grafana/loki:${LOKI_VERSION:-3.0.0}
    restart: unless-stopped
    volumes:
      - loki_data:/loki
      - ./config/loki.yml:/etc/loki/loki.yml:ro
    command: -config.file=/etc/loki/loki.yml -log.level=warn
    labels: ["traefik.enable=false"]

  promtail:
    image: grafana/promtail:${PROMTAIL_VERSION:-2.9.0}
    restart: unless-stopped
    volumes:
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - ./config/promtail.yml:/etc/promtail/config.yml:ro
      - promtail_positions:/var/lib/promtail
      - ${PROXY_LOGS_PATH:-/srv/proxy/logs}:/proxy-logs:ro
    command: -config.file=/etc/promtail/config.yml
    labels: ["traefik.enable=false"]

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:${CADVISOR_VERSION:-v0.47.2}
    restart: unless-stopped
    privileged: true
    mem_limit: 1g
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
    command:
      - '--docker_only=true'
      - '--housekeeping_interval=30s'
    labels: ["traefik.enable=false"]

  webhook-bridge:
    image: python:3.11-alpine
    restart: unless-stopped
    volumes:
      - ./services/webhook-bridge/app.py:/app/app.py:ro
    working_dir: /app
    command: python app.py
    environment:
      GITHUB_TOKEN: ${GITHUB_TOKEN}
      GITHUB_REPO: ${GITHUB_REPO}
      NTFY_URL: ${NTFY_URL:-http://ntfy}
      NTFY_TOPIC: ${NTFY_TOPIC:-infra-alerts}
      NTFY_TOKEN: ${NTFY_TOKEN}
    labels: ["traefik.enable=false"]

  ntfy:
    image: binwiederhier/ntfy:${NTFY_VERSION:-v2.11.0}
    restart: unless-stopped
    command: serve
    environment:
      TZ: ${TZ:-UTC}
    volumes:
      - ./config/ntfy-server.yml:/etc/ntfy/server.yml:ro
      - ntfy_data:/var/lib/ntfy
    networks: [web, default]
    healthcheck:
      test: ["CMD-SHELL", "wget -q -O - http://localhost:80/v1/health 2>/dev/null | grep -q '\"healthy\":true' || exit 1"]
      interval: 60s
      timeout: 10s
      retries: 3
      start_period: 30s
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.ntfy.rule=Host(`${NTFY_DOMAIN}`)"
      - "traefik.http.routers.ntfy.tls.certresolver=lets-encrypt"
      - "traefik.http.services.ntfy.loadbalancer.server.port=80"
      - "traefik.docker.network=web"

  blackbox-exporter:
    image: prom/blackbox-exporter:${BLACKBOX_VERSION:-v0.25.0}
    restart: unless-stopped
    networks: [default, mail_internal, web]
    volumes:
      - ./config/blackbox.yml:/etc/blackbox_exporter/config.yml:ro
    labels: ["traefik.enable=false"]

  node-exporter:
    image: prom/node-exporter:${NODE_EXPORTER_VERSION:-v1.7.0}
    restart: unless-stopped
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    labels: ["traefik.enable=false"]

networks:
  web: { external: true }
  mail_internal: { external: true }

volumes:
  grafana_data:
  prometheus_data:
  loki_data:
  alertmanager_data:
  promtail_positions:
  ntfy_data:

A few choices worth pointing out:

  • The Blackbox exporter sits on three networks on purpose. Being on the internal networks lets it probe other containers by name (service:port) without that traffic crossing the host firewall or making a pointless round-trip out through the public reverse proxy. The same placement lets the mail-port probes reach the mail server directly over mail_internal.
  • node-exporter runs with pid: host and read-only mounts of /proc, /sys, and /, with the pseudo-filesystems excluded from the filesystem collector so you do not alert on tmpfs and friends.
  • cAdvisor gets a mem_limit and --docker_only=true. It will otherwise account for everything on the host and grow unbounded over time.

Secrets live in .env, and .env is never committed

Nothing sensitive goes into Git. The repo ships a .env.example template; the real .env is git-ignored and only exists on the server. Every password, token, and SMTP credential is an environment variable.

.env.example
# Grafana
GRAFANA_VERSION=10.2.3
GF_SECURITY_ADMIN_USER=admin
GF_SECURITY_ADMIN_PASSWORD=
GRAFANA_DOMAIN=grafana.example.com

# Pin everything; upgrade deliberately
PROMETHEUS_VERSION=v2.48.1
ALERTMANAGER_VERSION=v0.26.0
LOKI_VERSION=3.0.0
PROMTAIL_VERSION=2.9.0
CADVISOR_VERSION=v0.47.2
BLACKBOX_VERSION=v0.25.0
NODE_EXPORTER_VERSION=v1.7.0
NTFY_VERSION=v2.11.0

# SMTP for email alerts (Grafana + Alertmanager)
MAIL_SMTP_USER=
MAIL_SMTP_PASSWORD=
ALERT_FROM_EMAIL=alerts@example.com
ALERT_TO_EMAIL=you@example.com

# GitHub Issues bridge - fine-grained PAT, Issues read and write on the repo
GITHUB_TOKEN=
GITHUB_REPO=youruser/your-infra-repo

# Self-hosted push (independent of the mail stack)
NTFY_DOMAIN=ntfy.example.com
NTFY_TOPIC=infra-alerts
NTFY_TOKEN=
NTFY_URL=http://ntfy

# Reverse-proxy access log dir, bind-mounted into Promtail
PROXY_LOGS_PATH=/srv/proxy/logs

TZ=UTC

If a value is blank above, it is a secret you fill in on the server and nowhere else.

Scraping metrics and probing endpoints

Prometheus scrapes three kinds of target: the internal exporters by container name, and two flavours of Blackbox probe (web URLs and mail ports). The Blackbox jobs use relabel_configs to rewrite each scrape so Prometheus asks the exporter to probe the real target. That is the standard multi-target exporter pattern, and it is the same shape for every probe type.

config/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'primary'

alerting:
  alertmanagers:
    - static_configs:
        - targets: [alertmanager:9093]

rule_files:
  - /etc/prometheus/alerts.rules

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'loki'
    static_configs:
      - targets: ['loki:3100']

  # Web uptime / latency / TLS expiry
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://www.example.com
          - https://app.example.com
          - https://api.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

  # Mail submission port, probed by name over mail_internal
  - job_name: 'blackbox_mail_submission'
    metrics_path: /probe
    params:
      module: [smtp_starttls]
    scrape_interval: 60s
    scrape_timeout: 20s   # must exceed the module's own 15s timeout
    static_configs:
      - targets: [mailserver:587]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Repeat the mail block for ports 25 (tcp_connect), 465 (smtps), and 993 (imaps), one job per module.

Lesson: scrape_timeout must be larger than the probe module's own timeout. If it is not, Prometheus kills the scrape mid-handshake and you get phantom "down" readings on a mail server that is perfectly healthy. A STARTTLS dialogue legitimately takes several seconds, so a 15s module timeout needs a 20s scrape timeout around it.

The probe modules

config/blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 10s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: []        # empty = any 2xx/3xx is OK
      method: GET
      follow_redirects: true
      preferred_ip_protocol: "ip4"
      tls_config:
        insecure_skip_verify: false  # real cert validation for public sites

  tcp_connect:
    prober: tcp
    timeout: 10s
    tcp:
      preferred_ip_protocol: "ip4"

  smtp_starttls:
    prober: tcp
    timeout: 15s   # the STARTTLS dialogue can take several seconds
    tcp:
      preferred_ip_protocol: "ip4"
      query_response:
        - expect: "^220 "
        - send: "EHLO blackbox\r\n"
        - expect: "^250"
        - send: "STARTTLS\r\n"
        - expect: "^220"
        - starttls: true
      tls_config:
        insecure_skip_verify: true   # internal hop; only testing reachability

  smtps:
    prober: tcp
    timeout: 15s
    tcp:
      preferred_ip_protocol: "ip4"
      tls: true
      tls_config:
        insecure_skip_verify: true

  imaps:
    prober: tcp
    timeout: 10s
    tcp:
      preferred_ip_protocol: "ip4"
      tls: true
      tls_config:
        insecure_skip_verify: true

The smtp_starttls module actually speaks the protocol: it waits for the 220 greeting, sends EHLO, issues STARTTLS, then upgrades the connection. That proves the mail server is genuinely answering SMTP, not just accepting a bare TCP connection, which is the difference between "mail works" and "the port is open but Postfix is wedged." For public web targets I validate certificates for real; for the internal mail hops I do not, because there I am only testing reachability over a private network.

Alert rules

These cover the host, the public endpoints, and the containers. The thresholds and the for: windows matter as much as the expressions, they are what separate a useful page from a pager that cries wolf.

config/alerts.rules
groups:
  - name: node
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU above 85% for 5 minutes (current: {{ $value }}%)"

      - alert: CriticalCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
        for: 2m
        labels: { severity: critical }
        annotations:
          summary: "Critical CPU usage on {{ $labels.instance }}"
          description: "CPU above 95% for 2 minutes (current: {{ $value }}%)"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory above 85% (current: {{ $value }}%)"

      - alert: LowDiskSpace
        expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 80
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk above 80% on / (current: {{ $value }}%)"

      - alert: CriticalDiskSpace
        expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 90
        for: 2m
        labels: { severity: critical }
        annotations:
          summary: "Critical disk space on {{ $labels.instance }}"
          description: "Disk above 90% on / (current: {{ $value }}%)"

  - name: blackbox
    rules:
      - alert: SiteDown
        expr: probe_success == 0
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Site down: {{ $labels.instance }}"
          description: "{{ $labels.instance }} unreachable for 5 minutes"

      - alert: SlowResponse
        expr: probe_duration_seconds{job="blackbox"} > 2
        for: 3m
        labels: { severity: warning }
        annotations:
          summary: "Slow response: {{ $labels.instance }}"
          description: "{{ $labels.instance }} responding in {{ $value | printf \"%.2f\" }}s"

      - alert: SiteFlapping
        expr: changes(probe_success{job="blackbox"}[30m]) > 4
        for: 0m
        labels: { severity: warning }
        annotations:
          summary: "Site flapping: {{ $labels.instance }}"
          description: "{{ $labels.instance }} changed state >4 times in 30 minutes"

      - alert: SSLCertExpiringSoon
        expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
        for: 1h
        labels: { severity: warning }
        annotations:
          summary: "SSL cert expiring: {{ $labels.instance }}"
          description: "Certificate expires in {{ $value | printf \"%.0f\" }} days"

      - alert: SSLCertExpiryCritical
        expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 7
        for: 1h
        labels: { severity: critical }
        annotations:
          summary: "SSL cert < 7 days: {{ $labels.instance }}"
          description: "Certificate expires in {{ $value | printf \"%.0f\" }} days"

  - name: containers
    rules:
      - alert: ContainerDown
        expr: absent(container_last_seen{name!=""}) or time() - container_last_seen{name!=""} > 60
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Container {{ $labels.name }} is down"
          description: "{{ $labels.name }} not seen for over a minute"

      - alert: ContainerHighMemory
        expr: (container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""}) * 100 > 85 and container_spec_memory_limit_bytes{name!=""} > 0
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Container {{ $labels.name }} high memory"
          description: "{{ $labels.name }} memory above 85% of its limit"

Two lessons are baked into those rules:

  • Scope the latency rule to web probes with job="blackbox". Without that label, SlowResponse also fires on the mail TLS/SMTP probes, which legitimately take two to three and a half seconds. Leaving the label off once produced hundreds of noise alerts over a few days and buried the alerts that actually mattered.
  • SiteDown uses for: 5m. A single failed scrape, a momentary timeout or a reverse-proxy reload, should not page anyone. A real outage outlasts five minutes comfortably. The flapping rule catches the in-between case where a service is bouncing up and down.

Routing alerts: two receivers, and injecting secrets safely

Alertmanager groups, deduplicates, and routes. The important move is the two receivers: email and a webhook. Email is the primary, human-readable channel; the webhook is where the issue tracking and push notification happen.

config/alertmanager.yml.template
global:
  smtp_smarthost: 'mailserver:587'
  smtp_require_tls: true

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match: { severity: critical }
      receiver: 'default'
      repeat_interval: 1h     # nag more often for criticals

receivers:
  - name: 'default'
    email_configs:
      - to: '${ALERT_TO_EMAIL}'
        from: '${ALERT_FROM_EMAIL}'
        smarthost: 'mailserver:587'
        auth_username: '${MAIL_SMTP_USER}'
        auth_password: '${MAIL_SMTP_PASSWORD}'
        require_tls: true
        tls_config: { insecure_skip_verify: true }
        send_resolved: true
    webhook_configs:
      - url: 'http://webhook-bridge:5001/webhook'
        send_resolved: true

inhibit_rules:
  - source_match: { severity: 'critical' }
    target_match: { severity: 'warning' }
    equal: ['alertname', 'instance']

The inhibit_rules block is a small quality-of-life win: when CPU on a host goes critical, you do not also get the warning-level alert for the same instance. One event, one notification.

Injecting secrets without baking them into the image

Alertmanager does not expand environment variables in its config file. Rather than commit credentials, the container runs a tiny entrypoint that substitutes them at startup. It uses awk with index/substr, plain string replacement, not regex, so a password containing special characters is handled correctly instead of being mangled.

config/alertmanager-entrypoint.sh
#!/bin/sh
# Substitute env vars into the Alertmanager config at startup.
# awk index/substr (not regex) so passwords with $, &, \, | survive intact.
awk \
  -v TO="$ALERT_TO_EMAIL" \
  -v FROM="$ALERT_FROM_EMAIL" \
  -v USER="$MAIL_SMTP_USER" \
  -v PASS="$MAIL_SMTP_PASSWORD" \
  '
  function replace(line, placeholder, value,    idx, len) {
    len = length(placeholder)
    while ((idx = index(line, placeholder)) > 0)
      line = substr(line, 1, idx-1) value substr(line, idx+len)
    return line
  }
  {
    line = $0
    line = replace(line, "${ALERT_TO_EMAIL}",     TO)
    line = replace(line, "${ALERT_FROM_EMAIL}",    FROM)
    line = replace(line, "${MAIL_SMTP_USER}",      USER)
    line = replace(line, "${MAIL_SMTP_PASSWORD}",  PASS)
    print line
  }
  ' /etc/alertmanager/alertmanager.yml.template > /tmp/alertmanager-resolved.yml

exec /bin/alertmanager \
  --config.file=/tmp/alertmanager-resolved.yml \
  --storage.path=/alertmanager

The template with placeholders is what lives in Git. The resolved file, with real credentials in it, only ever exists at /tmp inside the running container and never touches disk outside it.

The webhook bridge: alerts become self-closing issues

This is the piece that turns notifications into tracked work. It is a single dependency-free Python file, standard library only, which is why it can run on a bare python:3.11-alpine image with the script bind-mounted in, no build step. It does three things: it opens a GitHub Issue on a firing alert, closes the matching issue on a resolved alert, and for critical alerts also pushes to the self-hosted ntfy server.

services/webhook-bridge/app.py
#!/usr/bin/env python3
"""
Alertmanager -> GitHub Issues bridge.
Creates/closes GitHub issues automatically; critical alerts also push via ntfy.
Standard library only; internal-only (not exposed via the reverse proxy).
"""
import json, os, sys, threading, time
import urllib.request, urllib.error
from http.server import HTTPServer, BaseHTTPRequestHandler

GITHUB_TOKEN = os.environ.get('GITHUB_TOKEN', '')
GITHUB_REPO  = os.environ.get('GITHUB_REPO', '')
GITHUB_API   = 'https://api.github.com'

NTFY_URL   = os.environ.get('NTFY_URL', 'http://ntfy').rstrip('/')
NTFY_TOPIC = os.environ.get('NTFY_TOPIC', 'infra-alerts')
NTFY_TOKEN = os.environ.get('NTFY_TOKEN', '')

_lock   = threading.Lock()
_recent = {}  # title -> timestamp, a 5-minute in-flight dedup window


def log(msg):
    print(msg, flush=True)


def gh(method, path, payload=None):
    data = json.dumps(payload).encode() if payload else None
    req  = urllib.request.Request(
        f'{GITHUB_API}{path}', data=data, method=method,
        headers={
            'Authorization': f'Bearer {GITHUB_TOKEN}',
            'Accept': 'application/vnd.github.v3+json',
            'Content-Type': 'application/json',
            'X-GitHub-Api-Version': '2022-11-28',
        })
    try:
        with urllib.request.urlopen(req) as r:
            return json.loads(r.read())
    except urllib.error.HTTPError as e:
        log(f"GitHub {method} {path} -> HTTP {e.code}: {e.read().decode()}")
        return None
    except Exception as e:
        log(f"GitHub {method} {path} error: {e}")
        return None


def ntfy(title, message, priority='5', tags='rotating_light'):
    """Push to self-hosted ntfy. Title/tags stay ASCII (HTTP headers are latin-1);
    emoji come from tag names. Body may be UTF-8. Best-effort; never blocks."""
    if not NTFY_TOKEN:
        log("ntfy skipped: NTFY_TOKEN not set")
        return
    try:
        req = urllib.request.Request(
            f'{NTFY_URL}/{NTFY_TOPIC}',
            data=(message or title).encode('utf-8'), method='POST',
            headers={
                'Authorization': f'Bearer {NTFY_TOKEN}',
                'Title': title, 'Priority': str(priority), 'Tags': tags,
            })
        urllib.request.urlopen(req, timeout=10)
        log(f"ntfy sent ({priority}): {title}")
    except Exception as e:
        log(f"ntfy publish error: {e}")


def issue_title(alert):
    labels   = alert.get('labels', {})
    name     = labels.get('alertname', 'Alert')
    severity = labels.get('severity', 'warning').upper()
    instance = labels.get('instance', '')
    title    = f"[{severity}] {name}"
    if instance:
        title += f": {instance}"
    return title


def find_open_issue(title):
    # List API, not search API: fine-grained PATs can't use the search endpoint.
    page = 1
    while True:
        issues = gh('GET', f'/repos/{GITHUB_REPO}/issues?state=open&per_page=100&page={page}')
        if not issues:
            return None
        for issue in issues:
            if issue.get('title') == title:
                return issue
        if len(issues) < 100:
            return None
        page += 1


def handle_firing(alert):
    title    = issue_title(alert)
    labels   = alert.get('labels', {})
    severity = labels.get('severity', 'warning')
    instance = labels.get('instance', 'N/A')
    summary  = alert.get('annotations', {}).get('summary', '')
    desc     = alert.get('annotations', {}).get('description', '')

    with _lock:
        now = time.time()
        for k in list(_recent):              # evict stale dedup entries
            if now - _recent[k] > 300:
                del _recent[k]
        if title in _recent:
            log(f"Deduplicated (in-flight): {title}")
            return
        if find_open_issue(title):
            log(f"Issue already open: {title}")
            return
        _recent[title] = now

    body = f"""## Alert Firing

| | |
|---|---|
| **Severity** | `{severity}` |
| **Instance** | `{instance}` |
| **Summary** | {summary} |

{desc}

---
*Auto-created by Alertmanager. Closes automatically when the alert resolves.*
"""
    # Try with labels first; fall back to no labels if they don't exist on the repo.
    result = gh('POST', f'/repos/{GITHUB_REPO}/issues',
                {'title': title, 'body': body, 'labels': ['monitoring', severity]})
    if not result:
        result = gh('POST', f'/repos/{GITHUB_REPO}/issues', {'title': title, 'body': body})
    if result:
        log(f"Created issue #{result['number']}: {title}")

    if severity == 'critical':
        ntfy(title, f"{summary}\n\n{desc}".strip() or instance,
             priority='5', tags='rotating_light')


def handle_resolved(alert):
    title    = issue_title(alert)
    severity = alert.get('labels', {}).get('severity', 'warning')
    issue = find_open_issue(title)
    if not issue:
        log(f"No open issue to close for: {title}")
        return
    number = issue['number']
    gh('POST', f'/repos/{GITHUB_REPO}/issues/{number}/comments',
       {'body': ':green_circle: Alert resolved - closing automatically.'})
    gh('PATCH', f'/repos/{GITHUB_REPO}/issues/{number}', {'state': 'closed'})
    log(f"Closed issue #{number}: {title}")

    if severity == 'critical':
        ntfy(f"Resolved: {title}", "Alert resolved - issue closed automatically.",
             priority='3', tags='white_check_mark')


class Handler(BaseHTTPRequestHandler):
    def do_POST(self):
        length = int(self.headers.get('Content-Length', 0))
        try:
            data = json.loads(self.rfile.read(length))
            for alert in data.get('alerts', []):
                if alert.get('status') == 'firing':
                    handle_firing(alert)
                elif alert.get('status') == 'resolved':
                    handle_resolved(alert)
        except Exception as e:
            log(f"Webhook error: {e}")
        self.send_response(200); self.end_headers(); self.wfile.write(b'ok')

    def do_GET(self):
        self.send_response(200); self.end_headers()
        self.wfile.write(json.dumps({'status': 'ok', 'repo': GITHUB_REPO}).encode())

    def log_message(self, *_):
        pass  # suppress default request logging


if __name__ == '__main__':
    if not GITHUB_TOKEN:
        log("ERROR: GITHUB_TOKEN not set - issues will not be created")
        sys.exit(1)
    state = f"ntfy -> {NTFY_URL}/{NTFY_TOPIC}" if NTFY_TOKEN else "ntfy DISABLED"
    log(f"Webhook bridge on :5001 -> {GITHUB_REPO} | critical -> {state}")
    HTTPServer(('0.0.0.0', 5001), Handler).serve_forever()

Three details that are not obvious until you hit them:

  • Use the issues list API, not the search API. GitHub's fine-grained personal access tokens do not have access to the search endpoint, so matching an open issue by title means paging through state=open issues a hundred at a time. The first time I built this with the search API it returned 403s that looked like a permissions bug and were not.
  • Two layers of deduplication. An in-memory five-minute window catches in-flight bursts cheaply, and a find_open_issue check catches anything already filed. Without both, a flapping service fills the tracker with duplicates faster than you can close them.
  • The label fallback. Creating an issue with labels fails if those labels do not exist on the repo, so the bridge retries without them. The issue still gets filed.

Why the second channel is the whole point

Here is the trap this design exists to avoid. On a self-hosted box, the monitoring stack sends alert email through your own mail server. That is fine for almost every alert, until the alert you most need to receive is "the mail server is down." That alert is an email. It routes to the mail server. The mail server is down. You never hear about it. The monitoring did its job perfectly and you still found out from a user.

ntfy breaks that circular dependency. It is a small self-hosted push server; you install its app on your phone, subscribe to a topic, and the webhook bridge POSTs critical alerts straight to it over a path that shares nothing with the mail stack. Mail can be completely dead and the push still lands. Because the ntfy server has to be reachable from the public internet, it is locked to deny-all and given an explicit publisher:

config/ntfy-server.yml
base-url: https://ntfy.example.com
auth-default-access: deny-all

Then create the publishing user and token once, on the server:

bash
docker exec -it <ntfy-container> ntfy user add --role=user bridge      # set a password
docker exec -it <ntfy-container> ntfy access bridge infra-alerts rw    # publish + read
docker exec -it <ntfy-container> ntfy token add bridge                 # prints tk_...

Put the printed token into .env as NTFY_TOKEN and redeploy so the bridge picks it up. On the phone: install the ntfy app, add the server, subscribe to the topic, sign in. Then prove the independent path actually works:

bash
docker exec <ntfy-container> ntfy publish --token tk_... infra-alerts "test"

If the phone buzzes, the channel that survives a mail outage is real. The token never gets committed, it goes in .env only.

Logs: Loki and Promtail

Metrics tell you that something broke; logs tell you why. Promtail tails the host logs plus every container's stdout/stderr and ships them to Loki, which Grafana queries.

config/promtail.yml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /var/lib/promtail/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: varlogs
    static_configs:
      - targets: [localhost]
        labels: { job: varlogs, host: server, __path__: /var/log/{syslog,kern.log} }

  # Auth log - SSH logins, sudo, PAM. Contains IPs/usernames; gated by Grafana auth.
  - job_name: auth
    static_configs:
      - targets: [localhost]
        labels: { job: auth, host: server, __path__: /var/log/auth.log }

  # Reverse-proxy JSON access logs -> web analytics.
  # Keep cardinality low: only host + method become labels; everything else
  # (path, status, duration, client IP) is parsed at query time via | json.
  - job_name: proxy-access
    pipeline_stages:
      - json:
          expressions:
            request_host:   RequestHost
            request_method: RequestMethod
      - labels:
          host:   request_host
          method: request_method
    static_configs:
      - targets: [localhost]
        labels: { job: proxy-access, __path__: /proxy-logs/access.log }

  - job_name: containers
    pipeline_stages:
      - docker: {}
    static_configs:
      - targets: [localhost]
        labels: { job: containers, host: server, __path__: /var/lib/docker/containers/*/*-json.log }
    relabel_configs:
      - source_labels: [__path__]
        regex: '/var/lib/docker/containers/([^/]+)/.+'
        target_label: container_id
        replacement: '$1'
        action: replace
Discipline: keep label cardinality low. Labels are an index. High-cardinality values (request paths, durations, client IPs, status codes) as labels will blow Loki up. Promote only low-cardinality fields (virtual host, HTTP method) to labels and parse everything else at query time with | json.

The Loki config itself is mostly defaults: filesystem storage, single-binary mode, retention around 30 days. The one thing to plan for is that Loki's schema versions change between major releases, so keep dated schema_config entries and old data stays readable after an upgrade instead of needing migration.

Dashboards as code

Grafana provisions its datasources and dashboards from files on disk. No click-ops, and the entire observability layer comes back identically from Git on every redeploy.

provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    uid: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
  - name: Loki
    type: loki
    uid: loki
    access: proxy
    url: http://loki:3100
provisioning/dashboards/provider.yml
apiVersion: 1
providers:
  - name: 'default'
    folder: ''
    type: file
    options:
      path: /etc/grafana/provisioning/dashboards

Drop dashboard JSON files next to that provider and they appear on startup. The workflow is: build a dashboard in the UI, export its JSON, commit it. Every redeploy then comes up identical to the last. The dashboards I keep provisioned:

  • Site Uptime: a probe_success grid per target, HTTP response duration, uptime percentage over 24h / 7d / 30d, and TLS-expiry-in-days.
  • Infrastructure: host CPU, memory, disk gauges, network, container counts.
  • Logs: Loki log rates, container error rates, auth log, syslog.
  • Traefik / proxy: request rates, status-code breakdowns, p50/p95/p99 latency, 5xx rate.

Each UP/DOWN tile carries a data link straight to the site it represents, so the dashboard doubles as a launcher.

Deploying: manual, GitOps-style, no agents on the box

Deploys are deliberate. There is no auto-deploy on push; production only changes when I trigger it. A small "Deploy to Production" workflow dispatches an event to the orchestration repo, which runs on a self-hosted runner that SSHes into the VPS, syncs the named config files out of Git, pulls images, and brings the stack up. The pattern is identical for every project on the box, which is most of why a new project can be running end to end in about an hour.

.github/workflows/deploy-prod.yml
name: Deploy to Production
on:
  workflow_dispatch:
    inputs:
      force_recreate:
        description: 'Force recreate containers'
        type: boolean
        default: false
jobs:
  trigger-deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Trigger deploy
        uses: peter-evans/repository-dispatch@v4
        with:
          token: ${{ secrets.DEPLOY_TOKEN }}
          repository: youruser/infra-repo
          event-type: platform-deploy
          client-payload: |
            {
              "project_dir": "monitoring",
              "project_type": "platform",
              "sync_code": "true",
              "sync_files": "docker-compose.yml config/prometheus.yml config/alerts.rules config/blackbox.yml config/loki.yml config/promtail.yml config/ntfy-server.yml services/webhook-bridge/app.py config/alertmanager.yml.template config/alertmanager-entrypoint.sh config/provisioning/...",
              "force_recreate": "${{ inputs.force_recreate || false }}"
            }

The receiving deploy job, running on the self-hosted runner, does the boring-but-essential things in order: verify SSH, sync only the listed files, pull, bring up, then a health check that fails the run if any container is in an Exit / unhealthy / restarting state.

deploy job (skeleton)
jobs:
  deploy:
    runs-on: self-hosted
    environment: production
    timeout-minutes: 30
    steps:
      - name: Sync config files from repo
        if: steps.vars.outputs.sync_code == 'true'
        run: |
          ssh -i ~/.ssh/id_rsa_vps -p ${{ secrets.SSH_PORT }} \
            ${{ secrets.SSH_USER }}@${{ secrets.VPS_IP }} \
            "cd ~/docker/${{ steps.vars.outputs.project_dir }} && \
             git fetch origin main --depth=1 && \
             git checkout origin/main -- ${{ steps.vars.outputs.sync_files }}"

      - name: Pull images and deploy
        run: |
          ssh -i ~/.ssh/id_rsa_vps -p ${{ secrets.SSH_PORT }} \
            ${{ secrets.SSH_USER }}@${{ secrets.VPS_IP }} \
            "cd ~/docker/${{ steps.vars.outputs.project_dir }} && \
             docker compose pull && \
             docker compose up -d --remove-orphans"

      - name: Health check
        run: |
          sleep 30
          ssh -i ~/.ssh/id_rsa_vps -p ${{ secrets.SSH_PORT }} \
            ${{ secrets.SSH_USER }}@${{ secrets.VPS_IP }} \
            "cd ~/docker/${{ steps.vars.outputs.project_dir }} && \
             STOPPED=\$(docker compose ps | grep -cE 'Exit|unhealthy|restarting' || true) && \
             if [ \"\$STOPPED\" -gt 0 ]; then docker compose ps && exit 1; fi && \
             echo 'All containers healthy'"

Every host, port, user, and key path in there is a GitHub secret, never a literal. The runner holds the SSH key; the box runs no deploy agent of its own and exposes no deploy webhook. The deployment surface is one SSH key and one short-lived token. There is also a separate local Compose file for working on dashboards: Grafana only, anonymous admin, no Traefik, live provisioning reload, on localhost:3000.

Backups, so a dead box is not a dead project

Monitoring tells you something broke; snapshots let you put it back. A scheduled GitHub Actions job takes a provider-level snapshot of the VPS every week during quiet hours, with a manual trigger for ad-hoc snapshots before a risky change. The provider here is Contabo via its API, but the shape is generic; any provider with a snapshot API works the same way.

.github/workflows/snapshots.yml
name: Weekly VPS Snapshots
on:
  schedule:
    - cron: '0 2 * * 0'   # Sundays 02:00 UTC, low traffic
  workflow_dispatch:
    inputs:
      target:
        type: choice
        options: [both, vps1, vps2]
        default: both
jobs:
  snapshot:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - run: sudo apt-get install -y jq uuid-runtime
      - name: Create snapshots
        env:
          # all provider credentials are GitHub secrets, never literals
          PROVIDER_CLIENT_ID:     ${{ secrets.PROVIDER_CLIENT_ID }}
          PROVIDER_CLIENT_SECRET: ${{ secrets.PROVIDER_CLIENT_SECRET }}
          VPS_INSTANCE_ID:        ${{ secrets.VPS_INSTANCE_ID }}
        run: ./scripts/backup/snapshot.sh "${{ github.event.inputs.target || 'both' }}"

Same discipline as everywhere else: every credential is a secret, the schedule is deliberate, and there is a manual escape hatch.

The discipline that keeps it boring

A few habits matter more than any single component:

  • Pin every image and upgrade on purpose. cAdvisor is the cautionary tale: its latest tag has shipped builds with a constant CPU overhead of around 13%, so it is pinned to v0.47.2 and only moves after a tested bump. Dependabot proposes; nothing applies itself.
  • Verify the alert path end to end before trusting it. Stop a non-critical container, watch a GitHub Issue open, start it again, watch the issue close. Then make the mail stack unreachable and confirm the ntfy push still lands. An alerting pipeline you have never seen fire is a guess, not a safety net.
  • Keep secrets out of Git, always. Templates and .env.example live in the repo; real values live only on the server and in Actions secrets. The entrypoint resolves credentials into /tmp at runtime so they never sit on disk in plaintext config.
  • Reproduce, do not repair. Because the whole stack is one Compose file plus pinned images plus tracked config, recovering from a bad state is a redeploy, not a debugging session.

What this design buys

  • One docker compose up reproduces the entire observability stack, dashboards and all, from pinned images and tracked config.
  • No secret is ever committed; everything sensitive is an environment variable injected at runtime.
  • Alerts survive the failure of their own delivery channel, because criticals have a second path that shares no dependencies with email.
  • Every alert is a tracked, self-closing work item, so the issue tracker is a live picture of what is broken right now and goes quiet on its own when things recover.

None of the components are unusual. The value is in the wiring: the two-channel fan-out, the runtime secret injection, the cardinality-aware log labels, and the bridge that closes the loop between "something fired" and "someone is tracking it." That is the part most tutorials skip, and it is the part that decides whether your monitoring actually wakes you up when it matters, and stops bothering you when it does not.

The result, on one VPS I have been carrying forward since 2019, is uptime sitting at 97-100% month after month. Not because the hardware is special, but because the box tells me the moment it is not.

Open source throughout, version-pinned, and updated regularly via Dependabot and deliberate manual bumps.