[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"$fNB6lrgveKJoh6AL1gsM8_YurWTUXmjLGYrnTN2nJbXY":3},{"id":4,"title":5,"teaser":6,"body":7,"slug":8,"date":9,"coverImage":10,"tags":15},"305bae00-b32b-4288-b202-9ad23615ea6e","Self-hosted monitoring on a single VPS","","\u003Carticle class=\"wrap\">\u003Cheader class=\"post\">\u003Cp class=\"lede\">How I monitor a small fleet of self-hosted projects on one Linux box: what runs, why it is wired the way it is, and the handful of decisions that took a few outages to get right.\u003C\u002Fp>\u003C\u002Fheader>\u003Cnav class=\"toc\">\u003Cp>Contents\u003C\u002Fp>\u003Col>\u003Cli data-list-item-id=\"eefe2cd4ba55cbcfc51bdb97c76fb5f33\">\u003Ca href=\"#start\">Starting point: stock Ubuntu + Docker\u003C\u002Fa>\u003C\u002Fli>\u003Cli data-list-item-id=\"e9b13569bd35b45b0bac2427842e037e1\">\u003Ca href=\"#runs\">What runs\u003C\u002Fa>\u003C\u002Fli>\u003Cli data-list-item-id=\"e1509c66f6cfaf02556a6282dce855a85\">\u003Ca href=\"#arch\">Architecture at a glance\u003C\u002Fa>\u003C\u002Fli>\u003Cli data-list-item-id=\"e0e810a8e9156040e7e8c78932592ef6e\">\u003Ca href=\"#net\">Networks and the reverse proxy\u003C\u002Fa>\u003C\u002Fli>\u003Cli data-list-item-id=\"e628821f4c16a2528158cb39a5fe283db\">\u003Ca href=\"#compose\">The Compose file\u003C\u002Fa>\u003C\u002Fli>\u003Cli data-list-item-id=\"e9c135226fcce59be75aad3ac9898eb13\">\u003Ca href=\"#env\">Secrets live in .env\u003C\u002Fa>\u003C\u002Fli>\u003Cli data-list-item-id=\"e4a35e344122fa9f2a7c6b2a8d6045960\">\u003Ca href=\"#scrape\">Scraping and probing\u003C\u002Fa>\u003C\u002Fli>\u003Cli data-list-item-id=\"efbe2fb56ef7aa06cff15634f168efa7c\">\u003Ca href=\"#rules\">Alert rules\u003C\u002Fa>\u003C\u002Fli>\u003Cli data-list-item-id=\"e0e65e89911df4c609ac1c4a9eae92776\">\u003Ca href=\"#route\">Routing alerts\u003C\u002Fa>\u003C\u002Fli>\u003Cli data-list-item-id=\"e64ad2f40b9a1ccaa28a68b4a084580dd\">\u003Ca href=\"#bridge\">The webhook bridge\u003C\u002Fa>\u003C\u002Fli>\u003Cli data-list-item-id=\"ea402be686fc82de33cf6f7883fb48c7c\">\u003Ca href=\"#ntfy\">The second channel\u003C\u002Fa>\u003C\u002Fli>\u003Cli data-list-item-id=\"ea3b04bb32186cce7068fce1e74caf570\">\u003Ca href=\"#logs\">Logs: Loki + Promtail\u003C\u002Fa>\u003C\u002Fli>\u003Cli data-list-item-id=\"efa64f5babecfcb5f7874e00001071950\">\u003Ca href=\"#dash\">Dashboards as code\u003C\u002Fa>\u003C\u002Fli>\u003Cli data-list-item-id=\"e2e91517e756b42373eab81bc0fd13de2\">\u003Ca href=\"#deploy\">Deploying\u003C\u002Fa>\u003C\u002Fli>\u003Cli data-list-item-id=\"e935d4ae90e30916c88f907fab6a9f6a8\">\u003Ca href=\"#backup\">Backups\u003C\u002Fa>\u003C\u002Fli>\u003Cli data-list-item-id=\"e78c619cfce484715551018fbd0a0ed74\">\u003Ca href=\"#discipline\">The discipline\u003C\u002Fa>\u003C\u002Fli>\u003Cli data-list-item-id=\"ecb5639fbd7d60eceb9370bef005f12ac\">\u003Ca href=\"#buys\">What it buys\u003C\u002Fa>\u003C\u002Fli>\u003C\u002Fol>\u003C\u002Fnav>\u003Cmain>\u003Cp>There is no product to sell here and nothing exotic. Everything is open source, pinned to a version, and reproduced from a single Git repo. The interesting part is not the component list, it is the wiring: how an alert reaches me when the thing that is broken is the thing that delivers alerts, and how a fired alert becomes a tracked task instead of a notification I swipe away.\u003C\u002Fp>\u003Ch2 id=\"start\">The starting point: a stock Ubuntu host running Docker\u003C\u002Fh2>\u003Cp>The baseline for everything below is a plain server: a current Ubuntu LTS, Docker Engine with the Compose plugin, and a reverse proxy already terminating TLS. In my case that is a single Contabo VPS in Germany, the same instance I have upgraded since 2019, never reinstalled, just improved in place. Nothing here is provider-specific; any VPS or bare-metal box with Docker will do.\u003C\u002Fp>\u003Cp>A stock Docker host gives you containers and not much else. It will happily keep running long after something inside it has quietly died. You do not know:\u003C\u002Fp>\u003Cul>\u003Cli data-list-item-id=\"ed39e9b4c7d689e89c59eb042e5953897\">whether a public site is actually answering, or how fast\u003C\u002Fli>\u003Cli data-list-item-id=\"e72545bef076fe32213bdfa6a97b5d398\">when a TLS certificate is about to expire\u003C\u002Fli>\u003Cli data-list-item-id=\"edf98545b1ce4b0c26cb84970af1f2fe3\">whether the host is running out of CPU, memory, or disk\u003C\u002Fli>\u003Cli data-list-item-id=\"ef5e8fc942be653af9404acca25b8103d\">whether a container restarted at 3am\u003C\u002Fli>\u003Cli data-list-item-id=\"e1087bb6caf72553ca748f15746bac6d3\">\u003Cem>why\u003C\u002Fem> something broke, because the logs are scattered across \u003Ccode>docker logs\u003C\u002Fcode>, \u003Ccode>journalctl\u003C\u002Fcode>, and files under \u003Ccode>\u002Fvar\u002Flog\u003C\u002Fcode>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>This stack is the layer you add on top to answer those questions and, more importantly, to tell you about them before a user does. Two principles drive the whole design:\u003C\u002Fp>\u003Col class=\"principles\">\u003Cli data-list-item-id=\"e71071e8f993f4018faef2862fe2f42aa\">\u003Cstrong>An alert has to survive the failure of its own delivery channel.\u003C\u002Fstrong> On a self-hosted box, alert email is often delivered by your own mail server. The day the mail server goes down, the alert that says so is an email, routed to the dead mail server. You never see it. So critical alerts get a second, fully independent path.\u003C\u002Fli>\u003Cli data-list-item-id=\"ef9cbc12f0b54e2ecb53ae30dae943a3e\">\u003Cstrong>An alert is a work item, not a toast.\u003C\u002Fstrong> Every firing alert opens a tracked issue and closes itself when the problem resolves. The issue tracker becomes a live, accurate list of what is broken right now, and goes quiet on its own when things recover.\u003C\u002Fli>\u003C\u002Fol>\u003Ch2 id=\"runs\">What runs\u003C\u002Fh2>\u003Cdiv class=\"tablewrap\">\u003Ctable class=\"table\">\u003Cthead>\u003Ctr>\u003Cth>Component\u003C\u002Fth>\u003Cth>Job\u003C\u002Fth>\u003Cth>Pinned version\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>Prometheus\u003C\u002Ftd>\u003Ctd>Metrics collection and alert evaluation\u003C\u002Ftd>\u003Ctd>\u003Ccode>v2.48.1\u003C\u002Fcode>\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Grafana\u003C\u002Ftd>\u003Ctd>Dashboards and visualisation\u003C\u002Ftd>\u003Ctd>\u003Ccode>10.2.3\u003C\u002Fcode>\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Alertmanager\u003C\u002Ftd>\u003Ctd>Alert routing, grouping, deduplication\u003C\u002Ftd>\u003Ctd>\u003Ccode>v0.26.0\u003C\u002Fcode>\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Loki\u003C\u002Ftd>\u003Ctd>Log aggregation\u003C\u002Ftd>\u003Ctd>\u003Ccode>3.0.0\u003C\u002Fcode>\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Promtail\u003C\u002Ftd>\u003Ctd>Log shipping\u003C\u002Ftd>\u003Ctd>\u003Ccode>2.9.x\u003C\u002Fcode>\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>cAdvisor\u003C\u002Ftd>\u003Ctd>Per-container metrics\u003C\u002Ftd>\u003Ctd>\u003Ccode>v0.47.2\u003C\u002Fcode>\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Node Exporter\u003C\u002Ftd>\u003Ctd>Host metrics\u003C\u002Ftd>\u003Ctd>\u003Ccode>v1.7.0\u003C\u002Fcode>\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Blackbox Exporter\u003C\u002Ftd>\u003Ctd>HTTP \u002F TLS \u002F SMTP probing\u003C\u002Ftd>\u003Ctd>\u003Ccode>v0.25.0\u003C\u002Fcode>\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>ntfy\u003C\u002Ftd>\u003Ctd>Self-hosted push for critical alerts\u003C\u002Ftd>\u003Ctd>\u003Ccode>v2.11.0\u003C\u002Fcode>\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>webhook-bridge\u003C\u002Ftd>\u003Ctd>Turns alerts into self-closing GitHub Issues\u003C\u002Ftd>\u003Ctd>\u003Ccode>internal\u003C\u002Fcode>\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003C\u002Fdiv>\u003Cp>Everything is pinned. Versions live in \u003Ccode>.env\u003C\u002Fcode>, so an upgrade is an explicit edit and a redeploy, never a surprise on a \u003Ccode>docker compose pull\u003C\u002Fcode>. Dependabot proposes bumps; I apply them on purpose. cAdvisor is pinned for a concrete reason covered later.\u003C\u002Fp>\u003Ch2 id=\"arch\">Architecture at a glance\u003C\u002Fh2>\u003Cfigure class=\"diagram\">\u003Cpre>                          +--------------+\n   host + containers ---&gt; |  Prometheus  | --&gt; alert rules\n   (node-exporter,        +------+-------+\n    cAdvisor)                    | fires\n                                 v\n   public URLs ---&gt; Blackbox -&gt; +--------------+   email   +-------------+\n   (HTTP\u002FTLS\u002FSMTP) exporter     | Alertmanager | --------&gt; | mail server |\n                                +------+-------+           +-------------+\n                                       | webhook\n                                       v\n                                +---------------+ --&gt; GitHub Issues (open\u002Fclose)\n                                | webhook-bridge| --&gt; ntfy push (critical only)\n                                +---------------+\n   syslog \u002F auth \u002F proxy ---&gt; Promtail -&gt; Loki -&gt; Grafana (dashboards + Explore)\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cp>Metrics flow into Prometheus, which evaluates rules and hands firing alerts to Alertmanager. Alertmanager fans out to two receivers: email through the mail server, and a webhook. The webhook bridge files GitHub Issues and, for critical alerts only, pushes to a self-hosted ntfy server that shares nothing with the mail stack. Logs flow separately through Promtail into Loki, queried from Grafana.\u003C\u002Fp>\u003Ch2 id=\"net\">Networks and the reverse proxy\u003C\u002Fh2>\u003Cp>The stack joins two pre-existing external Docker networks: \u003Ccode>web\u003C\u002Fcode> (where the reverse proxy lives) and \u003Ccode>mail_internal\u003C\u002Fcode> (a private network shared with the mail server). Create them once if they do not exist:\u003C\u002Fp>\u003Cfigure class=\"code\">\u003Cfigcaption>bash\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\">docker network create web\ndocker network create mail_internal\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cp>Only two services are ever reachable from the public internet: Grafana and ntfy. Everything else (Prometheus, Alertmanager, the exporters, the webhook bridge) is internal-only and carries \u003Ccode>traefik.enable=false\u003C\u002Fcode>. Grafana sits behind an IP-allowlist middleware on top of its own login, because a login page on the public internet is just a brute-force target waiting to be found. All admin URLs are IP-restricted; that is a standing rule across the whole box. The labels below assume Traefik; for nginx-proxy or Caddy the routing changes but nothing else does.\u003C\u002Fp>\u003Ch2 id=\"compose\">The Compose file\u003C\u002Fh2>\u003Cp>This is the single file that defines the stack. Substitute your own domains and paths.\u003C\u002Fp>\u003Cfigure class=\"code\">\u003Cfigcaption>docker-compose.yml\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\">services:\n  grafana:\n    image: grafana\u002Fgrafana:${GRAFANA_VERSION:-10.2.3}\n    restart: unless-stopped\n    volumes:\n      - grafana_data:\u002Fvar\u002Flib\u002Fgrafana\n      - .\u002Fconfig\u002Fprovisioning:\u002Fetc\u002Fgrafana\u002Fprovisioning:ro\n    environment:\n      GF_SECURITY_ADMIN_USER: ${GF_SECURITY_ADMIN_USER:-admin}\n      GF_SECURITY_ADMIN_PASSWORD: ${GF_SECURITY_ADMIN_PASSWORD}\n      GF_USERS_ALLOW_SIGN_UP: \"false\"\n      GF_SMTP_ENABLED: \"true\"\n      GF_SMTP_HOST: mailserver\n      GF_SMTP_PORT: \"587\"\n      GF_SMTP_USER: ${MAIL_SMTP_USER}\n      GF_SMTP_PASSWORD: ${MAIL_SMTP_PASSWORD}\n      GF_SMTP_FROM_ADDRESS: ${ALERT_FROM_EMAIL}\n      GF_SMTP_FROM_NAME: \"Grafana Alerts\"\n      GF_SMTP_STARTTLS_POLICY: MandatoryStartTLS\n      GF_SMTP_SKIP_VERIFY: \"true\"\n      GF_LOG_LEVEL: warn\n    networks: [web, default, mail_internal]\n    labels:\n      - \"traefik.enable=true\"\n      - \"traefik.http.routers.grafana.rule=Host(`${GRAFANA_DOMAIN}`)\"\n      - \"traefik.http.routers.grafana.tls.certresolver=lets-encrypt\"\n      - \"traefik.http.routers.grafana.middlewares=admin-ipallowlist@file\"\n      - \"traefik.http.services.grafana.loadbalancer.server.port=3000\"\n      - \"traefik.docker.network=web\"\n\n  prometheus:\n    image: prom\u002Fprometheus:${PROMETHEUS_VERSION:-v2.48.1}\n    restart: unless-stopped\n    volumes:\n      - .\u002Fconfig\u002Fprometheus.yml:\u002Fetc\u002Fprometheus\u002Fprometheus.yml:ro\n      - .\u002Fconfig\u002Falerts.rules:\u002Fetc\u002Fprometheus\u002Falerts.rules:ro\n      - prometheus_data:\u002Fprometheus\n    command:\n      - '--config.file=\u002Fetc\u002Fprometheus\u002Fprometheus.yml'\n      - '--storage.tsdb.path=\u002Fprometheus'\n      - '--storage.tsdb.retention.time=30d'\n      - '--web.enable-lifecycle'\n    networks: [web, default]\n    labels: [\"traefik.enable=false\"]\n\n  alertmanager:\n    image: prom\u002Falertmanager:${ALERTMANAGER_VERSION:-v0.26.0}\n    restart: unless-stopped\n    env_file: .env\n    entrypoint: [\"\u002Fbin\u002Fsh\", \"\u002Fetc\u002Falertmanager\u002Falertmanager-entrypoint.sh\"]\n    volumes:\n      - .\u002Fconfig\u002Falertmanager.yml.template:\u002Fetc\u002Falertmanager\u002Falertmanager.yml.template:ro\n      - .\u002Fconfig\u002Falertmanager-entrypoint.sh:\u002Fetc\u002Falertmanager\u002Falertmanager-entrypoint.sh:ro\n      - alertmanager_data:\u002Falertmanager\n    networks: [default, mail_internal]\n    labels: [\"traefik.enable=false\"]\n\n  loki:\n    image: grafana\u002Floki:${LOKI_VERSION:-3.0.0}\n    restart: unless-stopped\n    volumes:\n      - loki_data:\u002Floki\n      - .\u002Fconfig\u002Floki.yml:\u002Fetc\u002Floki\u002Floki.yml:ro\n    command: -config.file=\u002Fetc\u002Floki\u002Floki.yml -log.level=warn\n    labels: [\"traefik.enable=false\"]\n\n  promtail:\n    image: grafana\u002Fpromtail:${PROMTAIL_VERSION:-2.9.0}\n    restart: unless-stopped\n    volumes:\n      - \u002Fvar\u002Flog:\u002Fvar\u002Flog:ro\n      - \u002Fvar\u002Flib\u002Fdocker\u002Fcontainers:\u002Fvar\u002Flib\u002Fdocker\u002Fcontainers:ro\n      - .\u002Fconfig\u002Fpromtail.yml:\u002Fetc\u002Fpromtail\u002Fconfig.yml:ro\n      - promtail_positions:\u002Fvar\u002Flib\u002Fpromtail\n      - ${PROXY_LOGS_PATH:-\u002Fsrv\u002Fproxy\u002Flogs}:\u002Fproxy-logs:ro\n    command: -config.file=\u002Fetc\u002Fpromtail\u002Fconfig.yml\n    labels: [\"traefik.enable=false\"]\n\n  cadvisor:\n    image: gcr.io\u002Fcadvisor\u002Fcadvisor:${CADVISOR_VERSION:-v0.47.2}\n    restart: unless-stopped\n    privileged: true\n    mem_limit: 1g\n    volumes:\n      - \u002F:\u002Frootfs:ro\n      - \u002Fvar\u002Frun:\u002Fvar\u002Frun:ro\n      - \u002Fsys:\u002Fsys:ro\n      - \u002Fvar\u002Flib\u002Fdocker:\u002Fvar\u002Flib\u002Fdocker:ro\n    command:\n      - '--docker_only=true'\n      - '--housekeeping_interval=30s'\n    labels: [\"traefik.enable=false\"]\n\n  webhook-bridge:\n    image: python:3.11-alpine\n    restart: unless-stopped\n    volumes:\n      - .\u002Fservices\u002Fwebhook-bridge\u002Fapp.py:\u002Fapp\u002Fapp.py:ro\n    working_dir: \u002Fapp\n    command: python app.py\n    environment:\n      GITHUB_TOKEN: ${GITHUB_TOKEN}\n      GITHUB_REPO: ${GITHUB_REPO}\n      NTFY_URL: ${NTFY_URL:-http:\u002F\u002Fntfy}\n      NTFY_TOPIC: ${NTFY_TOPIC:-infra-alerts}\n      NTFY_TOKEN: ${NTFY_TOKEN}\n    labels: [\"traefik.enable=false\"]\n\n  ntfy:\n    image: binwiederhier\u002Fntfy:${NTFY_VERSION:-v2.11.0}\n    restart: unless-stopped\n    command: serve\n    environment:\n      TZ: ${TZ:-UTC}\n    volumes:\n      - .\u002Fconfig\u002Fntfy-server.yml:\u002Fetc\u002Fntfy\u002Fserver.yml:ro\n      - ntfy_data:\u002Fvar\u002Flib\u002Fntfy\n    networks: [web, default]\n    healthcheck:\n      test: [\"CMD-SHELL\", \"wget -q -O - http:\u002F\u002Flocalhost:80\u002Fv1\u002Fhealth 2&gt;\u002Fdev\u002Fnull | grep -q '\\\"healthy\\\":true' || exit 1\"]\n      interval: 60s\n      timeout: 10s\n      retries: 3\n      start_period: 30s\n    labels:\n      - \"traefik.enable=true\"\n      - \"traefik.http.routers.ntfy.rule=Host(`${NTFY_DOMAIN}`)\"\n      - \"traefik.http.routers.ntfy.tls.certresolver=lets-encrypt\"\n      - \"traefik.http.services.ntfy.loadbalancer.server.port=80\"\n      - \"traefik.docker.network=web\"\n\n  blackbox-exporter:\n    image: prom\u002Fblackbox-exporter:${BLACKBOX_VERSION:-v0.25.0}\n    restart: unless-stopped\n    networks: [default, mail_internal, web]\n    volumes:\n      - .\u002Fconfig\u002Fblackbox.yml:\u002Fetc\u002Fblackbox_exporter\u002Fconfig.yml:ro\n    labels: [\"traefik.enable=false\"]\n\n  node-exporter:\n    image: prom\u002Fnode-exporter:${NODE_EXPORTER_VERSION:-v1.7.0}\n    restart: unless-stopped\n    pid: host\n    volumes:\n      - \u002Fproc:\u002Fhost\u002Fproc:ro\n      - \u002Fsys:\u002Fhost\u002Fsys:ro\n      - \u002F:\u002Frootfs:ro\n    command:\n      - '--path.procfs=\u002Fhost\u002Fproc'\n      - '--path.rootfs=\u002Frootfs'\n      - '--path.sysfs=\u002Fhost\u002Fsys'\n      - '--collector.filesystem.mount-points-exclude=^\u002F(sys|proc|dev|host|etc)($$|\u002F)'\n    labels: [\"traefik.enable=false\"]\n\nnetworks:\n  web: { external: true }\n  mail_internal: { external: true }\n\nvolumes:\n  grafana_data:\n  prometheus_data:\n  loki_data:\n  alertmanager_data:\n  promtail_positions:\n  ntfy_data:\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cp>A few choices worth pointing out:\u003C\u002Fp>\u003Cul>\u003Cli data-list-item-id=\"e5c3b34aab9685c643e44864e9f443795\">\u003Cstrong>The Blackbox exporter sits on three networks on purpose.\u003C\u002Fstrong> Being on the internal networks lets it probe other containers by name (\u003Ccode>service:port\u003C\u002Fcode>) without that traffic crossing the host firewall or making a pointless round-trip out through the public reverse proxy. The same placement lets the mail-port probes reach the mail server directly over \u003Ccode>mail_internal\u003C\u002Fcode>.\u003C\u002Fli>\u003Cli data-list-item-id=\"e07240895ab6fe47a6a1538f6db54e93d\">\u003Cstrong>node-exporter runs with \u003C\u002Fstrong>\u003Ccode>\u003Cstrong>pid: host\u003C\u002Fstrong>\u003C\u002Fcode> and read-only mounts of \u003Ccode>\u002Fproc\u003C\u002Fcode>, \u003Ccode>\u002Fsys\u003C\u002Fcode>, and \u003Ccode>\u002F\u003C\u002Fcode>, with the pseudo-filesystems excluded from the filesystem collector so you do not alert on tmpfs and friends.\u003C\u002Fli>\u003Cli data-list-item-id=\"e1719ceced228509bbbcbeb5c15838a61\">\u003Cstrong>cAdvisor gets a \u003C\u002Fstrong>\u003Ccode>\u003Cstrong>mem_limit\u003C\u002Fstrong>\u003C\u002Fcode>\u003Cstrong> and \u003C\u002Fstrong>\u003Ccode>\u003Cstrong>--docker_only=true\u003C\u002Fstrong>\u003C\u002Fcode>\u003Cstrong>.\u003C\u002Fstrong> It will otherwise account for everything on the host and grow unbounded over time.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2 id=\"env\">Secrets live in .env, and .env is never committed\u003C\u002Fh2>\u003Cp>Nothing sensitive goes into Git. The repo ships a \u003Ccode>.env.example\u003C\u002Fcode> template; the real \u003Ccode>.env\u003C\u002Fcode> is git-ignored and only exists on the server. Every password, token, and SMTP credential is an environment variable.\u003C\u002Fp>\u003Cfigure class=\"code\">\u003Cfigcaption>.env.example\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\"># Grafana\nGRAFANA_VERSION=10.2.3\nGF_SECURITY_ADMIN_USER=admin\nGF_SECURITY_ADMIN_PASSWORD=\nGRAFANA_DOMAIN=grafana.example.com\n\n# Pin everything; upgrade deliberately\nPROMETHEUS_VERSION=v2.48.1\nALERTMANAGER_VERSION=v0.26.0\nLOKI_VERSION=3.0.0\nPROMTAIL_VERSION=2.9.0\nCADVISOR_VERSION=v0.47.2\nBLACKBOX_VERSION=v0.25.0\nNODE_EXPORTER_VERSION=v1.7.0\nNTFY_VERSION=v2.11.0\n\n# SMTP for email alerts (Grafana + Alertmanager)\nMAIL_SMTP_USER=\nMAIL_SMTP_PASSWORD=\nALERT_FROM_EMAIL=alerts@example.com\nALERT_TO_EMAIL=you@example.com\n\n# GitHub Issues bridge - fine-grained PAT, Issues read and write on the repo\nGITHUB_TOKEN=\nGITHUB_REPO=youruser\u002Fyour-infra-repo\n\n# Self-hosted push (independent of the mail stack)\nNTFY_DOMAIN=ntfy.example.com\nNTFY_TOPIC=infra-alerts\nNTFY_TOKEN=\nNTFY_URL=http:\u002F\u002Fntfy\n\n# Reverse-proxy access log dir, bind-mounted into Promtail\nPROXY_LOGS_PATH=\u002Fsrv\u002Fproxy\u002Flogs\n\nTZ=UTC\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cp>If a value is blank above, it is a secret you fill in on the server and nowhere else.\u003C\u002Fp>\u003Ch2 id=\"scrape\">Scraping metrics and probing endpoints\u003C\u002Fh2>\u003Cp>Prometheus scrapes three kinds of target: the internal exporters by container name, and two flavours of Blackbox probe (web URLs and mail ports). The Blackbox jobs use \u003Ccode>relabel_configs\u003C\u002Fcode> to rewrite each scrape so Prometheus asks the exporter to probe the real target. That is the standard multi-target exporter pattern, and it is the same shape for every probe type.\u003C\u002Fp>\u003Cfigure class=\"code\">\u003Cfigcaption>config\u002Fprometheus.yml\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\">global:\n  scrape_interval: 15s\n  evaluation_interval: 15s\n  external_labels:\n    monitor: 'primary'\n\nalerting:\n  alertmanagers:\n    - static_configs:\n        - targets: [alertmanager:9093]\n\nrule_files:\n  - \u002Fetc\u002Fprometheus\u002Falerts.rules\n\nscrape_configs:\n  - job_name: 'prometheus'\n    static_configs:\n      - targets: ['localhost:9090']\n\n  - job_name: 'node-exporter'\n    static_configs:\n      - targets: ['node-exporter:9100']\n\n  - job_name: 'cadvisor'\n    static_configs:\n      - targets: ['cadvisor:8080']\n\n  - job_name: 'loki'\n    static_configs:\n      - targets: ['loki:3100']\n\n  # Web uptime \u002F latency \u002F TLS expiry\n  - job_name: 'blackbox'\n    metrics_path: \u002Fprobe\n    params:\n      module: [http_2xx]\n    static_configs:\n      - targets:\n          - https:\u002F\u002Fexample.com\n          - https:\u002F\u002Fwww.example.com\n          - https:\u002F\u002Fapp.example.com\n          - https:\u002F\u002Fapi.example.com\u002Fhealth\n    relabel_configs:\n      - source_labels: [__address__]\n        target_label: __param_target\n      - source_labels: [__param_target]\n        target_label: instance\n      - target_label: __address__\n        replacement: blackbox-exporter:9115\n\n  # Mail submission port, probed by name over mail_internal\n  - job_name: 'blackbox_mail_submission'\n    metrics_path: \u002Fprobe\n    params:\n      module: [smtp_starttls]\n    scrape_interval: 60s\n    scrape_timeout: 20s   # must exceed the module's own 15s timeout\n    static_configs:\n      - targets: [mailserver:587]\n    relabel_configs:\n      - source_labels: [__address__]\n        target_label: __param_target\n      - source_labels: [__param_target]\n        target_label: instance\n      - target_label: __address__\n        replacement: blackbox-exporter:9115\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cp>Repeat the mail block for ports 25 (\u003Ccode>tcp_connect\u003C\u002Fcode>), 465 (\u003Ccode>smtps\u003C\u002Fcode>), and 993 (\u003Ccode>imaps\u003C\u002Fcode>), one job per module.\u003C\u002Fp>\u003Cdiv class=\"note\">\u003Cstrong>Lesson:\u003C\u002Fstrong> \u003Ccode>scrape_timeout\u003C\u002Fcode> must be larger than the probe module's own \u003Ccode>timeout\u003C\u002Fcode>. If it is not, Prometheus kills the scrape mid-handshake and you get phantom \"down\" readings on a mail server that is perfectly healthy. A STARTTLS dialogue legitimately takes several seconds, so a 15s module timeout needs a 20s scrape timeout around it.\u003C\u002Fdiv>\u003Ch3>The probe modules\u003C\u002Fh3>\u003Cfigure class=\"code\">\u003Cfigcaption>config\u002Fblackbox.yml\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\">modules:\n  http_2xx:\n    prober: http\n    timeout: 10s\n    http:\n      valid_http_versions: [\"HTTP\u002F1.1\", \"HTTP\u002F2.0\"]\n      valid_status_codes: []        # empty = any 2xx\u002F3xx is OK\n      method: GET\n      follow_redirects: true\n      preferred_ip_protocol: \"ip4\"\n      tls_config:\n        insecure_skip_verify: false  # real cert validation for public sites\n\n  tcp_connect:\n    prober: tcp\n    timeout: 10s\n    tcp:\n      preferred_ip_protocol: \"ip4\"\n\n  smtp_starttls:\n    prober: tcp\n    timeout: 15s   # the STARTTLS dialogue can take several seconds\n    tcp:\n      preferred_ip_protocol: \"ip4\"\n      query_response:\n        - expect: \"^220 \"\n        - send: \"EHLO blackbox\\r\\n\"\n        - expect: \"^250\"\n        - send: \"STARTTLS\\r\\n\"\n        - expect: \"^220\"\n        - starttls: true\n      tls_config:\n        insecure_skip_verify: true   # internal hop; only testing reachability\n\n  smtps:\n    prober: tcp\n    timeout: 15s\n    tcp:\n      preferred_ip_protocol: \"ip4\"\n      tls: true\n      tls_config:\n        insecure_skip_verify: true\n\n  imaps:\n    prober: tcp\n    timeout: 10s\n    tcp:\n      preferred_ip_protocol: \"ip4\"\n      tls: true\n      tls_config:\n        insecure_skip_verify: true\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cp>The \u003Ccode>smtp_starttls\u003C\u002Fcode> module actually speaks the protocol: it waits for the \u003Ccode>220\u003C\u002Fcode> greeting, sends \u003Ccode>EHLO\u003C\u002Fcode>, issues \u003Ccode>STARTTLS\u003C\u002Fcode>, then upgrades the connection. That proves the mail server is genuinely answering SMTP, not just accepting a bare TCP connection, which is the difference between \"mail works\" and \"the port is open but Postfix is wedged.\" For public web targets I validate certificates for real; for the internal mail hops I do not, because there I am only testing reachability over a private network.\u003C\u002Fp>\u003Ch2 id=\"rules\">Alert rules\u003C\u002Fh2>\u003Cp>These cover the host, the public endpoints, and the containers. The thresholds and the \u003Ccode>for:\u003C\u002Fcode> windows matter as much as the expressions, they are what separate a useful page from a pager that cries wolf.\u003C\u002Fp>\u003Cfigure class=\"code\">\u003Cfigcaption>config\u002Falerts.rules\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\">groups:\n  - name: node\n    rules:\n      - alert: HighCPUUsage\n        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100) &gt; 85\n        for: 5m\n        labels: { severity: warning }\n        annotations:\n          summary: \"High CPU usage on {{ $labels.instance }}\"\n          description: \"CPU above 85% for 5 minutes (current: {{ $value }}%)\"\n\n      - alert: CriticalCPUUsage\n        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100) &gt; 95\n        for: 2m\n        labels: { severity: critical }\n        annotations:\n          summary: \"Critical CPU usage on {{ $labels.instance }}\"\n          description: \"CPU above 95% for 2 minutes (current: {{ $value }}%)\"\n\n      - alert: HighMemoryUsage\n        expr: (1 - (node_memory_MemAvailable_bytes \u002F node_memory_MemTotal_bytes)) * 100 &gt; 85\n        for: 5m\n        labels: { severity: warning }\n        annotations:\n          summary: \"High memory usage on {{ $labels.instance }}\"\n          description: \"Memory above 85% (current: {{ $value }}%)\"\n\n      - alert: LowDiskSpace\n        expr: (1 - (node_filesystem_avail_bytes{mountpoint=\"\u002F\"} \u002F node_filesystem_size_bytes{mountpoint=\"\u002F\"})) * 100 &gt; 80\n        for: 5m\n        labels: { severity: warning }\n        annotations:\n          summary: \"Low disk space on {{ $labels.instance }}\"\n          description: \"Disk above 80% on \u002F (current: {{ $value }}%)\"\n\n      - alert: CriticalDiskSpace\n        expr: (1 - (node_filesystem_avail_bytes{mountpoint=\"\u002F\"} \u002F node_filesystem_size_bytes{mountpoint=\"\u002F\"})) * 100 &gt; 90\n        for: 2m\n        labels: { severity: critical }\n        annotations:\n          summary: \"Critical disk space on {{ $labels.instance }}\"\n          description: \"Disk above 90% on \u002F (current: {{ $value }}%)\"\n\n  - name: blackbox\n    rules:\n      - alert: SiteDown\n        expr: probe_success == 0\n        for: 5m\n        labels: { severity: critical }\n        annotations:\n          summary: \"Site down: {{ $labels.instance }}\"\n          description: \"{{ $labels.instance }} unreachable for 5 minutes\"\n\n      - alert: SlowResponse\n        expr: probe_duration_seconds{job=\"blackbox\"} &gt; 2\n        for: 3m\n        labels: { severity: warning }\n        annotations:\n          summary: \"Slow response: {{ $labels.instance }}\"\n          description: \"{{ $labels.instance }} responding in {{ $value | printf \\\"%.2f\\\" }}s\"\n\n      - alert: SiteFlapping\n        expr: changes(probe_success{job=\"blackbox\"}[30m]) &gt; 4\n        for: 0m\n        labels: { severity: warning }\n        annotations:\n          summary: \"Site flapping: {{ $labels.instance }}\"\n          description: \"{{ $labels.instance }} changed state &gt;4 times in 30 minutes\"\n\n      - alert: SSLCertExpiringSoon\n        expr: (probe_ssl_earliest_cert_expiry - time()) \u002F 86400 &lt; 30\n        for: 1h\n        labels: { severity: warning }\n        annotations:\n          summary: \"SSL cert expiring: {{ $labels.instance }}\"\n          description: \"Certificate expires in {{ $value | printf \\\"%.0f\\\" }} days\"\n\n      - alert: SSLCertExpiryCritical\n        expr: (probe_ssl_earliest_cert_expiry - time()) \u002F 86400 &lt; 7\n        for: 1h\n        labels: { severity: critical }\n        annotations:\n          summary: \"SSL cert &lt; 7 days: {{ $labels.instance }}\"\n          description: \"Certificate expires in {{ $value | printf \\\"%.0f\\\" }} days\"\n\n  - name: containers\n    rules:\n      - alert: ContainerDown\n        expr: absent(container_last_seen{name!=\"\"}) or time() - container_last_seen{name!=\"\"} &gt; 60\n        for: 5m\n        labels: { severity: critical }\n        annotations:\n          summary: \"Container {{ $labels.name }} is down\"\n          description: \"{{ $labels.name }} not seen for over a minute\"\n\n      - alert: ContainerHighMemory\n        expr: (container_memory_usage_bytes{name!=\"\"} \u002F container_spec_memory_limit_bytes{name!=\"\"}) * 100 &gt; 85 and container_spec_memory_limit_bytes{name!=\"\"} &gt; 0\n        for: 5m\n        labels: { severity: warning }\n        annotations:\n          summary: \"Container {{ $labels.name }} high memory\"\n          description: \"{{ $labels.name }} memory above 85% of its limit\"\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cp>Two lessons are baked into those rules:\u003C\u002Fp>\u003Cul>\u003Cli data-list-item-id=\"e828fa4df7cfa8052c2dd05d28ca97229\">\u003Cstrong>Scope the latency rule to web probes with \u003C\u002Fstrong>\u003Ccode>\u003Cstrong>job=\"blackbox\"\u003C\u002Fstrong>\u003C\u002Fcode>\u003Cstrong>.\u003C\u002Fstrong> Without that label, \u003Ccode>SlowResponse\u003C\u002Fcode> also fires on the mail TLS\u002FSMTP probes, which legitimately take two to three and a half seconds. Leaving the label off once produced hundreds of noise alerts over a few days and buried the alerts that actually mattered.\u003C\u002Fli>\u003Cli data-list-item-id=\"e8a09a361b5f9be19fd671c8a9f8b4a40\">\u003Ccode>\u003Cstrong>SiteDown\u003C\u002Fstrong>\u003C\u002Fcode>\u003Cstrong> uses \u003C\u002Fstrong>\u003Ccode>\u003Cstrong>for: 5m\u003C\u002Fstrong>\u003C\u002Fcode>\u003Cstrong>.\u003C\u002Fstrong> A single failed scrape, a momentary timeout or a reverse-proxy reload, should not page anyone. A real outage outlasts five minutes comfortably. The flapping rule catches the in-between case where a service is bouncing up and down.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2 id=\"route\">Routing alerts: two receivers, and injecting secrets safely\u003C\u002Fh2>\u003Cp>Alertmanager groups, deduplicates, and routes. The important move is the two receivers: email and a webhook. Email is the primary, human-readable channel; the webhook is where the issue tracking and push notification happen.\u003C\u002Fp>\u003Cfigure class=\"code\">\u003Cfigcaption>config\u002Falertmanager.yml.template\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\">global:\n  smtp_smarthost: 'mailserver:587'\n  smtp_require_tls: true\n\nroute:\n  group_by: ['alertname', 'instance']\n  group_wait: 30s\n  group_interval: 5m\n  repeat_interval: 4h\n  receiver: 'default'\n  routes:\n    - match: { severity: critical }\n      receiver: 'default'\n      repeat_interval: 1h     # nag more often for criticals\n\nreceivers:\n  - name: 'default'\n    email_configs:\n      - to: '${ALERT_TO_EMAIL}'\n        from: '${ALERT_FROM_EMAIL}'\n        smarthost: 'mailserver:587'\n        auth_username: '${MAIL_SMTP_USER}'\n        auth_password: '${MAIL_SMTP_PASSWORD}'\n        require_tls: true\n        tls_config: { insecure_skip_verify: true }\n        send_resolved: true\n    webhook_configs:\n      - url: 'http:\u002F\u002Fwebhook-bridge:5001\u002Fwebhook'\n        send_resolved: true\n\ninhibit_rules:\n  - source_match: { severity: 'critical' }\n    target_match: { severity: 'warning' }\n    equal: ['alertname', 'instance']\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cp>The \u003Ccode>inhibit_rules\u003C\u002Fcode> block is a small quality-of-life win: when CPU on a host goes critical, you do not also get the warning-level alert for the same instance. One event, one notification.\u003C\u002Fp>\u003Ch3>Injecting secrets without baking them into the image\u003C\u002Fh3>\u003Cp>Alertmanager does not expand environment variables in its config file. Rather than commit credentials, the container runs a tiny entrypoint that substitutes them at startup. It uses \u003Ccode>awk\u003C\u002Fcode> with \u003Ccode>index\u003C\u002Fcode>\u002F\u003Ccode>substr\u003C\u002Fcode>, plain string replacement, not regex, so a password containing special characters is handled correctly instead of being mangled.\u003C\u002Fp>\u003Cfigure class=\"code\">\u003Cfigcaption>config\u002Falertmanager-entrypoint.sh\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\">#!\u002Fbin\u002Fsh\n# Substitute env vars into the Alertmanager config at startup.\n# awk index\u002Fsubstr (not regex) so passwords with $, &amp;, \\, | survive intact.\nawk \\\n  -v TO=\"$ALERT_TO_EMAIL\" \\\n  -v FROM=\"$ALERT_FROM_EMAIL\" \\\n  -v USER=\"$MAIL_SMTP_USER\" \\\n  -v PASS=\"$MAIL_SMTP_PASSWORD\" \\\n  '\n  function replace(line, placeholder, value,    idx, len) {\n    len = length(placeholder)\n    while ((idx = index(line, placeholder)) &gt; 0)\n      line = substr(line, 1, idx-1) value substr(line, idx+len)\n    return line\n  }\n  {\n    line = $0\n    line = replace(line, \"${ALERT_TO_EMAIL}\",     TO)\n    line = replace(line, \"${ALERT_FROM_EMAIL}\",    FROM)\n    line = replace(line, \"${MAIL_SMTP_USER}\",      USER)\n    line = replace(line, \"${MAIL_SMTP_PASSWORD}\",  PASS)\n    print line\n  }\n  ' \u002Fetc\u002Falertmanager\u002Falertmanager.yml.template &gt; \u002Ftmp\u002Falertmanager-resolved.yml\n\nexec \u002Fbin\u002Falertmanager \\\n  --config.file=\u002Ftmp\u002Falertmanager-resolved.yml \\\n  --storage.path=\u002Falertmanager\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cp>The template with placeholders is what lives in Git. The resolved file, with real credentials in it, only ever exists at \u003Ccode>\u002Ftmp\u003C\u002Fcode> inside the running container and never touches disk outside it.\u003C\u002Fp>\u003Ch2 id=\"bridge\">The webhook bridge: alerts become self-closing issues\u003C\u002Fh2>\u003Cp>This is the piece that turns notifications into tracked work. It is a single dependency-free Python file, standard library only, which is why it can run on a bare \u003Ccode>python:3.11-alpine\u003C\u002Fcode> image with the script bind-mounted in, no build step. It does three things: it opens a GitHub Issue on a firing alert, closes the matching issue on a resolved alert, and for critical alerts also pushes to the self-hosted ntfy server.\u003C\u002Fp>\u003Cfigure class=\"code\">\u003Cfigcaption>services\u002Fwebhook-bridge\u002Fapp.py\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\">#!\u002Fusr\u002Fbin\u002Fenv python3\n\"\"\"\nAlertmanager -&gt; GitHub Issues bridge.\nCreates\u002Fcloses GitHub issues automatically; critical alerts also push via ntfy.\nStandard library only; internal-only (not exposed via the reverse proxy).\n\"\"\"\nimport json, os, sys, threading, time\nimport urllib.request, urllib.error\nfrom http.server import HTTPServer, BaseHTTPRequestHandler\n\nGITHUB_TOKEN = os.environ.get('GITHUB_TOKEN', '')\nGITHUB_REPO  = os.environ.get('GITHUB_REPO', '')\nGITHUB_API   = 'https:\u002F\u002Fapi.github.com'\n\nNTFY_URL   = os.environ.get('NTFY_URL', 'http:\u002F\u002Fntfy').rstrip('\u002F')\nNTFY_TOPIC = os.environ.get('NTFY_TOPIC', 'infra-alerts')\nNTFY_TOKEN = os.environ.get('NTFY_TOKEN', '')\n\n_lock   = threading.Lock()\n_recent = {}  # title -&gt; timestamp, a 5-minute in-flight dedup window\n\n\ndef log(msg):\n    print(msg, flush=True)\n\n\ndef gh(method, path, payload=None):\n    data = json.dumps(payload).encode() if payload else None\n    req  = urllib.request.Request(\n        f'{GITHUB_API}{path}', data=data, method=method,\n        headers={\n            'Authorization': f'Bearer {GITHUB_TOKEN}',\n            'Accept': 'application\u002Fvnd.github.v3+json',\n            'Content-Type': 'application\u002Fjson',\n            'X-GitHub-Api-Version': '2022-11-28',\n        })\n    try:\n        with urllib.request.urlopen(req) as r:\n            return json.loads(r.read())\n    except urllib.error.HTTPError as e:\n        log(f\"GitHub {method} {path} -&gt; HTTP {e.code}: {e.read().decode()}\")\n        return None\n    except Exception as e:\n        log(f\"GitHub {method} {path} error: {e}\")\n        return None\n\n\ndef ntfy(title, message, priority='5', tags='rotating_light'):\n    \"\"\"Push to self-hosted ntfy. Title\u002Ftags stay ASCII (HTTP headers are latin-1);\n    emoji come from tag names. Body may be UTF-8. Best-effort; never blocks.\"\"\"\n    if not NTFY_TOKEN:\n        log(\"ntfy skipped: NTFY_TOKEN not set\")\n        return\n    try:\n        req = urllib.request.Request(\n            f'{NTFY_URL}\u002F{NTFY_TOPIC}',\n            data=(message or title).encode('utf-8'), method='POST',\n            headers={\n                'Authorization': f'Bearer {NTFY_TOKEN}',\n                'Title': title, 'Priority': str(priority), 'Tags': tags,\n            })\n        urllib.request.urlopen(req, timeout=10)\n        log(f\"ntfy sent ({priority}): {title}\")\n    except Exception as e:\n        log(f\"ntfy publish error: {e}\")\n\n\ndef issue_title(alert):\n    labels   = alert.get('labels', {})\n    name     = labels.get('alertname', 'Alert')\n    severity = labels.get('severity', 'warning').upper()\n    instance = labels.get('instance', '')\n    title    = f\"[{severity}] {name}\"\n    if instance:\n        title += f\": {instance}\"\n    return title\n\n\ndef find_open_issue(title):\n    # List API, not search API: fine-grained PATs can't use the search endpoint.\n    page = 1\n    while True:\n        issues = gh('GET', f'\u002Frepos\u002F{GITHUB_REPO}\u002Fissues?state=open&amp;per_page=100&amp;page={page}')\n        if not issues:\n            return None\n        for issue in issues:\n            if issue.get('title') == title:\n                return issue\n        if len(issues) &lt; 100:\n            return None\n        page += 1\n\n\ndef handle_firing(alert):\n    title    = issue_title(alert)\n    labels   = alert.get('labels', {})\n    severity = labels.get('severity', 'warning')\n    instance = labels.get('instance', 'N\u002FA')\n    summary  = alert.get('annotations', {}).get('summary', '')\n    desc     = alert.get('annotations', {}).get('description', '')\n\n    with _lock:\n        now = time.time()\n        for k in list(_recent):              # evict stale dedup entries\n            if now - _recent[k] &gt; 300:\n                del _recent[k]\n        if title in _recent:\n            log(f\"Deduplicated (in-flight): {title}\")\n            return\n        if find_open_issue(title):\n            log(f\"Issue already open: {title}\")\n            return\n        _recent[title] = now\n\n    body = f\"\"\"## Alert Firing\n\n| | |\n|---|---|\n| **Severity** | `{severity}` |\n| **Instance** | `{instance}` |\n| **Summary** | {summary} |\n\n{desc}\n\n---\n*Auto-created by Alertmanager. Closes automatically when the alert resolves.*\n\"\"\"\n    # Try with labels first; fall back to no labels if they don't exist on the repo.\n    result = gh('POST', f'\u002Frepos\u002F{GITHUB_REPO}\u002Fissues',\n                {'title': title, 'body': body, 'labels': ['monitoring', severity]})\n    if not result:\n        result = gh('POST', f'\u002Frepos\u002F{GITHUB_REPO}\u002Fissues', {'title': title, 'body': body})\n    if result:\n        log(f\"Created issue #{result['number']}: {title}\")\n\n    if severity == 'critical':\n        ntfy(title, f\"{summary}\\n\\n{desc}\".strip() or instance,\n             priority='5', tags='rotating_light')\n\n\ndef handle_resolved(alert):\n    title    = issue_title(alert)\n    severity = alert.get('labels', {}).get('severity', 'warning')\n    issue = find_open_issue(title)\n    if not issue:\n        log(f\"No open issue to close for: {title}\")\n        return\n    number = issue['number']\n    gh('POST', f'\u002Frepos\u002F{GITHUB_REPO}\u002Fissues\u002F{number}\u002Fcomments',\n       {'body': ':green_circle: Alert resolved - closing automatically.'})\n    gh('PATCH', f'\u002Frepos\u002F{GITHUB_REPO}\u002Fissues\u002F{number}', {'state': 'closed'})\n    log(f\"Closed issue #{number}: {title}\")\n\n    if severity == 'critical':\n        ntfy(f\"Resolved: {title}\", \"Alert resolved - issue closed automatically.\",\n             priority='3', tags='white_check_mark')\n\n\nclass Handler(BaseHTTPRequestHandler):\n    def do_POST(self):\n        length = int(self.headers.get('Content-Length', 0))\n        try:\n            data = json.loads(self.rfile.read(length))\n            for alert in data.get('alerts', []):\n                if alert.get('status') == 'firing':\n                    handle_firing(alert)\n                elif alert.get('status') == 'resolved':\n                    handle_resolved(alert)\n        except Exception as e:\n            log(f\"Webhook error: {e}\")\n        self.send_response(200); self.end_headers(); self.wfile.write(b'ok')\n\n    def do_GET(self):\n        self.send_response(200); self.end_headers()\n        self.wfile.write(json.dumps({'status': 'ok', 'repo': GITHUB_REPO}).encode())\n\n    def log_message(self, *_):\n        pass  # suppress default request logging\n\n\nif __name__ == '__main__':\n    if not GITHUB_TOKEN:\n        log(\"ERROR: GITHUB_TOKEN not set - issues will not be created\")\n        sys.exit(1)\n    state = f\"ntfy -&gt; {NTFY_URL}\u002F{NTFY_TOPIC}\" if NTFY_TOKEN else \"ntfy DISABLED\"\n    log(f\"Webhook bridge on :5001 -&gt; {GITHUB_REPO} | critical -&gt; {state}\")\n    HTTPServer(('0.0.0.0', 5001), Handler).serve_forever()\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cp>Three details that are not obvious until you hit them:\u003C\u002Fp>\u003Cul>\u003Cli data-list-item-id=\"e88b5cc9b3929853cb7d2ec6c257421d9\">\u003Cstrong>Use the issues list API, not the search API.\u003C\u002Fstrong> GitHub's fine-grained personal access tokens do not have access to the search endpoint, so matching an open issue by title means paging through \u003Ccode>state=open\u003C\u002Fcode> issues a hundred at a time. The first time I built this with the search API it returned 403s that looked like a permissions bug and were not.\u003C\u002Fli>\u003Cli data-list-item-id=\"eee6604225bd745222591353b4dc7f78b\">\u003Cstrong>Two layers of deduplication.\u003C\u002Fstrong> An in-memory five-minute window catches in-flight bursts cheaply, and a \u003Ccode>find_open_issue\u003C\u002Fcode> check catches anything already filed. Without both, a flapping service fills the tracker with duplicates faster than you can close them.\u003C\u002Fli>\u003Cli data-list-item-id=\"eb4e74299b662db264ca4c0b49a200d07\">\u003Cstrong>The label fallback.\u003C\u002Fstrong> Creating an issue with labels fails if those labels do not exist on the repo, so the bridge retries without them. The issue still gets filed.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2 id=\"ntfy\">Why the second channel is the whole point\u003C\u002Fh2>\u003Cp>Here is the trap this design exists to avoid. On a self-hosted box, the monitoring stack sends alert email through your own mail server. That is fine for almost every alert, until the alert you most need to receive is \"the mail server is down.\" That alert is an email. It routes to the mail server. The mail server is down. You never hear about it. The monitoring did its job perfectly and you still found out from a user.\u003C\u002Fp>\u003Cp>ntfy breaks that circular dependency. It is a small self-hosted push server; you install its app on your phone, subscribe to a topic, and the webhook bridge POSTs critical alerts straight to it over a path that shares nothing with the mail stack. Mail can be completely dead and the push still lands. Because the ntfy server has to be reachable from the public internet, it is locked to deny-all and given an explicit publisher:\u003C\u002Fp>\u003Cfigure class=\"code\">\u003Cfigcaption>config\u002Fntfy-server.yml\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\">base-url: https:\u002F\u002Fntfy.example.com\nauth-default-access: deny-all\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cp>Then create the publishing user and token once, on the server:\u003C\u002Fp>\u003Cfigure class=\"code\">\u003Cfigcaption>bash\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\">docker exec -it &lt;ntfy-container&gt; ntfy user add --role=user bridge      # set a password\ndocker exec -it &lt;ntfy-container&gt; ntfy access bridge infra-alerts rw    # publish + read\ndocker exec -it &lt;ntfy-container&gt; ntfy token add bridge                 # prints tk_...\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cp>Put the printed token into \u003Ccode>.env\u003C\u002Fcode> as \u003Ccode>NTFY_TOKEN\u003C\u002Fcode> and redeploy so the bridge picks it up. On the phone: install the ntfy app, add the server, subscribe to the topic, sign in. Then prove the independent path actually works:\u003C\u002Fp>\u003Cfigure class=\"code\">\u003Cfigcaption>bash\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\">docker exec &lt;ntfy-container&gt; ntfy publish --token tk_... infra-alerts \"test\"\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cp>If the phone buzzes, the channel that survives a mail outage is real. The token never gets committed, it goes in \u003Ccode>.env\u003C\u002Fcode> only.\u003C\u002Fp>\u003Ch2 id=\"logs\">Logs: Loki and Promtail\u003C\u002Fh2>\u003Cp>Metrics tell you that something broke; logs tell you why. Promtail tails the host logs plus every container's stdout\u002Fstderr and ships them to Loki, which Grafana queries.\u003C\u002Fp>\u003Cfigure class=\"code\">\u003Cfigcaption>config\u002Fpromtail.yml\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\">server:\n  http_listen_port: 9080\n  grpc_listen_port: 0\n\npositions:\n  filename: \u002Fvar\u002Flib\u002Fpromtail\u002Fpositions.yaml\n\nclients:\n  - url: http:\u002F\u002Floki:3100\u002Floki\u002Fapi\u002Fv1\u002Fpush\n\nscrape_configs:\n  - job_name: varlogs\n    static_configs:\n      - targets: [localhost]\n        labels: { job: varlogs, host: server, __path__: \u002Fvar\u002Flog\u002F{syslog,kern.log} }\n\n  # Auth log - SSH logins, sudo, PAM. Contains IPs\u002Fusernames; gated by Grafana auth.\n  - job_name: auth\n    static_configs:\n      - targets: [localhost]\n        labels: { job: auth, host: server, __path__: \u002Fvar\u002Flog\u002Fauth.log }\n\n  # Reverse-proxy JSON access logs -&gt; web analytics.\n  # Keep cardinality low: only host + method become labels; everything else\n  # (path, status, duration, client IP) is parsed at query time via | json.\n  - job_name: proxy-access\n    pipeline_stages:\n      - json:\n          expressions:\n            request_host:   RequestHost\n            request_method: RequestMethod\n      - labels:\n          host:   request_host\n          method: request_method\n    static_configs:\n      - targets: [localhost]\n        labels: { job: proxy-access, __path__: \u002Fproxy-logs\u002Faccess.log }\n\n  - job_name: containers\n    pipeline_stages:\n      - docker: {}\n    static_configs:\n      - targets: [localhost]\n        labels: { job: containers, host: server, __path__: \u002Fvar\u002Flib\u002Fdocker\u002Fcontainers\u002F*\u002F*-json.log }\n    relabel_configs:\n      - source_labels: [__path__]\n        regex: '\u002Fvar\u002Flib\u002Fdocker\u002Fcontainers\u002F([^\u002F]+)\u002F.+'\n        target_label: container_id\n        replacement: '$1'\n        action: replace\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cdiv class=\"note\">\u003Cstrong>Discipline:\u003C\u002Fstrong> keep label cardinality low. Labels are an index. High-cardinality values (request paths, durations, client IPs, status codes) as labels will blow Loki up. Promote only low-cardinality fields (virtual host, HTTP method) to labels and parse everything else at query time with \u003Ccode>| json\u003C\u002Fcode>.\u003C\u002Fdiv>\u003Cp>The Loki config itself is mostly defaults: filesystem storage, single-binary mode, retention around 30 days. The one thing to plan for is that Loki's schema versions change between major releases, so keep dated \u003Ccode>schema_config\u003C\u002Fcode> entries and old data stays readable after an upgrade instead of needing migration.\u003C\u002Fp>\u003Ch2 id=\"dash\">Dashboards as code\u003C\u002Fh2>\u003Cp>Grafana provisions its datasources and dashboards from files on disk. No click-ops, and the entire observability layer comes back identically from Git on every redeploy.\u003C\u002Fp>\u003Cfigure class=\"code\">\u003Cfigcaption>provisioning\u002Fdatasources\u002Fprometheus.yml\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\">apiVersion: 1\ndatasources:\n  - name: Prometheus\n    type: prometheus\n    uid: prometheus\n    access: proxy\n    url: http:\u002F\u002Fprometheus:9090\n    isDefault: true\n  - name: Loki\n    type: loki\n    uid: loki\n    access: proxy\n    url: http:\u002F\u002Floki:3100\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cfigure class=\"code\">\u003Cfigcaption>provisioning\u002Fdashboards\u002Fprovider.yml\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\">apiVersion: 1\nproviders:\n  - name: 'default'\n    folder: ''\n    type: file\n    options:\n      path: \u002Fetc\u002Fgrafana\u002Fprovisioning\u002Fdashboards\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cp>Drop dashboard JSON files next to that provider and they appear on startup. The workflow is: build a dashboard in the UI, export its JSON, commit it. Every redeploy then comes up identical to the last. The dashboards I keep provisioned:\u003C\u002Fp>\u003Cul>\u003Cli data-list-item-id=\"e10bc1d39f5be1a3303abf7ded60ead66\">\u003Cstrong>Site Uptime\u003C\u002Fstrong>: a \u003Ccode>probe_success\u003C\u002Fcode> grid per target, HTTP response duration, uptime percentage over 24h \u002F 7d \u002F 30d, and TLS-expiry-in-days.\u003C\u002Fli>\u003Cli data-list-item-id=\"e0a3370a52a5fc5360007fc5e2382c256\">\u003Cstrong>Infrastructure\u003C\u002Fstrong>: host CPU, memory, disk gauges, network, container counts.\u003C\u002Fli>\u003Cli data-list-item-id=\"ec6306721c06c44e410883fa998b8d6e6\">\u003Cstrong>Logs\u003C\u002Fstrong>: Loki log rates, container error rates, auth log, syslog.\u003C\u002Fli>\u003Cli data-list-item-id=\"efc7dc48c7083331b3b52e49f92a9ecd6\">\u003Cstrong>Traefik \u002F proxy\u003C\u002Fstrong>: request rates, status-code breakdowns, p50\u002Fp95\u002Fp99 latency, 5xx rate.\u003C\u002Fli>\u003C\u002Ful>\u003Cp>Each UP\u002FDOWN tile carries a data link straight to the site it represents, so the dashboard doubles as a launcher.\u003C\u002Fp>\u003Ch2 id=\"deploy\">Deploying: manual, GitOps-style, no agents on the box\u003C\u002Fh2>\u003Cp>Deploys are deliberate. There is no auto-deploy on push; production only changes when I trigger it. A small \"Deploy to Production\" workflow dispatches an event to the orchestration repo, which runs on a self-hosted runner that SSHes into the VPS, syncs the named config files out of Git, pulls images, and brings the stack up. The pattern is identical for every project on the box, which is most of why a new project can be running end to end in about an hour.\u003C\u002Fp>\u003Cfigure class=\"code\">\u003Cfigcaption>.github\u002Fworkflows\u002Fdeploy-prod.yml\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\">name: Deploy to Production\non:\n  workflow_dispatch:\n    inputs:\n      force_recreate:\n        description: 'Force recreate containers'\n        type: boolean\n        default: false\njobs:\n  trigger-deploy:\n    runs-on: ubuntu-latest\n    steps:\n      - name: Trigger deploy\n        uses: peter-evans\u002Frepository-dispatch@v4\n        with:\n          token: ${{ secrets.DEPLOY_TOKEN }}\n          repository: youruser\u002Finfra-repo\n          event-type: platform-deploy\n          client-payload: |\n            {\n              \"project_dir\": \"monitoring\",\n              \"project_type\": \"platform\",\n              \"sync_code\": \"true\",\n              \"sync_files\": \"docker-compose.yml config\u002Fprometheus.yml config\u002Falerts.rules config\u002Fblackbox.yml config\u002Floki.yml config\u002Fpromtail.yml config\u002Fntfy-server.yml services\u002Fwebhook-bridge\u002Fapp.py config\u002Falertmanager.yml.template config\u002Falertmanager-entrypoint.sh config\u002Fprovisioning\u002F...\",\n              \"force_recreate\": \"${{ inputs.force_recreate || false }}\"\n            }\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cp>The receiving deploy job, running on the self-hosted runner, does the boring-but-essential things in order: verify SSH, sync only the listed files, pull, bring up, then a health check that fails the run if any container is in an Exit \u002F unhealthy \u002F restarting state.\u003C\u002Fp>\u003Cfigure class=\"code\">\u003Cfigcaption>deploy job (skeleton)\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\">jobs:\n  deploy:\n    runs-on: self-hosted\n    environment: production\n    timeout-minutes: 30\n    steps:\n      - name: Sync config files from repo\n        if: steps.vars.outputs.sync_code == 'true'\n        run: |\n          ssh -i ~\u002F.ssh\u002Fid_rsa_vps -p ${{ secrets.SSH_PORT }} \\\n            ${{ secrets.SSH_USER }}@${{ secrets.VPS_IP }} \\\n            \"cd ~\u002Fdocker\u002F${{ steps.vars.outputs.project_dir }} &amp;&amp; \\\n             git fetch origin main --depth=1 &amp;&amp; \\\n             git checkout origin\u002Fmain -- ${{ steps.vars.outputs.sync_files }}\"\n\n      - name: Pull images and deploy\n        run: |\n          ssh -i ~\u002F.ssh\u002Fid_rsa_vps -p ${{ secrets.SSH_PORT }} \\\n            ${{ secrets.SSH_USER }}@${{ secrets.VPS_IP }} \\\n            \"cd ~\u002Fdocker\u002F${{ steps.vars.outputs.project_dir }} &amp;&amp; \\\n             docker compose pull &amp;&amp; \\\n             docker compose up -d --remove-orphans\"\n\n      - name: Health check\n        run: |\n          sleep 30\n          ssh -i ~\u002F.ssh\u002Fid_rsa_vps -p ${{ secrets.SSH_PORT }} \\\n            ${{ secrets.SSH_USER }}@${{ secrets.VPS_IP }} \\\n            \"cd ~\u002Fdocker\u002F${{ steps.vars.outputs.project_dir }} &amp;&amp; \\\n             STOPPED=\\$(docker compose ps | grep -cE 'Exit|unhealthy|restarting' || true) &amp;&amp; \\\n             if [ \\\"\\$STOPPED\\\" -gt 0 ]; then docker compose ps &amp;&amp; exit 1; fi &amp;&amp; \\\n             echo 'All containers healthy'\"\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cp>Every host, port, user, and key path in there is a GitHub secret, never a literal. The runner holds the SSH key; the box runs no deploy agent of its own and exposes no deploy webhook. The deployment surface is one SSH key and one short-lived token. There is also a separate local Compose file for working on dashboards: Grafana only, anonymous admin, no Traefik, live provisioning reload, on \u003Ccode>localhost:3000\u003C\u002Fcode>.\u003C\u002Fp>\u003Ch2 id=\"backup\">Backups, so a dead box is not a dead project\u003C\u002Fh2>\u003Cp>Monitoring tells you something broke; snapshots let you put it back. A scheduled GitHub Actions job takes a provider-level snapshot of the VPS every week during quiet hours, with a manual trigger for ad-hoc snapshots before a risky change. The provider here is Contabo via its API, but the shape is generic; any provider with a snapshot API works the same way.\u003C\u002Fp>\u003Cfigure class=\"code\">\u003Cfigcaption>.github\u002Fworkflows\u002Fsnapshots.yml\u003C\u002Ffigcaption>\u003Cpre>\u003Ccode class=\"language-plaintext\">name: Weekly VPS Snapshots\non:\n  schedule:\n    - cron: '0 2 * * 0'   # Sundays 02:00 UTC, low traffic\n  workflow_dispatch:\n    inputs:\n      target:\n        type: choice\n        options: [both, vps1, vps2]\n        default: both\njobs:\n  snapshot:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions\u002Fcheckout@v6\n      - run: sudo apt-get install -y jq uuid-runtime\n      - name: Create snapshots\n        env:\n          # all provider credentials are GitHub secrets, never literals\n          PROVIDER_CLIENT_ID:     ${{ secrets.PROVIDER_CLIENT_ID }}\n          PROVIDER_CLIENT_SECRET: ${{ secrets.PROVIDER_CLIENT_SECRET }}\n          VPS_INSTANCE_ID:        ${{ secrets.VPS_INSTANCE_ID }}\n        run: .\u002Fscripts\u002Fbackup\u002Fsnapshot.sh \"${{ github.event.inputs.target || 'both' }}\"\u003C\u002Fcode>\u003C\u002Fpre>\u003C\u002Ffigure>\u003Cp>Same discipline as everywhere else: every credential is a secret, the schedule is deliberate, and there is a manual escape hatch.\u003C\u002Fp>\u003Ch2 id=\"discipline\">The discipline that keeps it boring\u003C\u002Fh2>\u003Cp>A few habits matter more than any single component:\u003C\u002Fp>\u003Cul>\u003Cli data-list-item-id=\"e41b7f5b05ca27d306d5f0001c4107db8\">\u003Cstrong>Pin every image and upgrade on purpose.\u003C\u002Fstrong> cAdvisor is the cautionary tale: its \u003Ccode>latest\u003C\u002Fcode> tag has shipped builds with a constant CPU overhead of around 13%, so it is pinned to v0.47.2 and only moves after a tested bump. Dependabot proposes; nothing applies itself.\u003C\u002Fli>\u003Cli data-list-item-id=\"e23cbb1c4b07255b903e4017ef77bfe89\">\u003Cstrong>Verify the alert path end to end before trusting it.\u003C\u002Fstrong> Stop a non-critical container, watch a GitHub Issue open, start it again, watch the issue close. Then make the mail stack unreachable and confirm the ntfy push still lands. An alerting pipeline you have never seen fire is a guess, not a safety net.\u003C\u002Fli>\u003Cli data-list-item-id=\"efdbfcc16c3a5c069c4514f9efc756ad4\">\u003Cstrong>Keep secrets out of Git, always.\u003C\u002Fstrong> Templates and \u003Ccode>.env.example\u003C\u002Fcode> live in the repo; real values live only on the server and in Actions secrets. The entrypoint resolves credentials into \u003Ccode>\u002Ftmp\u003C\u002Fcode> at runtime so they never sit on disk in plaintext config.\u003C\u002Fli>\u003Cli data-list-item-id=\"ebea8afbe9fa8cb439d676cf010f91a14\">\u003Cstrong>Reproduce, do not repair.\u003C\u002Fstrong> Because the whole stack is one Compose file plus pinned images plus tracked config, recovering from a bad state is a redeploy, not a debugging session.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2 id=\"buys\">What this design buys\u003C\u002Fh2>\u003Cul>\u003Cli data-list-item-id=\"eab8563f708b356db66f3f5633c5f8132\">One \u003Ccode>docker compose up\u003C\u002Fcode> reproduces the entire observability stack, dashboards and all, from pinned images and tracked config.\u003C\u002Fli>\u003Cli data-list-item-id=\"e4565224d891ce4c960b6bdd778ba413d\">No secret is ever committed; everything sensitive is an environment variable injected at runtime.\u003C\u002Fli>\u003Cli data-list-item-id=\"edabfe8ae6d76d99ca0d5b8acea9e7108\">Alerts survive the failure of their own delivery channel, because criticals have a second path that shares no dependencies with email.\u003C\u002Fli>\u003Cli data-list-item-id=\"e5b08f5443b16a6af17eccc2b8bccaa40\">Every alert is a tracked, self-closing work item, so the issue tracker is a live picture of what is broken right now and goes quiet on its own when things recover.\u003C\u002Fli>\u003C\u002Ful>\u003Cp>None of the components are unusual. The value is in the wiring: the two-channel fan-out, the runtime secret injection, the cardinality-aware log labels, and the bridge that closes the loop between \"something fired\" and \"someone is tracking it.\" That is the part most tutorials skip, and it is the part that decides whether your monitoring actually wakes you up when it matters, and stops bothering you when it does not.\u003C\u002Fp>\u003Cp>The result, on one VPS I have been carrying forward since 2019, is uptime sitting at 97-100% month after month. Not because the hardware is special, but because the box tells me the moment it is not.\u003C\u002Fp>\u003Cp class=\"closing\">Open source throughout, version-pinned, and updated regularly via Dependabot and deliberate manual bumps.\u003C\u002Fp>\u003C\u002Fmain>\u003C\u002Farticle>","self-hosted-monitoring-single-vps","2026-05-31T12:51:38+00:00",{"id":11,"url":12,"alt":5,"width":13,"height":14},"fac79bca-eead-41ee-b19b-d88ded9ad22c","https:\u002F\u002Fdrupal.madsnorgaard.net\u002Fsites\u002Fdefault\u002Ffiles\u002F2026-05\u002FSelf-hosted%20monitoring%20on%20a%20single%20VPS.jpg",3165,1325,[16,19,22,25,28,31,34,37,40,43,46,49,52,55,58,61,64,67],{"id":17,"name":18,"slug":18},"d965487f-3728-4202-86de-5b2a421c60a4","Self-hosting",{"id":20,"name":21,"slug":21},"be2d7dd2-5cf1-4b35-89d8-a8e456b3ad50","monitoring",{"id":23,"name":24,"slug":24},"a0c4ac27-a9ae-4b64-a9cc-0c4b80ca5720","Observability",{"id":26,"name":27,"slug":27},"bfac572a-a9d6-48a5-9204-ae809f58e232","Prometheus",{"id":29,"name":30,"slug":30},"9b49265e-11c3-497d-9dc6-fcf72529d6d5","Grafana",{"id":32,"name":33,"slug":33},"115eecc0-9e4e-4017-882b-5e1268bf10f7","Loki",{"id":35,"name":36,"slug":36},"11825823-cfd7-4978-b59d-19890742488e","Alertmanager",{"id":38,"name":39,"slug":39},"cd4d8de4-8189-439a-92d1-52d8b9178351","Docker",{"id":41,"name":42,"slug":42},"3cfa8d37-7bd7-43db-9b2c-af4391cbc93f","Docker Compose",{"id":44,"name":45,"slug":45},"49f2de4c-0b5c-4736-9470-1ff09e7a57b5","DevOps",{"id":47,"name":48,"slug":48},"224e3734-c39c-46fe-9d62-82bc1f9c1ddf","Homelab",{"id":50,"name":51,"slug":51},"db8b4a3f-d438-4953-94ef-71a9e8a77cea","VPS",{"id":53,"name":54,"slug":54},"98c95e6b-edb6-4a31-a5a7-83531bb0b35b","Linux",{"id":56,"name":57,"slug":57},"b036641d-390a-413b-b911-5684b9cf2860","Traefik",{"id":59,"name":60,"slug":60},"bdabafd6-2d3a-45cd-8ea3-80474d2d3b24","ntfy",{"id":62,"name":63,"slug":63},"4f98c9bf-dee5-404a-8b6f-17c4c1e88da5","GitHub Actions",{"id":65,"name":66,"slug":66},"e0a7d67c-afe8-45f9-a35a-72eba4acaa70","Infrastructure as Code",{"id":68,"name":69,"slug":69},"c5d5cbb6-8f6a-4dc6-92bd-83cb64a9044e","Uptime"]