infra/skills/infra-ops/SKILL.md

3.9 KiB

name description
infra-ops Canonical conventions for Manohar's self-hosted infrastructure (Hetzner CX32 + Dokploy + Tailscale + Forgejo). Use whenever creating or editing a service, writing a Dokploy compose file, running SSH ops on the server, deploying via Forgejo, or touching networking/UFW. Encodes the script-first workflow, compose label requirements, overlay-vs-bridge networking rules, and the deploy loop so these directions never need restating.

Infra Ops — house style

Server

  • Host manohar-ubuntu: Hetzner CX32 (4 vCPU / 7.6 GB / 75 GB), Ubuntu 24, Docker 29, Helsinki.
  • SSH (Tailscale-only; user is always root):
    SSH_AUTH_SOCK=$(launchctl getenv SSH_AUTH_SOCK) ssh -i ~/.ssh/id_ed25519 root@100.75.128.45 'bash -s' < /local/script.sh
    
    • Tailscale IP 100.75.128.45 | public IPv4 77.42.82.225
    • NEVER use -t (no pseudo-TTY). NEVER heredoc over SSH.
    • Tailscale node idle = online, not down. Re-auth prompt is normal: approve, then kill+restart any wedged session.

Script-first (never deviate)

  • Write scripts locally to ~/MyProjects/ via Desktop Commander write_file (NOT the sandbox).
  • Execute remotely via the ssh pipe above ('bash -s' < script.sh).
  • Never patch files in place on the server bypassing git.
  • Backup-before-change: write a rollback script to /opt/<service>/ before modifying configs.
  • Dead-man's-switch for risky ops: a verify step that proves success before the change is trusted.

Dokploy compose conventions

Dokploy deploys compose as a swarm stack, so Traefik routing needs BOTH label sets:

  • container-level labels: (docker provider) AND deploy: labels: (swarm provider) — mirror them exactly.
  • No container_name: (swarm assigns names).
  • Attach dokploy-network (external: true) for Traefik ingress.
  • Deploy only through the Dokploy UI (not docker stack deploy by hand).
  • /etc/dokploy/compose/*/code/ is OVERWRITTEN on every redeploy — never treat it as source of truth.
  • Standard Traefik labels (replace SVC / HOST / PORT):
    traefik.enable=true
    traefik.docker.network=dokploy-network
    traefik.http.routers.SVC.rule=Host(`HOST`)
    traefik.http.routers.SVC.entrypoints=websecure
    traefik.http.routers.SVC.tls.certresolver=letsencrypt
    traefik.http.services.SVC.loadbalancer.server.port=PORT
    
  • Scaffold to copy: templates/dokploy-service.compose.yml

Networking (the rules that bite)

  • dokploy-network is a swarm OVERLAY → containers on it CANNOT reach the host (not 10.0.1.1, not the Tailscale IP) and cannot cleanly egress to a tailnet peer.
  • To reach the host OR a tailnet peer from a container, give it a second bridge network; its gateway (172.x.0.1) is the host, which then routes/masquerades out. Precedents: n8n → 172.19.0.1; tiger-bridge tiger-net172.18.0.1; ha-proxy uses this for tailnet egress.
  • UFW: ufw allow covers bridge subnets (172.x). It does NOT expose docker-published ports — those need ufw-docker allow PORT (DOCKER-USER chain).
  • Always ufw reload after rule changes; verify with iptables -L ufw-user-input -n -v.

Deploy loop

  • Git-driven services: source in ~/MyProjects/<svc>/, Forgejo remote git.manohargupta.com/manohar/<svc>. Push → Forgejo webhook → Dokploy rebuild. No manual server steps.
  • infra repo = local ~/MyProjects/deployments/ (remote manohar/infra), pushes over HTTPS:443. Flat *.compose.yml files and per-service subfolders are both fine.
  • Manual (non-Dokploy) stacks — Tiger /opt/tiger/, LiteLLM, code-server — compose lives in the repo, deployed by hand.

Working style

  • Root cause before fix; state tradeoffs between fix paths.
  • One mini-question / understanding check per major topic.
  • Explicit risk flag before any change touching security, stability, or data.
  • Token-efficient: batch ops, don't re-explain established context.
  • Don't redo security hardening (UFW/ufw-docker/fail2ban/SSH) — it's done.