Manohar 50a6520c20 docs: rewrite README + ARCHITECTURE for 2026-06-10 reality, extend TOOLS

ARCHITECTURE was last true on 2026-05-03 (pre-gateway, OpenRouter chains,
webhook mirror). Now documents: LiteLLM gateway routing, real spawning,
inbox loop, transcript mirror, audit trail, token rotation procedure,
RAM constraints. README no longer says 'Clawd Dashboard'.

2026-06-10 14:59:41 +00:00

6.3 KiB

Raw Blame History

Tiger Command Center — Architecture

Last updated: 2026-06-10. Covers the gateway migration, real sub-agent spawning, the TASKS.md inbox loop, the Telegram transcript mirror, and the unified audit trail.

1. System Overview

Self-hosted AI agent orchestration on a Hetzner VPS (8 GB RAM, Helsinki; Tailscale 100.75.128.45). Three host services + one containerised AI runtime behind Traefik, with ALL model traffic routed through a self-hosted LiteLLM gateway — no third-party balance can silently kill the system.

Internet/Manohar
     |  HTTPS 443
     v
dokploy-traefik (v3.6.7)
     |
     +-- agent.manohargupta.com --> tiger-dashboard (Next.js, :3100)
     |                                    | /api/* proxies (token server-side)
     |                                    v
     |                             tiger-bridge (Express+tsx, :3456, localhost)
     |                                    |  docker exec / volume reads
     |                                    v
     |                             tiger-openclaw (OpenClaw v2026.3.12)
     |                                    |
     +-- llm.manohargupta.com ----> litellm-gateway <-- ALL model calls
     |                                    |-- MiniMax API (own key): minimax-3 (primary),
     |                                    |   minimax-2.7, minimax-2.7-fast
     |                                    +-- Anthropic API (own key): claude-haiku, claude-sonnet
     |
     +-- angel.manohargupta.com --> position-tracker (standalone repo/deploy)
     |
Telegram @Tiger_4321_bot <--> OpenClaw native channel (long-polling, owns the bot)

2. Model Routing (post-OpenRouter)

OpenRouter was removed 2026-06-10 after its credits ran dry and silently broke both Tiger and the bridge's classifier. Everything now goes through the self-hosted gateway:

OpenClaw (openclaw.json): custom provider litellm (baseUrl: https://llm.manohargupta.com/v1, api: openai-completions). Primary litellm/minimax-3 (1M ctx), fallbacks litellm/minimax-2.7 → litellm/claude-haiku (cross-provider: survives a MiniMax outage).
Bridge (lib/llm.ts): slugs starting anthropic/ go to Anthropic direct; everything else goes to the gateway. Env: LLM_GATEWAY_URL, LLM_GATEWAY_KEY, TIGER_ROUTER_MODEL (default minimax-3).
Gateway config: /root/litellm/litellm_config.yaml (request_timeout: 300 to match the cron budget).

3. Sub-Agent Execution (the orchestration layer)

bridge/src/lib/agents.ts is the canonical specialist registry: cody (code), ethan (research), cathy (writing), elon (PM). Legacy ids coder/researcher/writer/pm are accepted as aliases.

A spawn (POST /tiger/spawn) runs an isolated OpenClaw session (--session-id spawn-<agent>-<id>) with the specialist persona prepended. Message transport is docker-cp of a temp file (escaping-proof). Runs are tracked in the executions table and serialized (MAX_CONCURRENT=1 — parallel turns push the 8GB host into swap and everything times out). Completion fires a Telegram notification via /tiger/notify.

Upgrade path: define real per-agent entries in openclaw.json agents.list (own IDENTITY.md + workspace each), then change the --agent flag in spawn.ts. Documented in lib/agents.ts; deferred until the RAM situation is resolved.

4. TASKS.md Inbox Loop

workspace/TASKS.md has a ## 📥 INBOX section. bridge/src/lib/inbox.ts checks every 30 min (09:00–20:00 IST): takes the first - [ ] line, classifies it (classifyAgent), spawns the specialist, rewrites the line to - [⏳ run-id → agent]. Manual trigger: POST /tiger/inbox/drain. Bridge-side scheduling means zero model tokens burned on empty checks and no bearer tokens embedded in cron prompts.

5. Telegram

The bot is owned by OpenClaw's native channel (long-polling). The bridge's TelegramChannel, telegram-webhook.ts and chat-mirror.ts are legacy: Telegram forbids webhook + getUpdates on one token, so the webhook design could never receive a message.
The dashboard mirror reads the native session transcript — routes/chat-telegram.ts resolves the telegram: session from sessions.json and serves the JSONL with cursor pagination and mtime caching. It filters to what Telegram actually saw: assistant messages carrying toolCall blocks (working narration) are skipped, thinking blocks ignored, injected metadata/system boilerplate stripped from user messages.

6. Audit Trail

GET /tiger/activity/audit merges, at read time, every durable action store: executions (spawns), tasks (lifecycle), outputs (artifacts), and OpenClaw's cron run JSONL. Cursor-paginated (before=<ISO>), type filters. The dashboard /activity page adds recent file-modification events on the first page. Read-time merging means history is complete retroactively and no action can happen without its audit row.

7. Crons (OpenClaw, tz Asia/Kolkata)

Job	Schedule	Timeout
Trade Baseline Reset	9:15 daily	60s
Trade P&L Monitor	every 2 min	60s
Hourly Trade Summary + News	hourly	90s
Hourly Task Check-in	0 9-21	300s
EOD Trade Summary	16:00 Mon–Fri	300s
Weekly Digest	Mon 9:00	300s

Timeout budget rationale: agent turns on this RAM-starved host can take minutes; 300s is the ceiling that made chronically-failing jobs pass.

8. Security Posture

Bridge: Bearer auth on all routes; token in bridge/.env + dashboard/.env.local + embedded in cron payloads (rotate all four together — jobs.json has it twice). Rotated 2026-06-10 after the old token leaked via a hardcode in agents-activity.ts to the public GitHub mirror. NEVER hardcode tokens in source: this repo mirrors publicly.
Git: Forgejo (origin, SSH port 2222, key id_ed25519_forgejo) + GitHub mirror. Push both.
position-tracker binds 127.0.0.1:3457; public access via Traefik at angel.manohargupta.com.
Known weak spots: litellm-db password, /opt/dashboard fossil with a stale token, dual Telegram pollers (bridge poller should be disabled).

9. Known Constraints

RAM: ~13GB workload on 8GB physical; 6+GB swap in steady state. This is the root cause of historical cron timeouts and the reason spawn concurrency is 1. Decision pending: evict homelab services vs upgrade.
OpenClaw v2026.3.12 predates MiniMax-M3, hence the explicit litellm/minimax-3 provider-prefixed model id.

6.3 KiB Raw Blame History Unescape Escape