OpenClawDashboard/ARCHITECTURE.md
Manohar 50a6520c20 docs: rewrite README + ARCHITECTURE for 2026-06-10 reality, extend TOOLS
ARCHITECTURE was last true on 2026-05-03 (pre-gateway, OpenRouter chains,
webhook mirror). Now documents: LiteLLM gateway routing, real spawning,
inbox loop, transcript mirror, audit trail, token rotation procedure,
RAM constraints. README no longer says 'Clawd Dashboard'.
2026-06-10 14:59:41 +00:00

139 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Tiger Command Center — Architecture
*Last updated: 2026-06-10. Covers the gateway migration, real sub-agent
spawning, the TASKS.md inbox loop, the Telegram transcript mirror, and the
unified audit trail.*
---
## 1. System Overview
Self-hosted AI agent orchestration on a Hetzner VPS (8 GB RAM, Helsinki;
Tailscale 100.75.128.45). Three host services + one containerised AI
runtime behind Traefik, with ALL model traffic routed through a self-hosted
LiteLLM gateway — no third-party balance can silently kill the system.
```
Internet/Manohar
| HTTPS 443
v
dokploy-traefik (v3.6.7)
|
+-- agent.manohargupta.com --> tiger-dashboard (Next.js, :3100)
| | /api/* proxies (token server-side)
| v
| tiger-bridge (Express+tsx, :3456, localhost)
| | docker exec / volume reads
| v
| tiger-openclaw (OpenClaw v2026.3.12)
| |
+-- llm.manohargupta.com ----> litellm-gateway <-- ALL model calls
| |-- MiniMax API (own key): minimax-3 (primary),
| | minimax-2.7, minimax-2.7-fast
| +-- Anthropic API (own key): claude-haiku, claude-sonnet
|
+-- angel.manohargupta.com --> position-tracker (standalone repo/deploy)
|
Telegram @Tiger_4321_bot <--> OpenClaw native channel (long-polling, owns the bot)
```
## 2. Model Routing (post-OpenRouter)
OpenRouter was removed 2026-06-10 after its credits ran dry and silently
broke both Tiger and the bridge's classifier. Everything now goes through
the self-hosted gateway:
- **OpenClaw** (`openclaw.json`): custom provider `litellm`
(`baseUrl: https://llm.manohargupta.com/v1`, `api: openai-completions`).
Primary `litellm/minimax-3` (1M ctx), fallbacks `litellm/minimax-2.7`
`litellm/claude-haiku` (cross-provider: survives a MiniMax outage).
- **Bridge** (`lib/llm.ts`): slugs starting `anthropic/` go to Anthropic
direct; everything else goes to the gateway. Env: `LLM_GATEWAY_URL`,
`LLM_GATEWAY_KEY`, `TIGER_ROUTER_MODEL` (default `minimax-3`).
- **Gateway config**: `/root/litellm/litellm_config.yaml`
(`request_timeout: 300` to match the cron budget).
## 3. Sub-Agent Execution (the orchestration layer)
`bridge/src/lib/agents.ts` is the canonical specialist registry:
**cody** (code), **ethan** (research), **cathy** (writing), **elon** (PM).
Legacy ids coder/researcher/writer/pm are accepted as aliases.
A spawn (`POST /tiger/spawn`) runs an isolated OpenClaw session
(`--session-id spawn-<agent>-<id>`) with the specialist persona prepended.
Message transport is docker-cp of a temp file (escaping-proof). Runs are
tracked in the `executions` table and serialized (`MAX_CONCURRENT=1`
parallel turns push the 8GB host into swap and everything times out).
Completion fires a Telegram notification via `/tiger/notify`.
Upgrade path: define real per-agent entries in `openclaw.json agents.list`
(own IDENTITY.md + workspace each), then change the `--agent` flag in
spawn.ts. Documented in lib/agents.ts; deferred until the RAM situation is
resolved.
## 4. TASKS.md Inbox Loop
`workspace/TASKS.md` has a `## 📥 INBOX` section. `bridge/src/lib/inbox.ts`
checks every 30 min (09:0020:00 IST): takes the first `- [ ]` line,
classifies it (`classifyAgent`), spawns the specialist, rewrites the line to
`- [⏳ run-id → agent]`. Manual trigger: `POST /tiger/inbox/drain`.
Bridge-side scheduling means zero model tokens burned on empty checks and
no bearer tokens embedded in cron prompts.
## 5. Telegram
- **The bot is owned by OpenClaw's native channel** (long-polling). The
bridge's `TelegramChannel`, `telegram-webhook.ts` and `chat-mirror.ts`
are legacy: Telegram forbids webhook + getUpdates on one token, so the
webhook design could never receive a message.
- **The dashboard mirror reads the native session transcript** —
`routes/chat-telegram.ts` resolves the `telegram:` session from
`sessions.json` and serves the JSONL with cursor pagination and mtime
caching. It filters to what Telegram actually saw: assistant messages
carrying toolCall blocks (working narration) are skipped, thinking blocks
ignored, injected metadata/system boilerplate stripped from user messages.
## 6. Audit Trail
`GET /tiger/activity/audit` merges, at read time, every durable action
store: `executions` (spawns), `tasks` (lifecycle), `outputs` (artifacts),
and OpenClaw's cron run JSONL. Cursor-paginated (`before=<ISO>`), type
filters. The dashboard `/activity` page adds recent file-modification
events on the first page. Read-time merging means history is complete
retroactively and no action can happen without its audit row.
## 7. Crons (OpenClaw, tz Asia/Kolkata)
| Job | Schedule | Timeout |
|---|---|---|
| Trade Baseline Reset | 9:15 daily | 60s |
| Trade P&L Monitor | every 2 min | 60s |
| Hourly Trade Summary + News | hourly | 90s |
| Hourly Task Check-in | 0 9-21 | 300s |
| EOD Trade Summary | 16:00 MonFri | 300s |
| Weekly Digest | Mon 9:00 | 300s |
Timeout budget rationale: agent turns on this RAM-starved host can take
minutes; 300s is the ceiling that made chronically-failing jobs pass.
## 8. Security Posture
- Bridge: Bearer auth on all routes; token in `bridge/.env` +
`dashboard/.env.local` + embedded in cron payloads (rotate all four
together — `jobs.json` has it twice). Rotated 2026-06-10 after the old
token leaked via a hardcode in `agents-activity.ts` to the public GitHub
mirror. NEVER hardcode tokens in source: this repo mirrors publicly.
- Git: Forgejo (origin, SSH port 2222, key `id_ed25519_forgejo`) + GitHub
mirror. Push both.
- position-tracker binds 127.0.0.1:3457; public access via Traefik at
angel.manohargupta.com.
- Known weak spots: litellm-db password, `/opt/dashboard` fossil with a
stale token, dual Telegram pollers (bridge poller should be disabled).
## 9. Known Constraints
- **RAM**: ~13GB workload on 8GB physical; 6+GB swap in steady state. This
is the root cause of historical cron timeouts and the reason spawn
concurrency is 1. Decision pending: evict homelab services vs upgrade.
- OpenClaw v2026.3.12 predates MiniMax-M3, hence the explicit
`litellm/minimax-3` provider-prefixed model id.