Files
matrix-bridge/docs/spark-control-integration.md
T
Keysat e5a751d4f4 Add Phase 3 Spark Control integration spec; mark Phase 2 done
docs/spark-control-integration.md: the SSH command contract (status via
docker inspect; restart via docker restart; update via git fetch + reset
--hard origin/master + docker compose up -d --build) plus the one-time
conversion of the Spark's ~/matrix-bridge to a Gitea clone. No bot code
change. Update source = git-pull-from-Gitea; rides Spark Control's existing
SSH into spark-32d0 (no new key). Corrected the infra note: Spark is on the
LAN with the Start9/Gitea host, so Spark->Gitea resolves directly.
2026-06-15 20:48:18 -05:00

187 lines
8.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 3 — Spark Control integration (spec for the Spark Control dev)
**Goal (ROADMAP Phase 3):** surface the matrix-bridge bot's container status on the Spark
Control dashboard, and add one-click **update** (pull + rebuild + restart) and **restart**,
wired the same SSH-behind-buttons way Spark Control already drives the Sparks.
**Exit (falsifiable):** bot status is visible on the panel, and the bot can be
updated/restarted from the panel.
This document is the **contract**: what to run, where, and what the output means. The
matrix-bridge side is fixed below; map the buttons onto Spark Control's existing
managed-service pattern however that codebase already models a Spark/service. No changes to
matrix-bridge are required for this.
---
## What the bot is
A single Docker container on the DGX Spark.
| Fact | Value |
|---|---|
| Host | `spark-32d0` (`10.59.211.6` on WireGuard), user **`modelo`** |
| Project dir | `/home/modelo/matrix-bridge` (`~/matrix-bridge` for modelo) |
| Compose service | `bot` |
| Container name | `matrix-bridge` (fixed via `container_name:`) |
| Image | `matrix-bridge-bot` |
| Lifecycle | host networking, `restart: unless-stopped` (survives Spark reboot) |
| Secrets | `.env`, `config.toml`**gitignored**, live only on the Spark, never in git |
Spark Control already SSHes into `spark-32d0`, so these ride the existing channel — **no new
key needed.** All commands below assume they run **as `modelo`** (owner of the dir, member of
the `docker` group). If Spark Control's channel connects as a different user, wrap each command
in `sudo -iu modelo bash -lc '<command>'` — running `git` in modelo's repo as root trips git's
"dubious ownership" guard, so don't skip this.
---
## One-time prerequisites (owner, not Spark Control dev)
The bot dir on the Spark was originally populated by `scp` of loose files. To make
git-pull-based updates work it must become a git clone of the Gitea repo **without disturbing
the gitignored secrets** (`.env`, `config.toml`). Because those two files are gitignored,
`git reset --hard` never touches them — so we can convert the existing dir in place.
**0a. Confirm the Spark can reach + authenticate to Gitea (fail loud here, not at first button press):**
```sh
git ls-remote ssh://git@immense-voyage.local:59916/grant/matrix-bridge.git >/dev/null \
&& echo "gitea reachable" || echo "FIX gitea access first"
```
The Spark is on the same LAN as the Start9 host running Gitea, so `immense-voyage.local`
resolves directly — this should just work. If it doesn't, the only likely gap is a key
authorized for read on the Gitea repo available to `modelo` (deploy key or existing key).
Don't proceed until `git ls-remote` succeeds.
**0b. Convert `~/matrix-bridge` to a clone tracking `master` (run as `modelo`):**
```sh
cd /home/modelo/matrix-bridge
git init -b master
git remote add origin ssh://git@immense-voyage.local:59916/grant/matrix-bridge.git
git fetch origin
git reset --hard origin/master # secrets are gitignored → untouched
git branch --set-upstream-to=origin/master master
```
Verify the secrets survived and the container still comes up clean:
```sh
ls -la /home/modelo/matrix-bridge/.env /home/modelo/matrix-bridge/config.toml # both present
git -C /home/modelo/matrix-bridge status # .env/config.toml show as ignored, tree clean
docker compose up -d --build && docker ps --filter name=^/matrix-bridge$
```
`master` is the release branch (today `master == phase-1`). Track whatever you treat as the
release line; the commands below assume `origin/master`.
---
## The contract — commands behind each control
Run from `/home/modelo/matrix-bridge` as `modelo`. Each is idempotent and fail-loud
(non-zero exit ⇒ surface it on the panel; don't swallow).
### Status (poll for the badge)
```sh
docker inspect -f '{{.State.Status}}|{{.State.StartedAt}}|{{.RestartCount}}' matrix-bridge
```
- Output e.g. `running|2026-06-15T18:02:11.4Z|0`. Parse field 1 for the badge:
- `running` → green/up. Field 3 (`RestartCount`) climbing while status flips to
`restarting`**crash loop** — show it; that's the most useful signal a dashboard gives here.
- `exited` → stopped/crashed.
- `restarting` → unhealthy / boot-looping.
- **Non-zero exit** (`No such object: matrix-bridge`) ⇒ **not deployed** — distinct from
"stopped". Show that state rather than erroring out.
Friendlier one-liner for a human-readable badge (empty string when not running):
```sh
docker ps --filter name=^/matrix-bridge$ --format '{{.Status}}' # e.g. "Up 2 hours"
```
### Logs (optional "view logs" action — handy for diagnosing a red badge)
```sh
docker logs --tail 100 matrix-bridge
```
### Restart (no code change)
```sh
docker restart matrix-bridge
```
### Update (pull latest code + rebuild + recreate) — the headline button
```sh
cd /home/modelo/matrix-bridge \
&& git fetch origin \
&& git reset --hard origin/master \
&& docker compose up -d --build
```
- `git reset --hard origin/master` is the deploy-box "always match remote" semantic: never gets
stuck on divergence, and gitignored secrets are preserved. (If you'd rather detect divergence,
`git pull --ff-only` is the gentler alternative — but then a wedged tree needs manual help.)
- `docker compose up -d --build` rebuilds the image and recreates the container only if the
build changed. First build after a base-image bump is slow (minutes); subsequent builds hit
the layer cache. **Treat update as long-running**: stream/await output, set a generous
timeout (≥10 min), and don't block the dashboard on it.
### Stop / Start (optional)
```sh
docker stop matrix-bridge # stop
cd /home/modelo/matrix-bridge && docker compose up -d # start (recreates if needed)
```
---
## Spark Control-side wiring (for the dev)
Map the above onto however Spark Control already registers a managed Spark/service:
1. **Register `matrix-bridge`** as a managed service (a tile), targeting `spark-32d0` over the
existing SSH channel, commands run as `modelo`.
2. **Status badge** ← poll the *Status* command on the panel's normal refresh cadence; map the
four states above (running / exited / restarting / not-deployed) to your existing badge
vocabulary. Surface `RestartCount` if your tile can show a secondary metric — a climbing
count is the crash-loop tell.
3. **Buttons:** `Update`, `Restart` (required for the exit criterion); `Logs`, `Stop`/`Start`
(optional, nice-to-have).
4. **Fail-loud, surfaced.** Every command's non-zero exit + stderr must reach the panel, not a
silent failure — this mirrors matrix-bridge's own discipline (a bad launch reports back into
the room rather than hanging). Especially: a failed `git fetch` (Gitea unreachable) or a
failed build should show the error, not a stuck spinner.
5. **`Update` is long-running** — see the timeout/streaming note above.
What I deliberately left generic: the tile's exact place in Spark Control's code, its UI, and
its config schema — that's yours to fit to the existing pattern. If a precise drop-in matters,
share how a Spark is currently registered (config entry + the command-runner seam) and I'll
tailor steps 15 to it.
---
## Acceptance (maps to the ROADMAP exit)
- [ ] Status tile shows the bot's live state and flips correctly across a manual
`docker stop` / `docker start` on the Spark.
- [ ] `Restart` from the panel cycles the container (status returns to `running`).
- [ ] `Update` from the panel pulls a new commit, rebuilds, and recreates the container — and
surfaces a clear error if Gitea is unreachable or the build fails.
---
## Note — optional future enhancement (not required for Phase 3)
The *Status* command reports container liveness (process up), not Matrix connectivity — the bot
can be `running` yet disconnected from Synapse. A truer signal would need a Docker `HEALTHCHECK`
backed by a bot-side liveness signal (e.g. the bot touches a file or exposes a tiny endpoint on
each successful sync loop), after which Status could read `{{.State.Health.Status}}`. That's a
matrix-bridge-side change, out of scope here — flag it if/when "running but silent" actually bites.