Mark Phase 3 (Spark Control) done; trim spec to live command contract

Shipped in Spark Control v0.21.0: status badge + Update/Restart/Stop-Start/Logs
tile. All three exit criteria confirmed. matrix-bridge needed no code change.

- AGENTS.md: Current state + ROADMAP Phase 3 -> DONE; Deploy switched scp -> git
  pull (Update button); D10 stamped; new Infra fact for the Spark->Gitea path and
  the load-bearing IdentitiesOnly ssh-config pin the Update button depends on.
- spark-control-integration.md: trimmed from dev spec to live contract (dropped
  sudo -iu fallback and dev-side scaffolding; folded in direct-as-modelo, the
  Gitea key gotcha, restart cadence, and the LAN-only HTTP API).
- README: dropped stale "pre-Phase 0" status; Setup reframed for a fresh install.

Deferred follow-up: badge reflects container liveness only, not Matrix
connectivity; HEALTHCHECK + {{.State.Health.Status}} is the matrix-bridge-side fix.
This commit is contained in:
Keysat
2026-06-15 23:19:30 -05:00
parent 843582ec03
commit 28c974fe1d
4 changed files with 123 additions and 139 deletions
+60 -107
View File
@@ -1,16 +1,14 @@
# Phase 3 — Spark Control integration (spec for the Spark Control dev)
# Phase 3 — Spark Control integration (live command contract)
**Goal (ROADMAP Phase 3):** surface the matrix-bridge bot's container status on the Spark
Control dashboard, and add one-click **update** (pull + rebuild + restart) and **restart**,
wired the same SSH-behind-buttons way Spark Control already drives the Sparks.
**Status: DONE (2026-06-16), shipped in Spark Control v0.21.0.** The matrix-bridge bot has a
tile on the Spark Control dashboard under "Always-on services" — a live status badge plus
**Update**, **Restart**, **Stop/Start**, and **View logs** buttons. All three ROADMAP Phase 3
exit criteria are met (status visible + reflects the container; update works; restart works).
matrix-bridge needed no code change.
**Exit (falsifiable):** bot status is visible on the panel, and the bot can be
updated/restarted from the panel.
This document is the **contract**: what to run, where, and what the output means. The
matrix-bridge side is fixed below; map the buttons onto Spark Control's existing
managed-service pattern however that codebase already models a Spark/service. No changes to
matrix-bridge are required for this.
This document is the **contract**: what each control runs on the Spark, and what the output
means. Kept as the reference for what the buttons actually do — and to reproduce by hand if the
dashboard is ever unavailable.
---
@@ -21,41 +19,39 @@ A single Docker container on the DGX Spark.
| Fact | Value |
|---|---|
| Host | `spark-32d0` (`10.59.211.6` on WireGuard), user **`modelo`** |
| Project dir | `/home/modelo/matrix-bridge` (`~/matrix-bridge` for modelo) |
| Project dir | `/home/modelo/matrix-bridge` — a **Gitea clone tracking `master`** |
| Compose service | `bot` |
| Container name | `matrix-bridge` (fixed via `container_name:`) |
| Image | `matrix-bridge-bot` |
| Lifecycle | host networking, `restart: unless-stopped` (survives Spark reboot) |
| Secrets | `.env`, `config.toml`**gitignored**, live only on the Spark, never in git |
Spark Control already SSHes into `spark-32d0`, so these ride the existing channel — **no new
key needed.** All commands below assume they run **as `modelo`** (owner of the dir, member of
the `docker` group). If Spark Control's channel connects as a different user, wrap each command
in `sudo -iu modelo bash -lc '<command>'` — running `git` in modelo's repo as root trips git's
"dubious ownership" guard, so don't skip this.
Spark Control SSHes into `spark-32d0` as **`modelo`** (the same login it already uses for Spark 2),
so these ride the existing channel — no new key, and **no `sudo` wrap**: this Spark has no
passwordless sudo, and since the channel is already `modelo` (owner of the dir, member of the
`docker` group) every command runs as the right user directly. (The original spec's
`sudo -iu modelo` different-user fallback therefore never applies here.)
Registration on the Spark Control side: the bot's SSH user is a config field (set to `modelo`),
the host reuses the existing Spark 2 connection, and container / dir / branch use the defaults
(`matrix-bridge` / `~/matrix-bridge` / `master`). The tile auto-hides when that user is blank or
the container is absent, so it stays out of the way on installs that don't run the bot.
---
## One-time prerequisites (owner, not Spark Control dev)
## One-time prerequisites — DONE
The bot dir on the Spark was originally populated by `scp` of loose files. To make
git-pull-based updates work it must become a git clone of the Gitea repo **without disturbing
the gitignored secrets** (`.env`, `config.toml`). Because those two files are gitignored,
`git reset --hard` never touches them — so we can convert the existing dir in place.
`~/matrix-bridge` was originally loose files from `scp`; it's now a git clone of the Gitea repo,
converted in place (the gitignored `.env`/`config.toml` were untouched, because `git reset --hard`
ignores them).
**0a. Confirm the Spark can reach + authenticate to Gitea (fail loud here, not at first button press):**
**Load-bearing gotcha that's now fixed:** on the Spark, git offered the wrong SSH key first and
Gitea rejected it (`Permission denied (publickey)`) even though the deploy key was correctly
registered. Fixed by pinning it in modelo's `~/.ssh/config` with `IdentitiesOnly yes` for the
Gitea host. **The Update button depends on that block staying in place — flag it if modelo's
account is ever rebuilt.**
```sh
git ls-remote ssh://git@immense-voyage.local:59916/grant/matrix-bridge.git >/dev/null \
&& echo "gitea reachable" || echo "FIX gitea access first"
```
The Spark is on the same LAN as the Start9 host running Gitea, so `immense-voyage.local`
resolves directly — this should just work. If it doesn't, the only likely gap is a key
authorized for read on the Gitea repo available to `modelo` (deploy key or existing key).
Don't proceed until `git ls-remote` succeeds.
**0b. Convert `~/matrix-bridge` to a clone tracking `master` (run as `modelo`):**
The conversion, for reference:
```sh
cd /home/modelo/matrix-bridge
@@ -66,23 +62,12 @@ git reset --hard origin/master # secrets are gitignored → untouched
git branch --set-upstream-to=origin/master master
```
Verify the secrets survived and the container still comes up clean:
```sh
ls -la /home/modelo/matrix-bridge/.env /home/modelo/matrix-bridge/config.toml # both present
git -C /home/modelo/matrix-bridge status # .env/config.toml show as ignored, tree clean
docker compose up -d --build && docker ps --filter name=^/matrix-bridge$
```
`master` is the release branch (today `master == phase-1`). Track whatever you treat as the
release line; the commands below assume `origin/master`.
---
## The contract — commands behind each control
Run from `/home/modelo/matrix-bridge` as `modelo`. Each is idempotent and fail-loud
(non-zero exit ⇒ surface it on the panel; don't swallow).
Run from `/home/modelo/matrix-bridge` as `modelo`. Each is idempotent and fail-loud: non-zero
exit + stderr is surfaced on the panel, not swallowed.
### Status (poll for the badge)
@@ -90,33 +75,28 @@ Run from `/home/modelo/matrix-bridge` as `modelo`. Each is idempotent and fail-l
docker inspect -f '{{.State.Status}}|{{.State.StartedAt}}|{{.RestartCount}}' matrix-bridge
```
- Output e.g. `running|2026-06-15T18:02:11.4Z|0`. Parse field 1 for the badge:
- `running`green/up. Field 3 (`RestartCount`) climbing while status flips to
`restarting`**crash loop** — show it; that's the most useful signal a dashboard gives here.
- `exited` → stopped/crashed.
- `restarting` → unhealthy / boot-looping.
- **Non-zero exit** (`No such object: matrix-bridge`) ⇒ **not deployed** — distinct from
"stopped". Show that state rather than erroring out.
- `running` → up · `exited` → stopped/crashed · `restarting` → unhealthy/boot-looping ·
non-zero exit (`No such object: matrix-bridge`)**not deployed** (tile hides). A climbing
`RestartCount` while status flips to `restarting` is the crash-loop tell.
- **Badge = container liveness only, not Matrix connectivity** — a bot that's `running` but
disconnected from Synapse still shows Healthy. See the HEALTHCHECK note below.
- *Cadence note:* a fast `docker restart` won't visibly flip the badge red — the panel re-checks
status only after the command returns, by which point the container is already back up. A full
`docker stop` turns it red within ~5s. Polling cadence, not a bug.
Friendlier one-liner for a human-readable badge (empty string when not running):
```sh
docker ps --filter name=^/matrix-bridge$ --format '{{.Status}}' # e.g. "Up 2 hours"
```
### Logs (optional "view logs" action — handy for diagnosing a red badge)
### Logs
```sh
docker logs --tail 100 matrix-bridge
```
### Restart (no code change)
### Restart
```sh
docker restart matrix-bridge
```
### Update (pull latest code + rebuild + recreate) — the headline button
### Update (pull + rebuild + recreate) — the headline button
```sh
cd /home/modelo/matrix-bridge \
@@ -125,15 +105,11 @@ cd /home/modelo/matrix-bridge \
&& docker compose up -d --build
```
- `git reset --hard origin/master` is the deploy-box "always match remote" semantic: never gets
stuck on divergence, and gitignored secrets are preserved. (If you'd rather detect divergence,
`git pull --ff-only` is the gentler alternative — but then a wedged tree needs manual help.)
- `docker compose up -d --build` rebuilds the image and recreates the container only if the
build changed. First build after a base-image bump is slow (minutes); subsequent builds hit
the layer cache. **Treat update as long-running**: stream/await output, set a generous
timeout (≥10 min), and don't block the dashboard on it.
`git reset --hard origin/master` is the deploy-box "always match remote" semantic: never stuck on
divergence, and gitignored secrets are preserved. Streamed live on the panel with a ~25-min
ceiling; non-zero exit + stderr surfaced. **Workflow: push to Gitea, then click Update.**
### Stop / Start (optional)
### Stop / Start
```sh
docker stop matrix-bridge # stop
@@ -142,45 +118,22 @@ cd /home/modelo/matrix-bridge && docker compose up -d # start (recreates
---
## Spark Control-side wiring (for the dev)
## Programmatic interface (LAN-only)
Map the above onto however Spark Control already registers a managed Spark/service:
The same controls are reachable over HTTP if scripting is ever wanted:
1. **Register `matrix-bridge`** as a managed service (a tile), targeting `spark-32d0` over the
existing SSH channel, commands run as `modelo`.
2. **Status badge** ← poll the *Status* command on the panel's normal refresh cadence; map the
four states above (running / exited / restarting / not-deployed) to your existing badge
vocabulary. Surface `RestartCount` if your tile can show a secondary metric — a climbing
count is the crash-loop tell.
3. **Buttons:** `Update`, `Restart` (required for the exit criterion); `Logs`, `Stop`/`Start`
(optional, nice-to-have).
4. **Fail-loud, surfaced.** Every command's non-zero exit + stderr must reach the panel, not a
silent failure — this mirrors matrix-bridge's own discipline (a bad launch reports back into
the room rather than hanging). Especially: a failed `git fetch` (Gitea unreachable) or a
failed build should show the error, not a stuck spinner.
5. **`Update` is long-running** — see the timeout/streaming note above.
What I deliberately left generic: the tile's exact place in Spark Control's code, its UI, and
its config schema — that's yours to fit to the existing pattern. If a precise drop-in matters,
share how a Spark is currently registered (config entry + the command-runner seam) and I'll
tailor steps 15 to it.
- `POST /api/matrix-bridge/update` → returns an id; `GET .../update/{id}` and
`.../update/{id}/stream` (SSE) for progress.
- `GET /api/matrix-bridge/logs?tail=N`
- status via `GET /api/services`
---
## Acceptance (maps to the ROADMAP exit)
## Future enhancement — truer status (not required; matrix-bridge-side)
- [ ] Status tile shows the bot's live state and flips correctly across a manual
`docker stop` / `docker start` on the Spark.
- [ ] `Restart` from the panel cycles the container (status returns to `running`).
- [ ] `Update` from the panel pulls a new commit, rebuilds, and recreates the container — and
surfaces a clear error if Gitea is unreachable or the build fails.
---
## Note — optional future enhancement (not required for Phase 3)
The *Status* command reports container liveness (process up), not Matrix connectivity — the bot
can be `running` yet disconnected from Synapse. A truer signal would need a Docker `HEALTHCHECK`
backed by a bot-side liveness signal (e.g. the bot touches a file or exposes a tiny endpoint on
each successful sync loop), after which Status could read `{{.State.Health.Status}}`. That's a
matrix-bridge-side change, out of scope here — flag it if/when "running but silent" actually bites.
Status reports container liveness, not Matrix connectivity — the bot can be `running` yet
disconnected from Synapse. A truer signal needs a Docker `HEALTHCHECK` backed by a bot-side
liveness signal (e.g. the bot touches a file or exposes a tiny endpoint on each successful sync
loop), after which Status could read `{{.State.Health.Status}}`. That's a matrix-bridge-side
change — do it if/when "running but silent" actually bites, then tell the Spark Control dev to
read the health field.