Files
matrix-bridge/docs/spark-control-integration.md
T
Keysat e5a751d4f4 Add Phase 3 Spark Control integration spec; mark Phase 2 done
docs/spark-control-integration.md: the SSH command contract (status via
docker inspect; restart via docker restart; update via git fetch + reset
--hard origin/master + docker compose up -d --build) plus the one-time
conversion of the Spark's ~/matrix-bridge to a Gitea clone. No bot code
change. Update source = git-pull-from-Gitea; rides Spark Control's existing
SSH into spark-32d0 (no new key). Corrected the infra note: Spark is on the
LAN with the Start9/Gitea host, so Spark->Gitea resolves directly.
2026-06-15 20:48:18 -05:00

8.1 KiB
Raw Blame History

Phase 3 — Spark Control integration (spec for the Spark Control dev)

Goal (ROADMAP Phase 3): surface the matrix-bridge bot's container status on the Spark Control dashboard, and add one-click update (pull + rebuild + restart) and restart, wired the same SSH-behind-buttons way Spark Control already drives the Sparks.

Exit (falsifiable): bot status is visible on the panel, and the bot can be updated/restarted from the panel.

This document is the contract: what to run, where, and what the output means. The matrix-bridge side is fixed below; map the buttons onto Spark Control's existing managed-service pattern however that codebase already models a Spark/service. No changes to matrix-bridge are required for this.


What the bot is

A single Docker container on the DGX Spark.

Fact Value
Host spark-32d0 (10.59.211.6 on WireGuard), user modelo
Project dir /home/modelo/matrix-bridge (~/matrix-bridge for modelo)
Compose service bot
Container name matrix-bridge (fixed via container_name:)
Image matrix-bridge-bot
Lifecycle host networking, restart: unless-stopped (survives Spark reboot)
Secrets .env, config.tomlgitignored, live only on the Spark, never in git

Spark Control already SSHes into spark-32d0, so these ride the existing channel — no new key needed. All commands below assume they run as modelo (owner of the dir, member of the docker group). If Spark Control's channel connects as a different user, wrap each command in sudo -iu modelo bash -lc '<command>' — running git in modelo's repo as root trips git's "dubious ownership" guard, so don't skip this.


One-time prerequisites (owner, not Spark Control dev)

The bot dir on the Spark was originally populated by scp of loose files. To make git-pull-based updates work it must become a git clone of the Gitea repo without disturbing the gitignored secrets (.env, config.toml). Because those two files are gitignored, git reset --hard never touches them — so we can convert the existing dir in place.

0a. Confirm the Spark can reach + authenticate to Gitea (fail loud here, not at first button press):

git ls-remote ssh://git@immense-voyage.local:59916/grant/matrix-bridge.git >/dev/null \
  && echo "gitea reachable" || echo "FIX gitea access first"

The Spark is on the same LAN as the Start9 host running Gitea, so immense-voyage.local resolves directly — this should just work. If it doesn't, the only likely gap is a key authorized for read on the Gitea repo available to modelo (deploy key or existing key). Don't proceed until git ls-remote succeeds.

0b. Convert ~/matrix-bridge to a clone tracking master (run as modelo):

cd /home/modelo/matrix-bridge
git init -b master
git remote add origin ssh://git@immense-voyage.local:59916/grant/matrix-bridge.git
git fetch origin
git reset --hard origin/master          # secrets are gitignored → untouched
git branch --set-upstream-to=origin/master master

Verify the secrets survived and the container still comes up clean:

ls -la /home/modelo/matrix-bridge/.env /home/modelo/matrix-bridge/config.toml   # both present
git -C /home/modelo/matrix-bridge status                                        # .env/config.toml show as ignored, tree clean
docker compose up -d --build && docker ps --filter name=^/matrix-bridge$

master is the release branch (today master == phase-1). Track whatever you treat as the release line; the commands below assume origin/master.


The contract — commands behind each control

Run from /home/modelo/matrix-bridge as modelo. Each is idempotent and fail-loud (non-zero exit ⇒ surface it on the panel; don't swallow).

Status (poll for the badge)

docker inspect -f '{{.State.Status}}|{{.State.StartedAt}}|{{.RestartCount}}' matrix-bridge
  • Output e.g. running|2026-06-15T18:02:11.4Z|0. Parse field 1 for the badge:
    • running → green/up. Field 3 (RestartCount) climbing while status flips to restartingcrash loop — show it; that's the most useful signal a dashboard gives here.
    • exited → stopped/crashed.
    • restarting → unhealthy / boot-looping.
  • Non-zero exit (No such object: matrix-bridge) ⇒ not deployed — distinct from "stopped". Show that state rather than erroring out.

Friendlier one-liner for a human-readable badge (empty string when not running):

docker ps --filter name=^/matrix-bridge$ --format '{{.Status}}'   # e.g. "Up 2 hours"

Logs (optional "view logs" action — handy for diagnosing a red badge)

docker logs --tail 100 matrix-bridge

Restart (no code change)

docker restart matrix-bridge

Update (pull latest code + rebuild + recreate) — the headline button

cd /home/modelo/matrix-bridge \
  && git fetch origin \
  && git reset --hard origin/master \
  && docker compose up -d --build
  • git reset --hard origin/master is the deploy-box "always match remote" semantic: never gets stuck on divergence, and gitignored secrets are preserved. (If you'd rather detect divergence, git pull --ff-only is the gentler alternative — but then a wedged tree needs manual help.)
  • docker compose up -d --build rebuilds the image and recreates the container only if the build changed. First build after a base-image bump is slow (minutes); subsequent builds hit the layer cache. Treat update as long-running: stream/await output, set a generous timeout (≥10 min), and don't block the dashboard on it.

Stop / Start (optional)

docker stop matrix-bridge                                    # stop
cd /home/modelo/matrix-bridge && docker compose up -d        # start (recreates if needed)

Spark Control-side wiring (for the dev)

Map the above onto however Spark Control already registers a managed Spark/service:

  1. Register matrix-bridge as a managed service (a tile), targeting spark-32d0 over the existing SSH channel, commands run as modelo.
  2. Status badge ← poll the Status command on the panel's normal refresh cadence; map the four states above (running / exited / restarting / not-deployed) to your existing badge vocabulary. Surface RestartCount if your tile can show a secondary metric — a climbing count is the crash-loop tell.
  3. Buttons: Update, Restart (required for the exit criterion); Logs, Stop/Start (optional, nice-to-have).
  4. Fail-loud, surfaced. Every command's non-zero exit + stderr must reach the panel, not a silent failure — this mirrors matrix-bridge's own discipline (a bad launch reports back into the room rather than hanging). Especially: a failed git fetch (Gitea unreachable) or a failed build should show the error, not a stuck spinner.
  5. Update is long-running — see the timeout/streaming note above.

What I deliberately left generic: the tile's exact place in Spark Control's code, its UI, and its config schema — that's yours to fit to the existing pattern. If a precise drop-in matters, share how a Spark is currently registered (config entry + the command-runner seam) and I'll tailor steps 15 to it.


Acceptance (maps to the ROADMAP exit)

  • Status tile shows the bot's live state and flips correctly across a manual docker stop / docker start on the Spark.
  • Restart from the panel cycles the container (status returns to running).
  • Update from the panel pulls a new commit, rebuilds, and recreates the container — and surfaces a clear error if Gitea is unreachable or the build fails.

Note — optional future enhancement (not required for Phase 3)

The Status command reports container liveness (process up), not Matrix connectivity — the bot can be running yet disconnected from Synapse. A truer signal would need a Docker HEALTHCHECK backed by a bot-side liveness signal (e.g. the bot touches a file or exposes a tiny endpoint on each successful sync loop), after which Status could read {{.State.Health.Status}}. That's a matrix-bridge-side change, out of scope here — flag it if/when "running but silent" actually bites.