Add Phase 3 Spark Control integration spec; mark Phase 2 done

docs/spark-control-integration.md: the SSH command contract (status via
docker inspect; restart via docker restart; update via git fetch + reset
--hard origin/master + docker compose up -d --build) plus the one-time
conversion of the Spark's ~/matrix-bridge to a Gitea clone. No bot code
change. Update source = git-pull-from-Gitea; rides Spark Control's existing
SSH into spark-32d0 (no new key). Corrected the infra note: Spark is on the
LAN with the Start9/Gitea host, so Spark->Gitea resolves directly.
This commit is contained in:
Keysat
2026-06-15 20:48:18 -05:00
parent ee8408d182
commit e5a751d4f4
3 changed files with 219 additions and 15 deletions
+26 -12
View File
@@ -70,6 +70,9 @@ the full answer back into the room (ask mode, D12).
- `AGENTS.md` — this file (canonical; `CLAUDE.md` is a relative symlink to it).
- `ROADMAP.md` — Phases 14+ with falsifiable exits, plus deferred/future directions.
- `README.md` — human-facing intro.
- `docs/spark-control-integration.md` — Phase 3 spec for the Spark Control dev: the SSH
command contract (status / restart / git-pull update) the dashboard drives, plus the one-time
conversion of the Spark's `~/matrix-bridge` to a Gitea clone. matrix-bridge needs no code change.
- `scripts/launch-claude.sh` — the Mac-side launch wrapper (the only seam that knows the
Mac's environment).
- `config.example.toml` — room→repo mapping template; the real `config.toml` is gitignored.
@@ -164,12 +167,16 @@ once" is not done.
## Infra facts (proven — stable reference)
- **WireGuard (`starttunnel`), not LAN:** Mac `10.59.211.5`; Spark (`spark-32d0`, user `modelo`)
`10.59.211.6`. The Spark is not on the Mac's LAN subnet.
- **WireGuard (`starttunnel`) for Mac↔Spark:** Mac `10.59.211.5`; Spark (`spark-32d0`, user `modelo`)
`10.59.211.6`. The Mac↔Spark seam runs over WireGuard (not the Mac's LAN subnet). The Spark *is*
on the LAN, same as the Start9 host (`immense-voyage`) — so Spark→Gitea (`immense-voyage.local:59916`)
resolves and works directly.
- **Spark → Mac:** SSH alias `mac-bridge` → the Mac as user `macpro`, dedicated key
(`~/.ssh/id_ed25519` on the Spark, in the Mac's `authorized_keys`). The Spark host's `~/.ssh/config` needs `IdentitiesOnly yes` because a
`Host *` rule shadows the default key; the container regenerates a clean config from `config.toml [mac]`.
- **Mac → Spark:** no authorized key — Spark-side ops (deploy/restart) are owner-run until Phase 3.
- **Mac → Spark:** no authorized key — direct Mac-initiated Spark ops stay owner-run. (This is *not*
what Phase 3 closes: Spark Control already has its own SSH channel into `spark-32d0`, so its
status/update/restart buttons ride that, not a Mac→Spark key.)
- **Matrix:** homeserver `https://matrix.gilliam.ai` (StartOS Synapse), bot `@agent:matrix.gilliam.ai`,
device `matrix-bridge-bot`. The bot reuses the stored access token (`.env`) — never re-logs in
(avoids device churn). No E2EE (D9); bot↔Synapse is clearnet TLS, softening D9's WireGuard-only rationale.
@@ -190,12 +197,19 @@ once" is not done.
drivable session on the phone via Remote Control.
- **Ask mode** (`?`-prefixed message): `ssh mac-bridge → ask-claude.sh → claude -p`, full answer posted
back into the room (chunked, no truncation). See D12.
- **Phase 2 (multi-room routing)** is effectively satisfied the bot is built multi-room and routes by
`room_id`; only a formal N=3 confirmation pass remains.
- **Next — Phase 3 (deferred to next session by owner):** Spark Control integration — bot container
status + one-click update/restart on the dashboard; also closes the Mac-has-no-key-into-Spark gap.
- **Open / risks:** (a) a `?`-ask in a repo `claude` has never opened may stall on the folder-trust gate
— add a trust flag to `ask-claude.sh` if/when hit, not preemptively; (b) owner TODO: clean up the
accidental MacBook docker deploy (`docker compose down` + `docker image rm matrix-bridge-bot`).
- **Repo:** tree clean; `master` == `phase-1` == `8ad1cd8`, pushed to Gitea. No test suite (pre-existing);
this session's changes were syntax/unit-checked locally, fresh-eyes reviewed, and proven live.
- **Phase 2 (multi-room routing) — DONE.** Owner confirmed the N=3 pass: routes by `room_id`,
correct repo, zero wrong-directory launches.
- **Phase 3 (Spark Control integration)spec drafted, handed to the Spark Control dev (2026-06-15).**
See `docs/spark-control-integration.md`: the SSH command contract (status via `docker inspect`;
restart via `docker restart`; update via `git fetch && git reset --hard origin/master &&
docker compose up -d --build`) plus a one-time conversion of the Spark's `~/matrix-bridge` from
scp'd loose files to a Gitea clone (secrets are gitignored, so `reset --hard` preserves them).
Decisions this session: update source = git-pull-from-Gitea (not scp-from-Mac); Spark Control
already SSHes into `spark-32d0`, so no new key. **matrix-bridge needs no code change** — the work
is now Spark Control-side (status tile + buttons) + the one-time Spark migration. Awaiting the dev.
- **Open / risks:** a `?`-ask in a repo `claude` has never opened may stall on the folder-trust gate
— add a trust flag to `ask-claude.sh` if/when hit, not preemptively. (Resolved this session: the
accidental MacBook docker deploy was cleaned up by the owner.)
- **Repo:** `master` == `phase-1` == `ee8408d` pushed to Gitea; this session adds the Phase 3 spec
doc + these AGENTS.md edits on top (uncommitted — propose committing as the handoff). No test suite
(pre-existing); the doc is a spec, no code changed.
+7 -3
View File
@@ -26,19 +26,23 @@ after it.
- **Exit (falsifiable):** 3 consecutive real messages each correctly launch a drivable
session on the phone.
## Phase 2 — Multi-room routing
## Phase 2 — Multi-room routing — DONE (2026-06-15)
- Room → repo mapping table; the bot routes by `room_id` (config over code).
- **Exit (falsifiable):** 3 real uses across ≥2 rooms, correct repo every time, zero
wrong-directory launches.
wrong-directory launches. *Met — owner-confirmed N=3 pass.*
## Phase 3 — Spark Control integration
## Phase 3 — Spark Control integration — SPEC DRAFTED (2026-06-15), awaiting Spark Control dev
- Bot container status surfaced on the Spark Control dashboard.
- One-click update (pull + restart) wired the same way Spark Control drives the Sparks today
(SSH/commands behind a button).
- **Exit (falsifiable):** bot status is visible and the bot can be updated/restarted from the
panel.
- **Spec:** `docs/spark-control-integration.md` — the SSH command contract + one-time Spark
migration to a Gitea clone. Decided: update = git-pull-from-Gitea; Spark Control's existing
SSH into `spark-32d0` carries the buttons (no new key). matrix-bridge needs no code change;
remaining work is Spark Control-side + the one-time migration.
## Phase 4+ — Future direction (documented, not yet scoped to build)
+186
View File
@@ -0,0 +1,186 @@
# Phase 3 — Spark Control integration (spec for the Spark Control dev)
**Goal (ROADMAP Phase 3):** surface the matrix-bridge bot's container status on the Spark
Control dashboard, and add one-click **update** (pull + rebuild + restart) and **restart**,
wired the same SSH-behind-buttons way Spark Control already drives the Sparks.
**Exit (falsifiable):** bot status is visible on the panel, and the bot can be
updated/restarted from the panel.
This document is the **contract**: what to run, where, and what the output means. The
matrix-bridge side is fixed below; map the buttons onto Spark Control's existing
managed-service pattern however that codebase already models a Spark/service. No changes to
matrix-bridge are required for this.
---
## What the bot is
A single Docker container on the DGX Spark.
| Fact | Value |
|---|---|
| Host | `spark-32d0` (`10.59.211.6` on WireGuard), user **`modelo`** |
| Project dir | `/home/modelo/matrix-bridge` (`~/matrix-bridge` for modelo) |
| Compose service | `bot` |
| Container name | `matrix-bridge` (fixed via `container_name:`) |
| Image | `matrix-bridge-bot` |
| Lifecycle | host networking, `restart: unless-stopped` (survives Spark reboot) |
| Secrets | `.env`, `config.toml`**gitignored**, live only on the Spark, never in git |
Spark Control already SSHes into `spark-32d0`, so these ride the existing channel — **no new
key needed.** All commands below assume they run **as `modelo`** (owner of the dir, member of
the `docker` group). If Spark Control's channel connects as a different user, wrap each command
in `sudo -iu modelo bash -lc '<command>'` — running `git` in modelo's repo as root trips git's
"dubious ownership" guard, so don't skip this.
---
## One-time prerequisites (owner, not Spark Control dev)
The bot dir on the Spark was originally populated by `scp` of loose files. To make
git-pull-based updates work it must become a git clone of the Gitea repo **without disturbing
the gitignored secrets** (`.env`, `config.toml`). Because those two files are gitignored,
`git reset --hard` never touches them — so we can convert the existing dir in place.
**0a. Confirm the Spark can reach + authenticate to Gitea (fail loud here, not at first button press):**
```sh
git ls-remote ssh://git@immense-voyage.local:59916/grant/matrix-bridge.git >/dev/null \
&& echo "gitea reachable" || echo "FIX gitea access first"
```
The Spark is on the same LAN as the Start9 host running Gitea, so `immense-voyage.local`
resolves directly — this should just work. If it doesn't, the only likely gap is a key
authorized for read on the Gitea repo available to `modelo` (deploy key or existing key).
Don't proceed until `git ls-remote` succeeds.
**0b. Convert `~/matrix-bridge` to a clone tracking `master` (run as `modelo`):**
```sh
cd /home/modelo/matrix-bridge
git init -b master
git remote add origin ssh://git@immense-voyage.local:59916/grant/matrix-bridge.git
git fetch origin
git reset --hard origin/master # secrets are gitignored → untouched
git branch --set-upstream-to=origin/master master
```
Verify the secrets survived and the container still comes up clean:
```sh
ls -la /home/modelo/matrix-bridge/.env /home/modelo/matrix-bridge/config.toml # both present
git -C /home/modelo/matrix-bridge status # .env/config.toml show as ignored, tree clean
docker compose up -d --build && docker ps --filter name=^/matrix-bridge$
```
`master` is the release branch (today `master == phase-1`). Track whatever you treat as the
release line; the commands below assume `origin/master`.
---
## The contract — commands behind each control
Run from `/home/modelo/matrix-bridge` as `modelo`. Each is idempotent and fail-loud
(non-zero exit ⇒ surface it on the panel; don't swallow).
### Status (poll for the badge)
```sh
docker inspect -f '{{.State.Status}}|{{.State.StartedAt}}|{{.RestartCount}}' matrix-bridge
```
- Output e.g. `running|2026-06-15T18:02:11.4Z|0`. Parse field 1 for the badge:
- `running` → green/up. Field 3 (`RestartCount`) climbing while status flips to
`restarting`**crash loop** — show it; that's the most useful signal a dashboard gives here.
- `exited` → stopped/crashed.
- `restarting` → unhealthy / boot-looping.
- **Non-zero exit** (`No such object: matrix-bridge`) ⇒ **not deployed** — distinct from
"stopped". Show that state rather than erroring out.
Friendlier one-liner for a human-readable badge (empty string when not running):
```sh
docker ps --filter name=^/matrix-bridge$ --format '{{.Status}}' # e.g. "Up 2 hours"
```
### Logs (optional "view logs" action — handy for diagnosing a red badge)
```sh
docker logs --tail 100 matrix-bridge
```
### Restart (no code change)
```sh
docker restart matrix-bridge
```
### Update (pull latest code + rebuild + recreate) — the headline button
```sh
cd /home/modelo/matrix-bridge \
&& git fetch origin \
&& git reset --hard origin/master \
&& docker compose up -d --build
```
- `git reset --hard origin/master` is the deploy-box "always match remote" semantic: never gets
stuck on divergence, and gitignored secrets are preserved. (If you'd rather detect divergence,
`git pull --ff-only` is the gentler alternative — but then a wedged tree needs manual help.)
- `docker compose up -d --build` rebuilds the image and recreates the container only if the
build changed. First build after a base-image bump is slow (minutes); subsequent builds hit
the layer cache. **Treat update as long-running**: stream/await output, set a generous
timeout (≥10 min), and don't block the dashboard on it.
### Stop / Start (optional)
```sh
docker stop matrix-bridge # stop
cd /home/modelo/matrix-bridge && docker compose up -d # start (recreates if needed)
```
---
## Spark Control-side wiring (for the dev)
Map the above onto however Spark Control already registers a managed Spark/service:
1. **Register `matrix-bridge`** as a managed service (a tile), targeting `spark-32d0` over the
existing SSH channel, commands run as `modelo`.
2. **Status badge** ← poll the *Status* command on the panel's normal refresh cadence; map the
four states above (running / exited / restarting / not-deployed) to your existing badge
vocabulary. Surface `RestartCount` if your tile can show a secondary metric — a climbing
count is the crash-loop tell.
3. **Buttons:** `Update`, `Restart` (required for the exit criterion); `Logs`, `Stop`/`Start`
(optional, nice-to-have).
4. **Fail-loud, surfaced.** Every command's non-zero exit + stderr must reach the panel, not a
silent failure — this mirrors matrix-bridge's own discipline (a bad launch reports back into
the room rather than hanging). Especially: a failed `git fetch` (Gitea unreachable) or a
failed build should show the error, not a stuck spinner.
5. **`Update` is long-running** — see the timeout/streaming note above.
What I deliberately left generic: the tile's exact place in Spark Control's code, its UI, and
its config schema — that's yours to fit to the existing pattern. If a precise drop-in matters,
share how a Spark is currently registered (config entry + the command-runner seam) and I'll
tailor steps 15 to it.
---
## Acceptance (maps to the ROADMAP exit)
- [ ] Status tile shows the bot's live state and flips correctly across a manual
`docker stop` / `docker start` on the Spark.
- [ ] `Restart` from the panel cycles the container (status returns to `running`).
- [ ] `Update` from the panel pulls a new commit, rebuilds, and recreates the container — and
surfaces a clear error if Gitea is unreachable or the build fails.
---
## Note — optional future enhancement (not required for Phase 3)
The *Status* command reports container liveness (process up), not Matrix connectivity — the bot
can be `running` yet disconnected from Synapse. A truer signal would need a Docker `HEALTHCHECK`
backed by a bot-side liveness signal (e.g. the bot touches a file or exposes a tiny endpoint on
each successful sync loop), after which Status could read `{{.State.Health.Status}}`. That's a
matrix-bridge-side change, out of scope here — flag it if/when "running but silent" actually bites.