9919fbf8f8
Two operators in a row hit the same crash-loop on upgrade:
Error: running migrations
Caused by:
migration 9 was previously applied but has been modified
sqlx records a SHA-384 of each migration's bytes when first applied,
then verifies the on-disk bytes still match on every subsequent boot.
Cross-build drift (trailing newlines, line-ending normalization, etc.)
produces different bytes for semantically-identical SQL — and sqlx
refuses to start. Recovery required SSHing in and running:
sqlite3 /data/keysat.db "DELETE FROM _sqlx_migrations WHERE version = 9;"
That's bad UX. Worse, every operator going through this version
range hits it once.
Self-heal: db::init now wraps sqlx::migrate!().run() with detection
for MigrateError::VersionMismatch(N) on a constant allowlist of
migrations certified safe to re-run (IDEMPOTENT_MIGRATIONS, just [9]
for now). When triggered, the daemon clears the stale row, retries,
logs a WARN explaining what happened, and continues. No SSH dance.
Allowlist gate is critical — auto-clearing checksums on additive
ALTER TABLE migrations like 0010 would error on retry (SQLite has
no ADD COLUMN IF NOT EXISTS). Only migrations explicitly designed
as drop-and-rebuild (like 0009) and tested via the
`migration_NNNN_is_idempotent` pattern in tests/migrations.rs
qualify.
Regression test in tests/migrations.rs exactly simulates the
production incident:
1. apply all migrations cleanly
2. poison v9's recorded checksum with bogus bytes
3. confirm raw sqlx::migrate! bails (proves the poisoning works)
4. call db::init — must succeed by clearing + re-applying v9
5. confirm v9 + v10 are both recorded with non-poisoned checksums
Test count: 38 (was 37; +1 db_init_self_heals test).
For operators currently stuck on the :49 crash-loop: just upgrade
to :50 from the StartOS marketplace. The :50 daemon will see the
mismatch on first boot, auto-clear v9's row, re-apply (0009 is
idempotent by design), and continue to 0010. No manual sqlite3 needed.