v0.1.0:50 — auto-recover from sqlx checksum drift on idempotent migrations

Two operators in a row hit the same crash-loop on upgrade:

  Error: running migrations
  Caused by:
      migration 9 was previously applied but has been modified

sqlx records a SHA-384 of each migration's bytes when first applied,
then verifies the on-disk bytes still match on every subsequent boot.
Cross-build drift (trailing newlines, line-ending normalization, etc.)
produces different bytes for semantically-identical SQL — and sqlx
refuses to start. Recovery required SSHing in and running:

  sqlite3 /data/keysat.db "DELETE FROM _sqlx_migrations WHERE version = 9;"

That's bad UX. Worse, every operator going through this version
range hits it once.

Self-heal: db::init now wraps sqlx::migrate!().run() with detection
for MigrateError::VersionMismatch(N) on a constant allowlist of
migrations certified safe to re-run (IDEMPOTENT_MIGRATIONS, just [9]
for now). When triggered, the daemon clears the stale row, retries,
logs a WARN explaining what happened, and continues. No SSH dance.

Allowlist gate is critical — auto-clearing checksums on additive
ALTER TABLE migrations like 0010 would error on retry (SQLite has
no ADD COLUMN IF NOT EXISTS). Only migrations explicitly designed
as drop-and-rebuild (like 0009) and tested via the
`migration_NNNN_is_idempotent` pattern in tests/migrations.rs
qualify.

Regression test in tests/migrations.rs exactly simulates the
production incident:
  1. apply all migrations cleanly
  2. poison v9's recorded checksum with bogus bytes
  3. confirm raw sqlx::migrate! bails (proves the poisoning works)
  4. call db::init — must succeed by clearing + re-applying v9
  5. confirm v9 + v10 are both recorded with non-poisoned checksums

Test count: 38 (was 37; +1 db_init_self_heals test).

For operators currently stuck on the :49 crash-loop: just upgrade
to :50 from the StartOS marketplace. The :50 daemon will see the
mismatch on first boot, auto-clear v9's row, re-apply (0009 is
idempotent by design), and continue to 0010. No manual sqlite3 needed.
This commit is contained in:
Grant
2026-05-08 13:00:06 -05:00
parent 29be2405a8
commit 9919fbf8f8
2 changed files with 155 additions and 4 deletions
+79
View File
@@ -397,6 +397,85 @@ async fn migration_0009_is_idempotent() {
assert_db_clean(&pool).await.expect("db clean after re-apply");
}
/// Regression for the v0.1.0:48 → :49 incident: the `_sqlx_migrations`
/// table records a checksum for each applied migration; on every
/// subsequent boot sqlx verifies the on-disk bytes still match.
/// Builds across versions can produce subtly different bytes
/// (trailing newlines, line-endings, build-host normalization) for
/// the same semantic SQL, which makes sqlx refuse to start with
/// "migration N was previously applied but has been modified" and
/// crashes the daemon.
///
/// `db::init` works around this by detecting the
/// `MigrateError::VersionMismatch` for migrations on the
/// `IDEMPOTENT_MIGRATIONS` allowlist (just `9` for now), clearing the
/// stale row, and retrying. This test simulates the exact scenario:
/// poison the recorded checksum for v9, run init, expect success.
#[tokio::test]
async fn db_init_self_heals_checksum_mismatch_on_idempotent_migrations() {
let (pool, _tmp) = make_pool().await;
// Step 1: apply all migrations cleanly to populate
// _sqlx_migrations with current checksums.
sqlx::migrate!("./migrations")
.run(&pool)
.await
.expect("first apply");
// Step 2: poison the recorded checksum for v9. This simulates
// the cross-build drift that triggered the production incident.
let bogus_checksum: Vec<u8> = (0..48).map(|_| 0xEF).collect(); // Sha384 = 48 bytes
let n = sqlx::query("UPDATE _sqlx_migrations SET checksum = ? WHERE version = 9")
.bind(&bogus_checksum)
.execute(&pool)
.await
.unwrap()
.rows_affected();
assert_eq!(n, 1, "_sqlx_migrations should have a row for v9");
// Step 3: confirm sqlx::migrate! ALONE bails — proves the
// poisoning works and that without self-heal the daemon would
// crash here.
let ungated = sqlx::migrate!("./migrations").run(&pool).await;
assert!(
matches!(
ungated,
Err(sqlx::migrate::MigrateError::VersionMismatch(9))
),
"raw sqlx::migrate! should reject the poisoned row: got {ungated:?}"
);
// Step 4: drop the existing pool and call db::init on the same
// file. The self-heal should clear v9's row, re-apply, succeed.
let tmp_path = _tmp.path().to_path_buf();
drop(pool);
drop(_tmp);
let healed = keysat::db::init(&tmp_path)
.await
.expect("db::init should self-heal the poisoned v9 row");
// Sanity check: v9 is back in _sqlx_migrations with a fresh
// (correct) checksum, and v10 is still there from the original
// apply.
let count: i64 =
sqlx::query_scalar("SELECT COUNT(*) FROM _sqlx_migrations WHERE version IN (9, 10)")
.fetch_one(&healed)
.await
.unwrap();
assert_eq!(count, 2, "both 9 and 10 should be recorded after self-heal");
// The poisoned checksum was replaced with the real one.
let new_checksum: Vec<u8> =
sqlx::query_scalar("SELECT checksum FROM _sqlx_migrations WHERE version = 9")
.fetch_one(&healed)
.await
.unwrap();
assert_ne!(
new_checksum, bogus_checksum,
"self-heal must replace the poisoned checksum with the current one"
);
}
/// Migration 0010 (multi-currency foundation): verifies that the
/// backfill correctly populates the new `price_currency` and
/// `price_value` columns against products that existed before the