v0.1.0:50 — auto-recover from sqlx checksum drift on idempotent migrations
Two operators in a row hit the same crash-loop on upgrade:
Error: running migrations
Caused by:
migration 9 was previously applied but has been modified
sqlx records a SHA-384 of each migration's bytes when first applied,
then verifies the on-disk bytes still match on every subsequent boot.
Cross-build drift (trailing newlines, line-ending normalization, etc.)
produces different bytes for semantically-identical SQL — and sqlx
refuses to start. Recovery required SSHing in and running:
sqlite3 /data/keysat.db "DELETE FROM _sqlx_migrations WHERE version = 9;"
That's bad UX. Worse, every operator going through this version
range hits it once.
Self-heal: db::init now wraps sqlx::migrate!().run() with detection
for MigrateError::VersionMismatch(N) on a constant allowlist of
migrations certified safe to re-run (IDEMPOTENT_MIGRATIONS, just [9]
for now). When triggered, the daemon clears the stale row, retries,
logs a WARN explaining what happened, and continues. No SSH dance.
Allowlist gate is critical — auto-clearing checksums on additive
ALTER TABLE migrations like 0010 would error on retry (SQLite has
no ADD COLUMN IF NOT EXISTS). Only migrations explicitly designed
as drop-and-rebuild (like 0009) and tested via the
`migration_NNNN_is_idempotent` pattern in tests/migrations.rs
qualify.
Regression test in tests/migrations.rs exactly simulates the
production incident:
1. apply all migrations cleanly
2. poison v9's recorded checksum with bogus bytes
3. confirm raw sqlx::migrate! bails (proves the poisoning works)
4. call db::init — must succeed by clearing + re-applying v9
5. confirm v9 + v10 are both recorded with non-poisoned checksums
Test count: 38 (was 37; +1 db_init_self_heals test).
For operators currently stuck on the :49 crash-loop: just upgrade
to :50 from the StartOS marketplace. The :50 daemon will see the
mismatch on first boot, auto-clear v9's row, re-apply (0009 is
idempotent by design), and continue to 0010. No manual sqlite3 needed.
This commit is contained in:
@@ -397,6 +397,85 @@ async fn migration_0009_is_idempotent() {
|
||||
assert_db_clean(&pool).await.expect("db clean after re-apply");
|
||||
}
|
||||
|
||||
/// Regression for the v0.1.0:48 → :49 incident: the `_sqlx_migrations`
|
||||
/// table records a checksum for each applied migration; on every
|
||||
/// subsequent boot sqlx verifies the on-disk bytes still match.
|
||||
/// Builds across versions can produce subtly different bytes
|
||||
/// (trailing newlines, line-endings, build-host normalization) for
|
||||
/// the same semantic SQL, which makes sqlx refuse to start with
|
||||
/// "migration N was previously applied but has been modified" and
|
||||
/// crashes the daemon.
|
||||
///
|
||||
/// `db::init` works around this by detecting the
|
||||
/// `MigrateError::VersionMismatch` for migrations on the
|
||||
/// `IDEMPOTENT_MIGRATIONS` allowlist (just `9` for now), clearing the
|
||||
/// stale row, and retrying. This test simulates the exact scenario:
|
||||
/// poison the recorded checksum for v9, run init, expect success.
|
||||
#[tokio::test]
|
||||
async fn db_init_self_heals_checksum_mismatch_on_idempotent_migrations() {
|
||||
let (pool, _tmp) = make_pool().await;
|
||||
|
||||
// Step 1: apply all migrations cleanly to populate
|
||||
// _sqlx_migrations with current checksums.
|
||||
sqlx::migrate!("./migrations")
|
||||
.run(&pool)
|
||||
.await
|
||||
.expect("first apply");
|
||||
|
||||
// Step 2: poison the recorded checksum for v9. This simulates
|
||||
// the cross-build drift that triggered the production incident.
|
||||
let bogus_checksum: Vec<u8> = (0..48).map(|_| 0xEF).collect(); // Sha384 = 48 bytes
|
||||
let n = sqlx::query("UPDATE _sqlx_migrations SET checksum = ? WHERE version = 9")
|
||||
.bind(&bogus_checksum)
|
||||
.execute(&pool)
|
||||
.await
|
||||
.unwrap()
|
||||
.rows_affected();
|
||||
assert_eq!(n, 1, "_sqlx_migrations should have a row for v9");
|
||||
|
||||
// Step 3: confirm sqlx::migrate! ALONE bails — proves the
|
||||
// poisoning works and that without self-heal the daemon would
|
||||
// crash here.
|
||||
let ungated = sqlx::migrate!("./migrations").run(&pool).await;
|
||||
assert!(
|
||||
matches!(
|
||||
ungated,
|
||||
Err(sqlx::migrate::MigrateError::VersionMismatch(9))
|
||||
),
|
||||
"raw sqlx::migrate! should reject the poisoned row: got {ungated:?}"
|
||||
);
|
||||
|
||||
// Step 4: drop the existing pool and call db::init on the same
|
||||
// file. The self-heal should clear v9's row, re-apply, succeed.
|
||||
let tmp_path = _tmp.path().to_path_buf();
|
||||
drop(pool);
|
||||
drop(_tmp);
|
||||
let healed = keysat::db::init(&tmp_path)
|
||||
.await
|
||||
.expect("db::init should self-heal the poisoned v9 row");
|
||||
|
||||
// Sanity check: v9 is back in _sqlx_migrations with a fresh
|
||||
// (correct) checksum, and v10 is still there from the original
|
||||
// apply.
|
||||
let count: i64 =
|
||||
sqlx::query_scalar("SELECT COUNT(*) FROM _sqlx_migrations WHERE version IN (9, 10)")
|
||||
.fetch_one(&healed)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(count, 2, "both 9 and 10 should be recorded after self-heal");
|
||||
|
||||
// The poisoned checksum was replaced with the real one.
|
||||
let new_checksum: Vec<u8> =
|
||||
sqlx::query_scalar("SELECT checksum FROM _sqlx_migrations WHERE version = 9")
|
||||
.fetch_one(&healed)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_ne!(
|
||||
new_checksum, bogus_checksum,
|
||||
"self-heal must replace the poisoned checksum with the current one"
|
||||
);
|
||||
}
|
||||
|
||||
/// Migration 0010 (multi-currency foundation): verifies that the
|
||||
/// backfill correctly populates the new `price_currency` and
|
||||
/// `price_value` columns against products that existed before the
|
||||
|
||||
Reference in New Issue
Block a user