v0.1.0:50 — auto-recover from sqlx checksum drift on idempotent migrations

Two operators in a row hit the same crash-loop on upgrade:

  Error: running migrations
  Caused by:
      migration 9 was previously applied but has been modified

sqlx records a SHA-384 of each migration's bytes when first applied,
then verifies the on-disk bytes still match on every subsequent boot.
Cross-build drift (trailing newlines, line-ending normalization, etc.)
produces different bytes for semantically-identical SQL — and sqlx
refuses to start. Recovery required SSHing in and running:

  sqlite3 /data/keysat.db "DELETE FROM _sqlx_migrations WHERE version = 9;"

That's bad UX. Worse, every operator going through this version
range hits it once.

Self-heal: db::init now wraps sqlx::migrate!().run() with detection
for MigrateError::VersionMismatch(N) on a constant allowlist of
migrations certified safe to re-run (IDEMPOTENT_MIGRATIONS, just [9]
for now). When triggered, the daemon clears the stale row, retries,
logs a WARN explaining what happened, and continues. No SSH dance.

Allowlist gate is critical — auto-clearing checksums on additive
ALTER TABLE migrations like 0010 would error on retry (SQLite has
no ADD COLUMN IF NOT EXISTS). Only migrations explicitly designed
as drop-and-rebuild (like 0009) and tested via the
`migration_NNNN_is_idempotent` pattern in tests/migrations.rs
qualify.

Regression test in tests/migrations.rs exactly simulates the
production incident:
  1. apply all migrations cleanly
  2. poison v9's recorded checksum with bogus bytes
  3. confirm raw sqlx::migrate! bails (proves the poisoning works)
  4. call db::init — must succeed by clearing + re-applying v9
  5. confirm v9 + v10 are both recorded with non-poisoned checksums

Test count: 38 (was 37; +1 db_init_self_heals test).

For operators currently stuck on the :49 crash-loop: just upgrade
to :50 from the StartOS marketplace. The :50 daemon will see the
mismatch on first boot, auto-clear v9's row, re-apply (0009 is
idempotent by design), and continue to 0010. No manual sqlite3 needed.
This commit is contained in:
Grant
2026-05-08 13:00:06 -05:00
parent 29be2405a8
commit 9919fbf8f8
2 changed files with 155 additions and 4 deletions
+76 -4
View File
@@ -35,11 +35,83 @@ pub async fn init(path: &Path) -> Result<SqlitePool> {
.await
.with_context(|| format!("opening sqlite at {}", path.display()))?;
sqlx::migrate!("./migrations")
.run(&pool)
.await
.context("running migrations")?;
run_migrations_with_self_heal(&pool).await?;
tracing::info!(path = %path.display(), "database ready");
Ok(pool)
}
/// Migrations that have been certified safe to re-run from scratch. If
/// sqlx complains about a checksum mismatch on one of these (which can
/// happen when the file content shifts subtly between builds —
/// trailing whitespace, line endings, build-host normalization), the
/// daemon clears the row from `_sqlx_migrations` and retries instead
/// of crash-looping.
///
/// Add a migration's version to this list ONLY when:
/// - It's `CREATE TABLE IF NOT EXISTS` / `INSERT OR IGNORE` style
/// OR a deliberate drop-and-rebuild that produces identical state
/// regardless of starting point.
/// - It does NOT include `ALTER TABLE ADD COLUMN` (that errors on
/// re-apply — SQLite has no `ADD COLUMN IF NOT EXISTS`).
/// - You've tested it via `migration_NNNN_is_idempotent` in
/// `tests/migrations.rs`.
const IDEMPOTENT_MIGRATIONS: &[i64] = &[
9, // see migrations/0009_discount_codes_set_price.sql — explicitly
// designed as a stash-drop-rebuild-restore that yields the same
// end state regardless of the starting state. Pinned by
// migration_0009_is_idempotent in tests/migrations.rs.
];
/// Run migrations with auto-recovery for the
/// `MigrateError::VersionMismatch` case on idempotent migrations.
///
/// Why this exists: sqlx records a SHA-384 of each migration file's
/// bytes when it's first applied, then verifies the on-disk bytes
/// still match on every subsequent boot. The verification is too
/// strict for our use case — a rebuild-from-clean-source can produce
/// different bytes (trailing newlines, line endings, etc.) even when
/// the SQL semantics are unchanged. Without this self-heal, every
/// such drift requires the operator to SSH in and run
/// `DELETE FROM _sqlx_migrations WHERE version = N` by hand.
///
/// The auto-clear is gated on `IDEMPOTENT_MIGRATIONS` so we only
/// re-apply migrations we've explicitly certified as safe to re-run.
/// Anything else still propagates the error and crashes the daemon —
/// preventing accidental data corruption from re-running a destructive
/// migration.
async fn run_migrations_with_self_heal(pool: &SqlitePool) -> Result<()> {
use sqlx::migrate::MigrateError;
let migrator = sqlx::migrate!("./migrations");
match migrator.run(pool).await {
Ok(()) => Ok(()),
Err(MigrateError::VersionMismatch(version))
if IDEMPOTENT_MIGRATIONS.contains(&version) =>
{
tracing::warn!(
migration = version,
"migration {version} checksum mismatch on a known-idempotent migration; \
clearing _sqlx_migrations row and retrying. This usually means the \
migration file's bytes drifted subtly between builds (trailing \
whitespace, line endings) without a semantic change."
);
sqlx::query("DELETE FROM _sqlx_migrations WHERE version = ?")
.bind(version)
.execute(pool)
.await
.with_context(|| {
format!("clearing _sqlx_migrations row for self-heal of v{version}")
})?;
migrator
.run(pool)
.await
.with_context(|| format!("retry of migrations after self-heal of v{version}"))?;
tracing::info!(
migration = version,
"migration {version} re-applied successfully after checksum self-heal"
);
Ok(())
}
Err(e) => Err(e).context("running migrations"),
}
}