redaction: \b after magnitude so amounts don't eat the next word (v0.1.0:57)

The currency-anchored amount regexes treated a single-letter magnitude
suffix (k/m/b) as optional but unbounded, so "$5,000,000 but" scrubbed to
"[AMOUNT_1]ut" — the 'b' of "but" was consumed as a 'billion' suffix. Add a
word boundary after _MAG on the three currency-anchored _AMOUNT_RES patterns
(range, symbol, ISO-code); the worded-amount pattern is unaffected. Money
still tokenizes in every case ($5m/$5b/$3-5M/USD 5,000,000); only the OUTBOUND
to-Claude text stops losing the leading letter of the following word.
Round-trips were already lossless.

Regression-locked by a round-5 section in test_scrub_leak.py; full redaction
suite (scrub_leak + reidentification + grounding_boundary) green. Packaged as
StartOS v0.1.0:57. Reported by the Spark gateway dev; gateway re-vendored
scrub.py verbatim for parity (same golden-file leak test gates both sides).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Keysat
2026-06-05 18:52:04 -05:00
parent 6d6f4bcc7e
commit c898ad8530
5 changed files with 43 additions and 7 deletions
+3 -2
View File
@@ -17,8 +17,9 @@ import { v_0_1_0_53 } from './v0.1.0.53'
import { v_0_1_0_54 } from './v0.1.0.54'
import { v_0_1_0_55 } from './v0.1.0.55'
import { v_0_1_0_56 } from './v0.1.0.56'
import { v_0_1_0_57 } from './v0.1.0.57'
export const versionGraph = VersionGraph.of({
current: v_0_1_0_56,
other: [v_0_1_0_39, v_0_1_0_40, v_0_1_0_41, v_0_1_0_42, v_0_1_0_43, v_0_1_0_44, v_0_1_0_45, v_0_1_0_46, v_0_1_0_47, v_0_1_0_48, v_0_1_0_49, v_0_1_0_50, v_0_1_0_51, v_0_1_0_52, v_0_1_0_53, v_0_1_0_54, v_0_1_0_55],
current: v_0_1_0_57,
other: [v_0_1_0_39, v_0_1_0_40, v_0_1_0_41, v_0_1_0_42, v_0_1_0_43, v_0_1_0_44, v_0_1_0_45, v_0_1_0_46, v_0_1_0_47, v_0_1_0_48, v_0_1_0_49, v_0_1_0_50, v_0_1_0_51, v_0_1_0_52, v_0_1_0_53, v_0_1_0_54, v_0_1_0_55, v_0_1_0_56],
})
+18
View File
@@ -0,0 +1,18 @@
import { VersionInfo } from '@start9labs/start-sdk'
// Redaction engine fix: the amount/magnitude regex no longer swallows the first
// letter of a following word. '$5,000,000 but' was scrubbing to '[AMOUNT_1]ut'
// because the single-letter 'b' (billion) suffix matched the 'b' of 'but'; a word
// boundary after the magnitude restores it. Round-trips were already lossless; this
// keeps the *outbound* (to-Claude) text from losing a word. No data migration.
export const v_0_1_0_57 = VersionInfo.of({
version: '0.1.0:57',
releaseNotes: {
en_US: [
'Redaction fix: a dollar amount immediately followed by a word (e.g. "$5,000,000',
'but") no longer eats the first letter of that word when de-identifying text sent',
'to the Architect. Real magnitude suffixes ($5m, $5b, $3-5M) still tokenize.',
].join(' '),
},
migrations: { up: async () => {}, down: async () => {} },
})