redaction: \b after magnitude so amounts don't eat the next word (v0.1.0:57)

The currency-anchored amount regexes treated a single-letter magnitude
suffix (k/m/b) as optional but unbounded, so "$5,000,000 but" scrubbed to
"[AMOUNT_1]ut" — the 'b' of "but" was consumed as a 'billion' suffix. Add a
word boundary after _MAG on the three currency-anchored _AMOUNT_RES patterns
(range, symbol, ISO-code); the worded-amount pattern is unaffected. Money
still tokenizes in every case ($5m/$5b/$3-5M/USD 5,000,000); only the OUTBOUND
to-Claude text stops losing the leading letter of the following word.
Round-trips were already lossless.

Regression-locked by a round-5 section in test_scrub_leak.py; full redaction
suite (scrub_leak + reidentification + grounding_boundary) green. Packaged as
StartOS v0.1.0:57. Reported by the Spark gateway dev; gateway re-vendored
scrub.py verbatim for parity (same golden-file leak test gates both sides).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Keysat
2026-06-05 18:52:04 -05:00
parent 6d6f4bcc7e
commit c898ad8530
5 changed files with 43 additions and 7 deletions
+3 -3
View File
@@ -70,9 +70,9 @@ _NUMWORD = (r"(?:one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|
r"sixty|seventy|eighty|ninety|hundred|couple|few|several|half|a)")
_MAG = r"(?:mm|bn|tn|thousand|million|billion|trillion|k|m|b)" # longest-first so 'MM' isn't split into 'M'
_AMOUNT_RES = [
re.compile(r"[$€£]\s?\d[\d,. ]*\d?\s?-\s?[$€£]?\s?\d[\d,. ]*\d?(?:\s?" + _MAG + r")?", re.IGNORECASE), # $3-5M range
re.compile(r"[$€£]\s?\d[\d,]*(?:\.\d+)?(?:\s?" + _MAG + r")?", re.IGNORECASE), # $5,000,000 / $5m
re.compile(r"\b(?:USD|EUR|GBP|CHF|CAD|AUD)\s?[$€£]?\s?\d[\d,]*(?:\.\d+)?(?:\s?" + _MAG + r")?", re.IGNORECASE),
re.compile(r"[$€£]\s?\d[\d,. ]*\d?\s?-\s?[$€£]?\s?\d[\d,. ]*\d?(?:\s?" + _MAG + r"\b)?", re.IGNORECASE), # $3-5M range
re.compile(r"[$€£]\s?\d[\d,]*(?:\.\d+)?(?:\s?" + _MAG + r"\b)?", re.IGNORECASE), # $5,000,000 / $5m
re.compile(r"\b(?:USD|EUR|GBP|CHF|CAD|AUD)\s?[$€£]?\s?\d[\d,]*(?:\.\d+)?(?:\s?" + _MAG + r"\b)?", re.IGNORECASE),
re.compile(r"\b\d[\d,]*(?:\.\d+)?\s?(?:dollars?|euros?|pounds?)\b", re.IGNORECASE), # 5,000,000 dollars
re.compile(r"(?i)\b(?:" + _NUMWORD + r"[\s\-]+){1,4}" + _MAG + r"\s+(?:dollars?|euros?|pounds?)\b"), # five million dollars
]