redaction: \b after magnitude so amounts don't eat the next word (v0.1.0:57)
The currency-anchored amount regexes treated a single-letter magnitude suffix (k/m/b) as optional but unbounded, so "$5,000,000 but" scrubbed to "[AMOUNT_1]ut" — the 'b' of "but" was consumed as a 'billion' suffix. Add a word boundary after _MAG on the three currency-anchored _AMOUNT_RES patterns (range, symbol, ISO-code); the worded-amount pattern is unaffected. Money still tokenizes in every case ($5m/$5b/$3-5M/USD 5,000,000); only the OUTBOUND to-Claude text stops losing the leading letter of the following word. Round-trips were already lossless. Regression-locked by a round-5 section in test_scrub_leak.py; full redaction suite (scrub_leak + reidentification + grounding_boundary) green. Packaged as StartOS v0.1.0:57. Reported by the Spark gateway dev; gateway re-vendored scrub.py verbatim for parity (same golden-file leak test gates both sides). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -157,6 +157,22 @@ def main():
|
||||
check("Reyes" not in out(zw, {"persons": ["Reyes"]}) and "Reyes" not in out(zw, {"persons": ["Reyes"]}),
|
||||
"zero-width-split known name does not leak")
|
||||
|
||||
print("\n[round-5 — magnitude suffix must not eat a following word]")
|
||||
# A single-letter magnitude (k/m/b) immediately before a real word must NOT be
|
||||
# consumed as a suffix: '$5,000,000 but' -> the 'b' of 'but' was being eaten,
|
||||
# yielding '[AMOUNT_1]ut'. A \b after the magnitude fixes it. Money still vanishes,
|
||||
# the following word survives intact, and legitimate suffixes still tokenize.
|
||||
for raw, word in [("$5,000,000 but he hesitates", "but he hesitates"),
|
||||
("committed $250,000 because timing", "because timing"),
|
||||
("USD 5,000,000 but capped", "but capped"),
|
||||
("between $3-5M but capped", "but capped")]:
|
||||
o = out(raw)
|
||||
check("[AMOUNT_1]ut" not in o and "[AMOUNT_1]ecause" not in o, f"magnitude does not bleed into next word: {raw!r}")
|
||||
check(word in o, f"following word survives intact: {word!r}")
|
||||
check("$" not in o and "USD 5" not in o, f"amount still tokenized: {raw!r}")
|
||||
check(out("raised $5m but later") == "raised [AMOUNT_1] but later", "real 'm' suffix still tokenizes ($5m)")
|
||||
check(out("about $5b in assets") == "about [AMOUNT_1] in assets", "real 'b' suffix still tokenizes ($5b)")
|
||||
|
||||
conn.close()
|
||||
print()
|
||||
if FAILS:
|
||||
|
||||
Reference in New Issue
Block a user