Add refactor-scout agent — read-only technical-debt surveyor for source code

The janitor's source-code sibling: surveys existing code (not a diff) for smells,
dead code, duplication, and over-complexity, prioritizes by churn × complexity, and
recommends a disposition (refactor/delete/defer/accept) per finding designed to feed
/triage. Test-net status and risk-to-change are first-class so a refactor is only
recommended when behavior preservation can be proven. Read-only; the risky auto-apply
half is deliberately deferred and gated. ROADMAP item 11.
This commit is contained in:
Keysat
2026-06-19 22:57:32 -05:00
parent 4533cc1d85
commit f258f9fd3c
4 changed files with 245 additions and 9 deletions
+46
View File
@@ -245,3 +245,49 @@ owner ratifies instead of researching.
reversible + test-covered" class, once the owner has watched it make calls and trusts the verdicts
(deliberately deferred — recommend-only first to build trust); (b) a thin `/triage`-then-`/adjudicate`
combo if the two-command chaining friction proves real (YAGNI for now).
## 11. `refactor-scout` — read-only technical-debt surveyor for source code ✅ BUILT (2026-06-19)
Built and live: `guides/refactor-scout.md` + `adapters/claude/agents/refactor-scout.md` (the
`refactor-scout` subagent). The **janitor's source-code sibling**: where the janitor spring-cleans
docs/artifacts, this surveys *code* debt. It answers "where is the code getting bulky and tangled,
and what's actually worth touching?" — the spring-cleaning the owner (a non-coder) wanted, scoped so
the value is captured with none of the risk.
**The insight that scoped it:** separate *seeing* debt (read-only, zero risk — almost all the value)
from *changing* it (where 100% of the risk lives). This agent builds only the seeing half. It surveys
*existing* code (not a diff — that's `reviewer`; not docs — that's `janitor`; not a whole-repo grade —
that's `evaluator`), and is deliberately **opinionated and tiered** (≤~12 highest value-to-risk
findings, never a 200-item dump that paralyzes).
- **Targeted, not uniform.** Prioritizes by **churn × complexity** (the proven hotspot heuristic) so
attention lands where debt actually hurts, then inspects those hotspots for smells (DUPLICATION,
LONG, COMPLEX, LARGE, DEAD, COUPLING, INCONSISTENT, MAGIC).
- **Tool-backed where possible.** Runs the repo's *own* analyzers read-only (knip/ts-prune, vulture,
deadcode, clippy, jscpd…) for high-confidence dead-code/duplication signals, kept separate from
reasoned judgment; **never installs anything** — a missing analyzer becomes a recommendation.
- **Behavior preservation is sacred + test-net is first-class.** Every finding carries risk-to-change
(blast-radius, unclear ⇒ HIGH) and test-net status. A REFACTOR is only ever recommended when a green
test can prove behavior is unchanged before/after; no coverage ⇒ "write a characterization test
first," never "just refactor."
- **The disposition loop (the deliverable's whole point).** Each finding is bucketed into
**REFACTOR** (worth it + safe + covered, annotated with the specific refactoring) · **DELETE** (dead,
tool-confirmed) · **DEFER** (worth it, gated on a test net → ROADMAP) · **ACCEPT** (real debt not
worth the risk — a legitimate choice). Buckets are designed to feed straight into `/triage`. The
non-coder approves on the *contract* ("tests green before/after, behavior unchanged, small diff,
reviewer signed off"), not by reading the diff.
**Deliberately NOT built — the auto-apply half (deferred + heavily gated).** Actually editing working
code is where the risk lives; for now the owner acts on findings manually, one at a time, via a normal
gated agent session (the apply path already exists — hand a finding to `reviewer`-backed work under the
test gate). A dedicated `/refactor-apply` command is a *future* item, justified only after the survey
proves its worth, and only behind: existing-or-generated characterization tests, LOW risk-to-change,
small diff, and human approval.
**Remaining options:** (a) fold a `refactor-scout` pass into `/full-eval` for code repos; (b) the gated
auto-apply command above, once the survey has earned trust; (c) once item 1's `/harden` exists, the
churn×complexity hotspot list pairs naturally with the per-stack linter baseline.
**Untested on a real repo** — the first run (recap or keysat) should confirm it stays tiered and honest
about test coverage rather than over-recommending refactors. Tune the guide's tiering cap / risk rules
if it over-produces.