Files
standards/subagents-handbook.md
T
Keysat 5f7a51199c Add doc-auditor agent and wire it into full-eval
Read-only documentation drift auditor: checks every README/instruction/HTML claim against the code and reports what no longer matches, modeled on the janitor/reviewer wrappers. Added to the full-eval suite (always-run) and to the subagents handbook roster and length budgets.
2026-06-13 00:06:43 -05:00

338 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Subagents Handbook
**Purpose:** reference for designing, using, and creating Claude Code subagents.
**Lives at:** `~/Projects/standards/subagents-handbook.md`, sibling of `portability.md`
and `retrofit-playbook.md`. The working parts follow the portability protocol: neutral
substance in `guides/<agent>.md` (one operating guide per agent, plus
`guides/full-eval.md` for the orchestrator); thin Claude wrappers in
`adapters/claude/agents/` and `adapters/claude/commands/`, reached via the standard
directory symlinks from `~/.claude/agents` and `~/.claude/commands`.
Verified against the official docs (June 2026): https://code.claude.com/docs/en/sub-agents
---
## 1. What a subagent is
A subagent is a helper Claude that runs in its **own context window**, does a chunk of
work, and hands back **only its final report**. Your file hierarchy controls what enters
context at session start; subagents control what stays out of context while work happens.
The trigger is always the same shape: *the task involves lots of reading or noisy output,
but I only need the conclusion.*
### Mechanics (the facts that matter)
- **Format.** A Markdown file with YAML frontmatter. Frontmatter = config; body = the
subagent's entire system prompt. Only `name` and `description` are required. Useful
optional fields: `tools` (allowlist), `disallowedTools`, `model` (`haiku`/`sonnet`/
`opus`/`inherit`), `effort` (`low`/`medium`/`high`/`xhigh`/`max` — reasoning depth,
default high), `maxTurns`, `memory`, `background`.
- **Wrapper/guide split (this setup).** Per the portability protocol, each definition
here is a *thin wrapper*: vendor syntax confined to the frontmatter, and a body that
does exactly three things — direct the agent to read its operating guide at
`~/Projects/standards/guides/<name>.md` before acting, give a fail-safe ("can't read
the guide → stop and say so, don't improvise"), and state a one-line safety floor
(read-only, etc.) that holds even guideless. The guide carries all substance —
mission, procedure, hard rules, report contract — as plain prose any agent could
follow; adopting another provider is a header swap.
- **Locations & precedence.** `.claude/agents/` (project) beats `~/.claude/agents/`
(user, all projects). Both are scanned recursively, so symlinked subfolders work.
- **Loading.** Subagent files (the wrappers) are read **at session start**. Add or edit
a wrapper on disk → restart the session. (Agents made via `/agents` load immediately.)
Guides are different: the agent reads its guide **at run time**, so guide edits take
effect on the very next run, no restart.
- **What the subagent sees.** Its own body, the delegation prompt the main agent writes,
your full CLAUDE.md/AGENTS.md hierarchy, and a git-status snapshot. It does **not** see
your conversation. The delegation prompt is the *only* channel in — paths, symptoms,
and scope must be stated there.
- **What comes back.** The subagent's final message, verbatim. Everything else it read
or did dies with its context window.
- **Built-ins.** `Explore` (Haiku, read-only — fast codebase search), `Plan`, and
`general-purpose` already exist. Don't rebuild them.
- **No nesting.** Subagents cannot spawn subagents. Fan-out happens from the main
thread (see §6, the orchestrator).
- **Foreground vs background.** Foreground blocks and passes permission prompts to you.
Background runs concurrently but **auto-denies** anything that would prompt — so
agents needing Bash approval (exerciser, auditor) should run foreground the first time.
- **Invocation.** Auto-delegation via the `description` field; by name in prose
("use the reviewer subagent"); or guaranteed via `@agent-<name>`.
- **Cost.** Every subagent is a fresh model run reading files from scratch. Heavy
fan-outs can burn several times the tokens of a single thread. Spend it where the
context savings or fresh eyes are worth it.
---
## 2. The catalog
Four properties a separate context window gives you. Every useful agent exploits at
least one.
| Property | What it buys you | Agents |
|---|---|---|
| **Compression** | Huge input → small conclusion | explorer, test-runner, researcher, docs-reader |
| **Independence** | Fresh eyes, unanchored by your session | reviewer, fresh-debugger, spec-checker, security-auditor |
| **Containment** | Messy trial-and-error stays out of your session | exerciser, reproducer, migrator |
| **Parallelism** | N investigations at once | any of the above, fanned out |
### Roster
| Agent | Property | One-liner | Status |
|---|---|---|---|
| explorer | compression | Map a codebase/subsystem, report how it works | **built-in** (`Explore`) |
| evaluator | independence | Whole-repo assessment: 6 lenses + intent match + alternatives | ✅ in kit |
| reviewer | independence | Fresh-eyes review of a diff before commit | ✅ in kit |
| security-auditor | independence | Hostile persona: vulns, secrets, dependency CVEs | ✅ in kit |
| spec-checker | independence | Compliance vs Start9 community registry requirements | ✅ in kit (`start9-spec-checker`) |
| portability-checker | independence | Compliance vs my vendor-neutral / hot-swap standard | ✅ in kit |
| exerciser | containment | Black-box QA: run it, feed it normal + hostile inputs | ✅ in kit |
| researcher | compression | Multi-source web research → cited brief | ✅ in kit |
| janitor | compression | Spring-clean docs/artifacts: report stale, orphaned, superseded files | ✅ in kit |
| doc-auditor | independence | Doc drift: every README/instruction/HTML claim checked against the code | ✅ in kit |
| test-runner | compression | Run the suite, return only failures + causes | build when a repo has a real suite |
| docs-reader | compression | Read a library's docs, return just the calls you need | build on 3rd manual-trawling task |
| fresh-debugger | independence | Gets symptoms only (never your theories), hunts the bug | build next time you're stuck >1hr |
| reproducer | containment | Bug report in → minimal repro steps out | build when triaging user reports |
| migrator | containment | Mechanical change across many files → "done, N anomalies" | build at first big rename/upgrade |
*CVE, since it'll keep coming up:* **Common Vulnerabilities and Exposures** — the public
catalog of known security flaws, each with an ID like `CVE-2026-12345`. "Dependency
CVEs" = known holes in the third-party libraries your software pulls in. Scanners
(`npm audit`, `cargo audit`, `pip-audit`, `osv-scanner`) check your lockfiles against
the catalog. The security-auditor runs them.
---
## 3. Rules of use
1. **Delegate reading; keep understanding.** The cardinal rule. A subagent returns
conclusions — its *reasoning* dies with its context. Never delegate work you'll
iterate on (implementation, design) because you'd be editing half-blind. Test: *if
the next step needs the understanding, do it in the main window.*
2. **Demand verifiable outputs.** Trustworthy reports cite things you can spot-check
without re-reading the inputs: `file:line`, repro commands, failing test names, URLs.
If an agent's output can't be made verifiable, it's a bad subagent task.
3. **The description field is the API.** Auto-delegation reads only `description`. Write
it as trigger conditions for the delegating model, not marketing for you. "Use
proactively when…" and "MUST BE USED after…" genuinely change behavior.
4. **Least-privilege tools.** Evaluative agents get read-only (`Read, Grep, Glob`,
maybe `Bash` for git/scanners). Only agents whose *job* is writing get `Write`/`Edit`.
This keeps reports honest (an agent that can "just fix it" stops reporting) and makes
background runs safe.
5. **One job per agent.** When a definition grows "and also…", split it. A muddled
mission produces a muddled report.
6. **Cap the report.** Every definition states a line budget. The main context is the
scarce resource the whole system exists to protect; an agent that returns 500 lines
defeats itself.
7. **Mandate a Surprises section.** Cheap insurance against tunnel vision — the best
findings are often outside the asked question. "None" is an acceptable answer;
omitting the section is not.
8. **Two dials per agent: model × effort.** `model` is who you hire — haiku for
mechanical work, sonnet default, opus where judgment is the product. `effort` is
how long they're allowed to think before answering — it scales reasoning depth, not
capability ceiling. Both live in the wrapper frontmatter; frontmatter effort
overrides the session level while that agent runs. Match the dial to where the cost
actually lives: evaluator and security-auditor are **opus/high** because the report
*is* the reasoning; reviewer, researcher, exerciser, spec-checker, and
portability-checker are **sonnet/medium** because their cost is dominated by frequency,
tool-call turns, or procedure-following — not thinking depth. The deep insight: a well-proceduralized
guide is *pre-computed reasoning*, so every hour spent tightening a guide lowers
the model+effort that task needs forever. `low` is for truly mechanical agents
(migrator, test-runner) once their guides are trusted; `xhigh` is the upgrade if an
opus agent's reports feel shallow on hard targets. Global overrides, strongest
first: `CLAUDE_CODE_EFFORT_LEVEL` and `CLAUDE_CODE_SUBAGENT_MODEL` env vars →
wrapper frontmatter → session setting — handy for "run everything cheap today."
9. **Fresh eyes need fresh inputs.** When delegating to reviewer/debugger types, give
symptoms and scope — *never your theories*. Anchoring them throws away the one thing
the separate context bought you.
10. **Parallelize the independent, serialize the dependent.** "Audit security, exercise
the app, and check spec compliance *in parallel*" is one sentence to the main agent.
But chained work (find issues → fix them) must flow back through the main thread.
11. **Iterate the guide, not the chat.** A disappointing report means the guide is
wrong. Fix `guides/<name>.md` (it's version-controlled) and rerun — guides load at
run time, so no restart. Touch the wrapper only for tools/model/description
changes, and restart the session when you do. Correcting the agent
conversationally fixes nothing for next time.
12. **Know when *not* to delegate.** Quick targeted changes; tasks needing your taste
mid-stream; reads under ~5 files (overhead exceeds benefit); anything where latency
matters more than context hygiene.
---
## 4. Shaping the report
The summary is the interface — "report back" gets you mush. Every *guide* in this kit
embeds a variant of the universal contract below; the agent reads its guide at startup,
so the contract reaches it without bloating the wrapper. The universal form here is
your template for designing new guides.
### Universal report contract
```
## Verdict — the answer, 13 sentences, no throat-clearing
## Findings — each: [severity] one-line title → evidence → so-what
## Coverage — what was examined and what wasn't (makes silence meaningful)
## Surprises — anything unexpected, on-mission or not ("none" allowed)
## Next actions — ranked, concrete, imperative mood
## Confidence — high/medium/low + the one thing that would raise it
```
### Severity scale (shared by all evaluative agents)
- **P0** exploitable, data loss, or crashes in normal use — fix before anything else
- **P1** wrong behavior or weak security on realistic paths — fix before release
- **P2** works, but fragile / hard to maintain / poor practice
- **P3** cosmetic, style, minor docs
- **P4** informational / matter of taste
### Evidence by agent type
Per-agent evidence requirements: code findings cite `file:line`; bugs include exact
repro commands plus expected-vs-actual; research claims carry a URL and are labeled
**Fact** (sourced) or **Inference** (the agent's own conclusion); compliance findings
cite the requirement's doc URL. A finding without its evidence type gets deleted, not
softened.
### Length budgets
Reviewer ≤ 70 lines · exerciser, spec-checker, portability-checker, janitor & doc-auditor
≤ 80 · researcher & security-auditor ≤ 100 · evaluator ≤ 120. Tighten freely; loosen only
with cause.
---
## 5. Creating a new agent
Three routes, in order of preference:
**Route 1 — copy a neighbor.** The fastest path to a *good* agent is copying the
closest existing pair: its guide in `guides/` and its wrapper in
`adapters/claude/agents/`. The contract and structure carry over; only the mission
changes. Scope rule applies: project-specific agents put the guide in
`<repo>/docs/guides/` and the wrapper in `<repo>/.claude/agents/` instead.
**Route 2 — `/agents` → "Generate with Claude".** Interactive, guided, loads without a
session restart. Good for quick experiments; once it earns permanence, split it —
substance to a guide, vendor header to a thin wrapper — at the right scope.
**Route 3 — the meta-prompt.** Paste this into any Claude session, filled in:
```
Create a Claude Code subagent definition file (YAML frontmatter + system-prompt body).
Role: [one sentence — what it does]
Trigger: [when the main agent should delegate; exact phrases that should fire it]
Inputs: [what the delegation prompt will contain — paths, scope, symptoms]
Tools: [least-privilege list; justify any Write/Edit/Bash]
Must never: [hard limits — e.g. never edits source, never returns raw logs]
Report: [sections, severity scale, evidence type, line cap]
Model: [haiku|sonnet|opus|inherit — and why]
Produce TWO files per my portability protocol:
1. The GUIDE — vendor-neutral, second person, self-contained: mission, procedure,
hard rules, then the report template verbatim. End with: "If blocked, report
exactly what blocked you — never guess or fabricate findings."
→ guides/<name>.md (standards repo if global; <repo>/docs/guides/ if project).
2. The WRAPPER — vendor syntax confined to the header: YAML frontmatter with name,
a description written FOR THE DELEGATING MODEL (explicit trigger conditions,
"use proactively when…" phrasing where wanted), least-privilege tools, model.
Body only: role one-liner, directive to read the guide first, stop-and-say-so
fail-safe if the guide is unreadable, one-line safety floor.
→ adapters/claude/agents/<name>.md (global) or <repo>/.claude/agents/ (project).
Then show one example invocation and the first 10 lines of an ideal report.
```
**Checklist** (every guide+wrapper pair, before it ships):
Guide — self-contained · embedded report template with line cap · Surprises section ·
blocked-means-say-so clause. Wrapper — trigger-rich description · least-privilege
`tools` · explicit `model` · guide pointer with fail-safe · safety floor. Then: save
both at the right scope, restart the session (new wrapper), run once on a real task,
tighten the guide where it disappointed.
---
## 6. The orchestrator question
You cannot build an orchestrator *as a subagent* — subagents can't spawn subagents.
Three real options:
1. **Slash command (this kit's choice): `/full-eval`.** A command file is just a stored
prompt for the *main thread* — and the main thread *can* fan out. `/full-eval` makes
it launch evaluator + security-auditor + exerciser + doc-auditor (+ start9-spec-checker
when relevant) in parallel, then synthesize one deduplicated, prioritized report. Zero new
machinery, works in any session.
2. **`claude --agent <name>` (advanced).** A session launched this way takes the agent's
system prompt as its main thread — and a main-thread agent *can* spawn subagents,
even restricted to a roster via `tools: Agent(evaluator, security-auditor, ...)`.
Build this only if `/full-eval` starts feeling too manual.
3. **Agent teams (experimental).** Multiple communicating sessions. Ignore until a task
genuinely exceeds one session's context even with subagents.
Synthesis is the orchestrator's real job, and it stays in the main thread on purpose:
cross-referencing ("the crash the exerciser found is the injection the auditor
flagged") is exactly the understanding you don't delegate.
---
## 7. Workflows
### A. The retrospective (the dozen existing repos)
Per repo, in order:
1. **Retrofit first** (per `retrofit-playbook.md`): AGENTS.md with Current state, the
CLAUDE.md symlink. Subagents load this file — the evaluator literally reads it to
judge the build against your stated intent. Eval before retrofit = grading against
a blank rubric. Confirm the retrofit's structure with **portability-checker** before
moving on — especially on repos retrofitted under an older playbook, where AGENTS.md
may still be a real CLAUDE.md or rules may not yet be symlinks into docs/guides.
2. **`/full-eval`** in a fresh session. Walk away; read one synthesized report.
3. **Triage** into the repo's AGENTS.md Current state: P0/P1 become the work queue,
P2 becomes a "known debt" list, P3+ gets deferred for later decision or done in bulk.
4. **Fix in the main session** (understanding work — yours), running **reviewer** on
each meaningful diff before commit.
5. **Close out** per the ritual; move to the next repo. Don't batch — one repo's
full cycle beats six repos' half-cycles.
### B. Steady state (everything built from now on)
- **While building:** built-in Explore for "how does X work" questions; researcher
before adopting any dependency; fresh-debugger discipline when stuck >1hr (symptoms
only, no theories).
- **Every meaningful diff:** reviewer, before commit. This is the habit that compounds.
- **Before any release:** exerciser + security-auditor in parallel.
- **Before registry submission:** start9-spec-checker.
- **After retrofitting or converting a repo:** portability-checker, to confirm it follows
the vendor-neutral / hot-swap standard before you rely on it being swappable.
- **Quarterly-ish:** `/full-eval` on actively maintained repos; drift accumulates
silently.
---
## 8. Install & maintenance
One-time install — the portability protocol's standard symlinks (if `~/.claude/agents`
or `~/.claude/commands` already exist as real directories, move their contents into the
adapters folders first):
```bash
ln -s ~/Projects/standards/adapters/claude/agents ~/.claude/agents
ln -s ~/Projects/standards/adapters/claude/commands ~/.claude/commands
```
Then restart any open Claude Code session (wrappers load at session start). First run:
`cd` into a repo, `claude`, then `/full-eval` — or invoke singles anytime
(`@agent-reviewer look at my uncommitted changes`).
- **Two-file edit model.** Substance → `guides/<name>.md`, live on the agent's next
run, no restart. Machinery (tools, model, description) → the wrapper, restart
required. When in doubt, you're editing the guide.
- Edits land via the normal close-out ritual → commit → push → backed up on the Start9
with the rest of the standards repo. The directory symlinks mean `git pull` updates
every machine-wide agent; there is no reinstall step.
- Sessions that can't see `~/Projects` (cloud, containers) simply have no global
agents; a wrapper that loads without its guide stops and says so rather than
improvising — that's the fail-safe working as designed.
- Prune ruthlessly. An agent unused for months is context-rot in the `/agents` list;
delete the pair — git remembers.