Files
standards/guides/full-eval.md
T
Keysat 5f7a51199c Add doc-auditor agent and wire it into full-eval
Read-only documentation drift auditor: checks every README/instruction/HTML claim against the code and reports what no longer matches, modeled on the janitor/reviewer wrappers. Added to the full-eval suite (always-run) and to the subagents handbook roster and length budgets.
2026-06-13 00:06:43 -05:00

79 lines
3.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Full evaluation — orchestration guide
*Substance file per the portability protocol. Vendor wrappers (e.g.
`adapters/claude/commands/full-eval.md`) point here; this guide is self-contained
and written as plain prose any orchestrating agent could follow.*
You are the orchestrator of a full independent evaluation of one repository. Your job
is fan-out, then synthesis. You do not perform the evaluations yourself, and you do not
fix anything you learn about.
## Phase 1 — Orient (2 minutes, no deep reading)
Read README and AGENTS.md/CLAUDE.md to capture the project's stated intent in one
sentence. Check the file listing for StartOS-wrapper markers (manifest.yaml, *.s9pk,
startos/, start-sdk usage). Check version-control status for uncommitted changes.
## Phase 2 — Fan out
Delegate to these role agents — in parallel where your tooling allows, otherwise
sequentially — each with a task prompt containing: the repo path, the one-sentence
intent, and any focus area you were given.
1. **evaluator** — full six-lens assessment of this repo.
2. **security-auditor** — adversarial audit of this repo.
3. **exerciser** — build, run, and black-box test this repo.
4. **doc-auditor** — claim-by-claim documentation drift audit (READMEs, instruction
files, public-facing HTML) against what the code actually does.
5. **start9-spec-checker** — only if Phase 1 found StartOS-wrapper markers.
6. **reviewer** — only if there are uncommitted changes; scope = the working diff.
Do not relay agents' reports to the user as they arrive; wait for all of them.
## Phase 3 — Synthesize
Produce ONE report, written to `EVALUATION.md` at the repo root (this file is your only
write), then show the user just the Verdict and Priority queue sections.
This path is stable and overwritten in place on each run, so if `EVALUATION.md` is
committed to version control, a subsequent run produces a reviewable diff of how the
evaluation changed over time. Whether to commit it is the user's call — don't commit it
yourself.
EVALUATION.md structure:
```
# Evaluation — <repo> — <date>
Intent: <the one sentence>
Agents run: <list, with any that failed/were skipped and why>
## Verdict
≤5 sentences synthesizing all reports: overall state, the headline risk, readiness.
## Cross-referenced findings
Where agents corroborate, merge into one finding and say so — e.g. a crash the
exerciser reproduced at the same code path the auditor flagged is ONE P0 with two
kinds of evidence. Deduplicate aggressively.
## Priority queue
P0 → P3, every finding from every agent exactly once, each one line:
[Px] finding — evidence pointer — source agent(s)
## Scorecard
The evaluator's lens table, adjusted only if another agent's evidence contradicts it —
e.g. the doc-auditor's drift findings sharpen the Documentation lens (note any adjustment).
## Disagreements & gaps
Where agents conflict, or where every agent's Coverage section shows the same blind
spot.
## Suggested order of work
37 steps, dependencies respected (e.g. "fix the build before trusting test results").
```
## Rules
- Preserve each agent's severity unless evidence from another agent changes it, and
say so when you do.
- Carry every agent's Surprises forward — don't drop them.
- If an agent failed or returned nothing useful, report that honestly rather than
papering over it.
- If blocked at any point, report exactly what blocked you — never guess or fabricate
findings.