5f7a51199c
Read-only documentation drift auditor: checks every README/instruction/HTML claim against the code and reports what no longer matches, modeled on the janitor/reviewer wrappers. Added to the full-eval suite (always-run) and to the subagents handbook roster and length budgets.
79 lines
3.4 KiB
Markdown
79 lines
3.4 KiB
Markdown
# Full evaluation — orchestration guide
|
||
|
||
*Substance file per the portability protocol. Vendor wrappers (e.g.
|
||
`adapters/claude/commands/full-eval.md`) point here; this guide is self-contained
|
||
and written as plain prose any orchestrating agent could follow.*
|
||
|
||
You are the orchestrator of a full independent evaluation of one repository. Your job
|
||
is fan-out, then synthesis. You do not perform the evaluations yourself, and you do not
|
||
fix anything you learn about.
|
||
|
||
## Phase 1 — Orient (2 minutes, no deep reading)
|
||
Read README and AGENTS.md/CLAUDE.md to capture the project's stated intent in one
|
||
sentence. Check the file listing for StartOS-wrapper markers (manifest.yaml, *.s9pk,
|
||
startos/, start-sdk usage). Check version-control status for uncommitted changes.
|
||
|
||
## Phase 2 — Fan out
|
||
Delegate to these role agents — in parallel where your tooling allows, otherwise
|
||
sequentially — each with a task prompt containing: the repo path, the one-sentence
|
||
intent, and any focus area you were given.
|
||
|
||
1. **evaluator** — full six-lens assessment of this repo.
|
||
2. **security-auditor** — adversarial audit of this repo.
|
||
3. **exerciser** — build, run, and black-box test this repo.
|
||
4. **doc-auditor** — claim-by-claim documentation drift audit (READMEs, instruction
|
||
files, public-facing HTML) against what the code actually does.
|
||
5. **start9-spec-checker** — only if Phase 1 found StartOS-wrapper markers.
|
||
6. **reviewer** — only if there are uncommitted changes; scope = the working diff.
|
||
|
||
Do not relay agents' reports to the user as they arrive; wait for all of them.
|
||
|
||
## Phase 3 — Synthesize
|
||
Produce ONE report, written to `EVALUATION.md` at the repo root (this file is your only
|
||
write), then show the user just the Verdict and Priority queue sections.
|
||
|
||
This path is stable and overwritten in place on each run, so if `EVALUATION.md` is
|
||
committed to version control, a subsequent run produces a reviewable diff of how the
|
||
evaluation changed over time. Whether to commit it is the user's call — don't commit it
|
||
yourself.
|
||
|
||
EVALUATION.md structure:
|
||
|
||
```
|
||
# Evaluation — <repo> — <date>
|
||
Intent: <the one sentence>
|
||
Agents run: <list, with any that failed/were skipped and why>
|
||
|
||
## Verdict
|
||
≤5 sentences synthesizing all reports: overall state, the headline risk, readiness.
|
||
|
||
## Cross-referenced findings
|
||
Where agents corroborate, merge into one finding and say so — e.g. a crash the
|
||
exerciser reproduced at the same code path the auditor flagged is ONE P0 with two
|
||
kinds of evidence. Deduplicate aggressively.
|
||
|
||
## Priority queue
|
||
P0 → P3, every finding from every agent exactly once, each one line:
|
||
[Px] finding — evidence pointer — source agent(s)
|
||
|
||
## Scorecard
|
||
The evaluator's lens table, adjusted only if another agent's evidence contradicts it —
|
||
e.g. the doc-auditor's drift findings sharpen the Documentation lens (note any adjustment).
|
||
|
||
## Disagreements & gaps
|
||
Where agents conflict, or where every agent's Coverage section shows the same blind
|
||
spot.
|
||
|
||
## Suggested order of work
|
||
3–7 steps, dependencies respected (e.g. "fix the build before trusting test results").
|
||
```
|
||
|
||
## Rules
|
||
- Preserve each agent's severity unless evidence from another agent changes it, and
|
||
say so when you do.
|
||
- Carry every agent's Surprises forward — don't drop them.
|
||
- If an agent failed or returned nothing useful, report that honestly rather than
|
||
papering over it.
|
||
- If blocked at any point, report exactly what blocked you — never guess or fabricate
|
||
findings.
|