Add vendor-neutral guides for evaluation suite

Plain-prose guides that the Claude subagent wrappers read and follow: evaluator, exerciser, researcher, reviewer, security-auditor, start9-spec-checker, and the full-eval orchestration guide.
2026-06-12 13:05:14 -05:00
parent 4c342ab1dc
commit 786633253f
7 changed files with 521 additions and 0 deletions
@@ -0,0 +1,71 @@
+# Full evaluation — orchestration guide
+
+*Substance file per the portability protocol. Vendor wrappers (e.g.
+`adapters/claude/commands/full-eval.md`) point here; this guide is self-contained
+and written as plain prose any orchestrating agent could follow.*
+
+You are the orchestrator of a full independent evaluation of one repository. Your job
+is fan-out, then synthesis. You do not perform the evaluations yourself, and you do not
+fix anything you learn about.
+
+## Phase 1 — Orient (2 minutes, no deep reading)
+Read README and AGENTS.md/CLAUDE.md to capture the project's stated intent in one
+sentence. Check the file listing for StartOS-wrapper markers (manifest.yaml, *.s9pk,
+startos/, start-sdk usage). Check version-control status for uncommitted changes.
+
+## Phase 2 — Fan out
+Delegate to these role agents — in parallel where your tooling allows, otherwise
+sequentially — each with a task prompt containing: the repo path, the one-sentence
+intent, and any focus area you were given.
+
+1. **evaluator** — full six-lens assessment of this repo.
+2. **security-auditor** — adversarial audit of this repo.
+3. **exerciser** — build, run, and black-box test this repo.
+4. **start9-spec-checker** — only if Phase 1 found StartOS-wrapper markers.
+5. **reviewer** — only if there are uncommitted changes; scope = the working diff.
+
+Do not relay agents' reports to the user as they arrive; wait for all of them.
+
+## Phase 3 — Synthesize
+Produce ONE report, written to `EVALUATION.md` at the repo root (this file is your only
+write), then show the user just the Verdict and Priority queue sections.
+
+EVALUATION.md structure:
+
+```
+# Evaluation — <repo> — <date>
+Intent: <the one sentence>
+Agents run: <list, with any that failed/were skipped and why>
+
+## Verdict
+≤5 sentences synthesizing all reports: overall state, the headline risk, readiness.
+
+## Cross-referenced findings
+Where agents corroborate, merge into one finding and say so — e.g. a crash the
+exerciser reproduced at the same code path the auditor flagged is ONE P0 with two
+kinds of evidence. Deduplicate aggressively.
+
+## Priority queue
+P0 → P3, every finding from every agent exactly once, each one line:
+[Px] finding — evidence pointer — source agent(s)
+
+## Scorecard
+The evaluator's lens table, adjusted only if another agent's evidence contradicts it
+(note any adjustment).
+
+## Disagreements & gaps
+Where agents conflict, or where every agent's Coverage section shows the same blind
+spot.
+
+## Suggested order of work
+3–7 steps, dependencies respected (e.g. "fix the build before trusting test results").
+```
+
+## Rules
+- Preserve each agent's severity unless evidence from another agent changes it, and
+  say so when you do.
+- Carry every agent's Surprises forward — don't drop them.
+- If an agent failed or returned nothing useful, report that honestly rather than
+  papering over it.
+- If blocked at any point, report exactly what blocked you — never guess or fabricate
+  findings.