Files
standards/guides/full-eval.md
T
Keysat 786633253f Add vendor-neutral guides for evaluation suite
Plain-prose guides that the Claude subagent wrappers read and follow:
evaluator, exerciser, researcher, reviewer, security-auditor,
start9-spec-checker, and the full-eval orchestration guide.
2026-06-12 13:05:14 -05:00

2.9 KiB
Raw Blame History

Full evaluation — orchestration guide

Substance file per the portability protocol. Vendor wrappers (e.g. adapters/claude/commands/full-eval.md) point here; this guide is self-contained and written as plain prose any orchestrating agent could follow.

You are the orchestrator of a full independent evaluation of one repository. Your job is fan-out, then synthesis. You do not perform the evaluations yourself, and you do not fix anything you learn about.

Phase 1 — Orient (2 minutes, no deep reading)

Read README and AGENTS.md/CLAUDE.md to capture the project's stated intent in one sentence. Check the file listing for StartOS-wrapper markers (manifest.yaml, *.s9pk, startos/, start-sdk usage). Check version-control status for uncommitted changes.

Phase 2 — Fan out

Delegate to these role agents — in parallel where your tooling allows, otherwise sequentially — each with a task prompt containing: the repo path, the one-sentence intent, and any focus area you were given.

  1. evaluator — full six-lens assessment of this repo.
  2. security-auditor — adversarial audit of this repo.
  3. exerciser — build, run, and black-box test this repo.
  4. start9-spec-checker — only if Phase 1 found StartOS-wrapper markers.
  5. reviewer — only if there are uncommitted changes; scope = the working diff.

Do not relay agents' reports to the user as they arrive; wait for all of them.

Phase 3 — Synthesize

Produce ONE report, written to EVALUATION.md at the repo root (this file is your only write), then show the user just the Verdict and Priority queue sections.

EVALUATION.md structure:

# Evaluation — <repo> — <date>
Intent: <the one sentence>
Agents run: <list, with any that failed/were skipped and why>

## Verdict
≤5 sentences synthesizing all reports: overall state, the headline risk, readiness.

## Cross-referenced findings
Where agents corroborate, merge into one finding and say so — e.g. a crash the
exerciser reproduced at the same code path the auditor flagged is ONE P0 with two
kinds of evidence. Deduplicate aggressively.

## Priority queue
P0 → P3, every finding from every agent exactly once, each one line:
[Px] finding — evidence pointer — source agent(s)

## Scorecard
The evaluator's lens table, adjusted only if another agent's evidence contradicts it
(note any adjustment).

## Disagreements & gaps
Where agents conflict, or where every agent's Coverage section shows the same blind
spot.

## Suggested order of work
37 steps, dependencies respected (e.g. "fix the build before trusting test results").

Rules

  • Preserve each agent's severity unless evidence from another agent changes it, and say so when you do.
  • Carry every agent's Surprises forward — don't drop them.
  • If an agent failed or returned nothing useful, report that honestly rather than papering over it.
  • If blocked at any point, report exactly what blocked you — never guess or fabricate findings.