Writing Evals for Your Plugin
harness-kit ships an eval framework that measures skill quality across models. Official evals live in evals/tasks/. Plugin authors can ship their own evals alongside their skills (beta).
Quickstart
Add an evals/tasks/ directory inside your plugin:
plugins/my-plugin/
├── .claude-plugin/
│ └── plugin.json
├── skills/my-skill/
│ └── SKILL.md
└── evals/
└── tasks/
└── task-basic.yaml
Task Definition Format
name: my-skill-basic # unique identifier for this task
skill: my-skill # matches the skill directory name
version: 1
input:
user_message: "/my-skill example-input"
fixture: my-fixture # optional: directory in evals/fixtures/
expectations:
structure: # fast, free, deterministic
required_sections: # H2 headings that must appear in output
- Summary
- Details
min_words: 100
max_words: 2000
must_contain:
- "expected phrase"
must_not_contain:
- "I cannot"
quality: # model-based rubric (uses Haiku — costs tokens)
- criterion: "Summary accurately describes the input"
weight: 2
- criterion: "Output does not invent information not present in the fixture"
weight: 3
trials: 3
tags:
- my-tag
Fixtures
Fixtures are minimal synthetic codebases loaded as pre-context in the skill prompt — no real repos, no network calls.
Official fixtures live at evals/fixtures/{name}/. Each file is injected as:
--- FILE: relative/path ---
<file contents>
Plugin-authored fixtures should live alongside your task definitions:
plugins/my-plugin/evals/
├── fixtures/
│ └── my-fixture/
│ └── example.py
└── tasks/
└── task-basic.yaml
Note: Plugin fixtures are not yet resolved by the official runner. This is planned for a future release.
Running Evals
# Install deps (from repo root)
pip install -r evals/requirements.txt
# All tasks, all models
python evals/runner.py
# One skill only
python evals/runner.py --skill explain
# One model only
python evals/runner.py --model sonnet
# Code graders only — no API cost for quality grading
python evals/runner.py --structure-only
# Include plugin evals (beta)
python evals/runner.py --include-plugin-evals
# Dry run — see what would run
python evals/runner.py --dry-run
Grader Design
Code-based graders run instantly and for free:
| Grader | What it checks |
|---|---|
SectionGrader | Required ## headings exist in output |
WordCountGrader | min/max word count bounds |
ContainsGrader | must_contain / must_not_contain phrases |
Model-based graders use Haiku to evaluate quality criteria. Each criterion is one API call returning {"pass": true/false, "reason": "..."}. Use weight to indicate relative importance.
Metrics
| Metric | Meaning |
|---|---|
| pass@1 | Fraction of trials passing all structure checks |
| pass@N | Probability ≥1 of N trials passes (capability ceiling) |
| pass^N | Probability all N trials pass (reliability floor) |
| quality | Average weighted quality score across trials |
Tips
- Keep fixtures small (1–5 files, <500 lines total) — they're loaded verbatim into the prompt
- Write
must_not_containchecks for known failure modes (e.g., "I don't have access") - Use
--structure-onlyduring task authoring to iterate without API grading cost - Quality criteria should be binary and unambiguous — avoid "good" or "clear" without defining what that means