Skip to main content

Writing Evals for Your Plugin

harness-kit ships an eval framework that measures skill quality across models. Official evals live in evals/tasks/. Plugin authors can ship their own evals alongside their skills (beta).

Quickstart

Add an evals/tasks/ directory inside your plugin:

plugins/my-plugin/
├── .claude-plugin/
│ └── plugin.json
├── skills/my-skill/
│ └── SKILL.md
└── evals/
└── tasks/
└── task-basic.yaml

Task Definition Format

name: my-skill-basic          # unique identifier for this task
skill: my-skill # matches the skill directory name
version: 1

input:
user_message: "/my-skill example-input"
fixture: my-fixture # optional: directory in evals/fixtures/

expectations:
structure: # fast, free, deterministic
required_sections: # H2 headings that must appear in output
- Summary
- Details
min_words: 100
max_words: 2000
must_contain:
- "expected phrase"
must_not_contain:
- "I cannot"

quality: # model-based rubric (uses Haiku — costs tokens)
- criterion: "Summary accurately describes the input"
weight: 2
- criterion: "Output does not invent information not present in the fixture"
weight: 3

trials: 3
tags:
- my-tag

Fixtures

Fixtures are minimal synthetic codebases loaded as pre-context in the skill prompt — no real repos, no network calls.

Official fixtures live at evals/fixtures/{name}/. Each file is injected as:

--- FILE: relative/path ---
<file contents>

Plugin-authored fixtures should live alongside your task definitions:

plugins/my-plugin/evals/
├── fixtures/
│ └── my-fixture/
│ └── example.py
└── tasks/
└── task-basic.yaml

Note: Plugin fixtures are not yet resolved by the official runner. This is planned for a future release.

Running Evals

# Install deps (from repo root)
pip install -r evals/requirements.txt

# All tasks, all models
python evals/runner.py

# One skill only
python evals/runner.py --skill explain

# One model only
python evals/runner.py --model sonnet

# Code graders only — no API cost for quality grading
python evals/runner.py --structure-only

# Include plugin evals (beta)
python evals/runner.py --include-plugin-evals

# Dry run — see what would run
python evals/runner.py --dry-run

Grader Design

Code-based graders run instantly and for free:

GraderWhat it checks
SectionGraderRequired ## headings exist in output
WordCountGradermin/max word count bounds
ContainsGradermust_contain / must_not_contain phrases

Model-based graders use Haiku to evaluate quality criteria. Each criterion is one API call returning {"pass": true/false, "reason": "..."}. Use weight to indicate relative importance.

Metrics

MetricMeaning
pass@1Fraction of trials passing all structure checks
pass@NProbability ≥1 of N trials passes (capability ceiling)
pass^NProbability all N trials pass (reliability floor)
qualityAverage weighted quality score across trials

Tips

  • Keep fixtures small (1–5 files, <500 lines total) — they're loaded verbatim into the prompt
  • Write must_not_contain checks for known failure modes (e.g., "I don't have access")
  • Use --structure-only during task authoring to iterate without API grading cost
  • Quality criteria should be binary and unambiguous — avoid "good" or "clear" without defining what that means