Claude Code Evaluation Platform

Measure what each skill can do with run-level evidence.

Build tasksets, run Claude in a captured terminal session, and inspect completion, efficiency, and capability evidence in one report flow.

Platform highlights

High-fidelity Claude runs

Run local Claude Code with terminal TUI capture, live stream, and deterministic replay.

Session-grounded metrics

Completion, duration, token usage, and tool behavior are parsed directly from Claude session JSONL.

Structured report output

Each run persists a report snapshot and final report for review, download, and comparisons.

Recent completed skill evaluations

View all →