Claude Code Evaluation Platform

Measure what each skill can do with run-level evidence.

Build tasksets, run Claude in a captured terminal session, and inspect completion, efficiency, and capability evidence in one report flow.

Browse Reports

Platform highlights

High-fidelity Claude runs

Run local Claude Code with terminal TUI capture, live stream, and deterministic replay.

Session-grounded metrics

Completion, duration, token usage, and tool behavior are parsed directly from Claude session JSONL.

Structured report output

Each run persists a report snapshot and final report for review, download, and comparisons.

Recent completed skill evaluations

View all →

2026-02-15 06:57:36.653Z

changelog-automation

completed

Automate changelog generation from commits, PRs, and releases following Keep a Changelog format. Use when setting up release workflows, generating release notes, or standardizing commit conventions.

Completion

100%

Duration

13m 19s

Tokens

1,930,868

Tool success

93%

View report →

2026-02-15 00:18:49.813Z

rust-best-practices