Claude Code Evaluation Platform

Measure what each skill can do with run-level evidence.

Build tasksets, run Claude in a captured terminal session, and inspect completion, efficiency, and capability evidence in one report flow.

CLI Workflow

$ npm i -g @skillscore/cli

$ ssc doctor

$ ssc start --setup

$ ssc publish --run-id <id>

Latest report: changelog-automation · 2026-02-15 06:57:36.653Z

Why skillscore.sh

Run With Full Evidence

Capture live TUI, terminal cast replay, and session JSONL so every score has inspectable proof.

Track What Actually Matters

Completion, duration, token usage, and tool-call quality are computed from session events.

Improve Skills Over Time

Use run and task trend views to find regressions, compare versions, and iterate on SKILL.md quality.

Latest benchmark reports

View all reports