HeadlinesBriefing favicon HeadlinesBriefing.com

GitHub launches agent-skills-eval to benchmark AI agent extensions

Hacker News •
×

GitHub project agent-skills-eval fills a gap in the Anthropic‑driven Agent Skills ecosystem by automatically measuring the impact of SKILL.md files. It runs each skill twice—once with the skill loaded into the prompt and once as a baseline—then hands both outputs to a judge model for side‑by‑side grading. Developers finally get concrete evidence whether a skill actually improves performance.

Running the evaluator is a one‑liner: `npx agent-skills-eval ./skills --target gpt-4o-mini --judge gpt-4o-mini --baseline --strict`. It produces a workspace containing JSONL artifacts and a static HTML report that shows pass/fail rates, token usage, and tool‑call assertions. The tool works out of the box with any OpenAI‑compatible API, including Anthropic, Together and local Llama servers, keeping it runtime‑agnostic.

Because it implements the full agentskills.io specification—including SKILL.md validation, evals.json handling and the official iteration‑N artifact layout—teams can plug the evaluator into CI pipelines or custom dashboards via its TypeScript SDK. The result is a reproducible, portable benchmark that lets engineers iterate on agent knowledge without guessing, delivering verifiable skill improvements in production.

Overall, agent-skills-eval provides a rigorous, language‑model‑agnostic framework that turns vague claims about agent knowledge into measurable results. By automating baseline comparisons, judge grading and report generation, it lets teams ship SKILL.md files with confidence that the addition truly shifts model behavior, rather than relying on anecdotal testing.