HeadlinesBriefing favicon HeadlinesBriefing.com

mdarena lets devs benchmark Claude.md prompts

Hacker News •
×

Open-source tool mdarena lets developers measure how their Claude.md prompts affect AI‑generated pull‑requests. By mining merged PRs from a repository, the utility builds a task set, auto‑detects test commands from CI files, and runs Claude with and without the prompt injected. Results appear as a side‑by‑side report that compares patch quality, token cost and statistical significance directly for teams.

A quick start involves `pip install mdarena`, then mining fifty recent PRs (`mdarena mine owner/repo --limit 50 --detect-tests`). Users can benchmark several Claude.md variants together with a baseline that strips all prompts. In a monorepo trial, the existing Claude.md boosted test‑pass rates by roughly 27% over the bare baseline, while a consolidated file performed no better than having no prompt at all.

The tool also supports SWE‑bench compatibility, allowing import of existing benchmark tasks or export of new ones as JSONL. Security warnings remind users to run mdarena only on trusted repositories, as it executes code in isolated temporary directories. By checking out a history‑free snapshot, mdarena avoids leakage of future commits, ensuring the benchmark reflects genuine past PR outcomes in production today.