REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage
Smriti Jha, Matteo Paltenghi, Chandra Maddila, Vijayaraghavan Murali, Shubham Ugare, Satish Chandra

TL;DR
REAP is an automated pipeline that creates reliable, production-like benchmarks for AI coding agents from real developer sessions, enabling faster and more accurate evaluation.
Contribution
It introduces REAP, a novel automated curation method that constructs in-distribution benchmarks from production data without manual labeling, ensuring trustworthy evaluation signals.
Findings
Harvest benchmark covers over four programming languages.
Solve rates range from 42.9% to 58.2% across models.
REAP improves evaluation reliability for AI coding agents.
Abstract
Production deployment of AI coding agents requires fast, reproducible evaluation signals. Existing industrial practices trade off speed and fidelity: online A/B testing takes weeks and risks user experience, shadow deployment yields signals that are not reproducible across runs, and public benchmarks diverge from production workloads in language distribution, prompt style, and codebase structure. This paper presents REAP (Relevance and Execution-Audited Pipeline), an automated curation pipeline that constructs production-derived benchmarks from real developer-agent sessions without manual labeling. Such curation, while in-distribution to production usage, runs into several challenges. Untestable prompts, misaligned tests, and test flakiness all compromise evaluation reliability. While tasks can be manually audited to ensure only high-quality tasks remain in the benchmark, this approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
