REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage

Smriti Jha; Matteo Paltenghi; Chandra Maddila; Vijayaraghavan Murali; Shubham Ugare; Satish Chandra

arXiv:2604.01527·cs.SE·May 12, 2026

REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage

Smriti Jha, Matteo Paltenghi, Chandra Maddila, Vijayaraghavan Murali, Shubham Ugare, Satish Chandra

PDF

TL;DR

REAP is an automated pipeline that creates reliable, production-like benchmarks for AI coding agents from real developer sessions, enabling faster and more accurate evaluation.

Contribution

It introduces REAP, a novel automated curation method that constructs in-distribution benchmarks from production data without manual labeling, ensuring trustworthy evaluation signals.

Findings

01

Harvest benchmark covers over four programming languages.

02

Solve rates range from 42.9% to 58.2% across models.

03

REAP improves evaluation reliability for AI coding agents.

Abstract

Production deployment of AI coding agents requires fast, reproducible evaluation signals. Existing industrial practices trade off speed and fidelity: online A/B testing takes weeks and risks user experience, shadow deployment yields signals that are not reproducible across runs, and public benchmarks diverge from production workloads in language distribution, prompt style, and codebase structure. This paper presents REAP (Relevance and Execution-Audited Pipeline), an automated curation pipeline that constructs production-derived benchmarks from real developer-agent sessions without manual labeling. Such curation, while in-distribution to production usage, runs into several challenges. Untestable prompts, misaligned tests, and test flakiness all compromise evaluation reliability. While tasks can be manually audited to ensure only high-quality tasks remain in the benchmark, this approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.