ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation
Xianpeng (Simon) Sun, Haonan Sun, Tian Yu, Sheng Ma, Qincheng Zhang, Lifei Rao, Chen Tian

TL;DR
This paper introduces a time-consistent benchmark for evaluating repository-aware software engineering systems, emphasizing temporal validity and prompt construction effects on model performance.
Contribution
It proposes a novel benchmark methodology that ensures temporal consistency and controls prompt variables, enabling more valid evaluation of software engineering models.
Findings
File-level F1 reaches over 0.80 with guided prompts.
Prompt construction significantly impacts model performance.
Temporal consistency is crucial for valid repository-aware evaluation.
Abstract
Evaluation of repository-aware software engineering systems is often confounded by synthetic task design, prompt leakage, and temporal contamination between repository knowledge and future code changes. We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code knowledge using only artifacts available before T0, and evaluates on engineering tasks derived from pull requests merged in the future interval (T0, T1]. Each historical pull request is transformed into a natural-language task through an LLM-assisted prompt-generation pipeline, and the benchmark is formalized as a matched A/B comparison in which the same software engineering agent is evaluated with and without repository-derived code knowledge while all other variables are held constant. We also report a baseline characterization study on two open-source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
