ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

Xianpeng (Simon) Sun; Haonan Sun; Tian Yu; Sheng Ma; Qincheng Zhang; Lifei Rao; Chen Tian

arXiv:2603.26137·cs.SE·March 30, 2026

ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

Xianpeng (Simon) Sun, Haonan Sun, Tian Yu, Sheng Ma, Qincheng Zhang, Lifei Rao, Chen Tian

PDF

TL;DR

This paper introduces a time-consistent benchmark for evaluating repository-aware software engineering systems, emphasizing temporal validity and prompt construction effects on model performance.

Contribution

It proposes a novel benchmark methodology that ensures temporal consistency and controls prompt variables, enabling more valid evaluation of software engineering models.

Findings

01

File-level F1 reaches over 0.80 with guided prompts.

02

Prompt construction significantly impacts model performance.

03

Temporal consistency is crucial for valid repository-aware evaluation.

Abstract

Evaluation of repository-aware software engineering systems is often confounded by synthetic task design, prompt leakage, and temporal contamination between repository knowledge and future code changes. We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code knowledge using only artifacts available before T0, and evaluates on engineering tasks derived from pull requests merged in the future interval (T0, T1]. Each historical pull request is transformed into a natural-language task through an LLM-assisted prompt-generation pipeline, and the benchmark is formalized as a matched A/B comparison in which the same software engineering agent is evaluated with and without repository-derived code knowledge while all other variables are held constant. We also report a baseline characterization study on two open-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.