RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing
Xuefeng Li, Nir Ben-Israel, Yotam Raz, Belal Ahmed, Doron Serebro, Antoine Raux

TL;DR
RepoMod-Bench introduces a comprehensive, implementation-agnostic benchmark for evaluating code repository modernization across multiple languages and repository sizes, revealing significant challenges in scaling autonomous code modernization.
Contribution
It presents a new benchmark framework with standardized interfaces and black-box testing for repository-level code modernization, addressing limitations of prior small-scale, language-specific benchmarks.
Findings
Pass rates drop from 91.3% on small projects to 15.3% on large ones.
The benchmark covers 21 repositories across 8 languages with 1.6M lines of code.
Scaling remains a major challenge for autonomous code modernization.
Abstract
The evolution of AI coding agents has shifted the frontier from simple snippet completion to autonomous repository-level engineering. However, evaluating these agents remains ill-posed in general code repository generation, where the lack of deterministic ground truth leads to ambiguous metrics. Code modernization via automated translation offers a more rigorous alternative by providing a fixed ground truth -- the source repository; yet existing benchmarks are limited to small-scale repositories and rely on language-specific unit tests visible to the agent, allowing test-driven overfitting. We address these limitations by introducing a benchmarking framework for repository-level code modernization built on an implementation-agnostic evaluation paradigm. This framework is instantiated through RepoMod-Bench: a benchmark of 21 real-world repositories with standardized interfaces,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Scientific Computing and Data Management
