RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing

Xuefeng Li; Nir Ben-Israel; Yotam Raz; Belal Ahmed; Doron Serebro; Antoine Raux

arXiv:2602.22518·cs.SE·February 27, 2026

RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing

Xuefeng Li, Nir Ben-Israel, Yotam Raz, Belal Ahmed, Doron Serebro, Antoine Raux

PDF

Open Access

TL;DR

RepoMod-Bench introduces a comprehensive, implementation-agnostic benchmark for evaluating code repository modernization across multiple languages and repository sizes, revealing significant challenges in scaling autonomous code modernization.

Contribution

It presents a new benchmark framework with standardized interfaces and black-box testing for repository-level code modernization, addressing limitations of prior small-scale, language-specific benchmarks.

Findings

01

Pass rates drop from 91.3% on small projects to 15.3% on large ones.

02

The benchmark covers 21 repositories across 8 languages with 1.6M lines of code.

03

Scaling remains a major challenge for autonomous code modernization.

Abstract

The evolution of AI coding agents has shifted the frontier from simple snippet completion to autonomous repository-level engineering. However, evaluating these agents remains ill-posed in general code repository generation, where the lack of deterministic ground truth leads to ambiguous metrics. Code modernization via automated translation offers a more rigorous alternative by providing a fixed ground truth -- the source repository; yet existing benchmarks are limited to small-scale repositories and rely on language-specific unit tests visible to the agent, allowing test-driven overfitting. We address these limitations by introducing a benchmarking framework for repository-level code modernization built on an implementation-agnostic evaluation paradigm. This framework is instantiated through RepoMod-Bench: a benchmark of 21 real-world repositories with standardized interfaces,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Scientific Computing and Data Management