Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction

Zhe Zhang; Runlin Liu; Aishan Liu; Xingyu Liu; Xiang Gao; Hailong Sun

arXiv:2508.07180·cs.SE·February 4, 2026

Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction

Zhe Zhang, Runlin Liu, Aishan Liu, Xingyu Liu, Xiang Gao, Hailong Sun

PDF

Open Access 3 Reviews

TL;DR

This paper introduces CODE2BENCH, a scalable, rigorous benchmark for evaluating code-generating LLMs by combining dynamic problem sources with high-coverage, property-based testing to reveal true model capabilities.

Contribution

It proposes Dual Scaling, a new philosophy for benchmark construction, and instantiates it in CODE2BENCH, integrating real-world code sources with automated high-coverage testing for improved evaluation.

Findings

01

Models perform better on API tasks than algorithmic synthesis.

02

Model performance is significantly influenced by the target language ecosystem.

03

Rigorous testing uncovers an illusion of correctness in simpler benchmarks.

Abstract

The evaluation of code-generating Large Language Models (LLMs) is fundamentally constrained by two intertwined challenges: a reliance on static, easily contaminated problem sources and the use of superficial, low-rigor testing. This paper introduces a new benchmark construction philosophy, Dual Scaling, designed to systematically address both limitations. Our approach involves continuously scaling the source of problems from dynamic, real-world code repositories and systematically scaling the rigor of tests via automated, high-coverage Property-Based Testing (PBT). We instantiate this philosophy in CODE2BENCH, an end-to-end framework that leverages Scope Graph analysis for principled dependency classification and a 100% branch coverage quality gate to ensure test suite integrity. Using this framework, we construct CODE2BENCH-2509, a new benchmark suite with native instances in both…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The benchmark designs rigorous and strong test cases. It not only accounts for edge cases but also ensures complete test coverage, substantially outperforming other benchmarks that rely on sparse test examples, which may lead to incorrectly judged “pass” cases. 2. The paper provides a carefully designed implementation in both Python and Java, addressing not only translation between languages but also their distinct type systems and library ecosystems. This enables meaningful cross-language c

Weaknesses

1. Although CODE2BENCH draws its source data from real repositories, the benchmark tasks remain function-level and isolated. This design simplifies testing but does not capture cross-function or module-level dependencies, which are prevalent in real-world software engineering. As such, the benchmark evaluates isolated reasoning rather than full software generation ability or collaborative code development. 2. As mentioned by the authors, real-world code often includes numerous defensive branche

Reviewer 02Rating 2Confidence 4

Strengths

1. Important problem: Addresses real limitations in LLM code evaluation, which are data contamination and superficial testing. 2. Solid engineering: Scope Graph analysis for dependency classification is technically sound. Property-Based Testing with 100% branch coverage demonstrates rigor. The framework automates benchmark construction. 3. release code, data, and results.

Weaknesses

1. Lack of Direct Comparison with Existing Benchmarks For a benchmark paper, it is crucial to demonstrate how the new benchmark compares with existing ones when evaluating the same models. The paper only shows Table 1 comparing characteristics (Dynamic, Rigorous Test, etc.) but lacks direct performance comparison with these baselines. Without evaluating the same 10 models on existing benchmarks, it's impossible to determine whether CODE2BENCH provides unique insights, whether the lower pass rate

Reviewer 03Rating 4Confidence 4

Strengths

- The 100% stringent branch coverage gate and large PBT-generated suites substantially reduce false positives and expose "near-perfect" failures that many benchmarks miss. - The outcome spectrum and “diagnostic fingerprints” provide more granular failure analysis (SyntaxErr/RuntimeErr/LogicErr vs. partial pass bands), illuminating the algorithmic synthesis vs. API-application divide and the role of language typing in error suppression. - WSC-Python spans >35 libraries; SC-Java demonstrates multi

Weaknesses

- Some of the figures in the paper are not very clear or visually polished. For example, in Figure 2, there is noticeable overlap between text elements and between text and icons, which affects readability. Improving the clarity and layout of the figures would make the presentation more professional and easier to interpret. - In the Related Work section, the authors assert that existing live benchmarks rely on narrow or specific data sources. However, Code2Bench is also curated from specific Git

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software System Performance and Reliability