ScaleBox: Enabling High-Fidelity and Scalable Code Verification for Large Language Models

Jiasheng Zheng; Xin Zheng; Boxi Cao; Pengbo Wang; Zhengzhao Ma; Qiming Zhu; Jiazhen Jiang; Yaojie Lu; Hongyu Lin; Xianpei Han; Le Sun

arXiv:2604.27467·cs.SE·May 1, 2026

ScaleBox: Enabling High-Fidelity and Scalable Code Verification for Large Language Models

Jiasheng Zheng, Xin Zheng, Boxi Cao, Pengbo Wang, Zhengzhao Ma, Qiming Zhu, Jiazhen Jiang, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

PDF

TL;DR

ScaleBox is a scalable, high-fidelity system that improves code verification accuracy and efficiency for large language models, enabling better training and evaluation at high concurrency.

Contribution

It introduces automated judge generation, fine-grained parallel execution, and a reproducible benchmarking suite for large-scale code verification.

Findings

01

Significantly improves verification accuracy and efficiency.

02

Enhances RL training stability and performance.

03

Outperforms heuristic baselines in experiments.

Abstract

Code sandboxes have emerged as a critical infrastructure for advancing the coding capabilities of large language models, providing verifiable feedback for both RL training and evaluation. However, existing systems fail to provide accurate verification and efficiency under high-concurrency workloads. We present ScaleBox, a high-fidelity and scalable system designed to address these limitations in large-scale code training. ScaleBox introduces automated special-judge generation and management, fine-grained parallel execution across test cases with seamless multi-node coordination, and a configuration-driven evaluation suite for reproducible benchmarking. A series of experiments demonstrates that ScaleBox significantly enhances code verification accuracy and efficiency. Our further RLVR experiments show that ScaleBox substantially improves both performance on LiveCodeBench and training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.