CodeScaler: Scaling Code LLM Training and Test-Time Inference via Reward Models
Xiao Zhu, Xinyu Zhou, Boyu Zhu, Hanxu Hu, Mingzhe Du, Haotian Zhang, Huiming Wang, Zhijiang Guo

TL;DR
CodeScaler is a novel reward model that enhances code LLM training and inference by leveraging preference data and syntax-aware techniques, significantly improving performance and reducing latency.
Contribution
It introduces a scalable reward model trained on curated preferences and synthetic data, improving code generation and inference efficiency without relying on test cases.
Findings
Outperforms execution-based RL by +1.55 to +4.23 points on coding benchmarks.
Yields +14.64 points improvement over base models with synthetic data.
Achieves similar performance to unit test approaches with 10-fold latency reduction.
Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, a reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across four coding benchmarks, CodeScaler consistently outperforms execution-based RL by +1.55 points on Qwen3-8B-Base and +4.23 points on Qwen3-14B-Base. By further scaling to 44K problems with additional synthetic data, CodeScaler yields +14.64 points…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Testing and Debugging Techniques · Domain Adaptation and Few-Shot Learning
