CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning

Ji Shi; Peiming Guo; Meishan Zhang; Miao Zhang; Xuebo Liu; Min Zhang; Weili Guan

arXiv:2601.22803·cs.AI·February 2, 2026

CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning

Ji Shi, Peiming Guo, Meishan Zhang, Miao Zhang, Xuebo Liu, Min Zhang, Weili Guan

PDF

Open Access 4 Reviews

TL;DR

CVeDRL introduces a reinforcement learning-based code verifier that uses difficulty-aware rewards and static analysis to improve unit test effectiveness and efficiency for large language model-generated code.

Contribution

The paper presents a novel RL framework with syntax- and difficulty-aware rewards, achieving state-of-the-art verification performance with fewer parameters and faster inference.

Findings

01

Up to 28.97% higher pass rate compared to GPT-3.5

02

15.08% higher branch coverage

03

Over 20 times faster inference than baselines

Abstract

Code verifiers play a critical role in post-verification for LLM-based code generation, yet existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency. While reinforcement learning (RL) offers a promising alternative by optimizing models through execution-driven rewards without labeled supervision, our preliminary results show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples. We first theoretically analyze showing that branch coverage, sample difficulty, syntactic and functional correctness can be jointly modeled as RL rewards, where optimizing these signals can improve the reliability of unit-test-based verification. Guided by this analysis, we design syntax- and functionality-aware rewards and further propose branch- and sample-difficulty--aware RL using…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

- CVeDR 0.6B achieves good inference efficiency improvements (>20x throughput) compared to SFT-based models like CodeRM 8B, while maintaining competitive performance. - The paper records impressive performance across benchmarks, such as 83.68% pass rate on MBPP+ despite its size. - The exponential reward shaping for branch coverage and integration of static complexity metrics represents a creative approach to addressing boundary conditions.

Weaknesses

- While the paper describes how HC and MI are integrated, it lacks rigorous justification for why these specific metrics improve upon existing dynamic reward approaches (no comparative studies, see point below). The geometric mean combination appears ad-hoc without theoretical grounding. - The paper lacks comparisons with other RL-based verifiers of similar scale using dynamic rewards. This omission makes it difficult to isolate the contribution of the static difficulty metrics. - The paper focu

Reviewer 02Rating 4Confidence 3

Strengths

1. The combination of software engineering quality metrics with the GRPO rewards is creative. 2. The motivation of this paper is convincing. Verification generated by LLMs is an important part for verifying the code generated by LLMs. 3. The experiments of this paper is thorough, which cross different models and benchmarks.

Weaknesses

1. The verification model has a narrow scope on the test cases it can generate. I'm not sure how would it generalize to larger problem in repo-level coding. 2. The training relies on code solutions that are appropriate and correct, which however is not very accessible for new domains of coding.

Reviewer 03Rating 4Confidence 4

Strengths

- Novel reward design: Combines syntax–functionality rewards with branch-difficulty-aware and sample-difficulty-aware reinforcement learning, a formulation not seen in prior RL-for-testing literature. - Theoretical grounding: Derives a quantitative bound linking test-case reliability, branch coverage, and verification confidence — providing rare analytical rigor for code-verification RL work. - Empirical gains: Achieves superior pass rate, branch coverage, and efficiency (20× faster inference)

Weaknesses

- Limited novelty in RL backbone: Uses GRPO (a known variant of DPO) with mostly standard RL fine-tuning procedures; the conceptual innovation mainly lies in reward shaping rather than algorithmic foundations. - Lack of evaluation in various model architectures: The paper claims to have good performance on “0.6B” scale models, but only evaluated their approach on Qwen3-0.6B-base model. Testing on more model architectures with a similar number of parameters will help justify the claim. - Lack of

Reviewer 04Rating 6Confidence 4

Strengths

1.Practical problem: improving verifiers directly benefits real code-gen systems via accuracy and cost/latency reductions. 2. Well-motivated design: syntax + functionality rewards are sensible; exponential shaping for rare branches is intuitive; difficulty-aware weighting is simple yet effective. 3. Clear deployment story: the verifier is RL-trained offline and then used at inference as a drop-in judge; the paper cleanly separates training vs usage. 4. Efficiency: empirical results suggest compa

Weaknesses

1. External validity of difficulty metrics: Halstead and maintainability indices may not generalize across languages or code styles; broader language coverage is limited. 2. Theory scope: the majority-vote reliability bound is helpful but presented at a high level; assumptions (e.g., independence of test outcomes across groups) are not tested. 3. Ablations depth: more analysis on shaping schedules, learned difficulty predictors vs static metrics, cross-language generalization, and sensitivity to

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Adversarial Robustness in Machine Learning