Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning
Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Kongcheng Zhang, Jiale Zhao, Jingwen Yang, Yihe Zhou, Jianwei Lv, Tongya Zheng, Hengtong Lu, Wei Chen, Yan Xie, Mingli Song

TL;DR
This paper introduces RuscaRL, a novel reinforcement learning framework that uses rubric-based scaffolding to enhance exploration and reasoning in large language models, overcoming traditional limitations.
Contribution
RuscaRL employs explicit rubric-guided exploration and verifiable rewards to improve reasoning capabilities in LLMs, representing a new approach to RL training for general reasoning tasks.
Findings
RuscaRL outperforms baseline models on multiple reasoning benchmarks.
Rubric-guided exploration increases diversity and quality of responses.
Decaying guidance helps models internalize reasoning patterns.
Abstract
Recent advances in Large Language Models (LLMs) have underscored the potential of Reinforcement Learning (RL) to facilitate the emergence of reasoning capabilities. Despite the encouraging results, a fundamental dilemma persists as RL improvement relies on learning from high-quality samples, yet the exploration for such samples remains bounded by the inherent limitations of LLMs. This, in effect, creates an undesirable cycle in which what cannot be explored cannot be learned. In this work, we propose Rubric-Scaffolded Reinforcement Learning (RuscaRL), a novel instructional scaffolding framework designed to break the exploration bottleneck for general LLM reasoning. Specifically, RuscaRL introduces checklist-style rubrics as (1) explicit scaffolding for exploration during rollout generation, where different rubrics are provided as external guidance within task instructions to steer…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The integration of instructional scaffolding method into RL for LLMs is highly good. 2. Demonstrates consistent gains across diverse benchmarks (HealthBench, MedQA, MMLU-Pro, Creative Writing, IFBench, etc.). 3. RuscaRL shows better sample efficiency (steeper Best-of-N curve) and avoids entropy collapse, evidencing better exploration control.
1. The method’s success hinges on well-designed, domain-specific rubrics. Poorly constructed rubrics may bias training or limit diversity. 2. The paper acknowledges this but does not propose automated rubric construction or robustness checks. 3. Using rubric-guided multi-sample rollouts with LLM-as-a-Judge evaluations is computationally expensive, especially for large-scale training (many rollouts × multiple criteria × grader calls). 4. While results are strong across domains, the framework’
1. The empirical results show improvements across several domains and model sizes. 2. The motivation (mitigating exploration bottlenecks in RL for LLMs) is interesting and relevant. 3. Different ablation studies are included. 4. The results are overall interesting.
1. Overstated novelty: Rubric rewards, scaffolded prompting, and gradual hint removal resemble prior work in rubric-guided RL, chain-of-thought prompting, and curriculum learning. The related work section cites almost exclusively 2025 works; the absence of foundational literature (curriculum RL, reward shaping, structured exploration strategies) is problematic. 2. Related-work coverage: Classical literature on exploration, curriculum RL, and scaffolding is missing. 3. Insufficient exploration
1. Motivation is clear and reasonable: the paper directly targets the exploration bottleneck of RLVR in open-ended domains by integrating external guidance into task instructions to improve rollout diversity and quality, enabling controllable, scheduled exploration via rubric scaffolds. 2. Strong empirical evidence: across multiple models and diverse tasks, the method consistently outperforms strong baselines, including Rubric-based RL. 3. Insightful ablations on scaffolding: the ablation stud
1. Baseline collapse issue (Figure 5): The Rubric-based RL baseline appears to collapse around 200 training steps, with entropy exploding, which raises concerns about the validity of the comparison. It is unclear whether this collapse is caused by non-robust experimental settings or run-specific instability, or whether it reflects an intrinsic weakness of the baseline. Since RuscaRL’s main gains only emerge after 200 steps (with no obvious improvements before that point in Fig. 5b), a substantia
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Natural Language Processing Techniques
