Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective
Deyang Kong, Qi Guo, Xiangyu Xi, Wei Wang, Jingang Wang, Xunliang Cai, Shikun Zhang, Wei Ye

TL;DR
This paper proposes CDAS, a novel sampling method for reinforcement learning in large language models that improves reasoning accuracy and efficiency by aligning problem difficulty with model competence using historical performance data.
Contribution
It introduces a competence-difficulty alignment sampling method that provides stable difficulty estimation and adaptive problem selection based on model competence.
Findings
CDAS achieves higher accuracy than baseline methods.
CDAS significantly improves training efficiency.
CDAS is 2.33 times faster than the Dynamic Sampling strategy.
Abstract
Reinforcement learning exhibits potential in enhancing the reasoning abilities of large language models, yet it is hard to scale for the low sample efficiency during the rollout phase. Existing methods attempt to improve efficiency by scheduling problems based on problem difficulties. However, these approaches suffer from unstable and biased estimations of problem difficulty and fail to capture the alignment between model competence and problem difficulty in RL training, leading to suboptimal results. To tackle these limitations, this paper introduces ompetence-ifficulty lignment ampling (), which enables accurate and stable estimation of problem difficulties by aggregating historical performance discrepancies of problems. Then the model competence is quantified to adaptively select problems whose difficulty is in alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Dialogue-Adaptive Pre-training Objective
