Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models
Yixiu Mao, Yun Qu, Qi Wang, Heming Zou, Xiangyang Ji

TL;DR
This paper introduces Dynamics-Predictive Sampling (DPS), a novel method that models prompt solving progress as a dynamical system to efficiently select training data, reducing computational costs and improving reasoning abilities of large language models during RL finetuning.
Contribution
DPS models prompt solving progress as a dynamical system using Bayesian inference, enabling efficient prompt selection without extensive rollouts in RL finetuning of large language models.
Findings
DPS significantly reduces the number of rollouts needed for training.
DPS accelerates RL finetuning compared to existing prompt selection methods.
DPS achieves superior reasoning performance across diverse tasks.
Abstract
Reinforcement learning (RL) finetuning has become a key technique for enhancing the reasoning abilities of large language models (LLMs). However, its effectiveness critically depends on the selection of training data. Recent advances underscore the importance of online prompt selection methods, which typically concentrate training on partially solved or moderately challenging examples under the current policy, thereby yielding more effective model updates. While significantly accelerating RL finetuning in terms of training steps, they also incur substantial computational overhead by requiring extensive LLM rollouts over large candidate batches to identify informative samples, an expense that can outweigh the finetuning process itself. To address this challenge, this work proposes Dynamics-Predictive Sampling (DPS), which online predicts and selects informative prompts by inferring their…
Peer Reviews
Decision·ICLR 2026 Poster
The paper offers a new dynamical-systems perspective on data selection in RL finetuning, bridging prompt-level reward evolution with state-space modeling. DPS eliminates the need for redundant LLM rollouts by predicting informative prompts, yielding 2–3× runtime reduction and 70 % rollout savings while maintaining accuracy parity with DAPO. Comprehensive experiments across three reasoning domains and multiple model scales. Evaluation includes accuracy curves, confusion-matrix analyses of predi
Reward formulation dependency: The approach currently depends on binary correctness rewards, which are straightforward in math-style benchmarks but not directly transferable to open-ended or process-based domains (e.g., code synthesis with partial correctness). The authors acknowledge this and suggest extending DPS to dense or step-wise rewards. Simplistic selection criterion: DPS uses top-B selection on predicted State 2 probabilities. More nuanced criteria (e.g., entropy-based uncertainty,
- The core contribution lies in shifting the sample selection overhead from LLM inference costs, which are proportional to batch size, to computationally negligible matrix operations. Experiments convincingly show that DPS can match or exceed the performance of baselines with substantially lower computational resources, a critical factor for the practical adoption of RL fine-tuning. - The study features a comprehensive experimental design across diverse and complex reasoning tasks (mathematics,
- The proposed three-state model (unsolved, partially solved, solved) is highly effective for binary reward settings. It would be beneficial to discuss the framework's extensibility to tasks with denser reward signals, such as continuous scores from process supervision. Addressing questions like how the state space could be partitioned (e.g., fixed vs. dynamic boundaries) would provide valuable insight into the method's potential applicability to a wider range of problems. - The model maintains
The paper's primary strength is its direct and effective focus on improving the computational efficiency of active prompt selection for RL finetuning. The work is presented with good clarity, first identifying the high cost of existing online sampling methods (like DS) that rely on extensive rollouts, and then proposing a clear, lightweight alternative. In terms of originality, the paper offers a practical methodological refinement; rather than introducing a new sampling goal, it iteratively imp
* The necessity of the HMM framework is not fully justified, as the paper omits comparisons to simpler predictive baselines. A non-probabilistic heuristic, such as tracking an exponential moving average of reward variance for each prompt, might also identify "partially solved" items with similar efficiency gains and less modeling overhead. * The claim of "negligible" computational overhead is only validated on relatively small datasets (e.g., 7.5k for MATH). The method's costs scale linearly wit
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
