Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR
Hao Yi, Yulan Hu, Xin Li, Sheng Ouyang, Lizhong Ding, Yong Liu

TL;DR
This paper introduces an uncertainty consistency metric for active learning in RLVR, enabling fewer queries to achieve comparable or better reasoning performance with reduced annotation costs.
Contribution
It proposes a novel uncertainty consistency metric and an online variant, improving query selection in RLVR and reducing data requirements.
Findings
Our method outperforms random and classic AL baselines.
Achieves full-dataset performance with only 30% of data.
Reduces annotation costs significantly in reasoning tasks.
Abstract
Large Language Models (LLMs) have recently improved mathematical reasoning through Reinforcement Learning with Verifiable Reward (RLVR). However, existing RLVR algorithms require large query budgets, making annotation costly. We investigate whether fewer but more informative queries can yield similar or superior performance, introducing active learning (AL) into RLVR. We identify that classic AL sampling strategies fail to outperform random selection in this setting, due to ignoring objective uncertainty when only selecting by subjective uncertainty. This work proposes an uncertainty consistency metric to evaluate how well subjective uncertainty aligns with objective uncertainty. In the offline setting, this alignment is measured using the Point-Biserial Correlation Coefficient (PBC). For online training, because of limited sampling and dynamically shifting output distributions, PBC…
Peer Reviews
Decision·ICLR 2026 Poster
The paper identifies a practical limitation in current RL-based reasoning training pipelines: query selection methods that rely solely on subjective uncertainty (e.g., perplexity) often select examples that are uncertain but uninformative, leading to unstable gradients and inefficient learning. This motivation is clearly articulated and supported by empirical evidence. This insight is both intuitive and impactful—valuable training samples are those where uncertainty meaningfully reflects correct
The consistency metric assumes that the examples used for selection reflect the distributions during RL optimization. If the underlying data distribution shifts over training (which is common in RLVR), the effectiveness of selection may degrade unless the scoring is frequently recomputed. The evaluation is rather limited to only Math reasoning. For query selection methods, it would be great to draw broader insights on whether the methods can be generalized beyond Math reasoning tasks.
1. The problem is well-motivated and relevant to current challenges in RLVR. 2. The paper offers an interesting empirical observation that inconsistent samples can lead to extreme gradients, which explains why standard AL can underperform random sampling. 3. The introduction of two alignment metrics—one for offline and one for online settings—is insightful, and the accompanying theoretical analysis provides some grounding. 4. Experiments are extensive and demonstrate strong results, achieving co
While the paper is promising, several points could benefit from deeper clarification or justification: 1. The link between sample inconsistency and extreme gradient behavior is intuitively explained but lacks theoretical support or formal analysis. 2. It is unclear why the offline setting cannot also leverage the online metric $r_{pb}^{online}$, which appears to yield stronger performance in experiments. 3. In some cases, training on the full dataset leads to worse results than using only 30% o
1. The story (writing) is good to easily follow the authors' idea and refresh the utilization of the AL for the emerging field. 2. This paper highlights that the importance of the query selection metric should not only rely on the LLM itself but also require consistency with the reward model's evaluation.
1. **My concern about using uncertainty.** After reviewing the Eqs. (2) and (3), IIUC, the definition of the subjective uncertainty is the low average probability of a policy model's responses $\log \pi_\mathrm{ref}(y_{k, t}^{(i)} \vert x^{(i)}, y_{k, <t}^{(i)})$ and the objective uncertainty is the low accuracy of a reward model's evaluation of a model's response, respectively. However, why can we call these two terms uncertainty? For example, if an LLM's response gives a response with lower pr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Machine Learning in Materials Science
