Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR

Hao Yi; Yulan Hu; Xin Li; Sheng Ouyang; Lizhong Ding; Yong Liu

arXiv:2601.22595·cs.AI·February 2, 2026

Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR

Hao Yi, Yulan Hu, Xin Li, Sheng Ouyang, Lizhong Ding, Yong Liu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces an uncertainty consistency metric for active learning in RLVR, enabling fewer queries to achieve comparable or better reasoning performance with reduced annotation costs.

Contribution

It proposes a novel uncertainty consistency metric and an online variant, improving query selection in RLVR and reducing data requirements.

Findings

01

Our method outperforms random and classic AL baselines.

02

Achieves full-dataset performance with only 30% of data.

03

Reduces annotation costs significantly in reasoning tasks.

Abstract

Large Language Models (LLMs) have recently improved mathematical reasoning through Reinforcement Learning with Verifiable Reward (RLVR). However, existing RLVR algorithms require large query budgets, making annotation costly. We investigate whether fewer but more informative queries can yield similar or superior performance, introducing active learning (AL) into RLVR. We identify that classic AL sampling strategies fail to outperform random selection in this setting, due to ignoring objective uncertainty when only selecting by subjective uncertainty. This work proposes an uncertainty consistency metric to evaluate how well subjective uncertainty aligns with objective uncertainty. In the offline setting, this alignment is measured using the Point-Biserial Correlation Coefficient (PBC). For online training, because of limited sampling and dynamically shifting output distributions, PBC…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The paper identifies a practical limitation in current RL-based reasoning training pipelines: query selection methods that rely solely on subjective uncertainty (e.g., perplexity) often select examples that are uncertain but uninformative, leading to unstable gradients and inefficient learning. This motivation is clearly articulated and supported by empirical evidence. This insight is both intuitive and impactful—valuable training samples are those where uncertainty meaningfully reflects correct

Weaknesses

The consistency metric assumes that the examples used for selection reflect the distributions during RL optimization. If the underlying data distribution shifts over training (which is common in RLVR), the effectiveness of selection may degrade unless the scoring is frequently recomputed. The evaluation is rather limited to only Math reasoning. For query selection methods, it would be great to draw broader insights on whether the methods can be generalized beyond Math reasoning tasks.

Reviewer 02Rating 6Confidence 3

Strengths

1. The problem is well-motivated and relevant to current challenges in RLVR. 2. The paper offers an interesting empirical observation that inconsistent samples can lead to extreme gradients, which explains why standard AL can underperform random sampling. 3. The introduction of two alignment metrics—one for offline and one for online settings—is insightful, and the accompanying theoretical analysis provides some grounding. 4. Experiments are extensive and demonstrate strong results, achieving co

Weaknesses

While the paper is promising, several points could benefit from deeper clarification or justification: 1. The link between sample inconsistency and extreme gradient behavior is intuitively explained but lacks theoretical support or formal analysis. 2. It is unclear why the offline setting cannot also leverage the online metric $r_{pb}^{online}$, which appears to yield stronger performance in experiments. 3. In some cases, training on the full dataset leads to worse results than using only 30% o

Reviewer 03Rating 4Confidence 4

Strengths

1. The story (writing) is good to easily follow the authors' idea and refresh the utilization of the AL for the emerging field. 2. This paper highlights that the importance of the query selection metric should not only rely on the LLM itself but also require consistency with the reward model's evaluation.

Weaknesses

1. **My concern about using uncertainty.** After reviewing the Eqs. (2) and (3), IIUC, the definition of the subjective uncertainty is the low average probability of a policy model's responses $\log \pi_\mathrm{ref}(y_{k, t}^{(i)} \vert x^{(i)}, y_{k, <t}^{(i)})$ and the objective uncertainty is the low accuracy of a reward model's evaluation of a model's response, respectively. However, why can we call these two terms uncertainty? For example, if an LLM's response gives a response with lower pr

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Topic Modeling · Machine Learning in Materials Science