HeaPA: Difficulty-Aware Heap Sampling and On-Policy Query Augmentation for LLM Reinforcement Learning

Weiqi Wang; Xin Liu; Binxuan Huang; Hejie Cui; Rongzhi Zhang; Changlong Yu; Shuowei Jin; Jingfeng Yang; Qingyu Yin; Zhengyang Wang; Zheng Li; Yifan Gao; Priyanka Nigam; Bing Yin; Lihong Li; Yangqiu Song

arXiv:2601.22448·cs.LG·February 2, 2026

HeaPA: Difficulty-Aware Heap Sampling and On-Policy Query Augmentation for LLM Reinforcement Learning

Weiqi Wang, Xin Liu, Binxuan Huang, Hejie Cui, Rongzhi Zhang, Changlong Yu, Shuowei Jin, Jingfeng Yang, Qingyu Yin, Zhengyang Wang, Zheng Li, Yifan Gao, Priyanka Nigam, Bing Yin, Lihong Li, Yangqiu Song

PDF

Open Access

TL;DR

HeaPA introduces a dynamic, frontier-aware sampling and pool augmentation method for RL training of LLMs, improving efficiency and accuracy by focusing on challenging prompts and expanding the prompt pool adaptively.

Contribution

The paper presents HeaPA, a novel heap-based sampling and on-policy prompt augmentation technique that maintains an evolving prompt pool for more efficient LLM reinforcement learning.

Findings

01

HeaPA reduces computation while maintaining or improving accuracy.

02

HeaPA's benefits increase with larger model scales.

03

HeaPA outperforms existing methods across multiple benchmarks.

Abstract

RLVR is now a standard way to train LLMs on reasoning tasks with verifiable outcomes, but when rollout generation dominates the cost, efficiency depends heavily on which prompts you sample and when. In practice, prompt pools are often static or only loosely tied to the model's learning progress, so uniform sampling can't keep up with the shifting capability frontier and ends up wasting rollouts on prompts that are already solved or still out of reach. Existing approaches improve efficiency through filtering, curricula, adaptive rollout allocation, or teacher guidance, but they typically assume a fixed pool-which makes it hard to support stable on-policy pool growth-or they add extra teacher cost and latency. We introduce HeaPA (Heap Sampling and On-Policy Query Augmentation), which maintains a bounded, evolving pool, tracks the frontier using heap-based boundary sampling, expands the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques