HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

Zhanyu Liu; Qingguo Hu; Ante Wang; Chenqing Liu; Zhishang Xiang; Hui Li; Delai Qiu; Jinsong Su

arXiv:2604.17928·cs.LG·April 21, 2026

HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

Zhanyu Liu, Qingguo Hu, Ante Wang, Chenqing Liu, Zhishang Xiang, Hui Li, Delai Qiu, Jinsong Su

PDF

TL;DR

This paper introduces HEAL, a framework that improves exploration in few-shot RLVR by aligning entropy dynamics between target and general domains, leading to better reasoning with limited data.

Contribution

HEAL combines selective high-value data incorporation with entropy dynamics alignment to enhance exploration and reasoning in low-resource RLVR settings.

Findings

01

HEAL improves few-shot RLVR performance across multiple domains.

02

Using only 32 target samples, HEAL matches or surpasses models trained with 1K samples.

03

Entropy Dynamics Alignment effectively mitigates entropy collapse and promotes diverse exploration.

Abstract

Reinforcement Learning with Verifiable Reward (RLVR) has proven effective for training reasoning-oriented large language models, but existing methods largely assume high-resource settings with abundant training data. In low-resource scenarios, RLVR is prone to more severe entropy collapse, which substantially limits exploration and degrades reasoning performance. To address this issue, we propose Hybrid-domain Entropy dynamics ALignment (HEAL), a framework tailored for few-shot RLVR. HEAL first selectively incorporates high-value general-domain data to promote more diverse exploration. Then, we introduce Entropy Dynamics Alignment (EDA), a reward mechanism that aligns trajectory-level entropy dynamics between the target and general domains, capturing both entropy magnitude and fine-grained variation. Through this alignment, EDA not only further mitigates entropy collapse but also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.