DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay
Long Li, Zhijian Zhou, Tianyi Wang, Weidi Xu, Zuming Huang, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, Yuan Qi

TL;DR
DyJR introduces a dynamic replay buffer and Jensen-Shannon divergence regularization to enhance diversity and efficiency in reinforcement learning for language models, outperforming existing methods on reasoning and Text-to-SQL tasks.
Contribution
The paper proposes DyJR, a novel regularization framework with a dynamic buffer and distributional constraint to preserve diversity and improve training efficiency in RL for language models.
Findings
DyJR outperforms GRPO, RLEP, and Ex-GRPO on benchmarks.
DyJR maintains training efficiency comparable to GRPO.
DyJR enhances diversity and reduces over-reliance on top tokens.
Abstract
While Reinforcement Learning (RL) enhances Large Language Model reasoning, on-policy algorithms like GRPO are sample-inefficient as they discard past rollouts. Existing experience replay methods address this by reusing accurate samples for direct policy updates, but this often incurs high computational costs and causes mode collapse via overfitting. We argue that historical data should prioritize sustaining diversity rather than simply reinforcing accuracy. To this end, we propose Dynamic Jensen-Shannon Replay (DyJR), a simple yet effective regularization framework using a dynamic reference distribution from recent trajectories. DyJR introduces two innovations: (1) A Time-Sensitive Dynamic Buffer that uses FIFO and adaptive sizing to retain only temporally proximal samples, synchronizing with model evolution; and (2) Jensen-Shannon Divergence Regularization, which replaces direct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Reinforcement Learning in Robotics · Machine Learning and Data Classification
