DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay

Long Li; Zhijian Zhou; Tianyi Wang; Weidi Xu; Zuming Huang; Wei Chu; Zhe Wang; Shirui Pan; Chao Qu; Yuan Qi

arXiv:2603.16157·cs.LG·March 18, 2026

DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay

Long Li, Zhijian Zhou, Tianyi Wang, Weidi Xu, Zuming Huang, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, Yuan Qi

PDF

Open Access

TL;DR

DyJR introduces a dynamic replay buffer and Jensen-Shannon divergence regularization to enhance diversity and efficiency in reinforcement learning for language models, outperforming existing methods on reasoning and Text-to-SQL tasks.

Contribution

The paper proposes DyJR, a novel regularization framework with a dynamic buffer and distributional constraint to preserve diversity and improve training efficiency in RL for language models.

Findings

01

DyJR outperforms GRPO, RLEP, and Ex-GRPO on benchmarks.

02

DyJR maintains training efficiency comparable to GRPO.

03

DyJR enhances diversity and reduces over-reliance on top tokens.

Abstract

While Reinforcement Learning (RL) enhances Large Language Model reasoning, on-policy algorithms like GRPO are sample-inefficient as they discard past rollouts. Existing experience replay methods address this by reusing accurate samples for direct policy updates, but this often incurs high computational costs and causes mode collapse via overfitting. We argue that historical data should prioritize sustaining diversity rather than simply reinforcing accuracy. To this end, we propose Dynamic Jensen-Shannon Replay (DyJR), a simple yet effective regularization framework using a dynamic reference distribution from recent trajectories. DyJR introduces two innovations: (1) A Time-Sensitive Dynamic Buffer that uses FIFO and adaptive sizing to retain only temporally proximal samples, synchronizing with model evolution; and (2) Jensen-Shannon Divergence Regularization, which replaces direct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Reinforcement Learning in Robotics · Machine Learning and Data Classification