DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning

Batuhan K. Karaman; Aditya Rawal; Suhaila Shakiah; Mohammad Ghavamzadeh; Mingyi Hong; Arijit Biswas; Ruida Zhou

arXiv:2602.00983·cs.CL·February 3, 2026

DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning

Batuhan K. Karaman, Aditya Rawal, Suhaila Shakiah, Mohammad Ghavamzadeh, Mingyi Hong, Arijit Biswas, Ruida Zhou

PDF

Open Access

TL;DR

DISPO is a novel reinforcement learning algorithm that improves training stability and efficiency for large language models in mathematical reasoning by decoupling importance sampling weight clipping, leading to better performance and controlled exploration.

Contribution

This paper introduces DISPO, a REINFORCE-style method that decouples importance sampling weight clipping, enhancing training stability and efficiency in large language model reinforcement learning for math reasoning.

Findings

01

DISPO achieves 61.04% on AIME'24, outperforming CISPO and DAPO.

02

Decoupling clipping parameters balances exploration and distillation.

03

Proper tuning prevents catastrophic training failures.

Abstract

Reinforcement learning with verifiable rewards has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models particularly in mathematics. Current approaches in this domain present a clear trade-off: PPO-style methods (e.g., GRPO/DAPO) offer training stability but exhibit slow learning trajectories due to their trust-region constraints on policy updates, while REINFORCE-style approaches (e.g., CISPO) demonstrate improved learning efficiency but suffer from performance instability as they clip importance sampling weights while still permitting non-zero gradients outside the trust-region. To address these limitations, we introduce DISPO, a simple yet effective REINFORCE-style algorithm that decouples the up-clipping and down-clipping of importance sampling weights for correct and incorrect responses, yielding four controllable policy update regimes.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics