TL;DR
MASPO introduces a unified RLVR framework for LLMs that enhances gradient use, balances probability mass, and aligns signal reliability, leading to improved robustness and efficiency.
Contribution
It proposes MASPO, a novel framework combining differentiable gating, adaptive limiting, and asymmetric risk control for better LLM reinforcement learning.
Findings
MASPO outperforms existing methods in robustness and sample efficiency.
The framework effectively balances exploration and exploitation in LLM training.
Code is publicly available at https://github.com/FlyTune/MASPO-RL.
Abstract
Existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms, such as GRPO, rely on rigid, uniform, and symmetric trust region mechanisms that are fundamentally misaligned with the complex optimization dynamics of Large Language Models (LLMs). In this paper, we identify three critical challenges in these methods: (1) inefficient gradient utilization caused by the binary cutoff of hard clipping, (2) insensitive probability mass arising from uniform ratio constraints that ignore the token distribution, and (3) asymmetric signal reliability stemming from the disparate credit assignment ambiguity between positive and negative samples. To bridge these gaps, we propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions. MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
