MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning

Xiaoliang Fu; Jiaye Lin; Yangyi Fang; Binbin Zheng; Chaowen Hu; Zekai Shao; Cong Qin; Lu Pan; Ke Zeng; Xunliang Cai

arXiv:2602.17550·cs.LG·April 21, 2026

MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning

Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Binbin Zheng, Chaowen Hu, Zekai Shao, Cong Qin, Lu Pan, Ke Zeng, Xunliang Cai

PDF

1 Repo

TL;DR

MASPO introduces a unified RLVR framework for LLMs that enhances gradient use, balances probability mass, and aligns signal reliability, leading to improved robustness and efficiency.

Contribution

It proposes MASPO, a novel framework combining differentiable gating, adaptive limiting, and asymmetric risk control for better LLM reinforcement learning.

Findings

01

MASPO outperforms existing methods in robustness and sample efficiency.

02

The framework effectively balances exploration and exploitation in LLM training.

03

Code is publicly available at https://github.com/FlyTune/MASPO-RL.

Abstract

Existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms, such as GRPO, rely on rigid, uniform, and symmetric trust region mechanisms that are fundamentally misaligned with the complex optimization dynamics of Large Language Models (LLMs). In this paper, we identify three critical challenges in these methods: (1) inefficient gradient utilization caused by the binary cutoff of hard clipping, (2) insensitive probability mass arising from uniform ratio constraints that ignore the token distribution, and (3) asymmetric signal reliability stemming from the disparate credit assignment ambiguity between positive and negative samples. To bridge these gaps, we propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions. MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

FlyTune/MASPO-RL
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.