Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Fanfan Liu; Youyang Yin; Peng Shi; Siqi Yang; Zhixiong Zeng; Haibo Qiu

arXiv:2602.05261·cs.CL·February 6, 2026

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Fanfan Liu, Youyang Yin, Peng Shi, Siqi Yang, Zhixiong Zeng, Haibo Qiu

PDF

Open Access

TL;DR

This paper analyzes response length variation in RLVR for LLMs and VLMs, introduces a length-unbiased optimization method, and demonstrates its effectiveness in improving reasoning performance across multiple benchmarks.

Contribution

It provides a theoretical analysis of response length dynamics and proposes LUSPO, a novel unbiased optimization algorithm that addresses length collapse in RLVR.

Findings

01

LUSPO outperforms existing methods like GSPO and GRPO.

02

Theoretical analysis explains response length variation patterns.

03

Extensive experiments validate LUSPO's superior reasoning capabilities.

Abstract

Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning