Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction

Fei Ding; Yongkang Zhang; Runhao Liu; Yuhao Liao; Zijian Zeng; Huiming Yang; Sibo wang; Linglin Liao

arXiv:2604.17328·cs.LG·April 21, 2026

Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction

Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Huiming Yang, Sibo wang, Linglin Liao

PDF

TL;DR

This paper addresses the length problem in sequence-level reinforcement learning by proposing a new training framework that constructs equal-length, comparable training segments to improve training stability and effectiveness.

Contribution

It introduces a novel perspective on the length problem as a comparison unit construction issue and proposes EqLen, a method for proactive equal-length sample construction during training.

Findings

01

EqLen effectively constructs equal-length training segments.

02

The framework improves training stability for sequence-level RL methods.

03

EqLen is applicable to algorithms like GRPO, GSPO, and RLOO.

Abstract

This paper investigates the length problem in sequence-level relative reinforcement learning. We observe that, although existing methods partially alleviate length-related phenomena, a more fundamental issue remains insufficiently characterized: the comparison units used during training lack inherent comparability. Building on this observation, we propose a new perspective: the length problem should not be viewed merely as a loss-scaling or normalization bias, but rather as a \emph{comparison unit construction} problem. We further establish a sample-construction-based training framework that, instead of applying post-hoc corrections to unequal-length responses, proactively constructs equal-length, alignable, and comparable training segments during generation. Within this framework, we propose EqLen, a concrete method applicable to group-relative comparison algorithms such as GRPO, GSPO,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.