Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy
Guangchen Lan, Lian Xiong, Xin Zhou, Hejie Cui, Yuwei Zhang, Mao Li, Zhenyu Shi, Besnik Fetahu, Lihong Li, Xian Li

TL;DR
This paper introduces ARL-RR, a novel reinforcement learning framework that avoids scalar reward aggregation by optimizing rubric-based evaluations, leading to improved performance and efficiency.
Contribution
The work proposes an alternating optimization approach for rubric rewards, removing fixed scalarization and dynamically emphasizing objectives, with theoretical and empirical validation.
Findings
ARL-RR outperforms scalarized methods in model performance.
ARL-RR improves training efficiency across multiple model scales.
Reward aggregation induces a variance contraction effect.
Abstract
Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, multi-dimensional, contextual rubric-based evaluations. However, existing approaches in RLRR are limited to linearly compressing vector rewards into a scalar reward with a fixed weightings, which is sensitive to artificial score design and fails to capture correlations among reward dimensions. To overcome the limitations of reward aggregation, this work proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), a framework that eliminates the need for a fixed scalarization by optimizing one semantic rubric meta-class at a time. Theoretically, we show that reward aggregation induces a variance contraction effect, which helps explain the performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
