Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy

Guangchen Lan; Lian Xiong; Xin Zhou; Hejie Cui; Yuwei Zhang; Mao Li; Zhenyu Shi; Besnik Fetahu; Lihong Li; Xian Li

arXiv:2603.15646·cs.LG·May 8, 2026

Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy

Guangchen Lan, Lian Xiong, Xin Zhou, Hejie Cui, Yuwei Zhang, Mao Li, Zhenyu Shi, Besnik Fetahu, Lihong Li, Xian Li

PDF

TL;DR

This paper introduces ARL-RR, a novel reinforcement learning framework that avoids scalar reward aggregation by optimizing rubric-based evaluations, leading to improved performance and efficiency.

Contribution

The work proposes an alternating optimization approach for rubric rewards, removing fixed scalarization and dynamically emphasizing objectives, with theoretical and empirical validation.

Findings

01

ARL-RR outperforms scalarized methods in model performance.

02

ARL-RR improves training efficiency across multiple model scales.

03

Reward aggregation induces a variance contraction effect.

Abstract

Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, multi-dimensional, contextual rubric-based evaluations. However, existing approaches in RLRR are limited to linearly compressing vector rewards into a scalar reward with a fixed weightings, which is sensitive to artificial score design and fails to capture correlations among reward dimensions. To overcome the limitations of reward aggregation, this work proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), a framework that eliminates the need for a fixed scalarization by optimizing one semantic rubric meta-class at a time. Theoretically, we show that reward aggregation induces a variance contraction effect, which helps explain the performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.