Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

Ran Xu; Tianci Liu; Zihan Dong; Tony Yu; Ilgee Hong; Carl Yang; Linjun Zhang; Tao Zhao; Haoyu Wang

arXiv:2602.01511·cs.CL·February 13, 2026

Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

Ran Xu, Tianci Liu, Zihan Dong, Tony Yu, Ilgee Hong, Carl Yang, Linjun Zhang, Tao Zhao, Haoyu Wang

PDF

Open Access 2 Models

TL;DR

This paper introduces Rubric-ARM, a reinforcement learning framework that jointly optimizes a rubric generator and judge to better evaluate complex, non-verifiable responses from large language models, improving alignment and performance.

Contribution

We propose Rubric-ARM, an alternating reinforcement learning method that dynamically generates rubrics and judges, enhancing response evaluation in non-verifiable domains.

Findings

01

Achieves state-of-the-art performance on multiple benchmarks.

02

Significantly improves downstream policy alignment.

03

Reduces gradient variance through alternating optimization.

Abstract

Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rubric generator and a judge using reinforcement learning from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates, providing theoretical analysis that demonstrates how this schedule reduces gradient variance during training. Extensive experiments show that Rubric-ARM achieves state-of-the-art performance among baselines on multiple benchmarks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling