Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLM Reward Models

Weijie Qiu; Dai Guan; Junxin Wang; Zhihang Li; Yongbo Gai; Mengyu Zhou; Erchao Zhao; Xiaoxi Jiang; Guanjun Jiang

arXiv:2603.16600·cs.CV·March 19, 2026

Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLM Reward Models

Weijie Qiu, Dai Guan, Junxin Wang, Zhihang Li, Yongbo Gai, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

PDF

Open Access

TL;DR

This paper introduces Proxy-GRM, a reinforcement learning approach that trains lightweight proxy agents to generate high-quality, transferable rubrics for vision-language models, significantly improving reward evaluation accuracy and transferability.

Contribution

It proposes Proxy-GRM, a novel method that explicitly optimizes rubrics via proxy-guided verification, enhancing their quality and transferability in VLM reward models.

Findings

01

Proxy-GRM achieves state-of-the-art results on multiple reward benchmarks.

02

Proxy rubrics transfer effectively to unseen evaluators, improving test-time reward accuracy.

03

Proxy-SFT outperforms Proxy-RL as a verifier in rubric quality.

Abstract

Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy's prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)