TL;DR
This paper introduces ARR and RPO, a framework that externalizes implicit preferences into explicit, interpretable rubrics to improve multimodal model alignment with human preferences.
Contribution
It proposes a novel approach to reward modeling that converts implicit preferences into explicit criteria, enhancing reliability and data efficiency in multimodal alignment.
Findings
ARR outperforms pairwise reward models on benchmarks.
Rubric-based evaluation reduces bias and improves interpretability.
Explicit rubrics enable zero-shot and few-shot deployment.
Abstract
Aligning multimodal generative models with human preferences demands reward signals that respect the compositional, multi-dimensional structure of human judgment. Prevailing RLHF approaches reduce this structure to scalar or pairwise labels, collapsing nuanced preferences into opaque parametric proxies and exposing vulnerabilities to reward hacking. While recent Rubrics-as-Reward (RaR) methods attempt to recover this structure through explicit criteria, generating rubrics that are simultaneously reliable, scalable, and data-efficient remains an open problem. We introduce Auto-Rubric as Reward (ARR), a framework that reframes reward modeling from implicit weight optimization to explicit, criteria-based decomposition. Before any pairwise comparison, ARR externalizes a VLM's internalized preference knowledge as prompt-specific rubrics, translating holistic intent into independently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
