TL;DR
C2 introduces a cooperative framework that enhances reward modeling by generating and validating helpful rubrics from binary preferences, improving reliability without costly annotations.
Contribution
The paper presents a novel cooperative approach to rubric-augmented reward modeling that reduces reliance on external annotations and mitigates rubric quality issues.
Findings
C2 outperforms reasoning reward models on RM-Bench.
C2 achieves significant gains on AlpacaEval 2.0.
An 8B reward model matches larger models' performance without external rubrics.
Abstract
Rubric-augmented verification guides reward models with explicit evaluation criteria, yielding more reliable judgments than single-model verification. However, most existing methods require costly rubric annotations, limiting scalability. Moreover, we find that rubric generation is vulnerable to a failure of cooperation; low-quality rubrics actively mislead reward models rather than help. Inspired by the principle of cooperative communication, we propose Cooperative yet Critical reward modeling (C2), a framework that significantly improves reward model judgments by having the reward model critically collaborate with a rubric generator trained solely from binary preferences. In C2, we synthesize helpful and misleading rubric pairs by measuring how each rubric shifts the reward model toward or away from the correct preference. Using these contrastive pairs, we train a cooperative rubric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
