TL;DR
ConsistRM is a self-training framework that improves generative reward models by using consistency-aware rewards, reducing reliance on human annotations, and enhancing stability and output consistency.
Contribution
It introduces a novel self-training method with consistency-aware rewards that stabilize training and improve alignment without human-labeled data.
Findings
Outperforms vanilla RFT by 1.5% on benchmark datasets.
Enhances output consistency and reduces position bias.
Provides stable pseudo-labels through temporal consistency.
Abstract
Generative reward models (GRMs) have emerged as a promising approach for aligning Large Language Models (LLMs) with human preferences by offering greater representational capacity and flexibility than traditional scalar reward models. However, GRMs face two major challenges: reliance on costly human-annotated data restricts scalability, and self-training approaches often suffer from instability and vulnerability to reward hacking. To address these issues, we propose ConsistRM, a self-training framework that enables effective and stable GRM training without human annotations. ConsistRM incorporates the Consistency-Aware Answer Reward, which produces reliable pseudo-labels with temporal consistency, thereby providing more stable model optimization. Moreover, the Consistency-Aware Critique Reward is introduced to assess semantic consistency across multiple critiques and allocates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
