Self-Consistency of the Internal Reward Models Improves Self-Rewarding   Language Models

Xin Zhou; Yiwen Guo; Ruotian Ma; Tao Gui; Qi Zhang; Xuanjing Huang

arXiv:2502.08922·cs.AI·February 14, 2025

Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models

Xin Zhou, Yiwen Guo, Ruotian Ma, Tao Gui, Qi Zhang, Xuanjing Huang

PDF

Open Access

TL;DR

This paper introduces Self-Consistent Internal Rewards (SCIR), a framework that enhances the consistency of internal reward models in language models, leading to improved alignment with human preferences and more reliable self-generated preference data.

Contribution

The paper proposes a novel SCIR framework that enforces consistency among internal reward models during training, significantly improving alignment performance of language models.

Findings

01

SCIR improves alignment performance over baseline methods.

02

Enforcing internal reward consistency enhances reward modeling capability.

03

Selective use of consistent preference data boosts reliability.

Abstract

Aligning Large Language Models (LLMs) with human preferences is crucial for their deployment in real-world applications. Recent advancements in Self-Rewarding Language Models suggest that an LLM can use its internal reward models (such as LLM-as-a-Judge) \cite{yuanself} to generate preference data, improving alignment performance without costly human annotation. However, we find that different internal reward models within the same LLM often generate inconsistent preferences. This inconsistency raises concerns about the reliability of self-generated preference data, hinders overall alignment performance, and highlights the need for further research to ensure reliable and coherent alignment with human preferences. To address this limitation, we propose Self-Consistent Internal Rewards (SCIR), a novel framework designed to enhance consistency among internal reward models during training.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCognitive Functions and Memory