PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation
Yongfu Xue

TL;DR
PIRA is a novel training paradigm for reward models that leverages preference instruction-following, aggregates diverse rewards, and stabilizes estimates to improve alignment of LLMs with human preferences and reduce overoptimization.
Contribution
PIRA introduces a new training approach combining preference instruction reformulation, reward aggregation, and output stabilization to address key limitations of existing reward models.
Findings
Significantly improves reward model performance.
Enhances generalization across tasks.
Reduces reward overoptimization.
Abstract
Reward models are pivotal for aligning Large Language Models (LLMs) with human preferences. Existing approaches face two key limitations: Discriminative reward models require large-scale annotated data, as they cannot exploit the preference instruction-following capability of LLMs available to generative reward models. Moreover, reward models are particularly prone to reward overoptimization, where LLMs exploit weaknesses in the reward function instead of improving true alignment. We introduce \textbf{PIRA}, a training paradigm that integrates three complementary strategies to address these challenges: (1) reformulating question-answer pairs into preference-task instructions to explicitly leverage LLMs' preference instruction-following capability, (2) averaging the rewards aggregated from diverse preference-task instructions for each sample, which mitigates task-specific bias and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Sentiment Analysis and Opinion Mining
