Aligning Language Models with Human Preferences via a Bayesian Approach
Jiashuo Wang, Haozhao Wang, Shichao Sun, Wenjie Li

TL;DR
This paper introduces a Bayesian approach to better align language models with human preferences by modeling preference disagreements, improving performance over existing methods in human-centric NLG tasks.
Contribution
It proposes a novel Bayesian framework (d-PM) to capture human preference disagreements and uses contrastive learning for efficient NLG training, surpassing prior methods.
Findings
Outperforms previous SOTA models in automatic evaluations
Achieves higher human satisfaction scores
Demonstrates robustness across multiple NLG tasks
Abstract
In the quest to advance human-centric natural language generation (NLG) systems, ensuring alignment between NLG models and human preferences is crucial. For this alignment, current popular methods leverage a reinforcement learning (RL) approach with a reward model trained on feedback from humans. However, inherent disagreements due to the subjective nature of human preferences pose a significant challenge for training the reward model, resulting in a deterioration of the NLG performance. To tackle this issue, previous approaches typically rely on majority voting or averaging to consolidate multiple inconsistent preferences into a merged one. Although straightforward to understand and execute, such methods suffer from an inability to capture the nuanced degrees of disaggregation among humans and may only represent a specialized subset of individuals, thereby lacking the ability to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques
MethodsContrastive Learning
