Aligning LLMs with Domain Invariant Reward Models
David Wu, Sanjiban Choudhury

TL;DR
This paper introduces extmethod, a framework for training domain-invariant reward models to align large language models with human preferences across diverse, data-scarce domains by leveraging source domain feedback.
Contribution
The paper proposes extmethod, a novel approach that learns domain-invariant reward models using dual loss optimization, enabling preference alignment in target domains lacking direct preference data.
Findings
Effective transfer across multiple domains including cross-lingual and noisy data
Improved accuracy and correlation in preference modeling
General applicability demonstrated across four distinct settings
Abstract
Aligning large language models (LLMs) to human preferences is challenging in domains where preference data is unavailable. We address the problem of learning reward models for such target domains by leveraging feedback collected from simpler source domains, where human preferences are easier to obtain. Our key insight is that, while domains may differ significantly, human preferences convey \emph{domain-agnostic} concepts that can be effectively captured by a reward model. We propose \method, a framework that trains domain-invariant reward models by optimizing a dual loss: a domain loss that minimizes the divergence between source and target distribution, and a source loss that optimizes preferences on the source domain. We show \method is a general approach that we evaluate and analyze across 4 distinct settings: (1) Cross-lingual transfer (accuracy: ), (2)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Digital Rights Management and Security
