Aligning LLMs with Domain Invariant Reward Models

David Wu; Sanjiban Choudhury

arXiv:2501.00911·cs.LG·January 3, 2025

Aligning LLMs with Domain Invariant Reward Models

David Wu, Sanjiban Choudhury

PDF

Open Access 1 Repo

TL;DR

This paper introduces extmethod, a framework for training domain-invariant reward models to align large language models with human preferences across diverse, data-scarce domains by leveraging source domain feedback.

Contribution

The paper proposes extmethod, a novel approach that learns domain-invariant reward models using dual loss optimization, enabling preference alignment in target domains lacking direct preference data.

Findings

01

Effective transfer across multiple domains including cross-lingual and noisy data

02

Improved accuracy and correlation in preference modeling

03

General applicability demonstrated across four distinct settings

Abstract

Aligning large language models (LLMs) to human preferences is challenging in domains where preference data is unavailable. We address the problem of learning reward models for such target domains by leveraging feedback collected from simpler source domains, where human preferences are easier to obtain. Our key insight is that, while domains may differ significantly, human preferences convey \emph{domain-agnostic} concepts that can be effectively captured by a reward model. We propose \method, a framework that trains domain-invariant reward models by optimizing a dual loss: a domain loss that minimizes the divergence between source and target distribution, and a source loss that optimizes preferences on the source domain. We show \method is a general approach that we evaluate and analyze across 4 distinct settings: (1) Cross-lingual transfer (accuracy: $0.621 \to 0.661$ ), (2)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

portal-cornell/dial
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Digital Rights Management and Security