Self-Play with Adversarial Critic: Provable and Scalable Offline   Alignment for Language Models

Xiang Ji; Sanjeev Kulkarni; Mengdi Wang; Tengyang Xie

arXiv:2406.04274·cs.LG·June 7, 2024

Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models

Xiang Ji, Sanjeev Kulkarni, Mengdi Wang, Tengyang Xie

PDF

Open Access

TL;DR

This paper introduces SPAC, a scalable and theoretically guaranteed offline preference optimization method with self-play for aligning large language models, demonstrating both convergence proofs and competitive empirical results.

Contribution

It presents SPAC, the first provable and scalable offline alignment method for LLMs, combining theoretical guarantees with practical effectiveness.

Findings

01

SPAC converges under single-policy concentrability.

02

SPAC performs competitively on a 7B Mistral model.

03

Theoretical analysis supports its effectiveness in large-scale settings.

Abstract

This work studies the challenge of aligning large language models (LLMs) with offline preference data. We focus on alignment by Reinforcement Learning from Human Feedback (RLHF) in particular. While popular preference optimization methods exhibit good empirical performance in practice, they are not theoretically guaranteed to converge to the optimal policy and can provably fail when the data coverage is sparse by classical offline reinforcement learning (RL) results. On the other hand, a recent line of work has focused on theoretically motivated preference optimization methods with provable guarantees, but these are not computationally efficient for large-scale applications like LLM alignment. To bridge this gap, we propose SPAC, a new offline preference optimization method with self-play, inspired by the on-average pessimism technique from the offline RL literature, to be the first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsFocus