Clone-Robust AI Alignment
Ariel D. Procaccia, Benjamin Schiffer, Shirley Zhang

TL;DR
This paper addresses the challenge of aligning large language models with human preferences by introducing a robust RLHF algorithm that maintains performance even with unbalanced datasets, inspired by social choice theory.
Contribution
The paper proposes a novel weighted MLE algorithm for RLHF that ensures robustness to approximate clones, improving alignment reliability in unbalanced datasets.
Findings
Standard RLHF fails to be clone-robust.
Weighted MLE guarantees robustness to approximate clones.
The new method preserves theoretical properties of RLHF.
Abstract
A key challenge in training Large Language Models (LLMs) is properly aligning them with human preferences. Reinforcement Learning with Human Feedback (RLHF) uses pairwise comparisons from human annotators to train reward functions and has emerged as a popular alignment method. However, input datasets in RLHF are not necessarily balanced in the types of questions and answers that are included. Therefore, we want RLHF algorithms to perform well even when the set of alternatives is not uniformly distributed. Drawing on insights from social choice theory, we introduce robustness to approximate clones, a desirable property of RLHF algorithms which requires that adding near-duplicate alternatives does not significantly change the learned reward function. We first demonstrate that the standard RLHF algorithm based on regularized maximum likelihood estimation (MLE) fails to satisfy this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Evolutionary Algorithms and Applications · Multi-Agent Systems and Negotiation
MethodsSparse Evolutionary Training
