On the Exponential Convergence for Offline RLHF with Pairwise Comparisons
Zhirui Chen, Vincent Y. F. Tan

TL;DR
This paper introduces RL-LOW, an algorithm for offline reinforcement learning from human feedback with pairwise comparisons, achieving exponential convergence in simple regret and matching lower bounds, with extensions to privacy-preserving settings.
Contribution
The paper proposes RL-LOW, the first algorithm with exponential convergence guarantees for offline RLHF with pairwise comparisons, and establishes matching instance-dependent lower bounds.
Findings
RL-LOW achieves exponential simple regret decay of exp(-Ω(n/H)).
Lower bounds match the upper bounds order-wise, proving optimality.
The hardness parameter H remains unchanged under differential privacy constraints.
Abstract
We consider the problem of offline reinforcement learning from human feedback (RLHF) with pairwise comparisons proposed by Zhu et al. (2023), where the implicit reward is a linear function of an unknown parameter. Given an offline dataset, our objective consists in ascertaining the optimal action for each state, with the ultimate goal of minimizing the {\em simple regret}. We propose an algorithm, \underline{RL} with \underline{L}ocally \underline{O}ptimal \underline{W}eights or {\sc RL-LOW}, which yields an exponential form of simple regret of where is the number of data samples and denotes an instance-dependent hardness quantity that depends explicitly on the suboptimality gap of each action. Furthermore, we derive a first-of-its-kind instance-dependent lower bound in offline RLHF with pairwise comparisons. Interestingly, we observe that the lower and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSupply Chain and Inventory Management · Auction Theory and Applications
MethodsFocus
