On the Exponential Convergence for Offline RLHF with Pairwise Comparisons

Zhirui Chen; Vincent Y. F. Tan

arXiv:2406.12205·cs.LG·January 23, 2026

On the Exponential Convergence for Offline RLHF with Pairwise Comparisons

Zhirui Chen, Vincent Y. F. Tan

PDF

Open Access 1 Video

TL;DR

This paper introduces RL-LOW, an algorithm for offline reinforcement learning from human feedback with pairwise comparisons, achieving exponential convergence in simple regret and matching lower bounds, with extensions to privacy-preserving settings.

Contribution

The paper proposes RL-LOW, the first algorithm with exponential convergence guarantees for offline RLHF with pairwise comparisons, and establishes matching instance-dependent lower bounds.

Findings

01

RL-LOW achieves exponential simple regret decay of exp(-Ω(n/H)).

02

Lower bounds match the upper bounds order-wise, proving optimality.

03

The hardness parameter H remains unchanged under differential privacy constraints.

Abstract

We consider the problem of offline reinforcement learning from human feedback (RLHF) with pairwise comparisons proposed by Zhu et al. (2023), where the implicit reward is a linear function of an unknown parameter. Given an offline dataset, our objective consists in ascertaining the optimal action for each state, with the ultimate goal of minimizing the {\em simple regret}. We propose an algorithm, \underline{RL} with \underline{L}ocally \underline{O}ptimal \underline{W}eights or {\sc RL-LOW}, which yields an exponential form of simple regret of $exp (- Ω (n / H))$ where $n$ is the number of data samples and $H$ denotes an instance-dependent hardness quantity that depends explicitly on the suboptimality gap of each action. Furthermore, we derive a first-of-its-kind instance-dependent lower bound in offline RLHF with pairwise comparisons. Interestingly, we observe that the lower and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

On the Exponential Convergence for Offline RLHF with Pairwise Comparisons· underline

Taxonomy

TopicsSupply Chain and Inventory Management · Auction Theory and Applications

MethodsFocus