TL;DR
This paper analyzes the convergence and efficiency of Distributed Direct Preference Optimization (DPO) in federated and decentralized reinforcement learning, providing theoretical guarantees and empirical validation.
Contribution
It offers the first convergence and complexity analysis of DPO in distributed settings, accounting for heterogeneity and communication constraints.
Findings
Derived convergence rates considering client drift and communication frequency.
Established convergence over general communication graphs with spectral connectivity.
Empirically validated theoretical insights on standard alignment benchmarks.
Abstract
Preference-based reinforcement learning (RL) is a key paradigm for aligning policies with human judgments, yet its theoretical behavior in distributed settings where preference data are fragmented across heterogeneous users remains poorly understood. Direct Preference Optimization (DPO) avoids explicit reward modeling but lacks convergence guarantees under federated and decentralized training, where communication constraints and non-IID preferences fundamentally alter optimization dynamics. We provide the first convergence and time-complexity analysis of DPO in distributed environments. Modeling personalized offline RL with user-specific preference distributions, we characterize the induced global optimization landscape. For federated DPO, we derive convergence rates that quantify the impact of client drift, communication frequency, and preference heterogeneity; for decentralized DPO,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
