Distributed Direct Preference Optimization

Zhanhong Jiang

arXiv:2605.20696·cs.LG·May 21, 2026

Distributed Direct Preference Optimization

Zhanhong Jiang

PDF

1 Repo

TL;DR

This paper analyzes the convergence and efficiency of Distributed Direct Preference Optimization (DPO) in federated and decentralized reinforcement learning, providing theoretical guarantees and empirical validation.

Contribution

It offers the first convergence and complexity analysis of DPO in distributed settings, accounting for heterogeneity and communication constraints.

Findings

01

Derived convergence rates considering client drift and communication frequency.

02

Established convergence over general communication graphs with spectral connectivity.

03

Empirically validated theoretical insights on standard alignment benchmarks.

Abstract

Preference-based reinforcement learning (RL) is a key paradigm for aligning policies with human judgments, yet its theoretical behavior in distributed settings where preference data are fragmented across heterogeneous users remains poorly understood. Direct Preference Optimization (DPO) avoids explicit reward modeling but lacks convergence guarantees under federated and decentralized training, where communication constraints and non-IID preferences fundamentally alter optimization dynamics. We provide the first convergence and time-complexity analysis of DPO in distributed environments. Modeling personalized offline RL with user-specific preference distributions, we characterize the induced global optimization landscape. For federated DPO, we derive convergence rates that quantify the impact of client drift, communication frequency, and preference heterogeneity; for decentralized DPO,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.