Efficient Federated RLHF via Zeroth-Order Policy Optimization

Deyi Wang; Qining Zhang; Lei Ying

arXiv:2604.17747·cs.LG·April 21, 2026

Efficient Federated RLHF via Zeroth-Order Policy Optimization

Deyi Wang, Qining Zhang, Lei Ying

PDF

TL;DR

This paper introduces Par-S²ZPO, a federated RLHF algorithm using zeroth-order optimization with binary perturbation, achieving low resource usage and comparable or better efficiency than centralized methods.

Contribution

The paper presents a novel federated RLHF algorithm with zeroth-order optimization, providing theoretical convergence guarantees and superior experimental performance.

Findings

01

Outperforms FedAvg-based RLHF on four MuJoCo tasks.

02

Achieves low communication, computation, and memory complexity.

03

Converges faster in policy updates than centralized methods.

Abstract

This paper considers reinforcement learning from human feedback in a federated learning setting with resource-constrained agents, such as edge devices. We propose an efficient federated RLHF algorithm, named Partitioned, Sign-based Stochastic Zeroth-order Policy Optimization (Par-S $^{2}$ ZPO). The algorithm is built on zeroth-order optimization with binary perturbation, resulting in low communication, computation, and memory complexity by design. Our theoretical analysis establishes an upper bound on the convergence rate of Par-S $^{2}$ ZPO, revealing that it is as efficient as its centralized counterpart in terms of sample complexity but converges faster in terms of policy update iterations. Our experimental results show that it outperforms a FedAvg-based RLHF on four MuJoCo RL tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.