Distributed Zeroth-Order Policy Gradient for Networked Multi-agent Reinforcement Learning from Human Feedback
Pengcheng Dai, He Wang, Dongming Wang, Jian Qin, and Wenwu Yu

TL;DR
This paper introduces a scalable distributed zeroth-order policy gradient method for multi-agent reinforcement learning using human feedback, suitable for large networked systems with localized interactions.
Contribution
It proposes a novel human feedback mechanism based on spatiotemporally truncated trajectories and develops a fully distributed algorithm with convergence guarantees.
Findings
Algorithm converges to an $psilon$-stationary point with polynomial sample complexity.
Simulation results demonstrate effectiveness in GridWorld and predator-prey environments.
Method enables collaborative optimization solely from human preference feedback.
Abstract
We study a networked multi-agent reinforcement learning (NMARL) problem with human feedback in an infinite-horizon setting, where agents interact over an underlying network with localized state dependencies and aim to collaboratively maximize the average discounted return. Existing approaches with preference feedback are primarily developed for single-agent settings and rely on centralized training, which limits their scalability and applicability to large-scale networked multi-agent systems. To address this, we introduce a novel human feedback mechanism based on spatiotemporally truncated trajectories, defined as -horizon trajectory pairs aggregated over each agent's -hop neighborhood. Building on this, we develop a distributed zeroth-order policy gradient algorithm, where each agent estimates its local policy gradient using human preference feedback generated from both the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
