Distributed Zeroth-Order Policy Gradient for Networked Multi-agent Reinforcement Learning from Human Feedback

Pengcheng Dai; He Wang; Dongming Wang; Jian Qin; and Wenwu Yu

arXiv:2605.15697·cs.MA·May 18, 2026

Distributed Zeroth-Order Policy Gradient for Networked Multi-agent Reinforcement Learning from Human Feedback

Pengcheng Dai, He Wang, Dongming Wang, Jian Qin, and Wenwu Yu

PDF

TL;DR

This paper introduces a scalable distributed zeroth-order policy gradient method for multi-agent reinforcement learning using human feedback, suitable for large networked systems with localized interactions.

Contribution

It proposes a novel human feedback mechanism based on spatiotemporally truncated trajectories and develops a fully distributed algorithm with convergence guarantees.

Findings

01

Algorithm converges to an $psilon$-stationary point with polynomial sample complexity.

02

Simulation results demonstrate effectiveness in GridWorld and predator-prey environments.

03

Method enables collaborative optimization solely from human preference feedback.

Abstract

We study a networked multi-agent reinforcement learning (NMARL) problem with human feedback in an infinite-horizon setting, where agents interact over an underlying network with localized state dependencies and aim to collaboratively maximize the average discounted return. Existing approaches with preference feedback are primarily developed for single-agent settings and rely on centralized training, which limits their scalability and applicability to large-scale networked multi-agent systems. To address this, we introduce a novel human feedback mechanism based on spatiotemporally truncated trajectories, defined as $H$ -horizon trajectory pairs aggregated over each agent's $κ$ -hop neighborhood. Building on this, we develop a distributed zeroth-order policy gradient algorithm, where each agent estimates its local policy gradient using human preference feedback generated from both the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.