Simulation-Based Optimistic Policy Iteration For Multi-Agent MDPs with Kullback-Leibler Control Cost
Khaled Nakhleh, Ceyhun Eksin, Sabit Ekin

TL;DR
This paper introduces a simulation-based optimistic policy iteration method for multi-agent MDPs with KL control costs, enabling agents to independently compute optimal policies with proven convergence.
Contribution
It presents a novel agent-based OPI scheme that handles KL control costs and demonstrates convergence for both synchronous and asynchronous evaluations.
Findings
Converges to the optimal value function and policy asymptotically.
Agents can compute policies independently using the Boltzmann distribution.
Validated on a multi-agent game with KL control costs.
Abstract
This paper proposes an agent-based optimistic policy iteration (OPI) scheme for learning stationary optimal stochastic policies in multi-agent Markov Decision Processes (MDPs), in which agents incur a Kullback-Leibler (KL) divergence cost for their control efforts and an additional cost for the joint state. The proposed scheme consists of a greedy policy improvement step followed by an m-step temporal difference (TD) policy evaluation step. We use the separable structure of the instantaneous cost to show that the policy improvement step follows a Boltzmann distribution that depends on the current value function estimate and the uncontrolled transition probabilities. This allows agents to compute the improved joint policy independently. We show that both the synchronous (entire state space evaluation) and asynchronous (a uniformly sampled set of substates) versions of the OPI scheme with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuction Theory and Applications
MethodsSparse Evolutionary Training
