Learning Partial Action Replacement in Offline MARL
Yue Jin, Giovanni Montana

TL;DR
This paper introduces PLCQL, an adaptive, efficient framework for partial action replacement in offline multi-agent reinforcement learning, improving performance and reducing computational costs.
Contribution
PLCQL formulates PAR subset selection as a contextual bandit problem, enabling dynamic, state-dependent agent replacement with theoretical error bounds and improved efficiency.
Findings
PLCQL outperforms previous methods on multiple benchmarks.
It reduces Q-function evaluations from n to 1 per iteration.
Achieves highest scores on 66% of tasks across benchmarks.
Abstract
Offline multi-agent reinforcement learning (MARL) faces a critical challenge: the joint action space grows exponentially with the number of agents, making dataset coverage exponentially sparse and out-of-distribution (OOD) joint actions unavoidable. Partial Action Replacement (PAR) mitigates this by anchoring a subset of agents to dataset actions, but existing approach relies on enumerating multiple subset configurations at high computational cost and cannot adapt to varying states. We introduce PLCQL, a framework that formulates PAR subset selection as a contextual bandit problem and learns a state-dependent PAR policy using Proximal Policy Optimisation with an uncertainty-weighted reward. This adaptive policy dynamically determines how many agents to replace at each update step, balancing policy improvement against conservative value estimation. We prove a value-error bound showing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
