Action Dependency Graphs for Globally Optimal Coordinated Reinforcement Learning
Jianglin Ding, Jingcheng Tang, Gangshan Jing

TL;DR
This paper introduces action dependency graphs (ADGs) for multi-agent reinforcement learning, enabling scalable, globally optimal policies without auto-regressive constraints, validated through theoretical proofs and experiments.
Contribution
It generalizes action-dependent policies using ADGs, proving conditions for global optimality and developing a scalable policy iteration algorithm.
Findings
Sparse ADGs can achieve global optimality under certain conditions.
The proposed framework improves scalability over auto-regressive policies.
Experimental results demonstrate robustness and effectiveness in complex environments.
Abstract
Action-dependent individual policies, which incorporate both environmental states and the actions of other agents in decision-making, have emerged as a promising paradigm for achieving global optimality in multi-agent reinforcement learning (MARL). However, the existing literature often adopts auto-regressive action-dependent policies, where each agent's policy depends on the actions of all preceding agents. This formulation incurs substantial computational complexity as the number of agents increases, thereby limiting scalability. In this work, we consider a more generalized class of action-dependent policies, which do not necessarily follow the auto-regressive form. We propose to use the `action dependency graph (ADG)' to model the inter-agent action dependencies. Within the context of MARL problems structured by coordination graphs, we prove that an action-dependent policy with a…
Peer Reviews
Decision·Submitted to ICLR 2026
- the authors make a relevant contribution by filling the gap both theoretically and empirically on the usage of action-dependent policies in MARL, which is an open relevant setting in the community. - the paper is clearly written and technically sound. It is relatively easy to follow. - the authors propose a method that is scalable because it can be integrated into existing MARL algorithms
- The experimental evaluation is limited to simple toy cooperative MARL tasks. I believe the paper could provide a much more relevant contribution to the field if tested on notorious problems such as SMACv2 [1] or MaMuJoCo [2]. [1] Ellis, Benjamin, et al. "Smacv2: An improved benchmark for cooperative multi-agent reinforcement learning." Advances in Neural Information Processing Systems 36 (2023): 37567-37593. [2] Peng, Bei, et al. "Facmac: Factored multi-agent centralised policy gradients." A
- The paper provides a clear and rigorous theoretical analysis establishing when $G_d$-local optimality implies global optimality. - The theoretical results are technically sound, with well-defined assumptions and consistent notation across sections. - The paper is well written and organized, with smooth logical flow from definitions to theorems and experiments. - How to utilize a sparse coordination structure is an important problem in multi-agent reinforcement learning.
- The experimental evaluation is limited to small-scale and largely toy environments, which do not convincingly demonstrate the framework’s scalability or practical relevance. - No comparisons are made against standard MARL baselines (e.g., QMIX[1], MAPPO[2], and other Sequential or Bayesian Network-based methods[3,4]), making it unclear whether ADG-based methods provide empirical advantages beyond theoretical guarantees. - The experiments focus on verifying theorems rather than exploring perf
1. The formulation of action-dependent policies and action dependency graphs is clear. 2. Algorithm 1 is clear and understandable. 3. The example in Section 4.2 is helpful in understanding the relationship between independent policy learning and local equilibrium convergence. 4. The coordination polymatrix games clearly elucidate why ADGs are preferable to CGs. 5. The proofs in the Appendix and main text appear correct and complete. The proof by induction in Appendix D is particularly well-expl
1. Lack of baselines - only the proposed method was benchmarked in this work. Why were other forms of action dependency not benchmarked? [1]. The viability of ADGs needs to be framed within the context of action-dependent policies and MARL in general - while the latter can be achieved by a simple ablation with regular MAPPO, it should be clearly demonstrated why action-dependency proposed is superior to message passing, autoregressive action dependency, or action memory. 2. Lack of environments
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
