Multi-Agent Cross-Entropy Method with Monotonic Nonlinear Critic Decomposition

Yan Wang; Ke Deng; Yongli Ren

arXiv:2511.18671·cs.LG·November 27, 2025

Multi-Agent Cross-Entropy Method with Monotonic Nonlinear Critic Decomposition

Yan Wang, Ke Deng, Yongli Ren

PDF

Open Access 4 Reviews

TL;DR

This paper introduces MCEM with monotonic nonlinear critic decomposition, enhancing cooperative multi-agent reinforcement learning by improving policy updates and addressing the centralized-decentralized mismatch issue.

Contribution

It proposes a novel multi-agent cross-entropy method combined with monotonic nonlinear critic decomposition to improve policy learning in MARL.

Findings

01

MCEM outperforms state-of-the-art methods on various benchmarks.

02

The approach effectively addresses the centralized-decentralized mismatch.

03

Enhanced sample efficiency through modified off-policy learning techniques.

Abstract

Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution (CTDE), where centralized critics leverage global information to guide decentralized actors. However, centralized-decentralized mismatch (CDM) arises when the suboptimal behavior of one agent degrades others' learning. Prior approaches mitigate CDM through value decomposition, but linear decompositions allow per-agent gradients at the cost of limited expressiveness, while nonlinear decompositions improve representation but require centralized gradients, reintroducing CDM. To overcome this trade-off, we propose the multi-agent cross-entropy method (MCEM), combined with monotonic nonlinear critic decomposition (NCD). MCEM updates policies by increasing the probability of high-value joint actions, thereby excluding suboptimal behaviors. For sample efficiency, we extend…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

This paper uses CEM to "filter" suboptimal actions to avoid centralized gradients, serving as an alternative to traditional centralized critic policy gradient methods and offering insights for large-scale agent tasks. The combination of MCEM and NCD is seamless, preserving both the excellent expressive power of nonlinear decomposition and the high quality of updated data. Furthermore, the improvements to the heterogeneous policy critic learning part (Sarsa + Retrace) are well-considered, enhanc

Weaknesses

W1: Theorem 5.1 and its proof in Appendix A form the core theoretical analysis of this paper. However, the proof is overly simplistic and intuitive, lacking rigorous mathematical form. It reads more like a description of the algorithm's design intent—to improve expected returns by selecting actions with high Q values—than a rigorous mathematical proof. Why does the first inequality $E_{\pi_g}[Q_{\pi_g}^{tot}(\tau, u)] \le E_{\pi_\rho}[Q_{\pi_g}^{tot}(\tau, u)] $ hold? Subsequent recursive expans

Reviewer 02Rating 2Confidence 4

Strengths

Simulations are conducted on standard MARL benchmarks, and advanced baselines are compared.

Weaknesses

1. Many policy gradient formulas in this paper are incorrect. 2. The use of the auxiliary proposal policies is unclear. 3. The convergence of the proposed algorithm cannot be guaranteed from a theoretical perspective. 4. For the continuous action setting, the authors are recommended to evaluate their proposed algorithm on benchmarks with high-dimensional action spaces, and to discuss the hyperparameter settings related to the cross-entropy method in their algorithm.

Reviewer 03Rating 2Confidence 4

Strengths

1. The extension of the CEM form the single-agent to MARL setting is natural and well-formulated as a percentile-greedy policy. 2. The computational details of the experiments are sufficient. 3. The results in SMAC against related VD methods are somewhat convincing, as MCEM NCD performs better or the same as other baselines (see below).

Weaknesses

1. The primary implementation contribution seems minimal - the only difference appears to be in how the actions are selected for the fit of the network. While Theorem 5.1 appears to demonstrate that the MCEM method should perform at least as well as baseline, this doesn't guarantee improvement in the general setting. Why was no equilibrium analysis or spectral analysis of the game dynamics with factorization performed to motivate the method further? Indeed, in the results, MCEM sometimes perform

Reviewer 04Rating 2Confidence 4

Strengths

- Extending CEM to multi-agent settings and pairing it with a monotonic nonlinear decomposition is an interesting and novel direction. - The approach demonstrates promising preliminary results and could inspire future work on cross-entropy–based policy optimization in cooperative MARL.

Weaknesses

1. The theoretical analysis is informal and lacks rigor. In particular, I do not see why the first inequality in Appendix A, $E_{\pi_g}[Q^{\pi_g}]\le E_{\pi_\rho}[Q^{\pi_g}]$, should hold under the description provided. The authors should formally define both $\pi_g$ and $\pi_\rho$. Are they final converged policies, or intermediate policies after one policy improvement step? The proof also implicitly assumes access to the exact $Q^\pi$, whereas in practice $Q^\pi$ is estimated by a QMIX-style n

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Adversarial Robustness in Machine Learning