High entropy leads to symmetry equivariant policies in Dec-POMDPs
Johannes Forkel, Constantin Ruhdorfer, Michael Beukman, Andreas Bulling, Jakob Foerster

TL;DR
This paper proves that high entropy regularization in Dec-POMDPs guarantees convergence to symmetry-equivariant policies and demonstrates its practical impact on multi-agent training outcomes.
Contribution
It establishes a theoretical link between high entropy regularization and symmetry-equivariant policies in Dec-POMDPs, supported by empirical evaluations.
Findings
High entropy regularization ensures convergence to symmetry-equivariant policies.
Entropy coefficient significantly affects cross-play and self-play returns.
Increasing entropy during hyperparameter tuning improves policy robustness and performance.
Abstract
We prove that in any Dec-POMDP, sufficiently high entropy regularization ensures that the policy gradient flow with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant w.r.t. all symmetries of the Dec-POMDP. In particular, policies coming from different initializations will be fully compatible, in that their cross-play returns are equal to their self-play returns. Through extensive evaluation of independent PPO, arguably the standard baseline deep multi-agent policy gradient algorithm, in the Hanabi, Overcooked and Yokai environments, we find that the entropy coefficient has a massive influence on the cross-play returns between independently trained policies, and that the decrease in self-play returns coming from increased entropy regularization can often be counteracted by greedifying the learned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
