Stabilizing Q Learning Via Soft Mellowmax Operator
Yaozhong Gan, Zhe Zhang, Xiaoyang Tan

TL;DR
This paper introduces SM2, an improved Soft Mellowmax operator for reinforcement learning that enhances stability, reliability, and performance guarantees, especially in high-dimensional and multi-agent scenarios.
Contribution
The paper proposes SM2, an enhanced Mellowmax operator with proven performance guarantees, addressing oversmoothing and parameter sensitivity issues in existing methods.
Findings
SM2 provides stable value function approximation in high-dimensional spaces.
Application of SM2 achieves state-of-the-art results in multi-agent reinforcement learning.
SM2 is reliable, easy to implement, and preserves the advantages of Mellowmax.
Abstract
Learning complicated value functions in high dimensional state space by function approximation is a challenging task, partially due to that the max-operator used in temporal difference updates can theoretically cause instability for most linear or non-linear approximation schemes. Mellowmax is a recently proposed differentiable and non-expansion softmax operator that allows a convergent behavior in learning and planning. Unfortunately, the performance bound for the fixed point it converges to remains unclear, and in practice, its parameter is sensitive to various domains and has to be tuned case by case. Finally, the Mellowmax operator may suffer from oversmoothing as it ignores the probability being taken for each action when aggregating them. In this paper, we address all the above issues with an enhanced Mellowmax operator, named SM2 (Soft Mellowmax). Particularly, the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Adversarial Robustness in Machine Learning
MethodsSoftmax
