TL;DR
This paper introduces S2Q, a novel multi-agent reinforcement learning method that maintains multiple sub-value functions to adapt to shifting optima, enhancing exploration and performance.
Contribution
S2Q is the first approach to learn multiple sub-value functions for better adaptability to changing value functions in MARL.
Findings
S2Q outperforms existing MARL algorithms on benchmark tasks.
S2Q enables quick adjustment to changing optima during training.
S2Q promotes persistent exploration through a Softmax-based policy.
Abstract
Value decomposition is a core approach for cooperative multi-agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub-value Q-learning (S2Q), which learns multiple sub-value functions to retain alternative high-value actions. Incorporating these sub-value functions into a Softmax-based behavior policy, S2Q encourages persistent exploration and enables to adjust quickly to the changing optima. Experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance. Our code is available at https://github.com/hyeon1996/S2Q.
Peer Reviews
Decision·ICLR 2026 Poster
- Dynamic value functions S2Q overcomes the limitation that conventional methods do not explicitly track suboptimal actions. When the optimal action changes, S2Q can immediately leverage the corresponding sub-value function and guide Q^{tot} to adapt. - Introducing communication during training S2Q explicitly executes tracked suboptimal actions with priority determined by a Softmax distribution P_t over their Q^{∗} values, thereby enabling exploration of a wider range of spaces than
- Old Benchmarks The StarCraft Multi-Agent Challenge (SMAC) (Samvelyan et al., 2019) is an old benchmark. It is advised to report the experimental results on the recently proposed SMAC-Hard benchmark [1]. [1] SMAC-Hard: Enabling Mixed Opponent Strategy Script and Self-play on SMAC, arXiv:2412.17707.
1. The toy matrix-game illustrates why retaining information about nearby high-value actions can help when the optimum shifts. S2Q operationalizes this via a suppressed TD objective and Softmax-guided behavior policy. The algorithmic presentation (Alg. 1; eqs. (B.2–B.5)) is easy to follow. 2. Results span SMAC Hard+, GRF, SMAC-Comm (with a “-Comm” variant), and SMACv2, showing consistent gains and faster learning, not only final win rates. The compute table quantifies overhead. 3. Removing So
1. The paper claims theoretical/empirical analyses, but no formal result is provided to justify that minimizing the modified TD with the suppression term reliably extracts distinct top-k modes under the IGM constraint or preserves contraction/stability properties. A small lemma would strengthen the case. 2. S2Q learns an encoder–decoder to approximate $P_t$ and reconstruct $s$, which provides additional supervision such as cross-entropy and reconstruction. Several non-communication baselines do
This paper is well written and provides a very extensive set of experiments and ablations. The results are consistently strong across very diverse environments. While apparently simple, i find clever the idea to surpress the optimal actions in the calculations of subsequent value functions and the performances show big improvements in a range of tasks.
The authors could have provided a deeper analysis of the scalability of the proposed method, since it requires sequential computations using sub-networks. I.e, since there is a mixing for each Q, up until what point can k scale? In the communication encoder-decoder module in figure 3, the authors could have provided a better description of the architecture of these modules. Please find below some more specific questions.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Adversarial Robustness in Machine Learning
