Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

Yonghyeon Jo; Sunwoo Lee; Seungyul Han

arXiv:2602.17062·cs.AI·May 21, 2026

Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

Yonghyeon Jo, Sunwoo Lee, Seungyul Han

PDF

1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces S2Q, a novel multi-agent reinforcement learning method that maintains multiple sub-value functions to adapt to shifting optima, enhancing exploration and performance.

Contribution

S2Q is the first approach to learn multiple sub-value functions for better adaptability to changing value functions in MARL.

Findings

01

S2Q outperforms existing MARL algorithms on benchmark tasks.

02

S2Q enables quick adjustment to changing optima during training.

03

S2Q promotes persistent exploration through a Softmax-based policy.

Abstract

Value decomposition is a core approach for cooperative multi-agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub-value Q-learning (S2Q), which learns multiple sub-value functions to retain alternative high-value actions. Incorporating these sub-value functions into a Softmax-based behavior policy, S2Q encourages persistent exploration and enables $Q^{tot}$ to adjust quickly to the changing optima. Experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance. Our code is available at https://github.com/hyeon1996/S2Q.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- Dynamic value functions S2Q overcomes the limitation that conventional methods do not explicitly track suboptimal actions. When the optimal action changes, S2Q can immediately leverage the corresponding sub-value function and guide Q^{tot} to adapt. - Introducing communication during training S2Q explicitly executes tracked suboptimal actions with priority determined by a Softmax distribution P_t over their Q^{∗} values, thereby enabling exploration of a wider range of spaces than

Weaknesses

- Old Benchmarks The StarCraft Multi-Agent Challenge (SMAC) (Samvelyan et al., 2019) is an old benchmark. It is advised to report the experimental results on the recently proposed SMAC-Hard benchmark [1]. [1] SMAC-Hard: Enabling Mixed Opponent Strategy Script and Self-play on SMAC, arXiv:2412.17707.

Reviewer 02Rating 4Confidence 4

Strengths

1. The toy matrix-game illustrates why retaining information about nearby high-value actions can help when the optimum shifts. S2Q operationalizes this via a suppressed TD objective and Softmax-guided behavior policy. The algorithmic presentation (Alg. 1; eqs. (B.2–B.5)) is easy to follow. 2. Results span SMAC Hard+, GRF, SMAC-Comm (with a “-Comm” variant), and SMACv2, showing consistent gains and faster learning, not only final win rates. The compute table quantifies overhead. 3. Removing So

Weaknesses

1. The paper claims theoretical/empirical analyses, but no formal result is provided to justify that minimizing the modified TD with the suppression term reliably extracts distinct top-k modes under the IGM constraint or preserves contraction/stability properties. A small lemma would strengthen the case. 2. S2Q learns an encoder–decoder to approximate $P_t$ and reconstruct $s$, which provides additional supervision such as cross-entropy and reconstruction. Several non-communication baselines do

Reviewer 03Rating 8Confidence 4

Strengths

This paper is well written and provides a very extensive set of experiments and ablations. The results are consistently strong across very diverse environments. While apparently simple, i find clever the idea to surpress the optimal actions in the calculations of subsequent value functions and the performances show big improvements in a range of tasks.

Weaknesses

The authors could have provided a deeper analysis of the scalability of the proposed method, since it requires sequential computations using sub-networks. I.e, since there is a mixing for each Q, up until what point can k scale? In the communication encoder-decoder module in figure 3, the authors could have provided a better description of the architecture of these modules. Please find below some more specific questions.

Code & Models

Repositories

hyeon1996/S2Q
github

Videos

Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Adversarial Robustness in Machine Learning