Beyond Shallow Behavior: Task-Efficient Value-Based Multi-Task Offline MARL via Skill Discovery
Xun Wang, Zhuoran Li, Hai Zhong, Longbo Huang

TL;DR
This paper introduces SD-CQL, a novel offline multi-task reinforcement learning algorithm that discovers skills in a latent space, enabling efficient generalization across multiple tasks with limited data, demonstrated by superior results in StarCraft II.
Contribution
The paper proposes SD-CQL, a task-efficient multi-task offline MARL method that discovers skills in latent space and improves generalization without retraining for new tasks.
Findings
Achieves top performance on 13 out of 14 StarCraft II task sets.
Up to 68.9% improvement on individual tasks.
Demonstrates strong multi-task generalization from limited data.
Abstract
As a data-driven approach, offline MARL learns superior policies solely from offline datasets, ideal for domains rich in historical data but with high interaction costs and risks. However, most existing methods are task-specific, requiring retraining for new tasks, leading to redundancy and inefficiency. To address this issue, we propose a task-efficient value-based multi-task offline MARL algorithm, Skill-Discovery Conservative Q-Learning (SD-CQL). Unlike existing methods decoding actions from skills via behavior cloning, SD-CQL discovers skills in a latent space by reconstructing the next observation, evaluates fixed and variable actions separately, and uses conservative Q-learning with local value calibration to select the optimal action for each skill. It eliminates the need for local-global alignment and enables strong multi-task generalization from limited, small-scale source…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper is well-written and easy to follow, presenting a clear framework that integrates skill discover with conservative value learning for multi-agent offline MARL. It demonstrate consistent performance improvements on SMAC benchmarks and provides quantitative analysis of the learned skill representations.
1. The skill discovery mechanism relies solely on local observation, without access to global state or task-specific rewards to learn latent skills. In this way, the learned skills mainly capture individual behavioral modes rather than genuinely cooperative or team-level strategies. This local-only formulation fundamentally limits the framework's capacity to represent coordinated multi-agent behaviors. 2. The method essentially performs multiple instances of single-agent skill discovery rather
- The paper addresses the critical and practical problem of zero-shot generalization in offline MARL, a key step for real-world applications with high data collection costs. - The empirical evaluation is comprehensive, testing on 14 SMAC task sets across various data qualities and two transfer settings. The SOTA results (13/14 sets) are strong and supported by ablation studies in the appendix. - The paper is well-structured. The method is presented logically.
1. Clarity of Notation/Figures: The paper contains ambiguities. 1. Figure 1.a plots an undefined "CFCQL" baseline. 2. The Q-function notation in Sec 3.2.1, such as $Q(a|o,z)$, is non-standard. The`|`(conditional) notation is incorrectly used and should be replaced with standard Q-function notation, e.g.,$Q(o, z, a)$. 3. It is recommended that bolded entries in all result tables indicate only the best result. The meaning of bolded entries differs between the main text and appendix tab
- The paper explores the under-investigated yet practical problem of multi-task offline MARL. While offline MARL itself has recently gained traction, the extension to an offline multi-task formulation—where agents must generalize across tasks with varying numbers of agents and configurations—remains relatively unexplored. - From an originality standpoint, the paper contributes a skill-based framework that moves beyond behavior-cloning-based action decoding. Prior skill-discovery methods in RL
1. Questionable definition of skill via next-observation prediction: The authors claim z represents "skills" (high-level decision patterns), but the learning mechanism is simply next-observation reconstruction. This is not skill learning—it's observation representation learning. - Traditional skill-based RL defines skills as temporally extended action sequences or sub-policies. - SD-CQL's z is extracted by predicting o_{t+1} from o_t, which is a standard technique for learning task-relevant obse
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Scheduling and Optimization Algorithms
MethodsQ-Learning
