From Solo to Symphony: Orchestrating Multi-Agent Collaboration with Single-Agent Demos
Xun Wang, Zhuoran Li, Yanshan Lin, Hai Zhong, Longbo Huang

TL;DR
This paper introduces SoCo, a framework that leverages solo demonstrations to improve the efficiency and effectiveness of multi-agent reinforcement learning by transferring solo knowledge into cooperative multi-agent policies.
Contribution
The paper proposes a novel Solo-to-Collaborative RL (SoCo) framework that pretrains solo policies and adapts them for multi-agent cooperation using a policy fusion mechanism.
Findings
SoCo significantly improves training efficiency.
SoCo enhances performance across various cooperative tasks.
Solo demonstrations effectively complement multi-agent data.
Abstract
Training a team of agents from scratch in multi-agent reinforcement learning (MARL) is highly inefficient, much like asking beginners to play a symphony together without first practicing solo. Existing methods, such as offline or transferable MARL, can ease this burden, but they still rely on costly multi-agent data, which often becomes the bottleneck. In contrast, solo experiences are far easier to obtain in many important scenarios, e.g., collaborative coding, household cooperation, and search-and-rescue. To unlock their potential, we propose Solo-to-Collaborative RL (SoCo), a framework that transfers solo knowledge into cooperative learning. SoCo first pretrains a shared solo policy from solo demonstrations, then adapts it for cooperation during multi-agent training through a policy fusion mechanism that combines an MoE-like gating selector and an action editor. Experiments across…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper takes a novel perspective, proposing to use a more general single-agent task corresponding to a multi-agent task as a kind of initialization for the multi-agent problem. It also shows good empirical performance, requiring fewer training steps within the multi-agent environments.
**Shortcomings in the comparative experiments:** 1. Since the method is positioned as a “plug-in,” it should be combined with more backbones to demonstrate extensibility. For the authors’ continuous-action settings, baselines such as MADDPG and SAC should be included. 2. The paper claims the method can extend to discrete-action environments, yet all reported experiments are on continuous control. The authors should add concrete results on discrete-action tasks. 3. I would like to see ablation
1. The paper addresses a practically relevant scenario where solo demonstrations are easier to collect than coordinated multi-agent trajectories in domains like collaborative coding, household robotics, and search-and-rescue. 2. The paper is generally well-written and easy to follow. 3. The policy fusion mechanism is intuitive, addressing goal ambiguity through the gating selector and domain shift through the action editor, with a modular design that allows flexibility for different scenarios.
1. The paper cites PegMARL (Yu et al., 2025), which is incorrectly characterized. According to its abstract, PegMARL utilizes "personalized expert demonstrations" that are "tailored for each individual agent" and "solely pertain to single-agent behaviors without encompassing any cooperative elements," which is functionally identical to what this paper calls "solo demonstrations." This is an important baseline that should be compared experimentally. 2. The approach requires "well-defined, struct
- The empirical results clearly show improvement with the SoCo framework on top of two existing MARL algorithms. - The paper addresses a well motivated and under-studied problem and the approach seems sound. - The idea of action editor is interesting and perhaps can find applications beyond the solo-to-multi setting addressed in this paper. The key challenged addressed in this paper is domain shift and the action editor idea can be used in other settings where domain shift can be observed.
My main reservation is the lack of comparisons with other baseline methods and limited evaluation. For example, the paper claims that PegMARL (Yu et al., 2025) "assume sufficient multi-agent data" but that is not true. Most of the experiments in the PegMARL paper are with single agent demonstrations (see results in Sec 5.1). In fact, it seems that PegMARL would be a more general version since it can handle both single agent as well as multi-agent demonstrations (Sec 5.2). Given that, it seems th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Mobile Crowdsensing and Crowdsourcing · Robot Manipulation and Learning
