COMBO: Compositional World Models for Embodied Multi-Agent Cooperation
Hongxin Zhang, Zeyuan Wang, Qiushi Lyu, Zheyuan Zhang, Sunli Chen,, Tianmin Shu, Behzad Dariush, Kwonjoon Lee, Yilun Du, Chuang Gan

TL;DR
This paper introduces COMBO, a compositional world model for embodied multi-agent cooperation that enables decentralized agents to plan and cooperate effectively using egocentric views and a tree search approach.
Contribution
The paper proposes a novel compositional world model that factorizes joint actions and integrates vision-language models for multi-agent cooperation under partial observability.
Findings
Effective in multi-agent benchmarks with 2-4 agents
Enables online cooperative planning with arbitrary number of agents
Improves cooperation efficiency across various tasks
Abstract
In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics conditioned on an arbitrary number of agents' actions given only partial egocentric visual observations of the world. To address this issue of partial observability, we first train generative models to estimate the overall world state given partial egocentric observations. To enable accurate simulation of multiple sets of actions on this world state, we then propose to learn a compositional world model for multi-agent cooperation by factorizing the naturally composable joint actions of multiple agents and compositionally generating the video conditioned on the world state. By leveraging this…
Peer Reviews
Decision·ICLR 2025 Poster
**Novelty**. Suggested framework, named COMBO, offers a unique solution to the multi-agent planning problem by utilizing compositional world modeling for accurate simulation. **Clear Framework Explanation.** The framework is presented clearly and is easy to follow. Each setting and procedure is understandable through Figure 3 and Algorithm 1. The roles of each module are well-explained in the text and formulation. **Well-designed Experiments And Clarified Implications of Results.** The experim
**Lack of Figure Clarity and Interpretability.** The figures in the paper are unclear and difficult to interpret. Illustrations should enhance understanding, but these require reading the text to decipher them. For instance, Figure 1-(b) displays a random assortment of images without any labels. I believe this can be resolved by adding explicit labels for sequential processes that each frame means. Similarly, Figure 4 presents consecutive frames without explanations. Adding annotations, e.g. sta
- The problem of join cooperation of several agents is interesting, and the approach for world modeling by merging observations from different agents seems plausible. - Decomposition of the full state to regions that can be affected by each agent (while not correct in general e.g. turning on light would change the whole image, however it is a reasonable assumption that the overall scene in effected by agents mostly independently)
- Scaling loss with the reachability assumes that reachability is provided externally, in real world agents "reachability" should be additionally estimated / discovered from the exploration data. It would be great if the authors would cover better how to discover the reachability regions if they are not provided. Also, what about regions what are not reachable by any agent? In current formulation, it is not clear if those regions are modeled or not in the world model. - Fine-tuning of VLM o
- They designed explicit compositional world model. It is one of the most distinguishable designs in their modeling, which supports the look ahead planning (tree search planning), and they showed it is beneficial through the empirical evaluation results. As a part of this, the world state estimation is good to build the world model for multi-agent setting. - It is a well written paper. I can easily follow their discussions without unnecessary questions.
- The room of the evaluated benchmarks is too small to show the effectiveness of their proposed modeling. In Table 1, COMBO outperformed previous works. Although, when comparing with LLaVA, it shows comparable success rate except TDW-Cook Cooperator 1 setting. Their efficiency on solving the tasks is clearly better than LLaVA, but we felt it is not good enough to evaluate the effectiveness of their modeling. - The generalization performance evaluation is too weak to show that in lines 521-524 an
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation
