From Pixels to Cooperation Multi Agent Reinforcement Learning based on Multimodal World Models

Sureyya Akin; Kavita Srivastava; Prateek B. Kapoor; Pradeep G. Sethi; Sunita Q. Patel; Rahu Srivastava

arXiv:2511.01310·cs.MA·November 12, 2025

From Pixels to Cooperation Multi Agent Reinforcement Learning based on Multimodal World Models

Sureyya Akin, Kavita Srivastava, Prateek B. Kapoor, Pradeep G. Sethi, Sunita Q. Patel, Rahu Srivastava

PDF

Open Access

TL;DR

This paper introduces a multimodal world model that enables sample-efficient cooperative multi-agent reinforcement learning from high-dimensional sensory inputs like pixels and audio, by learning environment dynamics in a latent space.

Contribution

It presents a novel shared multimodal world model that fuses observations and acts as an imagined simulator for efficient policy training in multi-agent settings.

Findings

01

Achieves orders-of-magnitude better sample efficiency than baselines.

02

Multimodal fusion is crucial for task success in sensory-asymmetric environments.

03

Provides robustness to sensor dropout, aiding real-world deployment.

Abstract

Learning cooperative multi-agent policies directly from high-dimensional, multimodal sensory inputs like pixels and audio (from pixels) is notoriously sample-inefficient. Model-free Multi-Agent Reinforcement Learning (MARL) algorithms struggle with the joint challenge of representation learning, partial observability, and credit assignment. To address this, we propose a novel framework based on a shared, generative Multimodal World Model (MWM). Our MWM is trained to learn a compressed latent representation of the environment's dynamics by fusing distributed, multimodal observations from all agents using a scalable attention-based mechanism. Subsequently, we leverage this learned MWM as a fast, "imagined" simulator to train cooperative MARL policies (e.g., MAPPO) entirely within its latent space, decoupling representation learning from policy learning. We introduce a new set of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning