Mixture-of-World Models: Scaling Multi-Task Reinforcement Learning with Modular Latent Dynamics
Boxuan Zhang, Weipu Zhang, Zhaohan Feng, Wei Xiao, Jian Sun, Jie Chen, Gang Wang

TL;DR
The paper introduces Mixture-of-World Models, a scalable, modular architecture for multi-task reinforcement learning that improves sample efficiency and achieves state-of-the-art results on Atari and Meta-World benchmarks.
Contribution
It presents a novel modular architecture combining variational autoencoders, Transformer-based dynamics models, and task clustering for efficient multi-task RL.
Findings
Achieves 110.4% human-normalized score on Atari 100k with fewer parameters.
Sets a new state-of-the-art success rate of 74.5% on Meta-World.
Uses 50% fewer parameters than ensemble models while maintaining performance.
Abstract
A fundamental challenge in multi-task reinforcement learning (MTRL) is achieving sample efficiency in visual domains where tasks exhibit substantial heterogeneity in both observations and dynamics. Model-based reinforcement learning offers a promising path to improved sample efficiency through world models, but standard monolithic architectures struggle to capture diverse task dynamics, resulting in poor reconstruction and prediction accuracy. We introduce Mixture-of-World Models (MoW), a scalable architecture that combines modular variational autoencoders for task-adaptive visual compression, a hybrid Transformer-based dynamics model with task-conditioned experts and a shared backbone, and a gradient-based task clustering strategy for efficient parameter allocation. On the Atari 100k benchmark, a single MoW agent trained once on 26 Atari games achieves a mean human-normalized score of…
Peer Reviews
Decision·ICLR 2026 Poster
1. Tailored MoE specifically for World Model Dynamics. Standard MoE in Large Language Models often uses token-level routing. The authors rightly identify that for dynamics modeling, token-level routing can lead to "fragmented learning," where sequential temporal dependencies are broken if consecutive tokens route to different experts. 2. End-to-end MTRL often suffers from gradient conflicts and tasks dominating the loss landscape. MoW integrates distinct mechanisms to stabilize this. 3. Paramete
1. Dependency on Warmup Quality. The gradient-based clustering relies heavily on a "warmup stage" where a single VAE/predictor set is trained. If the initial warmup yields noisy gradients (common in early RL from pixels), the resulting clusters might be suboptimal and fixed for the rest of training. The paper does not deeply analyze the sensitivity of final performance to the duration or stability of this warmup phase. 2. Architectural Complexity and Tuning. The system is highly complex, involvi
1. To the best of my knowledge, modular architectures for world models are still underexplored, making this paper’s direction novel and valuable. 2. The design choices—such as task-level routing, the combination of fixed task clusters for VAEs with dynamic expert routing for transformers—are conceptually interesting and insightful. 3. The qualitative comparison in Figure 2 shows that MoW produces notably higher-quality imagined rollouts than the multi-task STORM baseline.
1. **The ablation study is severely incomplete.** The method introduces many components—such as cluster-specific VAEs, cascaded expert-shared transformers, an auxiliary task predictor head, balanced loss, and gradient-based task clustering—but none of these components are ablated to clarify their individual contributions to performance. Without such analysis, it is hard to assess which design elements are truly essential. 2. The overall architecture is highly complex, introducing a large number
**Clear modular design**. MoW uses reasonable modular architecture designs. **Promising parameter scalability**. MoW demonstrates promising parameter scaling capability. However, it would be interesting if the author can further scale the parameters to investigate if the performance could be further exponentially improved.
**Natation clarity**. It would be beneficial to clarify the notations in the figures, for example, Fig. 1, which could significantly enhance the readability of this paper. It would also be beneficial if you could explain the notation a little bit when it first appears. Details see questions. **Lack of experimental evidence**. The author didn't demonstrate the common Atari-100k performance table, including mainstream baselines. **Insufficient baselines.** STORM is actually not a multi-task RL b
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
