Sparse Imagination for Efficient Visual World Model Planning
Junha Chun, Youngjoon Jeong, Taesup Kim

TL;DR
This paper introduces a sparse imagination technique for visual world models that reduces computational load during planning, enabling real-time decision-making in resource-constrained robotic systems without sacrificing task performance.
Contribution
It proposes a novel sparse imagination method using a transformer-based vision model with randomized grouped attention, improving efficiency in visual world model planning.
Findings
Significantly accelerates planning inference.
Maintains high control fidelity with reduced computation.
Applicable to both simple and complex real-world tasks.
Abstract
World model based planning has significantly improved decision-making in complex environments by enabling agents to simulate future states and make informed choices. This computational burden is particularly restrictive in robotics, where resources are severely constrained. To address this limitation, we propose a Sparse Imagination for Efficient Visual World Model Planning, which enhances computational efficiency by reducing the number of tokens processed during forward prediction. Our method leverages a sparsely trained vision-based world model based on transformers with randomized grouped attention strategy, allowing the model to flexibly adjust the number of tokens processed based on the computational resource. By enabling sparse imagination during latent rollout, our approach significantly accelerates planning while maintaining high control fidelity. Experimental results…
Peer Reviews
Decision·ICLR 2026 Poster
- The idea is well-motivated, as planning time and inference cost are major concerns for patch token-based world models that encode rich information about the environment. - The reduction in inference and planning time is significant, while achieving comparable or even better planning success rates across various benchmarks. - The analysis of token information in Section 5.3 is quite interesting, providing insights into the information content and redundancy of patch features.
- The application of the proposed method seems somewhat limited, as it only applies to world models with patch tokens, a transformer backbone, and MPC as the planning algorithm. However, the idea of randomly dropping tokens during training and inference appears more general. Could this approach be extended to other use cases? - At planning time, the method relies on resampling different tokens across MPC iterations to capture the full task information. For open-loop CEM planning, however, inform
1. The paper is generally well-written and easy to follow, with a clear problem formulation, motivation, and method description. The overall narrative is coherent and the technical contributions are communicated effectively. 2. The core idea of training a transformer-based world model with randomized grouped attention such that it can perform test-time planning by randomly dropping visual patch tokens is conceptually simple, architecture-agnostic, and broadly applicable to visual world-modeling
1. The writing, particularly in the Experiments section, needs improvement for readability. Important details and baselines are placed in Appendix B.6 instead of the main text, which makes it difficult to follow comparisons and understand experimental context. Additionally, some methods (e.g., Latin Hypercube Sampling, McKay et al., 2000) are referenced without explanation, making it harder for readers unfamiliar with these techniques to interpret results. 2. The paper argues that training the
1. The paper introduces Sparse Imagination, a surprisingly simple yet highly effective approach for accelerating world models. Counter-intuitively, such a straightforward method outperforms importance-based sampling strategies, offering both practical value and novel insight that the full patch tokens are redundant. 2. The authors conduct extensive experiments to demonstrate the superiority of random dropout–based Sparse Imagination. Comparisons across multiple dropout ratios and different drop
1. The real-world evaluation is limited to only a single task, which raises concerns about the robustness of the reported results. In general, I would suggest using at least three tasks, or at a minimum two different embodiments. This is particularly important because the workspace of the LeRobot arm is quite small, meaning that the effective action chunk workspace is very limited. 2. I have some concerns regarding the comparative experimental settings. In the case of random sampling, does fixe
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Advanced Vision and Imaging
MethodsSoftmax · Attention Is All You Need
