Imagination Helps Visual Reasoning, But Not Yet in Latent Space
You Li, Chi Chen, Yanghao Li, Fanhu Zeng, Kaiyu Huang, Jinan Xu, Maosong Sun

TL;DR
This paper investigates the effectiveness of latent visual reasoning in multimodal models, revealing limited causal influence of latent tokens and proposing a simpler explicit imagination method that outperforms complex latent-space approaches.
Contribution
The study uncovers key disconnections in latent reasoning and introduces CapImagine, a straightforward method that explicitly teaches models to imagine visually using text, improving performance.
Findings
Latent tokens have limited causal impact on final answers.
Perturbations on input and latent tokens minimally affect outcomes.
CapImagine outperforms complex latent-space baselines on vision benchmarks.
Abstract
Latent visual reasoning aims to mimic human's imagination process by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the true source of its efficacy, we investigate the validity of latent reasoning using Causal Mediation Analysis. We model the process as a causal chain: the input as the treatment, the latent tokens as the mediator, and the final answer as the outcome. Our findings uncover two critical disconnections: (a) Input-Latent Disconnect: dramatic perturbations on the input result in negligible changes to the latent tokens, suggesting that latent tokens do not effectively attend to the input sequence. (b) Latent-Answer Disconnect: perturbations on the latent tokens yield minimal impact on the final answer,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Action Observation and Synchronization
