Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

Zeyuan Yang; Xueyang Yu; Delin Chen; Maohao Shen; Chuang Gan

arXiv:2506.17218·cs.CV·June 23, 2025

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, Chuang Gan

PDF

Open Access 1 Repo 4 Reviews

TL;DR

Mirage introduces a novel framework enabling vision-language models to perform visual reasoning internally with latent visual tokens, improving multimodal reasoning without generating explicit images.

Contribution

The paper proposes a new method, Mirage, that allows VLMs to reason visually through latent tokens, bypassing the need for explicit image generation, and enhances reasoning capabilities.

Findings

01

Improved performance on multimodal reasoning benchmarks.

02

Effective internal visual reasoning without explicit image rendering.

03

Enhanced reasoning through training with distillation and reinforcement learning.

Abstract

Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually'', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

- The paper proposes a novel multimodal reasoning method that replaces full pixel-level decoding with compressed latent embeddings, achieving considerable performance gains. - Clear experimental and qualitative results. - Valid ablation study validating the necessity and effectiveness of the proposed two-stage training framework.

Weaknesses

- The process for compressing latent tokens (line 215) is not clearly explained. It is unclear how these features are divided and compressed with uniform pooling, and similarly how random pooling is applied. - There are several minor writing issues, such as line 133: urther -> further and inconsistent citation formatting (e.g., Anole (Chern et al., 2024) vs. Aurora Bigverdi et al. (2025) in lines 306–312). The authors are advised to carefully proofread the manuscript and standardize the citatio

Reviewer 02Rating 4Confidence 5

Strengths

1. The idea of latent visual tokens as a condition for Chain-of-Thought sounds interesting, which avoids the pitfalls of explicit image generation. 2. The two-stage training is simple and easy to implement, which can effectively balance visual grounding and reasoning flexibility. Besides, the use of RL for further refinement is a modern and justified addition. 3. Experiments conducted on spatial planning tasks present consistent improvements across multiple challenging benchmarks (e.g., VSP,

Weaknesses

1. The idea of generating latent visual tokens for Chain-of-Thought is not new, as it has already been used in many autonomous driving applications[1, 2, 3]. Therefore, the authors need to discuss the distinctions between their approach and these existing works. 2. There is a lack of a clear description regarding how to effectively encode the helper image. The approach for compressing an image into $k$=8 tokens requires a clear explanation. 3. It is necessary to analyze the attention maps bet

Reviewer 03Rating 2Confidence 3

Strengths

- Relevance of the problem: The paper tackles an important limitation of current vision–language models—their predominantly text-centric reasoning, which prevents them from integrating visual and textual information as humans naturally do. This is a timely and significant research direction.

Weaknesses

- The paper is not explicit about which baselines have access to helper images during training, making it difficult to interpret the reported gains. Although the authors mention that “unified models (Anole, MVoT) use the same multimodal supervision” (l. 308), it remains unclear whether these baselines (or the others in Tab 1) were trained with helper images or only with text-based reasoning traces that reference them. My understanding is that only Mirage directly used the helper images, but thi

Reviewer 04Rating 2Confidence 3

Strengths

- The paper is clearly written and generally accessible, with the main design choices well-motivated and logically presented. - The proposed idea of enhancing visual reasoning through latent visual tokens is both interesting and, to the best of my knowledge, novel.

Weaknesses

- My primary concern is the limited generalizability of the proposed approach. As currently presented, each task appears to require separate fine-tuning on a specifically synthesized dataset. It remains unclear whether the capabilities learned on one task can transfer to other domains. Moreover, some visual reasoning tasks may not admit a straightforward procedure for generating the synthetic 'helper image' used during training. Finally, results are reported exclusively on the Qwen2.5-VL family;

Code & Models

Repositories

umass-embodied-agi/mirage
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Generative Adversarial Networks and Image Synthesis

MethodsALIGN