Multimodal Reasoning via Latent Refocusing

Jizheng Ma; Xiaofei Zhou; Geyuan Zhang; Yanlong Song; Han Yan

arXiv:2511.02360·cs.CV·January 21, 2026

Multimodal Reasoning via Latent Refocusing

Jizheng Ma, Xiaofei Zhou, Geyuan Zhang, Yanlong Song, Han Yan

PDF

Open Access

TL;DR

This paper introduces LaRe, a multimodal reasoning method that refocuses on visual inputs within a rich latent space, improving accuracy and efficiency in reasoning tasks involving images and language.

Contribution

LaRe combines visual refocusing with latent space reasoning and a semantic augmentation training strategy, advancing multimodal reasoning capabilities and interpretability.

Findings

01

LaRe improves average accuracy by 9.4% over baselines.

02

Reduces inference token usage by 16.5%.

03

Achieves competitive performance with larger models.

Abstract

Chain of Thought (CoT) reasoning enhances logical performance by decomposing complex tasks, yet its multimodal extension faces a trade-off. The existing Thinking with Images paradigm is limited by the modality gap between vision and language, which hinders reliable extraction of reasoning relevant information from high dimensional visual data. Recent latent space reasoning method provides stronger multimodal representations, but it often lacks the ability to refocus on visual inputs and suffers from limited interpretability. To address these issues, we propose \underline{La}tent \underline{Re}focusing (LaRe), a novel multimodal reasoning paradigm that combines visual refocusing with rich latent representations, enabling iterative reasoning within the latent space. We further design a semantic augmentation training strategy that enhances the semantic structure of the latent space through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Neurobiology of Language and Bilingualism