Multimodal Reasoning via Latent Refocusing
Jizheng Ma, Xiaofei Zhou, Geyuan Zhang, Yanlong Song, Han Yan

TL;DR
This paper introduces LaRe, a multimodal reasoning method that refocuses on visual inputs within a rich latent space, improving accuracy and efficiency in reasoning tasks involving images and language.
Contribution
LaRe combines visual refocusing with latent space reasoning and a semantic augmentation training strategy, advancing multimodal reasoning capabilities and interpretability.
Findings
LaRe improves average accuracy by 9.4% over baselines.
Reduces inference token usage by 16.5%.
Achieves competitive performance with larger models.
Abstract
Chain of Thought (CoT) reasoning enhances logical performance by decomposing complex tasks, yet its multimodal extension faces a trade-off. The existing Thinking with Images paradigm is limited by the modality gap between vision and language, which hinders reliable extraction of reasoning relevant information from high dimensional visual data. Recent latent space reasoning method provides stronger multimodal representations, but it often lacks the ability to refocus on visual inputs and suffers from limited interpretability. To address these issues, we propose \underline{La}tent \underline{Re}focusing (LaRe), a novel multimodal reasoning paradigm that combines visual refocusing with rich latent representations, enabling iterative reasoning within the latent space. We further design a semantic augmentation training strategy that enhances the semantic structure of the latent space through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Neurobiology of Language and Bilingualism
