Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
Jin Cui, Xinyue Long, Xunyong Zhang, Yadong Zhang, Chuanchang Su, Jingye Gan, Boran Zhao, Pengju Ren

TL;DR
This paper introduces RIS, a spatial-semantic grounded framework for latent visual reasoning in multimodal large language models, addressing information bottlenecks and manifold compatibility issues.
Contribution
RIS develops a grounded latent reasoning method that aligns with pretrained models using spatial and semantic evidence, improving interpretability and reasoning fidelity.
Findings
RIS achieves consistent improvements on multiple benchmarks.
RIS learns diverse and interpretable latent trajectories.
RIS enhances internal visual reasoning in MLLMs.
Abstract
Multimodal Large Language Models (MLLMs) have made remarkable progress on vision-language reasoning, yet most methods still compress visual evidence into discrete textual thoughts, creating an information bottleneck for fine-grained perception. Recent latent visual reasoning methods attempt to reason in continuous hidden states, but we find that they suffer from insufficient manifold compatibility: latent trajectories drift away from pretrained reasoning circuits, collapse into instance-agnostic patterns, and are often bypassed during answer generation. To address these issues, we propose RIS (Retrieve, Integrate, and Synthesize), a spatial-semantic grounded framework that develops latent reasoning as a compatible extension of pretrained MLLM computation. We first construct a step-wise grounded reasoning dataset with bounding boxes and region-specific semantic descriptions. Built on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
