Multimodal Latent Reasoning via Hierarchical Visual Cues Injection
Yiming Zhang, Qiangyu Yan, Borui Jiang, Kai Han

TL;DR
This paper introduces HIVE, a framework for multimodal latent reasoning that integrates hierarchical visual cues into the model's internal representations, enabling more grounded and efficient multi-step inference.
Contribution
HIVE is a novel recursive transformer-based approach that injects hierarchical visual cues into latent space for improved multimodal reasoning without relying on explicit textual rationales.
Findings
Hierarchical visual cues improve scene understanding.
Test-time scaling with vision knowledge enhances reasoning.
Latent space reasoning reduces reliance on verbose explanations.
Abstract
The advancement of multimodal large language models (MLLMs) has enabled impressive perception capabilities. However, their reasoning process often remains a "fast thinking" paradigm, reliant on end-to-end generation or explicit, language-centric chains of thought (CoT), which can be inefficient, verbose, and prone to hallucination. This work posits that robust reasoning should evolve within a latent space, integrating multimodal signals seamlessly. We propose multimodal latent reasoning via HIerarchical Visual cuEs injection (\emph{HIVE}), a novel framework that instills deliberate, "slow thinking" without depending on superficial textual rationales. Our method recursively extends transformer blocks, creating an internal loop for iterative reasoning refinement. Crucially, it injectively grounds this process with hierarchical visual cues from global scene context to fine-grained regional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
