Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
Yanting Miao, Yutao Sun, Dexin Wang, Mengyu Zhou, Pascal Poupart, Lei Lv, Qi Zhao, Li Wang, Hao Li, Xiaoxi Jiang, Guanjun Jiang

TL;DR
This paper introduces GAP, a new paradigm for visual latent reasoning in multimodal large language models, addressing feature-space mismatch issues to improve stability and performance.
Contribution
GAP aligns visual latent reasoning at multiple levels, enhancing stability and performance in multimodal large language models without external tools.
Findings
Achieves best mean aggregate perception and reasoning on Qwen2.5-VL 7B.
Latent signals provide task-relevant visual information beyond token slots.
Addresses feature-space mismatch to improve latent feedback reliability.
Abstract
Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We identify evidence for a feature-space mismatch that can contribute to this instability: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume~\citep{xie2025mhc,li2026siamesenorm,team2026attention}. This mismatch can make direct latent feedback unreliable. Motivated by this diagnosis, we propose \textbf{GAP}, a \textbf{G}ranular \textbf{A}lignment \textbf{P}aradigm for visual latent modeling. GAP aligns visual latent reasoning at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
