Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
Xin Zhang, Qiqi Tao, Jiawei Du, Moyun Liu, Joey Tianyi Zhou

TL;DR
The paper introduces a method to enhance visual latent reasoning in multimodal models by optimizing latent representations at inference time, overcoming prior suppression issues.
Contribution
It proposes a two-stage inference-time latent optimization approach that improves reasoning capacity without updating model parameters.
Findings
Latent optimization significantly improves reasoning across eight benchmarks.
Disentangling objectives prevents latent collapse and enhances semantic quality.
Inference-time optimization unlocks latent reasoning without additional training.
Abstract
Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a previously overlooked optimization pathology in existing latent visual reasoning methods: although visual latents become semantically enriched during training, their contribution to final answer prediction is systematically suppressed. Within the shared parameter space, the autoregressive objective favors shortcut reliance on direct visual input, driving latent tokens toward transition-like states rather than informative reasoning content. We term this phenomenon Silenced Visual Latents. To address it, we disentangle the two conflicting objectives by directly optimizing the latent reasoning at inference time, keeping backbone parameters frozen. In Stage I,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
