Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

Xin Zhang; Qiqi Tao; Jiawei Du; Moyun Liu; Joey Tianyi Zhou

arXiv:2605.02735·cs.LG·May 5, 2026

Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

Xin Zhang, Qiqi Tao, Jiawei Du, Moyun Liu, Joey Tianyi Zhou

PDF

TL;DR

The paper introduces a method to enhance visual latent reasoning in multimodal models by optimizing latent representations at inference time, overcoming prior suppression issues.

Contribution

It proposes a two-stage inference-time latent optimization approach that improves reasoning capacity without updating model parameters.

Findings

01

Latent optimization significantly improves reasoning across eight benchmarks.

02

Disentangling objectives prevents latent collapse and enhances semantic quality.

03

Inference-time optimization unlocks latent reasoning without additional training.

Abstract

Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a previously overlooked optimization pathology in existing latent visual reasoning methods: although visual latents become semantically enriched during training, their contribution to final answer prediction is systematically suppressed. Within the shared parameter space, the autoregressive objective favors shortcut reliance on direct visual input, driving latent tokens toward transition-like states rather than informative reasoning content. We term this phenomenon Silenced Visual Latents. To address it, we disentangle the two conflicting objectives by directly optimizing the latent reasoning at inference time, keeping backbone parameters frozen. In Stage I,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.