Leveraging Latent Visual Reasoning in Silence

Dongyao Zhu; Zhen Wang; Xi Xiao; Han Jiang; Saeed Vahidian; Wei-Lun Chao; Tanya Berger-Wolf; Yu Su; Raju Vatsavai; Jianyang Gu

arXiv:2605.18641·cs.CV·May 19, 2026

Leveraging Latent Visual Reasoning in Silence

Dongyao Zhu, Zhen Wang, Xi Xiao, Han Jiang, Saeed Vahidian, Wei-Lun Chao, Tanya Berger-Wolf, Yu Su, Raju Vatsavai, Jianyang Gu

PDF

1 Repo

TL;DR

This paper investigates the role of latent visual reasoning tokens in multimodal tasks, showing they influence learning but are often unnecessary at inference, and proposes a reward to enhance their utility.

Contribution

It introduces an attention-based reward mechanism that encourages latent tokens to interact with text, improving reasoning performance without relying on persistent latent tokens at inference.

Findings

01

Latent tokens have limited impact at inference when replaced with noise.

02

Reinforcement learning reduces latent token generation behavior.

03

The proposed reward improves performance across benchmarks.

Abstract

Latent visual reasoning involves visual evidence more directly in multimodal reasoning by inserting continuous latent tokens before textual generation. However, the necessity of these latent tokens at inference remains ambiguous. We show that replacing latent tokens with random noise or removing them completely causes little performance degradation across spatial reasoning benchmarks. Reinforcement learning further diminishes the latent generation behavior after post-training. These observations raise a central question: Is latent visual reasoning still meaningful? We argue that its value should be measured by how effectively latent tokens guide learning, rather than whether they persist as an inference-time format. Our analysis shows that latent reasoning is unevenly favorable across question types, yet hard task-level routing for applying latent generation is brittle. Motivated by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ddydyd32/silent-lvr/tree/master
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.