TL;DR
This paper investigates the role of latent visual reasoning tokens in multimodal tasks, showing they influence learning but are often unnecessary at inference, and proposes a reward to enhance their utility.
Contribution
It introduces an attention-based reward mechanism that encourages latent tokens to interact with text, improving reasoning performance without relying on persistent latent tokens at inference.
Findings
Latent tokens have limited impact at inference when replaced with noise.
Reinforcement learning reduces latent token generation behavior.
The proposed reward improves performance across benchmarks.
Abstract
Latent visual reasoning involves visual evidence more directly in multimodal reasoning by inserting continuous latent tokens before textual generation. However, the necessity of these latent tokens at inference remains ambiguous. We show that replacing latent tokens with random noise or removing them completely causes little performance degradation across spatial reasoning benchmarks. Reinforcement learning further diminishes the latent generation behavior after post-training. These observations raise a central question: Is latent visual reasoning still meaningful? We argue that its value should be measured by how effectively latent tokens guide learning, rather than whether they persist as an inference-time format. Our analysis shows that latent reasoning is unevenly favorable across question types, yet hard task-level routing for applying latent generation is brittle. Motivated by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
