What's Holding Back Latent Visual Reasoning?
Andr\'e G. Viveiros, Nuno Gon\c{c}alves, Andr\'e F. T. Martins, Matthias Lindemann

TL;DR
This paper investigates why latent visual reasoning models often ignore intermediate tokens, revealing issues with dataset informativeness and token prediction quality that hinder their effectiveness.
Contribution
The study uncovers that current models do not utilize latent tokens effectively due to dataset limitations and token quality, providing insights for future improvements.
Findings
Latent tokens are often ignored because they add limited information in existing datasets.
Models can rely on latent tokens when trained on diagnostic datasets with informative intermediate steps.
Inference-time latent tokens tend to collapse, reducing their usefulness for reasoning.
Abstract
Humans can approach complex visual problems by mentally simulating intermediate visual steps, rather than reasoning through language alone. Inspired by this, several works on Vision-Language Models have recently explored chain-of-thought reasoning with continuous latent tokens as intermediate visual imagination steps. In this work, we investigate how recent models leverage such latent tokens. Surprisingly, we find that model accuracy is unaffected when latent tokens are replaced by uninformative dummy tokens. This indicates that latent tokens play a minimal causal role in the model's final prediction. To better understand this phenomenon, we analyze both the training signal provided by oracle latent representations and the quality of the latent tokens generated at inference time. Our experiments reveal two crucial issues holding back latent visual reasoning: First, in most existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
