Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
Chenfeng Wang, Wei He, Xuhan Zhu, Chunpeng Zhou, Qizhen Li, Song Yan, Yufei Zheng, Chengjun Yu, Fan Lu, Wei Zhai, Yang Cao, Pengfei Yu, Zheng-Jun Zha

TL;DR
This paper identifies limitations in current long-sequence visual reasoning methods and proposes SCOLAR, a new approach that significantly extends reasoning sequence length and improves performance on benchmarks.
Contribution
The paper introduces SCOLAR, a lightweight detransformer leveraging full-sequence hidden states to enhance latent reasoning length and accuracy in vision-language models.
Findings
SCOLAR extends latent reasoning sequences by over 30 times.
Achieves +14.12% improvement over baseline on reasoning benchmarks.
Demonstrates strong out-of-distribution generalization.
Abstract
In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent visual reasoning methods systematically degrades as the latent sequence grows longer. We reveal the root cause: Information Gain Collapse -- autoregressive generation makes each step highly dependent on prior outputs, so subsequent tokens can barely introduce new information. We further identify that heavily pooled () image embeddings used as supervision targets provide no more signal than meaningless placeholders. Motivated by these insights, we propose SCOLAR (Self-COnsistent LAtent Reasoning), which introduces a lightweight detransformer that leverages the LLM's full-sequence hidden states to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
