Latent Chain-of-Thought for Visual Reasoning

Guohao Sun; Hang Hua; Jian Wang; Jiebo Luo; Sohail Dianat; Majid Rabbani; Raghuveer Rao; Zhiqiang Tao

arXiv:2510.23925·cs.AI·October 31, 2025

Latent Chain-of-Thought for Visual Reasoning

Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao

PDF

TL;DR

This paper introduces a novel latent chain-of-thought approach for visual reasoning in large vision-language models, using scalable Bayesian inference and diversity-seeking reinforcement learning to improve reasoning accuracy and interpretability.

Contribution

It reformulates reasoning as posterior inference, proposing a scalable training algorithm with diversity-seeking RL and Bayesian inference to enhance generalization and interpretability of LVLMs.

Findings

01

Improves performance on seven reasoning benchmarks

02

Enhances model generalization across unseen tasks

03

Increases interpretability of reasoning processes

Abstract

Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.