Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models
Tan-Hanh Pham, Chris Ngo

TL;DR
This paper introduces Multimodal Chain of Continuous Thought (MCOUT), a novel reasoning approach in a joint latent space for vision-language models, improving multimodal reasoning accuracy and alignment over traditional language-based methods.
Contribution
The paper proposes MCOUT, a new continuous latent space reasoning paradigm for multimodal models, with two variants enhancing cross-modal alignment and reasoning performance.
Findings
MCOUT improves accuracy by up to 8.23% on benchmarks.
MCOUT enhances BLEU scores by up to 8.27%.
Latent continuous reasoning outperforms traditional language-based methods.
Abstract
Many reasoning techniques for large multimodal models adapt language model approaches, such as Chain-of-Thought (CoT) prompting, which express reasoning as word sequences. While effective for text, these methods are suboptimal for multimodal contexts, struggling to align audio, visual, and textual information dynamically. To explore an alternative paradigm, we propose the Multimodal Chain of Continuous Thought (MCOUT), which enables reasoning directly in a joint latent space rather than in natural language. In MCOUT, the reasoning state is represented as a continuous hidden vector, iteratively refined and aligned with visual and textual embeddings, inspired by human reflective cognition. We develop two variants: MCOUT-Base, which reuses the language model`s last hidden state as the continuous thought for iterative reasoning, and MCOUT-Multi, which integrates multimodal latent attention…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper articulates a compelling case for continuous latent reasoning over token-based CoT. By generating N_t=5-10 hidden state vectors instead of 50-100 reasoning tokens, the method achieves computational efficiency while maintaining interpretable iteration counts. 2. The paper is generally well written with a clear motivation, good writing, and informative figures. 3. Section 4.3's analysis of modality collapse in MCOUT-Multi is commendable. The authors identify low-variance mixed embed
Here's the review with unified, simpler formatting: 1. **MCOUT-Base is essentially COCONUT applied to VLMs**: The paper positions MCOUT-Base as a novel contribution, but it appears methodologically identical to COCONUT (Hao et al., 2024) - taking the last hidden state and feeding it back iteratively. The only difference is applying it to a VLM architecture (which is just an LLM with a vision encoder). This should be acknowledged more directly. The true novel contribution is MCOUT-Multi, but...
The work makes a pioneering effort to shift reasoning for multimodal tasks from discrete natural language (traditional CoT) to a continuous latent space, representing an innovative exploration of latent-space reasoning in multimodal scenarios.
1. Limited Novelty and Inadequate Comparison The core method largely transfers latent CoT from COCONUT to a multimodal context, lacking sufficient novelty. Additionally, it fails to adequately compare against relevant existing methods (e.g., Mirage) and omits key baselines (e.g., COCONUT adapted to multimodal scenarios), making its distinct advantages unclear. 2. Ambiguity in Method and Presentation The paper lacks clarity: the introduction fails to clearly articulate specific contributions, a
1. The paper introduces a novel paradigm for vision language model reasoning, shifting from discrete natural language sequences to iterative refinements within a continuous latent space. This approach effectively challenges the inherent limitations of language centric Chain of Thought methods, presenting a promising avenue for more intrinsic and efficient reasoning. 2. MCOUT achieves substantial performance improvements on several multimodal benchmarks using only a one billion parameter vision
1. The paper proposes multimodal latent attention to enhance cross-modal alignment. However, the conducted latent space analysis reveals that the mixed embeddings exhibit remarkably low variance across iterations. This observation strongly suggests that the multimodal attention mechanism fails to extract and integrate discriminative visual context effectively, leading to MCOUT-Multi’s performance being almost indistinguishable from MCOUT-Base. Consequently, the claimed advantage in dynamic multi
N/A
* The comparison of results is very unclear. For example, MCOUT-Base (Nt = 10) is written to have "up 4.79%" with 58.86% accuracy, while the baseline has 56.17. It is unclear what the method is comparing against. In addition the comparison with other baseline models is unfair because all the MCOUT models are finetuned on the benchmarks. * It is unclear what the purpose of the Multimodal Latent Attention module is. The discussion in Section 4.3 introduces many statistical results but are largely
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Semantic Web and Ontologies · Constraint Satisfaction and Optimization
