Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

Chengzhi Liu; Yuzhe Yang; Yue Fan; Qingyue Wei; Sheng Liu; Xin Eric Wang

arXiv:2512.12623·cs.CV·April 10, 2026

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

Chengzhi Liu, Yuzhe Yang, Yue Fan, Qingyue Wei, Sheng Liu, Xin Eric Wang

PDF

TL;DR

This paper introduces DMLR, a novel framework that dynamically interleaves reasoning and perception in latent space, improving multimodal reasoning efficiency and accuracy without relying on explicit step-by-step processes.

Contribution

The paper proposes DMLR, a dynamic latent reasoning framework with confidence-guided optimization and visual injection strategies, advancing multimodal reasoning capabilities.

Findings

01

DMLR significantly improves reasoning accuracy across seven benchmarks.

02

The approach enhances perception and reasoning performance while maintaining high efficiency.

03

Dynamic visual-textual interleaving outperforms static methods in multimodal tasks.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation. However, these methods remain dependent on explicit step-by-step reasoning, unstable perception-reasoning interaction and notable computational overhead. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time Dynamic Multimodal Latent Reasoning framework that employs confidence-guided latent policy gradient optimization to refine…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.