Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

Yudong Han; Yong Wang; Zaiquan Yang; Zhen Qu; Liyuan Pan; Xiangxiang Chu

arXiv:2604.10500·cs.CV·May 13, 2026

Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

Yudong Han, Yong Wang, Zaiquan Yang, Zhen Qu, Liyuan Pan, Xiangxiang Chu

PDF

TL;DR

This paper introduces a novel approach to multimodal latent reasoning that enhances visual perception and reasoning depth through a visual replay module and routing depth scaling, achieving state-of-the-art results.

Contribution

It proposes a visual replay module and adaptive depth scaling to improve visual token optimization and complex reasoning in multimodal latent models.

Findings

01

Achieves state-of-the-art performance on multiple benchmarks.

02

Provides significant inference speedups over explicit CoT methods.

03

Addresses visual under-optimization and token complexity issues.

Abstract

Multimodal latent reasoning has emerged as a promising paradigm that replaces explicit Chain-of-Thought (CoT) decoding with implicit feature propagation, simultaneously enhancing representation informativeness and reducing inference latency. By analyzing token-level gradient dynamics during latent training, we reveal two critical observations: (1) visual tokens exhibit significantly smaller gradient norms than their textual counterparts due to inherent language bias, resulting in systematic visual under-optimization; and (2) semantically simple tokens converge rapidly, whereas complex tokens exhibit persistent gradient instability constrained by fixed architectural depths. To address these limitations, we propose a visual replay module and routing depth scaling to collaboratively enhance visual perception and refine complicated latents for deeper contextual reasoning. The former module…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.