Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

Shuai Dong; Siyuan Wang; Xingyu Liu; Chenglin Li; Haowen Hou; Zhongyu Wei

arXiv:2512.05665·cs.CL·January 22, 2026

Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

Shuai Dong, Siyuan Wang, Xingyu Liu, Chenglin Li, Haowen Hou, Zhongyu Wei

PDF

Open Access

TL;DR

ILVR introduces a novel framework for multimodal reasoning that combines dynamic latent visual state evolution with precise perceptual modeling, improving reasoning performance while reducing computational costs.

Contribution

ILVR unifies latent visual reasoning with interleaved textual generation, employing a self-supervised feature distillation strategy for adaptive visual cue generation.

Findings

01

ILVR outperforms existing methods on multimodal reasoning benchmarks.

02

The approach effectively balances perceptual detail and computational efficiency.

03

Extensive experiments validate the superiority of ILVR in dynamic visual reasoning.

Abstract

Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet faces limitations: methods either fail to capture intermediate state evolution due to single-step, non-interleaved structures, or sacrifice precise perceptual modeling by over-compressing features. We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling. ILVR interleaves textual generation with latent visual representations that act as specific, evolving cues for subsequent reasoning. Specifically, we employ a self-supervision strategy where a momentum teacher model selectively distills relevant features from ground-truth intermediate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling