LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

Linquan Wu; Tianxiang Jiang; Yifei Dong; Haoyu Yang; Fengji Zhang; Shichaang Meng; Ai Xuan; Linqi Song; Jacky Keung

arXiv:2601.10129·cs.CV·January 16, 2026

LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, Jacky Keung

PDF

Open Access 1 Models 1 Datasets

TL;DR

LaViT introduces a novel framework that aligns latent visual thoughts to improve multi-modal reasoning by reconstructing visual semantics and attention trajectories, leading to significant performance gains.

Contribution

It proposes aligning latent visual thoughts instead of static embeddings, addressing the perception gap in multimodal models, and employs curriculum sensory gating for better grounding.

Findings

01

Achieves up to +16.9% gains on reasoning tasks.

02

Enables a 3B model to outperform larger models like GPT-4o.

03

Significantly improves visual grounding in multimodal reasoning.

Abstract

Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Svard/LaViT-3B
model· 14 dl· ♡ 5
14 dl♡ 5

Datasets

Svard/LaViT-15k
dataset· 24k dl
24k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Advanced Graph Neural Networks