Vision-aligned Latent Reasoning for Multi-modal Large Language Model
Byungwoo Jeon, Yoonwoo Jeong, Hyunseok Lee, Minsu Cho, Jinwoo Shin

TL;DR
This paper introduces VaLR, a reasoning framework that aligns visual information in latent space to improve multi-modal large language models' performance on complex, long-context tasks.
Contribution
VaLR dynamically generates vision-aligned latent tokens to enhance reasoning, addressing visual information dilution in existing multi-modal models.
Findings
VaLR outperforms existing methods on long-context understanding benchmarks.
Performance on VSI-Bench improves from 33.0% to 52.9%.
Achieves a 19.9% gain over Qwen2.5-VL.
Abstract
Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling. To address this issue, we introduce Vision-aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders. Empirical results demonstrate that VaLR consistently outperforms existing approaches across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
