UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
Houcheng Jiang, Jiajun Fu, Junfeng Fang, Chen Gao, Xiang Wang, Xiangnan He, Yong Li

TL;DR
UniVLR introduces a unified visual reasoning framework that consolidates textual and visual reasoning into a shared visual workspace, improving efficiency and performance in multimodal large language models.
Contribution
It proposes a novel unified visual latent reasoning approach that replaces fragmented reasoning paths with a shared visual workspace, enhancing efficiency and reasoning quality.
Findings
Outperforms prior visual latent reasoning methods on perception and visual reasoning tasks.
Uses fewer reasoning tokens while maintaining or improving accuracy.
Reasoning is conducted solely through visual latents, eliminating external tool calls.
Abstract
Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with visual latent tokens. This interleaved design limits efficiency and keeps reasoning fragmented across separate text and vision channels. We propose UniVLR, a unified visual latent reasoning framework that treats textual reasoning and auxiliary visual evidence as a shared visual workspace. Instead of preserving text CoT as an independent inference-time path, UniVLR renders reasoning traces together with auxiliary images and learns to compress this unified representation into compact visual latent tokens. At inference time, the model reasons only through visual latents and directly decodes the final answer, avoiding both external tool calls and verbose text reasoning. Experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
