LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
Yifan Dai, Zhenhua Wu, Bohan Zeng, Daili Hua, Jialing Liu, Bozhou Li, Yuran Wang, Chengzhuo Tong, Hao Liang, Xiaochen Ma, Junbo Niu, Tianyu Guo, Yang Shi, Yue Ding, Yiyan Ji, Bingyin Mei, Yushuo Guan, Yuanxing Zhang, Pengfei Wan, Fangcheng Fu, Wentao Zhang

TL;DR
LatentOmni introduces a unified latent space for audio-visual reasoning, enhancing temporal grounding and outperforming existing models in omnimodal understanding tasks.
Contribution
The paper proposes LatentOmni, a novel framework that interleaves textual reasoning with audio-visual latent states, and introduces a new dataset for training and evaluation.
Findings
LatentOmni achieves state-of-the-art performance on multiple benchmarks.
LatentOmni outperforms explicit text chain-of-thought baselines.
The approach preserves dense sensory information for better reasoning.
Abstract
Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose \textbf{LatentOmni}, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
