Audio-Visual Intelligence in Large Foundation Models

You Qin; Kai Liu; Shengqiong Wu; Kai Wang; Shijian Deng; Yapeng Tian; Junbin Xiao; Yazhou Xing; Yinghao Ma; Bobo Li; Roger Zimmermann; Lei Cui; Furu Wei; Jiebo Luo; Hao Fei

arXiv:2605.04045·cs.CV·May 6, 2026

Audio-Visual Intelligence in Large Foundation Models

You Qin, Kai Liu, Shengqiong Wu, Kai Wang, Shijian Deng, Yapeng Tian, Junbin Xiao, Yazhou Xing, Yinghao Ma, Bobo Li, Roger Zimmermann, Lei Cui, Furu Wei, Jiebo Luo, Hao Fei

PDF

1 Repo

TL;DR

This survey comprehensively reviews audio-visual intelligence in large foundation models, covering tasks, methods, datasets, and challenges to unify the fragmented research landscape.

Contribution

It provides the first unified taxonomy, methodological synthesis, and structured comparison of AVI tasks, datasets, and evaluation practices in the context of large models.

Findings

01

Unified taxonomy of AVI tasks from understanding to generation

02

Methodological overview including fusion and generation techniques

03

Identification of open challenges like synchronization and safety

Abstract

Audio-Visual Intelligence (AVI) has emerged as a central frontier in artificial intelligence, bridging auditory and visual modalities to enable machines that can perceive, generate, and interact in the multimodal real world. In the era of large foundation models, joint modeling of audio and vision has become increasingly crucial, i.e., not only for understanding but also for controllable generation and reasoning across dynamic, temporally grounded signals. Recent advances, such as Meta MovieGen and Google Veo-3, highlight the growing industrial and academic focus on unified audio-vision architectures that learn from massive multimodal data. However, despite rapid progress, the literature remains fragmented, spanning diverse tasks, inconsistent taxonomies, and heterogeneous evaluation practices that impede systematic comparison and knowledge integration. This survey provides the first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

javisverse/Awesome-AVI
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.