AV-Unified: A Unified Framework for Audio-visual Scene Understanding

Guangyao Li; Xin Wang; Wenwu Zhu

arXiv:2603.06530·cs.CV·March 9, 2026

AV-Unified: A Unified Framework for Audio-visual Scene Understanding

Guangyao Li, Xin Wang, Wenwu Zhu

PDF

Open Access

TL;DR

AV-Unified introduces a comprehensive framework that jointly learns multiple audio-visual scene understanding tasks by standardizing inputs and outputs, capturing multi-scale cues, and modeling audio-visual associations for improved scene comprehension.

Contribution

The paper presents a novel unified architecture that integrates diverse audio-visual tasks into a single model with shared representations and multi-scale perception modules.

Findings

01

Effective across various benchmark datasets

02

Improves performance on temporal, spatial, and spatiotemporal tasks

03

Demonstrates versatility and robustness of the unified approach

Abstract

When humans perceive the world, they naturally integrate multiple audio-visual tasks within dynamic, real-world scenes. However, current works such as event localization, parsing, segmentation and question answering are mostly explored individually, making it challenging to comprehensively understand complex audio-visual scenes and explore inter-task relationships. Hence, we propose \textbf{AV-Unified}, a unified framework that enables joint learning across a wide range of audio-visual scene understanding tasks. AV-Unified standardizes the diverse input-output formats of each task and incorporates a multi-scale spatiotemporal perception network to effectively capture audio-visual associations. Specifically, we unify the inputs and outputs of all supported tasks by converting them into sequences of discrete tokens, establishing a shared representation that allows a single architecture to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Multimodal Machine Learning Applications · Music and Audio Processing