AV-Unified: A Unified Framework for Audio-visual Scene Understanding
Guangyao Li, Xin Wang, Wenwu Zhu

TL;DR
AV-Unified introduces a comprehensive framework that jointly learns multiple audio-visual scene understanding tasks by standardizing inputs and outputs, capturing multi-scale cues, and modeling audio-visual associations for improved scene comprehension.
Contribution
The paper presents a novel unified architecture that integrates diverse audio-visual tasks into a single model with shared representations and multi-scale perception modules.
Findings
Effective across various benchmark datasets
Improves performance on temporal, spatial, and spatiotemporal tasks
Demonstrates versatility and robustness of the unified approach
Abstract
When humans perceive the world, they naturally integrate multiple audio-visual tasks within dynamic, real-world scenes. However, current works such as event localization, parsing, segmentation and question answering are mostly explored individually, making it challenging to comprehensively understand complex audio-visual scenes and explore inter-task relationships. Hence, we propose \textbf{AV-Unified}, a unified framework that enables joint learning across a wide range of audio-visual scene understanding tasks. AV-Unified standardizes the diverse input-output formats of each task and incorporates a multi-scale spatiotemporal perception network to effectively capture audio-visual associations. Specifically, we unify the inputs and outputs of all supported tasks by converting them into sequences of discrete tokens, establishing a shared representation that allows a single architecture to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Multimodal Machine Learning Applications · Music and Audio Processing
