From Waveforms to Pixels: A Survey on Audio-Visual Segmentation
Jia Li, Yapeng Tian

TL;DR
This survey comprehensively reviews Audio-Visual Segmentation (AVS), covering methodologies, benchmarks, challenges, and future directions for integrating audio and visual data to identify sound-producing objects in videos.
Contribution
It provides an extensive overview of AVS approaches, benchmarks, and challenges, and proposes future research directions to advance the field.
Findings
Comparison of AVS methods across benchmarks
Impact of architectural choices and fusion strategies
Identification of current challenges and future directions
Abstract
Audio-Visual Segmentation (AVS) aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities. It has emerged as a significant research area in multimodal perception, enabling fine-grained object-level understanding. In this survey, we present a comprehensive overview of the AVS field, covering its problem formulation, benchmark datasets, evaluation metrics, and the progression of methodologies. We analyze a wide range of approaches, including architectures for unimodal and multimodal encoding, key strategies for audio-visual fusion, and various decoder designs. Furthermore, we examine major training paradigms, from fully supervised learning to weakly supervised and training-free methods. Notably, we provide an extensive comparison of AVS methods across standard benchmarks, highlighting the impact of different architectural choices, fusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
