From Waveforms to Pixels: A Survey on Audio-Visual Segmentation

Jia Li; Yapeng Tian

arXiv:2508.03724·cs.CV·August 7, 2025

From Waveforms to Pixels: A Survey on Audio-Visual Segmentation

Jia Li, Yapeng Tian

PDF

TL;DR

This survey comprehensively reviews Audio-Visual Segmentation (AVS), covering methodologies, benchmarks, challenges, and future directions for integrating audio and visual data to identify sound-producing objects in videos.

Contribution

It provides an extensive overview of AVS approaches, benchmarks, and challenges, and proposes future research directions to advance the field.

Findings

01

Comparison of AVS methods across benchmarks

02

Impact of architectural choices and fusion strategies

03

Identification of current challenges and future directions

Abstract

Audio-Visual Segmentation (AVS) aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities. It has emerged as a significant research area in multimodal perception, enabling fine-grained object-level understanding. In this survey, we present a comprehensive overview of the AVS field, covering its problem formulation, benchmark datasets, evaluation metrics, and the progression of methodologies. We analyze a wide range of approaches, including architectures for unimodal and multimodal encoding, key strategies for audio-visual fusion, and various decoder designs. Furthermore, we examine major training paradigms, from fully supervised learning to weakly supervised and training-free methods. Notably, we provide an extensive comparison of AVS methods across standard benchmarks, highlighting the impact of different architectural choices, fusion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.