3D Audio-Visual Segmentation

Artem Sokolov; Swapnil Bhosale; Xiatian Zhu

arXiv:2411.02236·cs.CV·October 22, 2025

3D Audio-Visual Segmentation

Artem Sokolov, Swapnil Bhosale, Xiatian Zhu

PDF

Open Access

TL;DR

This paper introduces 3D Audio-Visual Segmentation, a new task extending AVS to 3D, with a novel benchmark and a model that effectively segments sounding objects in 3D space, advancing embodied AI capabilities.

Contribution

The paper presents the first 3D AVS benchmark and a novel model, EchoSegnet, integrating pretrained foundation models for effective 3D sound object segmentation.

Findings

01

EchoSegnet outperforms existing methods on the new benchmark.

02

The benchmark includes 34 scenes and 7 object categories with photorealistic 3D annotations.

03

The approach demonstrates robustness to occlusions and acoustic variations.

Abstract

Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR. To that end, Audio-Visual Segmentation (AVS), taking as condition an audio signal to identify the masks of the target sounding objects in an input image with synchronous camera and microphone sensors, has been recently advanced. However, this paradigm is still insufficient for real-world operation, as the mapping from 2D images to 3D scenes is missing. To address this fundamental limitation, we introduce a novel research problem, 3D Audio-Visual Segmentation, extending the existing AVS to the 3D output space. This problem poses more challenges due to variations in camera extrinsics, audio scattering, occlusions, and diverse acoustics across sounding object categories. To facilitate this research, we create the very first simulation based benchmark,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · 3D Surveying and Cultural Heritage · Digital Media Forensic Detection