Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang,, Di Hu

TL;DR
This paper introduces Ref-AVS, a new task and benchmark for segmenting objects in visual scenes based on natural language expressions enriched with audio and visual cues, advancing multimodal perception research.
Contribution
The work presents the first Ref-AVS benchmark with pixel-level annotations and a novel method leveraging multimodal cues for precise object segmentation.
Findings
The proposed method outperforms existing approaches in accuracy.
Ref-AVS benchmark enables new research in multimodal object segmentation.
Experiments validate the effectiveness of multimodal cues in segmentation tasks.
Abstract
Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions are articulated in natural language forms but are enriched with multimodal cues, including audio and visual descriptions. To facilitate this research, we construct the first Ref-AVS benchmark, which provides pixel-level annotations for objects described in corresponding multimodal-cue expressions. To tackle the Ref-AVS task, we propose a new method that adequately utilizes multimodal cues to offer precise segmentation guidance. Finally, we conduct quantitative and qualitative experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
