Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Yaoting Wang; Peiwen Sun; Dongzhan Zhou; Guangyao Li; Honggang Zhang,; Di Hu

arXiv:2407.10957·cs.CV·July 16, 2024

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang,, Di Hu

PDF

Open Access

TL;DR

This paper introduces Ref-AVS, a new task and benchmark for segmenting objects in visual scenes based on natural language expressions enriched with audio and visual cues, advancing multimodal perception research.

Contribution

The work presents the first Ref-AVS benchmark with pixel-level annotations and a novel method leveraging multimodal cues for precise object segmentation.

Findings

01

The proposed method outperforms existing approaches in accuracy.

02

Ref-AVS benchmark enables new research in multimodal object segmentation.

03

Experiments validate the effectiveness of multimodal cues in segmentation tasks.

Abstract

Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions are articulated in natural language forms but are enriched with multimodal cues, including audio and visual descriptions. To facilitate this research, we construct the first Ref-AVS benchmark, which provides pixel-level annotations for objects described in corresponding multimodal-cue expressions. To tackle the Ref-AVS task, we propose a new method that adequately utilizes multimodal cues to offer precise segmentation guidance. Finally, we conduct quantitative and qualitative experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing