Discovering Sounding Objects by Audio Queries for Audio Visual   Segmentation

Shaofei Huang; Han Li; Yuqing Wang; Hongji Zhu; Jiao Dai; Jizhong Han,; Wenge Rong; Si Liu

arXiv:2309.09501·cs.CV·September 19, 2023

Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

Shaofei Huang, Han Li, Yuqing Wang, Hongji Zhu, Jiao Dai, Jizhong Han,, Wenge Rong, Si Liu

PDF

Open Access

TL;DR

This paper introduces AQFormer, a novel audio-queried transformer for audio visual segmentation that explicitly models object-level correspondence and temporal interactions, significantly improving performance on AVS benchmarks.

Contribution

The paper proposes AQFormer, which uses audio-conditioned object queries and an audio-bridged temporal module to enhance sound object segmentation accuracy.

Findings

01

Achieves state-of-the-art results on AVS benchmarks.

02

Improves M_J and M_F metrics by over 7% on MS3 setting.

03

Demonstrates effectiveness of explicit object-level correspondence.

Abstract

Audio visual segmentation (AVS) aims to segment the sounding objects for each frame of a given video. To distinguish the sounding objects from silent ones, both audio-visual semantic correspondence and temporal interaction are required. The previous method applies multi-frame cross-modal attention to conduct pixel-level interactions between audio features and visual features of multiple frames simultaneously, which is both redundant and implicit. In this paper, we propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information and associate each of them to particular sounding objects. Explicit object-level semantic correspondence between audio and visual modalities is established by gathering object information from visual features with predefined audio queries. Besides, an Audio-Bridged Temporal Interaction module is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization

MethodsAttention Is All You Need · Softmax · Dense Connections · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Linear Layer · Residual Connection · Adam · Multi-Head Attention · Layer Normalization