Multi-scale Multi-instance Visual Sound Localization and Segmentation

Shentong Mo; Haofan Wang

arXiv:2409.00486·cs.CV·September 4, 2024

Multi-scale Multi-instance Visual Sound Localization and Segmentation

Shentong Mo, Haofan Wang

PDF

Open Access

TL;DR

This paper introduces M2VSL, a multi-scale multi-instance framework that improves visual sound localization by leveraging multi-scale features and a transformer to achieve state-of-the-art results in localization and segmentation.

Contribution

The paper proposes a novel multi-scale multi-instance framework with a transformer for enhanced visual sound localization and segmentation.

Findings

01

Achieves state-of-the-art performance on multiple benchmarks.

02

Effectively leverages multi-scale features for better localization.

03

Outperforms previous methods in sound source segmentation.

Abstract

Visual sound localization is a typical and challenging problem that predicts the location of objects corresponding to the sound source in a video. Previous methods mainly used the audio-visual association between global audio and one-scale visual features to localize sounding objects in each image. Despite their promising performance, they omitted multi-scale visual features of the corresponding image, and they cannot learn discriminative regions compared to ground truths. To address this issue, we propose a novel multi-scale multi-instance visual sound localization framework, namely M2VSL, that can directly learn multi-scale semantic features associated with sound sources from the input image to localize sounding objects. Specifically, our M2VSL leverages learnable multi-scale visual features to align audio-visual representations at multi-level locations of the corresponding image. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies

MethodsALIGN