Multi-scale Multi-instance Visual Sound Localization and Segmentation
Shentong Mo, Haofan Wang

TL;DR
This paper introduces M2VSL, a multi-scale multi-instance framework that improves visual sound localization by leveraging multi-scale features and a transformer to achieve state-of-the-art results in localization and segmentation.
Contribution
The paper proposes a novel multi-scale multi-instance framework with a transformer for enhanced visual sound localization and segmentation.
Findings
Achieves state-of-the-art performance on multiple benchmarks.
Effectively leverages multi-scale features for better localization.
Outperforms previous methods in sound source segmentation.
Abstract
Visual sound localization is a typical and challenging problem that predicts the location of objects corresponding to the sound source in a video. Previous methods mainly used the audio-visual association between global audio and one-scale visual features to localize sounding objects in each image. Despite their promising performance, they omitted multi-scale visual features of the corresponding image, and they cannot learn discriminative regions compared to ground truths. To address this issue, we propose a novel multi-scale multi-instance visual sound localization framework, namely M2VSL, that can directly learn multi-scale semantic features associated with sound sources from the input image to localize sounding objects. Specifically, our M2VSL leverages learnable multi-scale visual features to align audio-visual representations at multi-level locations of the corresponding image. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
MethodsALIGN
