V-SlowFast Network for Efficient Visual Sound Separation
Lingyu Zhu, Esa Rahtu

TL;DR
This paper introduces V-SlowFast, an efficient three-stream neural network for visual sound separation that leverages multi-resolution spectrograms and novel attention and contrastive mechanisms, outperforming previous methods.
Contribution
The paper proposes a new V-SlowFast framework with multi-resolution spectrograms, contrastive learning, and an attention module, achieving state-of-the-art results with fewer parameters.
Findings
Outperforms previous state-of-the-art on MUSIC-21, AVE, and VGG-Sound datasets.
Achieves significant reduction in model size and computational complexity.
Demonstrates effectiveness of multi-resolution spectrograms and contrastive learning in sound separation.
Abstract
The objective of this paper is to perform visual sound separation: i) we study visual sound separation on spectrograms of different temporal resolutions; ii) we propose a new light yet efficient three-stream framework V-SlowFast that operates on Visual frame, Slow spectrogram, and Fast spectrogram. The Slow spectrogram captures the coarse temporal resolution while the Fast spectrogram contains the fine-grained temporal resolution; iii) we introduce two contrastive objectives to encourage the network to learn discriminative visual features for separating sounds; iv) we propose an audio-visual global attention module for audio and visual feature fusion; v) the introduced V-SlowFast model outperforms previous state-of-the-art in single-frame based visual sound separation on small- and large-scale datasets: MUSIC-21, AVE, and VGG-Sound. We also propose a small V-SlowFast architecture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
V-SlowFast Network for Efficient Visual Sound Separation· youtube
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Adaptive Filtering Techniques
