Leveraging Category Information for Single-Frame Visual Sound Source Separation
Lingyu Zhu, Esa Rahtu

TL;DR
This paper introduces simple, efficient models for visual sound source separation using only a single video frame, leveraging category information to improve performance, and achieves comparable or better results than complex existing methods.
Contribution
The paper proposes two novel models that utilize category labels or category similarity information during training for single-frame visual sound separation.
Findings
Models outperform recent baselines on MUSIC dataset
Single-frame approach reduces complexity compared to multi-stage architectures
Category information enhances separation performance
Abstract
Visual sound source separation aims at identifying sound components from a given sound mixture with the presence of visual cues. Prior works have demonstrated impressive results, but with the expense of large multi-stage architectures and complex data representations (e.g. optical flow trajectories). In contrast, we study simple yet efficient models for visual sound separation using only a single video frame. Furthermore, our models are able to exploit the information of the sound source category in the separation process. To this end, we propose two models where we assume that i) the category labels are available at the training time, or ii) we know if the training sample pairs are from the same or different category. The experiments with the MUSIC dataset show that our model obtains comparable or better performance compared to several recent baseline methods. The code is available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Adaptive Filtering Techniques
