Leveraging Category Information for Single-Frame Visual Sound Source   Separation

Lingyu Zhu; Esa Rahtu

arXiv:2007.07984·cs.CV·April 19, 2021

Leveraging Category Information for Single-Frame Visual Sound Source Separation

Lingyu Zhu, Esa Rahtu

PDF

Open Access 3 Repos

TL;DR

This paper introduces simple, efficient models for visual sound source separation using only a single video frame, leveraging category information to improve performance, and achieves comparable or better results than complex existing methods.

Contribution

The paper proposes two novel models that utilize category labels or category similarity information during training for single-frame visual sound separation.

Findings

01

Models outperform recent baselines on MUSIC dataset

02

Single-frame approach reduces complexity compared to multi-stage architectures

03

Category information enhances separation performance

Abstract

Visual sound source separation aims at identifying sound components from a given sound mixture with the presence of visual cues. Prior works have demonstrated impressive results, but with the expense of large multi-stage architectures and complex data representations (e.g. optical flow trajectories). In contrast, we study simple yet efficient models for visual sound separation using only a single video frame. Furthermore, our models are able to exploit the information of the sound source category in the separation process. To this end, we propose two models where we assume that i) the category labels are available at the training time, or ii) we know if the training sample pairs are from the same or different category. The experiments with the MUSIC dataset show that our model obtains comparable or better performance compared to several recent baseline methods. The code is available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Adaptive Filtering Techniques