A Generalized Bandsplit Neural Network for Cinematic Audio Source   Separation

Karn N. Watcharasupat; Chih-Wei Wu; Yiwei Ding; Iroro Orife; Aaron J.; Hipple; Phillip A. Williams; Scott Kramer; Alexander Lerch; and William; Wolcott

arXiv:2309.02539·eess.AS·August 27, 2024·1 cites

A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation

Karn N. Watcharasupat, Chih-Wei Wu, Yiwei Ding, Iroro Orife, Aaron J., Hipple, Phillip A. Williams, Scott Kramer, Alexander Lerch, and William, Wolcott

PDF

Open Access 1 Repo

TL;DR

This paper introduces a generalized neural network model for cinematic audio source separation that leverages psychoacoustic frequency scales, a novel loss function, and a flexible architecture to improve separation quality and computational efficiency.

Contribution

The work extends the Bandsplit RNN to handle various frequency partitions using psychoacoustic scales, introduces a new loss function, and employs a common-encoder setup for better performance and flexibility.

Findings

01

Achieved state-of-the-art results on the Divide and Remaster dataset.

02

Performed above the ideal ratio mask for dialogue separation.

03

Reduced computational complexity with a shared encoder architecture.

Abstract

Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psychoacoustically motivated frequency scales were used to inform the band definitions which are now defined with redundancy for more reliable feature extraction. A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed. We additionally exploit the information-sharing property of a common-encoder setup to reduce computational complexity during both training and inference, improve separation performance for hard-to-generalize classes of sounds, and allow flexibility during inference time with detachable decoders. Our best…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

karnwatcharasupat/bandit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Music and Audio Processing