DnR-nonverbal: Cinematic Audio Source Separation Dataset Containing Non-Verbal Sounds
Takuya Hasumi, Yusuke Fujita

TL;DR
This paper introduces DnR-nonverbal, a new cinematic audio source separation dataset that includes non-verbal sounds like laughter and screams, addressing limitations of existing datasets and improving model performance on real movie audio.
Contribution
The paper presents a novel dataset, DnR-nonverbal, specifically designed to include non-verbal sounds for cinematic audio source separation, enhancing model robustness.
Findings
Current models struggle with non-verbal sound separation.
The new dataset improves separation performance on movie audio.
Non-verbal sounds are crucial for realistic audio separation.
Abstract
We propose a new dataset for cinematic audio source separation (CASS) that handles non-verbal sounds. Existing CASS datasets only contain reading-style sounds as a speech stem. These datasets differ from actual movie audio, which is more likely to include acted-out voices. Consequently, models trained on conventional datasets tend to have issues where emotionally heightened voices, such as laughter and screams, are more easily separated as an effect, not speech. To address this problem, we build a new dataset, DnR-nonverbal. The proposed dataset includes non-verbal sounds like laughter and screams in the speech stem. From the experiments, we reveal the issue of non-verbal sound extraction by the current CASS model and show that our dataset can effectively address the issue in the synthetic and actual movie audio. Our dataset is available at https://zenodo.org/records/15470640.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis
