DnR-nonverbal: Cinematic Audio Source Separation Dataset Containing Non-Verbal Sounds

Takuya Hasumi; Yusuke Fujita

arXiv:2506.02499·cs.SD·June 10, 2025

DnR-nonverbal: Cinematic Audio Source Separation Dataset Containing Non-Verbal Sounds

Takuya Hasumi, Yusuke Fujita

PDF

Open Access

TL;DR

This paper introduces DnR-nonverbal, a new cinematic audio source separation dataset that includes non-verbal sounds like laughter and screams, addressing limitations of existing datasets and improving model performance on real movie audio.

Contribution

The paper presents a novel dataset, DnR-nonverbal, specifically designed to include non-verbal sounds for cinematic audio source separation, enhancing model robustness.

Findings

01

Current models struggle with non-verbal sound separation.

02

The new dataset improves separation performance on movie audio.

03

Non-verbal sounds are crucial for realistic audio separation.

Abstract

We propose a new dataset for cinematic audio source separation (CASS) that handles non-verbal sounds. Existing CASS datasets only contain reading-style sounds as a speech stem. These datasets differ from actual movie audio, which is more likely to include acted-out voices. Consequently, models trained on conventional datasets tend to have issues where emotionally heightened voices, such as laughter and screams, are more easily separated as an effect, not speech. To address this problem, we build a new dataset, DnR-nonverbal. The proposed dataset includes non-verbal sounds like laughter and screams in the speech stem. From the experiments, we reveal the issue of non-verbal sound extraction by the current CASS model and show that our dataset can effectively address the issue in the synthetic and actual movie audio. Our dataset is available at https://zenodo.org/records/15470640.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis