A Knowledge-Driven Approach to Target Speech Extraction in the Presence of Background Sound Effects for Cinematic Audio Source Separation (CASS)
Chun-wei Ho, Sabato Marco Siniscalchi, Kai Li, Chin-Hui Lee

TL;DR
This paper introduces a knowledge-driven method for extracting target speech from cinematic audio with background effects, leveraging manners of articulation to improve separation quality.
Contribution
It presents a novel approach that incorporates articulator-aware knowledge sources into speech separation, enhancing extraction in complex cinematic sound environments.
Findings
Knowledge-driven features improve separation accuracy.
Articulator-aware methods outperform knowledge-agnostic approaches.
Better extraction of speech segments buried in background sounds.
Abstract
We propose a knowledge-driven approach to speech target extraction in the presence of background sound effects already recorded in cinematic audio. The specific knowledge sources studied are manners of articulation that are detected in speech frames and adopted to form a knowledge vector as a part of features to enhance speech separation and target speech extraction because some short speech segments are often difficult to separate from mixed background sounds. Testing on the recent Sound Demixing Challenge data for cinematic audio source separation (CASS) shows that utilizing articulator-aware knowledge sources produces better separation results than those obtained without using any knowledge, especially for speech segments buried in unspecified background sound events.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
