X-DC: Explainable Deep Clustering based on Learnable Spectrogram Templates
Chihiro Watanabe, Hirokazu Kameoka

TL;DR
This paper introduces X-DC, an interpretable deep clustering method for speech separation that uses learnable spectrogram templates, enhancing model transparency and adaptability while maintaining high separation performance.
Contribution
The paper proposes an explainable deep clustering framework with interpretable spectrogram templates, enabling better understanding and adaptation in speech separation tasks.
Findings
X-DC achieves comparable speech separation performance to traditional DC.
The model's interpretability allows visualization of the separation process.
Incorporating model adaptation improves robustness to test conditions.
Abstract
Deep neural networks (DNNs) have achieved substantial predictive performance in various speech processing tasks. Particularly, it has been shown that a monaural speech separation task can be successfully solved with a DNN-based method called deep clustering (DC), which uses a DNN to describe the process of assigning a continuous vector to each time-frequency (TF) bin and measure how likely each pair of TF bins is to be dominated by the same speaker. In DC, the DNN is trained so that the embedding vectors for the TF bins dominated by the same speaker are forced to get close to each other. One concern regarding DC is that the embedding process described by a DNN has a black-box structure, which is usually very hard to interpret. The potential weakness owing to the non-interpretable black-box structure is that it lacks the flexibility of addressing the mismatch between training and test…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
