CUHK-EE Systems for the vTAD Challenge at NCMMSC 2025
Aemon Yat Fei Chiu, Jingyu Li, Yusheng Tian, Guangyan Zhang, Tan Lee

TL;DR
This paper introduces advanced voice timbre attribute detection systems using WavLM-Large embeddings and Diff-Net variants, achieving high accuracy and EER in the NCMMSC 2025 challenge, with insights into factors affecting performance.
Contribution
The paper presents novel timbre detection systems leveraging WavLM-Large and Diff-Net architectures, demonstrating improved generalization and performance in speaker attribute tasks.
Findings
WavLM-Large+FFN system achieves 77.96% accuracy and 21.79% EER.
WavLM-Large+SE-ResFFN excels in 'Seen' setting with 94.42% accuracy.
Model complexity influences generalization and robustness in timbre detection.
Abstract
This paper presents the Voice Timbre Attribute Detection (vTAD) systems developed by the Digital Signal Processing & Speech Technology Laboratory (DSP&STL) of the Department of Electronic Engineering (EE) at The Chinese University of Hong Kong (CUHK) for the 20th National Conference on Human-Computer Speech Communication (NCMMSC 2025) vTAD Challenge. The proposed systems leverage WavLM-Large embeddings with attentive statistical pooling (ASTP) to extract robust speaker representations, followed by two variants of Diff-Net, i.e., Feed-Forward Neural Network (FFN) and Squeeze-and-Excitation-enhanced Residual FFN (SE-ResFFN), to compare timbre attribute intensities between utterance pairs. Experimental results demonstrate that the WavLM-Large+FFN system generalises better to unseen speakers, achieving 77.96% accuracy and 21.79% equal error rate (EER), while the WavLM-Large+SE-ResFFN model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
