CUHK-EE Systems for the vTAD Challenge at NCMMSC 2025

Aemon Yat Fei Chiu; Jingyu Li; Yusheng Tian; Guangyan Zhang; Tan Lee

arXiv:2507.23266·eess.AS·February 16, 2026

CUHK-EE Systems for the vTAD Challenge at NCMMSC 2025

Aemon Yat Fei Chiu, Jingyu Li, Yusheng Tian, Guangyan Zhang, Tan Lee

PDF

TL;DR

This paper introduces advanced voice timbre attribute detection systems using WavLM-Large embeddings and Diff-Net variants, achieving high accuracy and EER in the NCMMSC 2025 challenge, with insights into factors affecting performance.

Contribution

The paper presents novel timbre detection systems leveraging WavLM-Large and Diff-Net architectures, demonstrating improved generalization and performance in speaker attribute tasks.

Findings

01

WavLM-Large+FFN system achieves 77.96% accuracy and 21.79% EER.

02

WavLM-Large+SE-ResFFN excels in 'Seen' setting with 94.42% accuracy.

03

Model complexity influences generalization and robustness in timbre detection.

Abstract

This paper presents the Voice Timbre Attribute Detection (vTAD) systems developed by the Digital Signal Processing & Speech Technology Laboratory (DSP&STL) of the Department of Electronic Engineering (EE) at The Chinese University of Hong Kong (CUHK) for the 20th National Conference on Human-Computer Speech Communication (NCMMSC 2025) vTAD Challenge. The proposed systems leverage WavLM-Large embeddings with attentive statistical pooling (ASTP) to extract robust speaker representations, followed by two variants of Diff-Net, i.e., Feed-Forward Neural Network (FFN) and Squeeze-and-Excitation-enhanced Residual FFN (SE-ResFFN), to compare timbre attribute intensities between utterance pairs. Experimental results demonstrate that the WavLM-Large+FFN system generalises better to unseen speakers, achieving 77.96% accuracy and 21.79% equal error rate (EER), while the WavLM-Large+SE-ResFFN model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.