Introducing voice timbre attribute detection

Jinghao He; Zhengyan Sheng; Liping Chen; Kong Aik Lee; Zhen-Hua Ling

arXiv:2505.09661·cs.SD·June 24, 2025

Introducing voice timbre attribute detection

Jinghao He, Zhengyan Sheng, Liping Chen, Kong Aik Lee, Zhen-Hua Ling

PDF

Open Access 1 Repo

TL;DR

This paper introduces voice timbre attribute detection (vTAD), a new task for analyzing speech timbre using sensory attributes, and proposes a framework based on speaker embeddings tested on the VCTK-RVA dataset.

Contribution

It presents a novel task of voice timbre attribute detection and a framework utilizing speaker embeddings, with experimental validation on multiple encoders and datasets.

Findings

01

ECAPA-TDNN performs better on seen speakers.

02

FACodec generalizes better to unseen speakers.

03

Open-source code and dataset are provided.

Abstract

This paper focuses on explaining the timbre conveyed by speech signals and introduces a task termed voice timbre attribute detection (vTAD). In this task, voice timbre is explained with a set of sensory attributes describing its human perception. A pair of speech utterances is processed, and their intensity is compared in a designated timbre descriptor. Moreover, a framework is proposed, which is built upon the speaker embeddings extracted from the speech utterances. The investigation is conducted on the VCTK-RVA dataset. Experimental examinations on the ECAPA-TDNN and FACodec speaker encoders demonstrated that: 1) the ECAPA-TDNN speaker encoder was more capable in the seen scenario, where the testing speakers were included in the training set; 2) the FACodec speaker encoder was superior in the unseen scenario, where the testing speakers were not part of the training, indicating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vtad2025-challenge/vtad
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsSparse Evolutionary Training