Introducing voice timbre attribute detection
Jinghao He, Zhengyan Sheng, Liping Chen, Kong Aik Lee, Zhen-Hua Ling

TL;DR
This paper introduces voice timbre attribute detection (vTAD), a new task for analyzing speech timbre using sensory attributes, and proposes a framework based on speaker embeddings tested on the VCTK-RVA dataset.
Contribution
It presents a novel task of voice timbre attribute detection and a framework utilizing speaker embeddings, with experimental validation on multiple encoders and datasets.
Findings
ECAPA-TDNN performs better on seen speakers.
FACodec generalizes better to unseen speakers.
Open-source code and dataset are provided.
Abstract
This paper focuses on explaining the timbre conveyed by speech signals and introduces a task termed voice timbre attribute detection (vTAD). In this task, voice timbre is explained with a set of sensory attributes describing its human perception. A pair of speech utterances is processed, and their intensity is compared in a designated timbre descriptor. Moreover, a framework is proposed, which is built upon the speaker embeddings extracted from the speech utterances. The investigation is conducted on the VCTK-RVA dataset. Experimental examinations on the ECAPA-TDNN and FACodec speaker encoders demonstrated that: 1) the ECAPA-TDNN speaker encoder was more capable in the seen scenario, where the testing speakers were included in the training set; 2) the FACodec speaker encoder was superior in the unseen scenario, where the testing speakers were not part of the training, indicating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsSparse Evolutionary Training
