Towards an Interpretable Representation of Speaker Identity via Perceptual Voice Qualities
Robin Netzorg, Bohan Yu, Andrea Guzman, Peter Wu, Luna McNulty, Gopala, Anumanchipalli

TL;DR
This paper introduces an interpretable voice representation based on perceptual voice qualities (PQs), bridging high-level demographics and low-level acoustic features, and demonstrates that non-experts can perceive and predict these PQs.
Contribution
It proposes a novel PQ-based representation of speaker identity that is interpretable and accessible to non-experts, expanding understanding of speech perception.
Findings
Non-experts can perceive PQs reliably.
PQ-based representations are predictable from various speech features.
Adding gendered PQs enhances interpretability of speaker identity.
Abstract
Unlike other data modalities such as text and vision, speech does not lend itself to easy interpretation. While lay people can understand how to describe an image or sentence via perception, non-expert descriptions of speech often end at high-level demographic information, such as gender or age. In this paper, we propose a possible interpretable representation of speaker identity based on perceptual voice qualities (PQs). By adding gendered PQs to the pathology-focused Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) protocol, our PQ-based approach provides a perceptual latent space of the character of adult voices that is an intermediary of abstraction between high-level demographics and low-level acoustic, physical, or learned representations. Contrary to prior belief, we demonstrate that these PQs are hearable by ensembles of non-experts, and further demonstrate that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
