Towards an Interpretable Representation of Speaker Identity via   Perceptual Voice Qualities

Robin Netzorg; Bohan Yu; Andrea Guzman; Peter Wu; Luna McNulty; Gopala; Anumanchipalli

arXiv:2310.02497·cs.SD·October 5, 2023

Towards an Interpretable Representation of Speaker Identity via Perceptual Voice Qualities

Robin Netzorg, Bohan Yu, Andrea Guzman, Peter Wu, Luna McNulty, Gopala, Anumanchipalli

PDF

Open Access

TL;DR

This paper introduces an interpretable voice representation based on perceptual voice qualities (PQs), bridging high-level demographics and low-level acoustic features, and demonstrates that non-experts can perceive and predict these PQs.

Contribution

It proposes a novel PQ-based representation of speaker identity that is interpretable and accessible to non-experts, expanding understanding of speech perception.

Findings

01

Non-experts can perceive PQs reliably.

02

PQ-based representations are predictable from various speech features.

03

Adding gendered PQs enhances interpretability of speaker identity.

Abstract

Unlike other data modalities such as text and vision, speech does not lend itself to easy interpretation. While lay people can understand how to describe an image or sentence via perception, non-expert descriptions of speech often end at high-level demographic information, such as gender or age. In this paper, we propose a possible interpretable representation of speaker identity based on perceptual voice qualities (PQs). By adding gendered PQs to the pathology-focused Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) protocol, our PQ-based approach provides a perceptual latent space of the character of adult voices that is an intermediary of abstraction between high-level demographics and low-level acoustic, physical, or learned representations. Contrary to prior belief, we demonstrate that these PQs are hearable by ensembles of non-experts, and further demonstrate that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing