Low-Cost Detection of Degraded Voice Clones via Source-Output Acoustic Consistency
Jana Shokr, Minos Papadopoulos, Jeremy Cooperstock, Pavo Orepic

TL;DR
This paper demonstrates that simple, interpretable acoustic features like fundamental frequency and Harmonics-to-Noise Ratio can effectively detect degraded synthetic voices, aiding quick rejection in sensitive applications.
Contribution
It introduces a lightweight, threshold-based detection method using source-output acoustic features for identifying failed voice synthesis outputs.
Findings
f0 and HNR achieved over 85% accuracy for WaveRNN
HNR outperformed other features for HiFi-GAN detection
source-output features capture distinct failure patterns
Abstract
Recent advances in generative speech have increased the need for automatic detection of obviously failed synthetic outputs. This is particularly important in clinical settings such as AVATAR therapy, in which schizophrenia patients engage with a computer-generated representation of their hallucinated voices and degraded synthesis may disrupt immersion and therapeutic engagement. We investigate whether low-dimensional, interpretable source-output acoustic features can provide a lightweight first-pass detector of degraded voice-cloning outputs. Motivated by source-filter models of speech, we first test median fundamental frequency (f0) as a source-related consistency measure, and compare it with vocal tract length (VTL) as a filter-related measure and Harmonics-to-Noise Ratio (HNR) as a noise-related descriptor. Human-labeled voice-cloning samples generated with two vocoder families,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
