Pitch Imperfect: Detecting Audio Deepfakes Through Acoustic Prosodic Analysis
Kevin Warren, Daniel Olszewski, Seth Layton, Kevin Butler, Carrie, Gates, Patrick Traynor

TL;DR
This paper proposes a prosody-based method for detecting audio deepfakes, demonstrating comparable accuracy to existing models but with enhanced robustness and explainability through linguistic feature analysis.
Contribution
It introduces a prosody-focused detection approach using classical linguistic features, improving robustness and interpretability over traditional low-level audio feature models.
Findings
Achieves 93% accuracy and 24.7% EER in deepfake detection.
Prosody features like jitter, shimmer, and fundamental frequency are key indicators.
Prosody-based models are more robust against adversarial attacks.
Abstract
Audio deepfakes are increasingly in-differentiable from organic speech, often fooling both authentication systems and human listeners. While many techniques use low-level audio features or optimization black-box model training, focusing on the features that humans use to recognize speech will likely be a more long-term robust approach to detection. We explore the use of prosody, or the high-level linguistic features of human speech (e.g., pitch, intonation, jitter) as a more foundational means of detecting audio deepfakes. We develop a detector based on six classical prosodic features and demonstrate that our model performs as well as other baseline models used by the community to detect audio deepfakes with an accuracy of 93% and an EER of 24.7%. More importantly, we demonstrate the benefits of using a linguistic features-based approach over existing models by applying an adaptive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing
