Pitch Imperfect: Detecting Audio Deepfakes Through Acoustic Prosodic   Analysis

Kevin Warren; Daniel Olszewski; Seth Layton; Kevin Butler; Carrie; Gates; Patrick Traynor

arXiv:2502.14726·cs.SD·February 21, 2025

Pitch Imperfect: Detecting Audio Deepfakes Through Acoustic Prosodic Analysis

Kevin Warren, Daniel Olszewski, Seth Layton, Kevin Butler, Carrie, Gates, Patrick Traynor

PDF

Open Access

TL;DR

This paper proposes a prosody-based method for detecting audio deepfakes, demonstrating comparable accuracy to existing models but with enhanced robustness and explainability through linguistic feature analysis.

Contribution

It introduces a prosody-focused detection approach using classical linguistic features, improving robustness and interpretability over traditional low-level audio feature models.

Findings

01

Achieves 93% accuracy and 24.7% EER in deepfake detection.

02

Prosody features like jitter, shimmer, and fundamental frequency are key indicators.

03

Prosody-based models are more robust against adversarial attacks.

Abstract

Audio deepfakes are increasingly in-differentiable from organic speech, often fooling both authentication systems and human listeners. While many techniques use low-level audio features or optimization black-box model training, focusing on the features that humans use to recognize speech will likely be a more long-term robust approach to detection. We explore the use of prosody, or the high-level linguistic features of human speech (e.g., pitch, intonation, jitter) as a more foundational means of detecting audio deepfakes. We develop a detector based on six classical prosodic features and demonstrate that our model performs as well as other baseline models used by the community to detect audio deepfakes with an accuracy of 93% and an EER of 24.7%. More importantly, we demonstrate the benefits of using a linguistic features-based approach over existing models by applying an adaptive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing