PhonemeDF: A Synthetic Speech Dataset for Audio Deepfake Detection and Naturalness Evaluation
Vamshi Nallaguntla, Aishwarya Fursule, Shruti Kshirsagar, Anderson R. Avila

TL;DR
This paper introduces PhonemeDF, a phoneme-level synthetic speech dataset for evaluating audio deepfake detection and naturalness, using phoneme distribution divergence to assess fidelity and improve detection methods.
Contribution
The work provides a new phoneme-aligned dataset with real and synthetic speech, and demonstrates how phoneme distribution divergence correlates with deepfake detection performance.
Findings
KLD between real and synthetic phoneme distributions correlates with detection accuracy.
PhonemeDF enables evaluation of naturalness at the phoneme level.
KLD can identify the most discriminative phonemes for deepfake detection.
Abstract
The growing sophistication of speech generated by Artificial Intelligence (AI) has introduced new challenges in audio deepfake detection. Text-to-speech (TTS) and voice conversion (VC) technologies can create highly convincing synthetic speech with naturalness and intelligibility. This poses serious threats to voice biometric security and to systems designed to combat the spread of spoken misinformation, where synthetic voices may be used to disseminate false or malicious content. While interest in AI-generated speech has increased, resources for evaluating naturalness at the phoneme level remain limited. In this work, we address this gap by presenting the Phoneme-Level DeepFake dataset (PhonemeDF), comprising parallel real and synthetic speech segmented at the phoneme level. Real speech samples are derived from a subset of LibriSpeech, while synthetic samples are generated using four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Digital Media Forensic Detection · Voice and Speech Disorders
