Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes
Kuiyuan Zhang, Zhongyun Hua, Rushi Lan, Yushu Zhang, Yifang Guo

TL;DR
This paper introduces a novel phoneme-level feature discrepancy detection method for speech deepfakes, utilizing adaptive pooling and graph attention networks to identify temporal inconsistencies, significantly improving detection accuracy.
Contribution
It presents a new mechanism for detecting speech deepfakes by modeling phoneme-level inconsistencies with adaptive pooling and graph attention networks, outperforming existing methods.
Findings
Outperforms state-of-the-art deepfake detection methods on four benchmarks.
Effectively detects phoneme-level inconsistencies in synthetic speech.
Enhances detection robustness with a novel augmentation technique.
Abstract
Recent advancements in text-to-speech and speech conversion technologies have enabled the creation of highly convincing synthetic speech. While these innovations offer numerous practical benefits, they also cause significant security challenges when maliciously misused. Therefore, there is an urgent need to detect these synthetic speech signals. Phoneme features provide a powerful speech representation for deepfake detection. However, previous phoneme-based detection approaches typically focused on specific phonemes, overlooking temporal inconsistencies across the entire phoneme sequence. In this paper, we develop a new mechanism for detecting speech deepfakes by identifying the inconsistencies of phoneme-level speech features. We design an adaptive phoneme pooling technique that extracts sample-specific phoneme-level features from frame-level speech data. By applying this technique to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsSoftmax · Attention Is All You Need
