Frame-level Temporal Difference Learning for Partial Deepfake Speech Detection
Menglu Li, Xiao-Ping Zhang, Lian Zhao

TL;DR
This paper introduces a novel deepfake speech detection method that analyzes frame-level temporal differences to identify unnatural variations, achieving state-of-the-art results without needing costly frame-level annotations.
Contribution
It proposes a Temporal Difference Attention Module (TDAM) that detects partial deepfakes by modeling temporal irregularities at multiple scales without explicit boundary labels.
Findings
Achieves an EER of 0.59% on PartialSpoof dataset
Achieves an EER of 0.03% on HAD dataset
Outperforms existing methods significantly
Abstract
Detecting partial deepfake speech is essential due to its potential for subtle misinformation. However, existing methods depend on costly frame-level annotations during training, limiting real-world scalability. Also, they focus on detecting transition artifacts between bonafide and deepfake segments. As deepfake generation techniques increasingly smooth these transitions, detection has become more challenging. To address this, our work introduces a new perspective by analyzing frame-level temporal differences and reveals that deepfake speech exhibits erratic directional changes and unnatural local transitions compared to bonafide speech. Based on this finding, we propose a Temporal Difference Attention Module (TDAM) that redefines partial deepfake detection as identifying unnatural temporal variations, without relying on explicit boundary annotations. A dual-level hierarchical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
