Frame-level Temporal Difference Learning for Partial Deepfake Speech Detection

Menglu Li; Xiao-Ping Zhang; Lian Zhao

arXiv:2507.15101·cs.SD·July 28, 2025

Frame-level Temporal Difference Learning for Partial Deepfake Speech Detection

Menglu Li, Xiao-Ping Zhang, Lian Zhao

PDF

TL;DR

This paper introduces a novel deepfake speech detection method that analyzes frame-level temporal differences to identify unnatural variations, achieving state-of-the-art results without needing costly frame-level annotations.

Contribution

It proposes a Temporal Difference Attention Module (TDAM) that detects partial deepfakes by modeling temporal irregularities at multiple scales without explicit boundary labels.

Findings

01

Achieves an EER of 0.59% on PartialSpoof dataset

02

Achieves an EER of 0.03% on HAD dataset

03

Outperforms existing methods significantly

Abstract

Detecting partial deepfake speech is essential due to its potential for subtle misinformation. However, existing methods depend on costly frame-level annotations during training, limiting real-world scalability. Also, they focus on detecting transition artifacts between bonafide and deepfake segments. As deepfake generation techniques increasingly smooth these transitions, detection has become more challenging. To address this, our work introduces a new perspective by analyzing frame-level temporal differences and reveals that deepfake speech exhibits erratic directional changes and unnatural local transitions compared to bonafide speech. Based on this finding, we propose a Temporal Difference Attention Module (TDAM) that redefines partial deepfake detection as identifying unnatural temporal variations, without relying on explicit boundary annotations. A dual-level hierarchical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.