Robust Localization of Partially Fake Speech: Metrics and Out-of-Domain Evaluation
Hieu-Thi Luong, Inbal Rimon, Haim Permuter, Kong Aik Lee, Eng Siong Chng

TL;DR
This paper critically evaluates partial audio deepfake localization, highlighting the limitations of current metrics like EER, and proposes a more comprehensive evaluation framework emphasizing out-of-domain performance and real-world applicability.
Contribution
It reframes the localization task as sequential anomaly detection and advocates for threshold-dependent metrics, revealing the poor generalization of current models to out-of-domain data.
Findings
Current models perform well in-domain but poorly out-of-domain.
Over-optimizing for in-domain EER can harm real-world performance.
Adding partially fake data to training improves detection accuracy.
Abstract
Partial audio deepfake localization poses unique challenges and remain underexplored compared to full-utterance spoofing detection. While recent methods report strong in-domain performance, their real-world utility remains unclear. In this analysis, we critically examine the limitations of current evaluation practices, particularly the widespread use of Equal Error Rate (EER), which often obscures generalization and deployment readiness. We propose reframing the localization task as a sequential anomaly detection problem and advocate for the use of threshold-dependent metrics such as accuracy, precision, recall, and F1-score, which better reflect real-world behavior. Specifically, we analyze the performance of the open-source Coarse-to-Fine Proposal Refinement Framework (CFPRF), which achieves a 20-ms EER of 7.61% on the in-domain PartialSpoof evaluation set, but 43.25% and 27.59% on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Topic Modeling
