HQ-MPSD: A Multilingual Artifact-Controlled Benchmark for Partial Deepfake Speech Detection
Menglu Li, Majd Alber, Ramtin Asgarianamiri, Lian Zhao, Xiao-Ping Zhang

TL;DR
This paper introduces HQ-MPSD, a high-quality, multilingual dataset for partial deepfake speech detection, highlighting the challenges in generalization faced by current models on realistic, artifact-free manipulations.
Contribution
The creation of HQ-MPSD, a large-scale, linguistically coherent, and naturalistic partial deepfake speech dataset that addresses limitations of previous datasets and provides a challenging benchmark.
Findings
State-of-the-art models perform poorly on HQ-MPSD with over 80% performance drop.
The dataset reveals significant generalization challenges in current detection methods.
HQ-MPSD's diversity and realism make it a more effective benchmark for future research.
Abstract
Detecting partial deepfake speech is challenging because manipulations occur only in short regions while the surrounding audio remains authentic. However, existing detection methods are fundamentally limited by the quality of available datasets, many of which rely on outdated synthesis systems and generation procedures that introduce dataset-specific artifacts rather than realistic manipulation cues. To address this gap, we introduce HQ-MPSD, a high-quality multilingual partial deepfake speech dataset. HQ-MPSD is constructed using linguistically coherent splice points derived from fine-grained forced alignment, preserving prosodic and semantic continuity and minimizing audible and visual boundary artifacts. The dataset contains 350.8 hours of speech across eight languages and 550 speakers, with background effects added to better reflect real-world acoustic conditions. MOS evaluations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis
