Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization
Tung Vu, Yen Nguyen, Hai Nguyen, Cuong Pham, Cong Tran

TL;DR
This paper introduces a new dataset, a detection framework, and a metric for localizing multiple manipulated speech segments, addressing a gap in deepfake detection for partial inpainting.
Contribution
The authors present MIST, a multilingual dataset with multi-region inpainting; ISA, a novel detection framework; and SF1@tau, a new evaluation metric for localization accuracy.
Findings
Existing detectors fail on partial inpainting at word level.
ISA outperforms non-iterative baselines in localization tasks.
The dataset and tools are publicly available for research.
Abstract
Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity - an increasingly realistic threat. Existing audio deepfake detection benchmarks focus on utterance-level binary classification or single-region tampering, leaving a critical gap in detecting and localizing multiple inpainted segments whose count is unknown a priori. We address this gap with three contributions. First, we introduce MIST (Multiregion Inpainting Speech Tampering), a large-scale multilingual dataset spanning 6 languages with 1-3 independently inpainted word-level segments per utterance, generated via LLM-guided semantic replacement and neural voice cloning, with fake content constituting only 2-7% of each utterance. Second, we propose ISA (Iterative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
