Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

Tung Vu; Yen Nguyen; Hai Nguyen; Cuong Pham; Cong Tran

arXiv:2605.02223·cs.SD·May 20, 2026

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

Tung Vu, Yen Nguyen, Hai Nguyen, Cuong Pham, Cong Tran

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces a new dataset, a detection framework, and a metric for localizing multiple manipulated speech segments, addressing a gap in deepfake detection for partial inpainting.

Contribution

The authors present MIST, a multilingual dataset with multi-region inpainting; ISA, a novel detection framework; and SF1@tau, a new evaluation metric for localization accuracy.

Findings

01

Existing detectors fail on partial inpainting at word level.

02

ISA outperforms non-iterative baselines in localization tasks.

03

The dataset and tools are publicly available for research.

Abstract

Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity - an increasingly realistic threat. Existing audio deepfake detection benchmarks focus on utterance-level binary classification or single-region tampering, leaving a critical gap in detecting and localizing multiple inpainted segments whose count is unknown a priori. We address this gap with three contributions. First, we introduce MIST (Multiregion Inpainting Speech Tampering), a large-scale multilingual dataset spanning 6 languages with 1-3 independently inpainted word-level segments per utterance, generated via LLM-guided semantic replacement and neural voice cloning, with fake content constituting only 2-7% of each utterance. Second, we propose ISA (Iterative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Datasets

tung2308/MIST_SpeechInpaintingDataset
dataset· 62 dl
62 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.