Detecting the Undetectable: Assessing the Efficacy of Current Spoof   Detection Methods Against Seamless Speech Edits

Sung-Feng Huang; Heng-Cheng Kuo; Zhehuai Chen; Xuesong Yang; Chao-Han; Huck Yang; Yu Tsao; Yu-Chiang Frank Wang; Hung-yi Lee; Szu-Wei Fu

arXiv:2501.03805·cs.SD·January 8, 2025

Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits

Sung-Feng Huang, Heng-Cheng Kuo, Zhehuai Chen, Xuesong Yang, Chao-Han, Huck Yang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee, Szu-Wei Fu

PDF

Open Access

TL;DR

This paper evaluates the effectiveness of current spoof detection methods against advanced seamless speech edits created with Voicebox, introducing a new challenging dataset and demonstrating the potential of self-supervised detectors.

Contribution

It introduces the SINE dataset for seamless speech edits, re-implements Voicebox training, and shows that self-supervised detectors can effectively identify sophisticated speech manipulations.

Findings

01

Speech edited with Voicebox is harder to detect than traditional methods.

02

Self-supervised detectors perform well in detection, localization, and generalization.

03

The SINE dataset and models will be publicly released.

Abstract

Neural speech editing advancements have raised concerns about their misuse in spoofing attacks. Traditional partially edited speech corpora primarily focus on cut-and-paste edits, which, while maintaining speaker consistency, often introduce detectable discontinuities. Recent methods, like A\textsuperscript{3}T and Voicebox, improve transitions by leveraging contextual information. To foster spoofing detection research, we introduce the Speech INfilling Edit (SINE) dataset, created with Voicebox. We detailed the process of re-implementing Voicebox training and dataset creation. Subjective evaluations confirm that speech edited using this novel technique is more challenging to detect than conventional cut-and-paste methods. Despite human difficulty, experimental results demonstrate that self-supervised-based detectors can achieve remarkable performance in detection, localization, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Speech Recognition and Synthesis

MethodsFocus