Speech-Forensics: Towards Comprehensive Synthetic Speech Dataset Establishment and Analysis
Zhoulin Ji, Chenhao Lin, Hang Wang, Chao Shen

TL;DR
This paper introduces a comprehensive synthetic speech dataset and a novel TEmporal Speech LocalizaTion network, TEST, that jointly detects, localizes, and recognizes synthetic speech segments with high accuracy, advancing research and applications in speech forensics.
Contribution
The paper presents a new extensive dataset covering authentic and synthetic speech, and proposes TEST, a model combining LSTM and Transformer for simultaneous detection, localization, and recognition of synthetic speech.
Findings
Achieved an average mAP of 83.55% and EER of 5.25% at the utterance level.
Attained an EER of 1.07% and 92.19% F1 score at the segment level.
Demonstrated robust performance for comprehensive synthetic speech analysis.
Abstract
Detecting synthetic from real speech is increasingly crucial due to the risks of misinformation and identity impersonation. While various datasets for synthetic speech analysis have been developed, they often focus on specific areas, limiting their utility for comprehensive research. To fill this gap, we propose the Speech-Forensics dataset by extensively covering authentic, synthetic, and partially forged speech samples that include multiple segments synthesized by different high-quality algorithms. Moreover, we propose a TEmporal Speech LocalizaTion network, called TEST, aiming at simultaneously performing authenticity detection, multiple fake segments localization, and synthesis algorithms recognition, without any complex post-processing. TEST effectively integrates LSTM and Transformer to extract more powerful temporal speech representations and utilizes dense prediction on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing
