Speech-Forensics: Towards Comprehensive Synthetic Speech Dataset Establishment and Analysis

Zhoulin Ji; Chenhao Lin; Hang Wang; Chao Shen

arXiv:2412.09032·cs.SD·July 18, 2025

Speech-Forensics: Towards Comprehensive Synthetic Speech Dataset Establishment and Analysis

Zhoulin Ji, Chenhao Lin, Hang Wang, Chao Shen

PDF

TL;DR

This paper introduces a comprehensive synthetic speech dataset and a novel TEmporal Speech LocalizaTion network, TEST, that jointly detects, localizes, and recognizes synthetic speech segments with high accuracy, advancing research and applications in speech forensics.

Contribution

The paper presents a new extensive dataset covering authentic and synthetic speech, and proposes TEST, a model combining LSTM and Transformer for simultaneous detection, localization, and recognition of synthetic speech.

Findings

01

Achieved an average mAP of 83.55% and EER of 5.25% at the utterance level.

02

Attained an EER of 1.07% and 92.19% F1 score at the segment level.

03

Demonstrated robust performance for comprehensive synthetic speech analysis.

Abstract

Detecting synthetic from real speech is increasingly crucial due to the risks of misinformation and identity impersonation. While various datasets for synthetic speech analysis have been developed, they often focus on specific areas, limiting their utility for comprehensive research. To fill this gap, we propose the Speech-Forensics dataset by extensively covering authentic, synthetic, and partially forged speech samples that include multiple segments synthesized by different high-quality algorithms. Moreover, we propose a TEmporal Speech LocalizaTion network, called TEST, aiming at simultaneously performing authenticity detection, multiple fake segments localization, and synthesis algorithms recognition, without any complex post-processing. TEST effectively integrates LSTM and Transformer to extract more powerful temporal speech representations and utilizes dense prediction on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing