Compression Robust Synthetic Speech Detection Using Patched Spectrogram Transformer
Amit Kumar Singh Yadav, Ziyue Xiang, Kratika Bhagtani, Paolo, Bestagini, Stefano Tubaro, Edward J. Delp

TL;DR
This paper introduces PS3DT, a transformer-based synthetic speech detector that processes mel-spectrogram patches, demonstrating improved accuracy and robustness in diverse and challenging real-world scenarios compared to existing methods.
Contribution
The paper proposes PS3DT, a novel patch-based transformer model for synthetic speech detection that enhances generalization and robustness over prior spectrogram-based approaches.
Findings
PS3DT outperforms existing spectrogram-based methods on ASVspoof2019.
PS3DT generalizes well to out-of-distribution datasets like In-the-Wild.
PS3DT is robust against speech compression and telephone-quality synthetic speech.
Abstract
Many deep learning synthetic speech generation tools are readily available. The use of synthetic speech has caused financial fraud, impersonation of people, and misinformation to spread. For this reason forensic methods that can detect synthetic speech have been proposed. Existing methods often overfit on one dataset and their performance reduces substantially in practical scenarios such as detecting synthetic speech shared on social platforms. In this paper we propose, Patched Spectrogram Synthetic Speech Detection Transformer (PS3DT), a synthetic speech detector that converts a time domain speech signal to a mel-spectrogram and processes it in patches using a transformer neural network. We evaluate the detection performance of PS3DT on ASVspoof2019 dataset. Our experiments show that PS3DT performs well on ASVspoof2019 dataset compared to other approaches using spectrogram for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Adam · Softmax · Multi-Head Attention · Layer Normalization · Residual Connection · Absolute Position Encodings
