Synthesized Speech Detection Using Convolutional Transformer-Based Spectrogram Analysis
Emily R. Bartusiak, Edward J. Delp

TL;DR
This paper presents a novel spectrogram analysis method using a Compact Convolutional Transformer to effectively detect synthesized speech, addressing security concerns related to voice forgery and virtual assistant misuse.
Contribution
The paper introduces a Compact Convolutional Transformer architecture for synthesized speech detection, leveraging convolutional inductive biases and attention mechanisms for improved accuracy with limited data.
Findings
Successfully differentiates genuine and synthesized speech signals
Outperforms traditional methods in detection accuracy
Effective with smaller training datasets
Abstract
Synthesized speech is common today due to the prevalence of virtual assistants, easy-to-use tools for generating and modifying speech signals, and remote work practices. Synthesized speech can also be used for nefarious purposes, including creating a purported speech signal and attributing it to someone who did not speak the content of the signal. We need methods to detect if a speech signal is synthesized. In this paper, we analyze speech signals in the form of spectrograms with a Compact Convolutional Transformer (CCT) for synthesized speech detection. A CCT utilizes a convolutional layer that introduces inductive biases and shared weights into a network, allowing a transformer architecture to perform well with fewer data samples used for training. The CCT uses an attention mechanism to incorporate information from all parts of a signal under analysis. Trained on both genuine human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Multi-Head Attention · Residual Connection · Softmax · Label Smoothing · Adam · Position-Wise Feed-Forward Layer · Layer Normalization
