Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features
Yuxiang Zhang, Zhuo Li, Jingze Lu, Wenchao Wang, Pengyuan Zhang

TL;DR
This paper introduces a synthetic speech detection method that leverages temporal consistency and distribution analysis of speaker features, addressing robustness and interpretability issues in existing approaches.
Contribution
It analyzes the inherent flaws in speaker features from TTS and proposes a novel SSD method based on temporal and distributional speaker feature analysis.
Findings
Effective in cross-dataset scenarios
Low computational complexity
Performs well with silence trimming
Abstract
Current synthetic speech detection (SSD) methods perform well on certain datasets but still face issues of robustness and interpretability. A possible reason is that these methods do not analyze the deficiencies of synthetic speech. In this paper, the flaws of the speaker features inherent in the text-to-speech (TTS) process are analyzed. Differences in the temporal consistency of intra-utterance speaker features arise due to the lack of fine-grained control over speaker features in TTS. Since the speaker representations in TTS are based on speaker embeddings extracted by encoders, the distribution of inter-utterance speaker features differs between synthetic and bonafide speech. Based on these analyzes, an SSD method based on temporal consistency and distribution of speaker features is proposed. On one hand, modeling the temporal consistency of intra-utterance speaker features can aid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsConvolution · Non Maximum Suppression · 1x1 Convolution · SSD
