Synthetic Speech Detection Based on Temporal Consistency and   Distribution of Speaker Features

Yuxiang Zhang; Zhuo Li; Jingze Lu; Wenchao Wang; Pengyuan Zhang

arXiv:2309.16954·eess.AS·October 2, 2023

Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features

Yuxiang Zhang, Zhuo Li, Jingze Lu, Wenchao Wang, Pengyuan Zhang

PDF

Open Access

TL;DR

This paper introduces a synthetic speech detection method that leverages temporal consistency and distribution analysis of speaker features, addressing robustness and interpretability issues in existing approaches.

Contribution

It analyzes the inherent flaws in speaker features from TTS and proposes a novel SSD method based on temporal and distributional speaker feature analysis.

Findings

01

Effective in cross-dataset scenarios

02

Low computational complexity

03

Performs well with silence trimming

Abstract

Current synthetic speech detection (SSD) methods perform well on certain datasets but still face issues of robustness and interpretability. A possible reason is that these methods do not analyze the deficiencies of synthetic speech. In this paper, the flaws of the speaker features inherent in the text-to-speech (TTS) process are analyzed. Differences in the temporal consistency of intra-utterance speaker features arise due to the lack of fine-grained control over speaker features in TTS. Since the speaker representations in TTS are based on speaker embeddings extracted by encoders, the distribution of inter-utterance speaker features differs between synthetic and bonafide speech. Based on these analyzes, an SSD method based on temporal consistency and distribution of speaker features is proposed. On one hand, modeling the temporal consistency of intra-utterance speaker features can aid…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsConvolution · Non Maximum Suppression · 1x1 Convolution · SSD