Towards End-to-End Synthetic Speech Detection

Guang Hua; Andrew Beng Jin Teoh; Haijian Zhang

arXiv:2106.06341·eess.AS·July 13, 2021·IEEE Signal Process. Lett.

Towards End-to-End Synthetic Speech Detection

Guang Hua, Andrew Beng Jin Teoh, Haijian Zhang

PDF

2 Repos

TL;DR

This paper demonstrates that end-to-end deep neural networks can effectively detect synthetic speech, outperforming traditional methods that rely on hand-crafted features, and generalize well across different datasets.

Contribution

The study introduces TSSDNet, a lightweight end-to-end neural network that surpasses state-of-the-art methods in synthetic speech detection without using hand-crafted features.

Findings

01

TSSDNet outperforms existing methods on ASVspoof2019.

02

The model generalizes well to ASVspoof2015.

03

End-to-end DNNs have great potential for synthetic speech detection.

Abstract

The constant Q transform (CQT) has been shown to be one of the most effective speech signal pre-transforms to facilitate synthetic speech detection, followed by either hand-crafted (subband) constant Q cepstral coefficient (CQCC) feature extraction and a back-end binary classifier, or a deep neural network (DNN) directly for further feature extraction and classification. Despite the rich literature on such a pipeline, we show in this paper that the pre-transform and hand-crafted features could simply be replaced by end-to-end DNNs. Specifically, we experimentally verify that by only using standard components, a light-weight neural network could outperform the state-of-the-art methods for the ASVspoof2019 challenge. The proposed model is termed Time-domain Synthetic Speech Detection Net (TSSDNet), having ResNet- or Inception-style structures. We further demonstrate that the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.