An End-to-End Multi-Module Audio Deepfake Generation System for ADD   Challenge 2023

Sheng Zhao; Qilong Yuan; Yibo Duan; Zhuoyue Chen

arXiv:2307.00729·cs.SD·July 4, 2023

An End-to-End Multi-Module Audio Deepfake Generation System for ADD Challenge 2023

Sheng Zhao, Qilong Yuan, Yibo Duan, Zhuoyue Chen

PDF

Open Access

TL;DR

This paper presents an end-to-end multi-module synthetic speech generation system that integrates speaker encoding, Tacotron2-based synthesis, and WaveRNN vocoding, achieving top performance in the ADD 2023 challenge.

Contribution

It introduces a novel integrated multi-module model for synthetic speech generation and demonstrates its effectiveness through extensive experiments and winning the ADD 2023 challenge.

Findings

01

Achieved a WDSR of 44.97% in ADD 2023 challenge

02

Compared various datasets and model structures extensively

03

Outperformed existing methods in synthetic speech deception success

Abstract

The task of synthetic speech generation is to generate language content from a given text, then simulating fake human voice.The key factors that determine the effect of synthetic speech generation mainly include speed of generation, accuracy of word segmentation, naturalness of synthesized speech, etc. This paper builds an end-to-end multi-module synthetic speech generation model, including speaker encoder, synthesizer based on Tacotron2, and vocoder based on WaveRNN. In addition, we perform a lot of comparative experiments on different datasets and various model structures. Finally, we won the first place in the ADD 2023 challenge Track 1.1 with the weighted deception success rate (WDSR) of 44.97%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Softmax · Tanh Activation · Sigmoid Activation · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · WaveRNN