The MSXF TTS System for ICASSP 2022 ADD Challenge

Chunyong Yang; Pengfei Liu; Yanli Chen; Hongbin Wang; Min Liu

arXiv:2201.11400·cs.SD·January 28, 2022

The MSXF TTS System for ICASSP 2022 ADD Challenge

Chunyong Yang, Pengfei Liu, Yanli Chen, Hongbin Wang, Min Liu

PDF

Open Access

TL;DR

This paper introduces an end-to-end TTS system using VITS and wav2vec 2.0 for the ICASSP 2022 ADD Challenge, exploring speech speed and volume effects on spoofing detection, achieving fourth place.

Contribution

The paper presents a novel TTS system with constraint loss and analyzes speech attributes influencing spoofing, advancing audio deep synthesis detection methods.

Findings

01

Faster speech reduces silence, making spoofing easier to detect.

02

Lower volume improves spoofing ability despite normalization.

03

Achieved fourth place in the ADD Challenge.

Abstract

This paper presents our MSXF TTS system for Task 3.1 of the Audio Deep Synthesis Detection (ADD) Challenge 2022. We use an end to end text to speech system, and add a constraint loss to the system when training stage. The end to end TTS system is VITS, and the pre-training self-supervised model is wav2vec 2.0. And we also explore the influence of the speech speed and volume in spoofing. The faster speech means the less the silence part in audio, the easier to fool the detector. We also find the smaller the volume, the better spoofing ability, though we normalize volume for submission. Our team is identified as C2, and we got the fourth place in the challenge.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings