Time Domain Adversarial Voice Conversion for ADD 2022
Cheng Wen, Tingwei Guo, Xingjun Tan, Rui Yan, Shuran Zhou, Chuandong, Xie, Wei Zou, Xiangang Li

TL;DR
This paper presents a time domain adversarial voice conversion system for the ADD 2022 challenge, capable of generating convincing fake speech that can deceive anti-spoofing detectors while maintaining good audio quality.
Contribution
The authors introduce a novel time domain post-processing step for voice conversion that enhances deception ability against anti-spoofing detectors.
Findings
System ranks top in ADD 2022 Track 3.1
Demonstrates strong adversarial ability against anti-spoofing detectors
Maintains acceptable audio quality and speaker similarity
Abstract
In this paper, we describe our speech generation system for the first Audio Deep Synthesis Detection Challenge (ADD 2022). Firstly, we build an any-to-many voice conversion (VC) system to convert source speech with arbitrary language content into the target speaker%u2019s fake speech. Then the converted speech generated from VC is post-processed in the time domain to improve the deception ability. The experimental results show that our system has adversarial ability against anti-spoofing detectors with a little compromise in audio quality and speaker similarity. This system ranks top in Track 3.1 in the ADD 2022, showing that our method could also gain good generalization ability against different detectors.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
