Exposing AI-Synthesized Human Voices Using Neural Vocoder Artifacts
Chengzhe Sun, Shan Jia, Shuwei Hou, Ehab AlBadawy, Siwei Lyu

TL;DR
This paper presents a novel detection method for AI-synthesized human voices by identifying neural vocoder artifacts using a multi-task learning framework, significantly improving classification accuracy.
Contribution
It introduces a multi-task learning approach that leverages vocoder artifact detection to enhance synthetic voice detection performance.
Findings
High classification accuracy achieved with the proposed model.
Vocoder artifact detection improves synthetic voice discrimination.
Multi-task learning constrains feature extraction for better results.
Abstract
The advancements of AI-synthesized human voices have introduced a growing threat of impersonation and disinformation. It is therefore of practical importance to developdetection methods for synthetic human voices. This work proposes a new approach to detect synthetic human voices based on identifying artifacts of neural vocoders in audio signals. A neural vocoder is a specially designed neural network that synthesizes waveforms from temporal-frequency representations, e.g., mel-spectrograms. The neural vocoder is a core component in most DeepFake audio synthesis models. Hence the identification of neural vocoder processing implies that an audio sample may have been synthesized. To take advantage of the vocoder artifacts for synthetic human voice detection, we introduce a multi-task learning framework for a binary-class RawNet2 model that shares the front-end feature extractor with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
