Robust AI-Synthesized Speech Detection Using Feature Decomposition   Learning and Synthesizer Feature Augmentation

Kuiyuan Zhang; Zhongyun Hua; Yushu Zhang; Yifang Guo; and Tao Xiang

arXiv:2411.09167·cs.SD·November 15, 2024

Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation

Kuiyuan Zhang, Zhongyun Hua, Yushu Zhang, Yifang Guo, and Tao Xiang

PDF

Open Access

TL;DR

This paper introduces a robust deepfake speech detection method that uses feature decomposition and synthesizer feature augmentation to improve detection accuracy across unseen synthesizers.

Contribution

It proposes a dual-stream feature decomposition learning framework with synthesizer and content streams, incorporating pseudo-labeling and adversarial training for synthesizer-independent features.

Findings

01

Enhanced detection robustness against unseen synthesizers

02

Effective feature augmentation improves model generalization

03

Combines synthesizer and content features for better classification

Abstract

AI-synthesized speech, also known as deepfake speech, has recently raised significant concerns due to the rapid advancement of speech synthesis and speech conversion techniques. Previous works often rely on distinguishing synthesizer artifacts to identify deepfake speech. However, excessive reliance on these specific synthesizer artifacts may result in unsatisfactory performance when addressing speech signals created by unseen synthesizers. In this paper, we propose a robust deepfake speech detection method that employs feature decomposition to learn synthesizer-independent content features as complementary for detection. Specifically, we propose a dual-stream feature decomposition learning strategy that decomposes the learned speech representation using a synthesizer stream and a content stream. The synthesizer stream specializes in learning synthesizer features through supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis