Towards Generalized Source Tracing for Codec-Based Deepfake Speech

Xuanjun Chen; I-Ming Lin; Lin Zhang; Haibin Wu; Hung-yi Lee; Jyh-Shing Roger Jang

arXiv:2506.07294·cs.SD·August 19, 2025

Towards Generalized Source Tracing for Codec-Based Deepfake Speech

Xuanjun Chen, I-Ming Lin, Lin Zhang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

PDF

Open Access

TL;DR

This paper introduces SASTNet, a novel model that combines semantic and acoustic features to improve source tracing of codec-based deepfake speech, achieving state-of-the-art results on relevant datasets.

Contribution

The paper proposes SASTNet, a new approach that effectively trains on simulated data and generalizes well to real deepfake speech for source tracing.

Findings

01

SASTNet outperforms previous methods on CodecFake+ dataset.

02

Joint semantic and acoustic features enhance generalization.

03

Model maintains high accuracy on unseen deepfake audio.

Abstract

Recent attempts at source tracing for codec-based deepfake speech (CodecFake), generated by neural audio codec-based speech generation (CoSG) models, have exhibited suboptimal performance. However, how to train source tracing models using simulated CoSG data while maintaining strong performance on real CoSG-generated audio remains an open challenge. In this paper, we show that models trained solely on codec-resynthesized data tend to overfit to non-speech regions and struggle to generalize to unseen content. To mitigate these challenges, we introduce the Semantic-Acoustic Source Tracing Network (SASTNet), which jointly leverages Whisper for semantic feature encoding and Wav2vec2 with AudioMAE for acoustic feature encoding. Our proposed SASTNet achieves state-of-the-art performance on the CoSG test set of the CodecFake+ dataset, demonstrating its effectiveness for reliable source tracing.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques

MethodsSparse Evolutionary Training