TL;DR
This paper introduces RTCFake, a large-scale speech deepfake dataset for real-time communication, and proposes a phoneme-guided consistency learning strategy to improve detection robustness across platforms and noise conditions.
Contribution
It presents the first RTC-specific speech deepfake dataset and a novel PCL method that enhances cross-platform and noise-robust detection performance.
Findings
RTCFake dataset contains approximately 600 hours of data from multiple platforms.
PCL strategy significantly improves cross-platform generalization.
The approach enhances robustness against complex noise and unknown speech enhancement processes.
Abstract
With the rapid advancement of speech generation technologies, the threat posed by speech deepfakes in real-time communication (RTC) scenarios has intensified. However, existing detection studies mainly focus on offline simulations and struggle to cope with the complex distortions introduced during RTC transmission, including unknown speech enhancement processes (e.g., noise suppression) and codec compression. To address this challenge, we present the first large-scale speech deepfake dataset tailored for RTC scenarios, termed \textit{RTCFake}, totaling approximately 600 hours. The dataset is constructed by transmitting speech through multiple mainstream social media and conferencing platforms (e.g., Zoom), enabling precise pairing between offline and online speech. In addition, we propose a phoneme-guided consistency learning (PCL) strategy that enforces models to learn…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
