TL;DR
This paper introduces the WST-X series, a wavelet scattering transform-based feature extractor for speech deepfake detection, combining interpretability and high-level information capture, leading to superior performance on multiple benchmarks.
Contribution
The WST-X series is a novel feature extraction method that merges the advantages of hand-crafted and SSL features using wavelet scattering transforms for improved deepfake detection.
Findings
WST-X outperforms existing front-ends on Deepfake-Eval-2024 benchmark.
Small averaging scale ($J$) and high-frequency resolutions ($Q$, $L$) are key for detecting subtle artifacts.
Stable, translation-invariant features are crucial for effective speech deepfake detection.
Abstract
In this work, we focus on front-end design for speech deepfake detectors, the component that determines the discriminative acoustic cues provided to the classifier. Existing approaches are primarily categorized into two types. Hand-crafted filterbank features are transparent but limited in capturing higher-level information. SSL features, in turn, lack interpretability and may overlook fine-grained spectral anomalies. We propose the WST-X series, a novel family of feature extractors that combines the best of both worlds via the wavelet scattering transform (WST), which cascades wavelet convolutions with modulus nonlinearities to produce deformation-stable, multi-scale features. Experiments on the recent Deepfake-Eval-2024 benchmark, together with cross-dataset evaluations on the SpoofCeleb and In-the-Wild, show that WST-X outperforms existing front-ends by a wide margin. Our analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
