Self-supervised pre-training with acoustic configurations for replay spoofing detection
Hye-jin Shim, Hee-Soo Heo, Jee-weon Jung, and Ha-Jin Yu

TL;DR
This paper introduces a self-supervised pretraining method for replay spoofing detection that leverages acoustic configurations from existing datasets, improving detection accuracy without extensive physical data collection.
Contribution
It proposes a novel self-supervised framework focusing on acoustic configurations, enabling effective pretraining for replay spoofing detection using datasets from related tasks.
Findings
Outperforms baseline by 30% in detection accuracy
Effective use of existing datasets for pretraining
Improves robustness in replay spoofing detection
Abstract
Constructing a dataset for replay spoofing detection requires a physical process of playing an utterance and re-recording it, presenting a challenge to the collection of large-scale datasets. In this study, we propose a self-supervised framework for pretraining acoustic configurations using datasets published for other tasks, such as speaker verification. Here, acoustic configurations refer to the environmental factors generated during the process of voice recording but not the voice itself, including microphone types, place and ambient noise levels. Specifically, we select pairs of segments from utterances and train deep neural networks to determine whether the acoustic configurations of the two segments are identical. We validate the effectiveness of the proposed method based on the ASVspoof 2019 physical access dataset utilizing two well-performing systems. The experimental results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
