TL;DR
This paper introduces a deep learning VTD system with learnable spectro-temporal receptive fields that significantly improves robustness and performance in voice type discrimination and spoofing detection tasks.
Contribution
It proposes a novel deep learning approach with learnable STRFs, enhancing robustness and accuracy over static methods in VTD and spoofing detection.
Findings
Learnable STRFs outperform static STRFs in VTD.
System improves baseline performance across various SNRs.
Effective in spoofing detection with distractor noise.
Abstract
Voice Type Discrimination (VTD) refers to discrimination between regions in a recording where speech was produced by speakers that are physically within proximity of the recording device ("Live Speech") from speech and other types of audio that were played back such as traffic noise and television broadcasts ("Distractor Audio"). In this work, we propose a deep-learning-based VTD system that features an initial layer of learnable spectro-temporal receptive fields (STRFs). Our approach is also shown to provide very strong performance on a similar spoofing detection task in the ASVspoof 2019 challenge. We evaluate our approach on a new standardized VTD database that was collected to support research in this area. In particular, we study the effect of using learnable STRFs compared to static STRFs or unconstrained kernels. We also show that our system consistently improves a competitive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
