A Multi-Resolution Front-End for End-to-End Speech Anti-Spoofing
Wei Liu, Meng Sun, Xiongwei Zhang, Hugo Van hamme, Thomas Fang Zheng

TL;DR
This paper introduces a multi-resolution front-end that automatically learns optimal combinations of time-frequency resolutions for speech anti-spoofing, improving classification performance while reducing model complexity.
Contribution
It proposes a learnable multi-resolution feature extraction method with automatic weighting and pruning, enhancing end-to-end speech anti-spoofing systems.
Findings
Outperforms baseline methods on ASVSpoof 2019 dataset
Automatically learns optimal time-frequency resolution combinations
Reduces model complexity through pruning
Abstract
The choice of an optimal time-frequency resolution is usually a difficult but important step in tasks involving speech signal classification, e.g., speech anti-spoofing. The variations of the performance with different choices of timefrequency resolutions can be as large as those with different model architectures, which makes it difficult to judge what the improvement actually comes from when a new network architecture is invented and introduced as the classifier. In this paper, we propose a multi-resolution front-end for feature extraction in an end-to-end classification framework. Optimal weighted combinations of multiple time-frequency resolutions will be learned automatically given the objective of a classification task. Features extracted with different time-frequency resolutions are weighted and concatenated as inputs to the successive networks, where the weights are predicted by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Infant Health and Development
